How to choose my fine tuning data?

djohnson1 · December 27, 2023, 5:12pm

Our company conducts interviews over the phone. Those interviews get transcribed and edited for readability and compliance. We have a very specific editing style and we would like to use the fine tuning feature to train a model on our editing style. Would it be better to use single sentences in the training data, full paragraphs, or the entire document? We want to feed the raw transcript of the interview into the trained model and ask it to perform edits using our style guidelines provided in the training data. What is the best approach?

patalacey · December 28, 2023, 4:21am

I’d suggest train it on either the entire document or full paragraphs depending on the length of the document (fine tuning gpt 3.5 turbo has a limit of 16k tokens for each training example). You can use the system message to describe the focus of the editing done in the examples, but keep it short and consise to save room for your user and assistant messages. The user message would be the raw transcription and the assistant message would be the edited version. The minimum amount of training examples is 10, but I’d suggest a few hundred to get some solid results. For more info, check out the fine tuning guide in their documentation.

djohnson1 · January 2, 2024, 3:54pm

Should the training data resemble to actual JSON of the API call my app will be making to the model? Or should it just be examples of original text, and then examples of edited text? Would one produce a better outcome over another? Below is the JSON structure of what we’ll be sending the model in a request.

{
    "paragraphs": [
        {
            "text": "<strong>What did you study in college?</strong>",
            "start": 1.8399999,
            "end": 14.715,
            "confidence": "0.9394453931578947368421052632",
            "speaker": {
                "deepgramSpeaker": "0",
                "speakerSource": "Interviewer"
            }
        },
        {
            "text": "I studied mechanical engineering and history.",
            "start": 16.64,
            "end": 43.019997,
            "confidence": "0.9605421618644067796610169492",
            "speaker": {
                "deepgramSpeaker": "1",
                "speakerSource": "Expert"
            }
        }
    ]
}

djohnson1 · January 2, 2024, 4:16pm

I should be more clear. Here’s what one interaction from the raw transcript might look like:

{
    "paragraphs": [
        {
            "text": "So okay, thanks for taking the time to talk to me today. So why don't we start with, where did you, or what did you study in college?",
            "start": 1.8399999,
            "end": 14.715,
            "confidence": "0.9394453931578947368421052632",
            "speaker": {
                "deepgramSpeaker": "0",
                "speakerSource": "Interviewer"
            }
        },
        {
            "text": "I I studied, well, mechanical engineering, and but also history.",
            "start": 16.64,
            "end": 43.019997,
            "confidence": "0.9605421618644067796610169492",
            "speaker": {
                "deepgramSpeaker": "1",
                "speakerSource": "Expert"
            }
        }
    ]
}

And here’s what we want the model to return in response:

{
    "paragraphs": [
        {
            "text": "What did you study in college?",
            "start": 1.8399999,
            "end": 14.715,
            "speaker": {
                "deepgramSpeaker": "0",
                "speakerSource": "Interviewer"
            }
        },
        {
            "text": "I studied mechanical engineering and history.",
            "start": 16.64,
            "end": 43.019997,
            "speaker": {
                "deepgramSpeaker": "1",
                "speakerSource": "Expert"
            }
        }
    ]
}

In that scenario, it seems like the training data should be cut into paragraphs, so showing one paragraph and it’s raw version, and then the expected response with the changes to the text. It seems like to keep it on task of just editing that text it would show off the changes better per paragraph, instead of over the whole document. Thoughts?

_j · January 2, 2024, 7:58pm

The fine-tune file should consist of examples.

The system message and AI purpose
The user instruction and data
The type of output that shall then be produced by transforming information

An example that you might already have the data for:

system: You are re-writo, and your job is re-write interview transcriptions.

user:
Rewrite this into Our Widget company editing style:

{original text}

assistant:
{human edited text}

djohnson1 · January 2, 2024, 8:12pm

You suggest this method even though that’s not the practical application of how my app will be communicating with the model? It’s going to send the entire document as a JSON, and I need the response to be in the same JSON format. I asked GPT 4 about this process and it suggested my training data look as close to the actual request and expected response as I can get it.

_j · January 2, 2024, 8:18pm

Replace {original text} with {“data”: “{original text}”} then.

I provide an overview, you fill in the overall understanding with your application.

You’re tuning the AI on what it receives and what it produces in response.

There are those that would mess up the training far more than the understanding you already have.

Topic		Replies	Views
Are fine-tuned models a good way to give GPT a specific tone of voice? API api	5	4086	July 20, 2023
How closely does my training data need to match my prompt sequencing for Fine-tuning to be effective? API fine-tuning , training	7	1039	February 6, 2024
Fine tuning for writing style - lessons and questions API fine-tuning	5	3264	January 17, 2024
Fine-Tuning 3.5 Turbo for writing style/tone API	1	1675	September 27, 2023
How does gpt-3.5-turbo fine-tuning work? API gpt-35-turbo , fine-tuning	10	1941	September 11, 2023

How to choose my fine tuning data?

Related topics