How to choose my fine tuning data?

Our company conducts interviews over the phone. Those interviews get transcribed and edited for readability and compliance. We have a very specific editing style and we would like to use the fine tuning feature to train a model on our editing style. Would it be better to use single sentences in the training data, full paragraphs, or the entire document? We want to feed the raw transcript of the interview into the trained model and ask it to perform edits using our style guidelines provided in the training data. What is the best approach?

I’d suggest train it on either the entire document or full paragraphs depending on the length of the document (fine tuning gpt 3.5 turbo has a limit of 16k tokens for each training example). You can use the system message to describe the focus of the editing done in the examples, but keep it short and consise to save room for your user and assistant messages. The user message would be the raw transcription and the assistant message would be the edited version. The minimum amount of training examples is 10, but I’d suggest a few hundred to get some solid results. For more info, check out the fine tuning guide in their documentation.

Should the training data resemble to actual JSON of the API call my app will be making to the model? Or should it just be examples of original text, and then examples of edited text? Would one produce a better outcome over another? Below is the JSON structure of what we’ll be sending the model in a request.

{
    "paragraphs": [
        {
            "text": "<strong>What did you study in college?</strong>",
            "start": 1.8399999,
            "end": 14.715,
            "confidence": "0.9394453931578947368421052632",
            "speaker": {
                "deepgramSpeaker": "0",
                "speakerSource": "Interviewer"
            }
        },
        {
            "text": "I studied mechanical engineering and history.",
            "start": 16.64,
            "end": 43.019997,
            "confidence": "0.9605421618644067796610169492",
            "speaker": {
                "deepgramSpeaker": "1",
                "speakerSource": "Expert"
            }
        }
    ]
}

I should be more clear. Here’s what one interaction from the raw transcript might look like:

{
    "paragraphs": [
        {
            "text": "So okay, thanks for taking the time to talk to me today. So why don't we start with, where did you, or what did you study in college?",
            "start": 1.8399999,
            "end": 14.715,
            "confidence": "0.9394453931578947368421052632",
            "speaker": {
                "deepgramSpeaker": "0",
                "speakerSource": "Interviewer"
            }
        },
        {
            "text": "I I studied, well, mechanical engineering, and but also history.",
            "start": 16.64,
            "end": 43.019997,
            "confidence": "0.9605421618644067796610169492",
            "speaker": {
                "deepgramSpeaker": "1",
                "speakerSource": "Expert"
            }
        }
    ]
}

And here’s what we want the model to return in response:

{
    "paragraphs": [
        {
            "text": "What did you study in college?",
            "start": 1.8399999,
            "end": 14.715,
            "speaker": {
                "deepgramSpeaker": "0",
                "speakerSource": "Interviewer"
            }
        },
        {
            "text": "I studied mechanical engineering and history.",
            "start": 16.64,
            "end": 43.019997,
            "speaker": {
                "deepgramSpeaker": "1",
                "speakerSource": "Expert"
            }
        }
    ]
}

In that scenario, it seems like the training data should be cut into paragraphs, so showing one paragraph and it’s raw version, and then the expected response with the changes to the text. It seems like to keep it on task of just editing that text it would show off the changes better per paragraph, instead of over the whole document. Thoughts?

The fine-tune file should consist of examples.

  • The system message and AI purpose
  • The user instruction and data
  • The type of output that shall then be produced by transforming information

An example that you might already have the data for:

system: You are re-writo, and your job is re-write interview transcriptions.

user:
Rewrite this into Our Widget company editing style:

{original text}

assistant:
{human edited text}

You suggest this method even though that’s not the practical application of how my app will be communicating with the model? It’s going to send the entire document as a JSON, and I need the response to be in the same JSON format. I asked GPT 4 about this process and it suggested my training data look as close to the actual request and expected response as I can get it.

Replace {original text} with {“data”: “{original text}”} then.

I provide an overview, you fill in the overall understanding with your application.

You’re tuning the AI on what it receives and what it produces in response.

There are those that would mess up the training far more than the understanding you already have.

1 Like