How closely does my training data need to match my prompt sequencing for Fine-tuning to be effective?

I have a system that generates blog posts. The user flow starts with a keyword, then suggests a headline, followed by an outline, and the final draft. There are other stages in between.

Early on we ran into issues outputting long 1,000+ word articles, so now we generate them section by section, and then stitch them back together at the end. It’s been working well.

My question is, how closely does the training data need to map to our current prompt structure/sequencing?

It seems we need to train on article sections rather than the full articles, given our current section by section approach. Which I actually like. However, what I’m unsure of is whether or not I can train on the sections alone without the “system” and “user” prompts that led up to it (since there are MANY.)

How much context, if any, do I need to provide GPT with my training data? How closely does the [messages] element in my payload to GPT need to match my actual current structure?

Hey there and welcome to the community!

So, from reading your post, I think there might be a misunderstanding of what “training” is with regard to LLMs.

Training data is the model. It is its substance. Data is fed into it via Machine Learning techniques, and the result is what we currently interact with.

Training a model is very specific jargon. If you are talking about providing context in which it appears to be learning from that context, that is a different story.

If a model needs to learn a different or specific format for interaction, then fine-tuning would be the best approach. With this, you basically give the model all kinds of different examples on how it should respond and interact with the user. This bakes it into the model permanently, and may have some unpredictability in its results. This technique changes the model weights to something more desirable.

Meanwhile, prompting the model, and merely providing it data as a message, is not training or teaching the model in a traditional sense. It is basically taking the entire clump of text as a whole and coming up with an appropriate response pattern based on that data. You can use RAG to intelligently supplement messages with extra context when needed, and this doesn’t require you to change any model weights (fine-tune).

So with all that said, what exactly is the issue that you’re asking for help on? Currently, it sounds like you have an appropriate approach to a task you’re trying to accomplish. The amount of context you can input is based on the context window of the model you’re calling (like GPT-4 32k). That would be 32k tokens. If it is already working and doing what it’s supposed to be doing, why would training/fine-tuning be necessary?

1 Like

We did an experiment with using fine-tuning on headlines and it worked very well. We emulated the 2-3 stages of the conversation to get to those headlines. The full article text is about 8 steps (a series of prompts) deeper into the user flow.

My question is how precisely we have to emulate those steps within the json file we have to submit to OpenAI to create the new model.

So, are the amount of steps satisfactory? Is that how many you want?

Fine tuning would work well to help reduce the amount of steps needed to achieve the result. The level of precision depends on the the precision of your structured workflow itself. If the step process is the important aspect you’re trying to solidify, then you want as much precision and detail as possible. Those are like its bread and butter. If what matters is the solution, and how it has managed to come up with those solutions, then you would focus on pre-processing that data to show examples of how it could come up with the desired result with fewer shots.

It all depends on what is most important to you.

Sorry. Not doing a good job of explaining.

I’m trying to understand what the structure of the json file needs to be for fine-tuning. From what I understand it needs to be in the same structure as the [conversation] structure.

Ah, I see now.

This can be found here:

Each example in the dataset should be a conversation in the same format as our Chat Completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

You cannot deviate from the Chat Completions API format for this. You can adjust the content, but not the format of the json itself.

From my experience. You do not need to follow the exact steps, your JSONL file can have 8 steps long in 1 sentence.

Try to have around 100-250 examples of such conversations in your JSONL file. Leave it overnight to fine tune (taking >8 hours for me) and voila.

Hope this helped.

Last question. How closely do I need to match the content of the JSON structure with the content of my current prompt structure and sequence?