Fine-Tuning Setup for gpt-3.5-turbo-16k

I’m currently in the process of fine-tuning GPT-3.5-turbo and am exploring the most effective strategies for this phase. Specifically, I’m torn between two approaches and would appreciate your insights.

  1. Directive Approach with Specific Prompts: This method involves being quite prescriptive in the training data. For instance, providing detailed prompts that guide the model toward a specific type of response. Here’s what it looks like in practice:

“role”: “user”,
“content”: “Here is a specific question or command prompt + content…”

“role”: “assistant”,
“content”: “This would be how I expect the model to be structured and formulate its response…”

Then, I would follow this with precise expectations for the model’s response.

  1. Content-Only with Post-Training Refinement: Alternatively, I’m considering a more open-ended approach where I feed the model the content without a specific prompt. After that, I’d run multiple rounds, each with up to 100 examples, demonstrating the desired response format and content. Here’s a simplified example:

“role”: “user”,
“content”: “Here is the content without direct prompting…”

Subsequently, I’d provide examples like this:

“role”: “assistant”,
“content”: “This would be how I expect the model to be structured and formulate its response…”
with 100+ samples for training data

My goal is to to ensure the model reliably generates the most relevant and accurate content. However, I’m unsure which strategy is more effective in the long run, especially when the model encounters varied real-world scenarios.

Has anyone here experimented with these methods? Which do you find more effective in terms of model accuracy, response relevance, and long-term adaptability? I’m eager to learn from your experiences and to understand the pros and cons that perhaps I haven’t considered.

Also, is the fine-tuned model 16k in context?

Thanks in advance for your insights!

second part. That is not a method of training demonstrated. The inferred behavior would be: for any content on this topic, emit a single stop token. Or: Emit random text. You might as well add an AI output “I’m sorry, but I can’t help with that”.

first part. the language you use is prescriptive, not demonstrative as required.

Thanks for your response. I’m currently training based on the first approach in a directive manner in order to get a specific format for output, just preparing the data sets.

For anyone else here reading, I’ll update/report my experience if anyone cares.

1 Like

Yeah, we’d love to hear back. Fine-tuning isn’t my forte, but I’ve done some with GPT-2 and the original GPT-3 Davinci.

We’ve actually got a great community growing here with a lot of “hidden gem” posts if you search around a bit.

Once you reach Trust Level 3 on the forum, you get access to AI tools here on the site and more. Good luck tinkering!

1 Like

The point I make is: the input of example conversations isn’t chat to an AI where you say “here’s what I’d like you to learn from reading my example…”. It is what a user or script would input to stimulate the exact response given by AI.

1 Like

Yeah… it would be the actual chat completion with the 3 messages of system, user, assistant. Thats what I’m doing currently as I setup the training data :slight_smile: I’m excited to see the results. I will test with 10 first, since this is what openai recommends. But they do suggest best results will be for 50-100 datasets. So I will test and validate. Will report back :slight_smile: Have you successfully done implemented alot of trained data sets? and has the model performed as you expected or wanted it to?

Sure! thanks for the warm welcome :slight_smile: Definitely exciting to be here! Yeah how were your experiences with GPT-2 and GPT-3 Davinci?

The models become deprecated rather quickly eh? lol which is a good thing. Unfortunately, the TPM of GPT-4 is lower than than that of GPT-3.5-16k. Seems like OpenAI really prefer users to use the API for GPT-3.5-turbo. But would really just love to have GPT-4-32k which isn’t available for API use :frowning: yet…


That feels like decades ago now haha…

I actually dove in around the time I read this blog post by Janelle Shane where she asked her audience for D&D backstories and fine-tuned GPT-2 with them. I asked for the dataset, and she shared. I found a lot of sci-fi and other stuff in there, so I got to work cleaning the dataset (losing a lot) then writing my own examples.

One of the more interesting revelations was people interchanging steps (?) and epochs in docs and articles when they shouldn’t as they refer to different things… Overfitting was a bit problem too unless you got all the parameters right. All that said, the tinkering eventually led to advanced AI backstory generators at LitRPG Adventures.

After running the site for a while, I used the content generated that I owned to fine-tune Davinci. The costs were a lot higher for output, but I could cut the prompt down a lot, so it kinda evened out. With GPT-3.5-turbo (+instruct), I’ve not made fine-tuning a priority again. I do have around 100,000 generations now that were generated for non-commercial uses that I can use to train if I do so in the future.

You’re right, though, that things deprecate really quickly these days! It’s really hard to stay up to date on everything.

Look forward to hearing about your results!

Successfully completed the upload for training data.
Training Loss of .6437. Hard to say if this is bad or good so need empirical data to really understand and quantify. Generally, a lower loss indicates that predictions are closer to the true values, which is good. However, the scale of good vs bad is problem-specific. .6437 could be excellent in one iteration and poor in another. So was very much looking forward to testing and validating.

  • Training loss is a numerical indication of how far your model’s predictions are from the actual targets or labels during the training phase.
  • It is calculated by a loss function (also called a cost function or objective function), which you choose based on what kind of task you are performing (e.g., regression, classification). Common loss functions include mean squared error for regression tasks or cross-entropy loss for classification.

The training data was 10 data sets. I prepared 50 data sets, but had issues with large prompts and formatting so uploaded 10 after manually formatting all 10.

All the data was trained on GPT-4 chat completions so was very excited to test it out, but unfortunately there wasn’t any clear documentation whether or not fine tuned models support 16k, but it looks like 4,096 tokens is max which is the base model GPT-3.5-Turbo.

Alot of time and energy on a dead end for now at least… I just hope they will have GPT-4k-32k API. Will solve alot of issues and bring alot of new use cases.

Hopefully this will save time and energy for others.

In case do you need data for fine-tuning 3.5 here (huggingface/bitext) do you have a free dataset with morte than 26K rows