I have used a 3700-line jsonl dataset to fine tune gpt-4o-mini-2024-07-18 with default hyperparams.
Trained tokens: 2,091,327
Epochs: 3
Batch size: 7
LR multiplier: 1.8
Seed: 1658117049
Based on your experience, how to consider iterating on hyperparameters?
My tests currently produce okay results, but not precise enough.
Iâve already looked at Iterating on data quality
Collect examples to target remaining issues
Scrutinize existing examples for issues
Consider the balance and diversity of data
Make sure your training examples contain all of the information needed for the response
Look at the agreement / consistency in the training examples
For me, what tends to bite me for classification is that I may have severely imbalanced labels, and I need to generate more training data to compensate.
But you didnât mention what your fine-tune is doing, is it a classifier of sorts?
Thanks for your input Kurt.
AI was tasked to analyze, understand and identify every topics and subtopics across the whole Deno documentation and generate jsonl entries accordingly, so I am confident the labels are fairly well balanced / distributed. There are 3700 lines!
When I ask chatGPT with the new search feature âWhat is Deno 2 backward compatible with?â it answers perfectly well but my fine tuned 4o mini outputs a vague, albeit correct, answer, yet the dataset contains a line with the very answer:
{"messages": [{"role": "system", "content": "You are a helpful assistant, expert in web development and the Deno JS runtime environment. You answer professionally and concisely, always double-checking your responses for accuracy."}, {"role": "user", "content": "Is Deno 2 backwards compatible with Node.js and npm?"}, {"role": "assistant", "content": "Yes, Deno 2 is backwards compatible with Node.js and npm. This means you can run existing Node.js applications using Deno 2, allowing for seamless integration and the ability to leverage Deno's features in your current projects."}]}
Is that too long a line? Do I have to increase Epochs two fold?
You may be better off with a robust RAG solution that allows you to stay on-top of changes in the documentation. This can be done in many ways, like hashing the page contents.
Think about it, each time the documentation updates you will be required to fine-tune the model all over again
It would be very difficult to beat SearchGPT as it can always find the most up-to-date information
The last 1/3 looks like the loss isnât going anywhere, but if you simply need more obedience and overfitting, you can submit another job with your created fine-tune model as the input model name, perhaps with another 2 epochs, and it will continue training based on the existing fine-tune model to deepen the weights.
Remember that you must be reusing the system message and as similar of input as you trained on to activate your training; you canât expect to have the quality if you write a completely different message. You are ultimately training a small model that only performs âchatâ for OpenAI because of their own extensive training.
Thanks, I will launch a new fine tuning session then.
RE system message: you mean that the second tuning should basically use the same system message in the dataset as in the 1st dataset? If so, yes I would like to reuse the dataset, so same everything.
There are IDEs that have this feature built-in, like Cursor.
Cursor comes with a set of third party docs crawled, indexed, and ready to be used as context. You can access them by using the @Docs symbol.
Add Custom Docs
If you want to crawl and index custom docs that are not already provided, you can do so by @Docs > Add new doc. The following modal will appear after youâve pasted in the URL of your desired doc:
Yes, the system prompt should be seen more as an activation of your training against what already exists, rather than an instruction, although instruction-following is already there.
If your input and output was completely un-chat-like, youâd want to depart even farther from an instruction that says âyou are chatgptâ.
When you use an existing fine-tune to create a new tuned model on the same data file, the result is similar to if you had specified more epochs from the start â without having to pay for the whole thing over again with a higher token expense.
Indeed, familiar with this, I use aider in ZED Blazing fast.
Issue with adding docs âon the flyâ is the tokens consumption, hence my tuning⌠or RAG if that would be efficient.
The modelâs quality got worse.
Fine-tuning sequentially (one dataset after the other) can lead to whatâs known as catastrophic forgetting. This occurs when the model starts to overwrite knowledge from the first fine-tuning during the second, diminishing its performance on the initial data.
Quite a cold shower for my first try at fine tuning with a 3700-entry dataset.
I find the fine tuning method quite archaic and prehistoric tbh. Reminds me of the early days of the internet with the modem and its cumbersome setup.