Finetuned Model So Bad It Doesn't Work

I finetuned a model to do a fairly complex task. I’ve trained a model to do this task before but with data I considered to be slightly worse. It performed okay, but I wanted to try and improve it by only training it on the very best examples.

That meant getting read of about half of them (there was originally 85 and I cut it down to 35).

The finetuned model was so bad, it literally didn’t work (I kept getting errors saying response was not valid Unicode, or that max tokens had been exceeded even though none of the finetuning examples exceeded 3000 tokens).

The training and validation loss represent that, 7 and 7.5 respectively.

I’m going to assume it was just because the model was underfitting due to a lack of data.

Is that right?

It’s really hard to tell without seeing your code. If you’re getting invalid JSON errors and your max tokens are being exceeded, those are strictly logical items and as far as I know are not prone to “error” like that. I would assume something in your setup is incorrect without knowing more. For example, have you tried counting the total request tokens and comparing against the limits? etc etc

Personally, I don’t event try to “improve” fine tuning by selecting “best” examples from the dataset.

Another note about the “best”… There is no such thing for me like “best example”.

When fine tuning, you basically have to start with the best completions for the given prompts.

The “seed” data set must be THE BEST, or at least the best you can get. And be AS CLOSE AS POSSIBLE TO YOUR REAL LIFE USE CASES.

Personally I don’t start fine-tuning if my seed does not meet the requirements above and it’s at least a couple of hundreds of samples.

  1. Train the first model on the data from above.

  2. Then you run the tuned model on the rest of your samples (another couple of hundreds) to see where it fails and why.

  3. Then you edit manually the results if they are not THE BEST possible.

3.1 While editing you classify the “where and why” the model failed to identify the failing trends.

  1. Then you combine the new data with the previous one to train a second model.

  2. Then you find another couple of hundreds of samples that fall into the most common categories of fails of the model #1 and run them through model #2.

  3. Then you go back to step 3 with the data from model #2 and iterate steps 3-6 with one modification: beside adding the samples similar to the failed ones, you add the same amount of “standard use case” samples to the data to run the trained models on. This way you keep balance between “standard use cases” and “edge cases” in your training data set.

  4. When your training dataset is around 1k samples, split it in 2 parts 70%-30% where the bigger one is training and smaller one validation. Then make another 2-3 iterations of steps 3-6.

Taking the approach above, forces you to have in your training dataset the “best” samples only, so reducing the training dataset by selecting the best samples becomes naturally meaningless…

Try this at least once to get to something like 3k samples training dataset (I know it will take some time and effort)…

, and I bet you’ll be blown away by the performance of your new shining model.

PS disclaimer: keep your expectations in the land of reasonable, try not to do “magic” models where you ask “don’t know what” to get the “miracle don’t know which”.


Thank you so much for this feedback. Definitely seems like you know what you’re talking about. Will get to work on implementing your tactics! Here goes to another month of making examples :rofl:

The number of training examples required is highly dependent on the use case. You can create a performing fine-tuned model with as little as 30-50 examples just like you can with multiple thousands of examples.

Re-sharing the link to the latest fine-tuned guidance:

1 Like

The task I’m training it to do is very nuanced and quite complex, so I still think the number of examples is the problem.

It’s still a bit weird though how it literally doesn’t work. I would have thought it would at least respond!!

I get a weird error every time I use it. I don’t have this problem with any other finetuned models I use.

I know what you mean, but I don’t get errors like this with any of my other finetuned models. That makes me think it has to be something to do with that model, as opposed to my setup. Could be wrong though…

It sounds to me like you may benefit from going back to the drawing board. Instead of rushing to create another hundred or even thousands of examples, validate whether your prompt works on just a regular non-finetuned model. Even if it is not yet yielding the desired consistency in output etc. the basic logic should work. If it doesn’t, then perhaps you need to decomplexify your task and break it down into multiple steps.

Sure, it’s about the precision: if your task is common and doesn’t require almost 100% accuracy, you can get away with fewer samples or even without fine-tuning.

Now if your task is something very specific, not obvious and requires very high precision - hundreds or even thousands of samples are needed (like legal docs analysis we do at LAWXER).

But in any case the dataset building approach stays pretty much the same regardless the scale.

Some hints for me, I become curious.

Have you checked your samples for contradictions? Usually the task must be clear (and simple, not easy but simple to understand) for fine tuning. Sometimes samples may be contradicting each other not only in meanings but also in “what is needed as result”. If that’s the case, seeing the lie number of samples, your model may "get confused"and produce …

Have you looked in what the error log says if any?

Yeah well to be specific, what I’m doing is building an AI tutor. But I want lessons to be taught in a consistent style and tone, with enough detail, and contain relevant test questions and exercises. So it is quite a nuanced task, and without finetuning, it is almost impossible to get it to do all those things consistently.

Yeah I mean I don’t think so. All the system prompts are the same, all the user messages are in the same format, same with all the assistant function calls.

The lessons themselves (the arguments of the function calls) are all obviously slightly different as they are on different topics, but they are all roughly the same length, taught in the same tone, with the same number of test questions etc.

It’s so perplexing. Just the fact that it doesn’t work at all, despite the fact it works quite well with just another 40 examples.

The error codes I get the most is error code 500. It usually either says something like ‘invalid unicode expression in response’, or 'Error: 500 The server had an error while processing your request. Sorry about that! ’

Yeh the prompt works at a basic level. The content produced is decent, but just a bit inconsistent, and undetailed.

Check my messages on the form about the tone tuning for how to get that.

But being from logistics myself, the task your giving it is too general and needs to be broken down to smaller tasks (write down your complete workflow of cours préparation, then build as blocks and create a pipeline).

Also the subject becomes more interesting if you use user data and their errors to adjust the lessons as they progress

Yeah that’s kind what I’m doing.

So first I plan modules for the course, then I plan the units for the first modules, then I plan lessons for each of those units, then I make the content for each of those lessons individually.

Going foreword, I’m going to use users data to adjust lessons to suit their learning style, and add lessons covering topics they don’t perform well on, but I thought I’d try and get a base model down first that simply makes content.

Or maybe I should try implementing that system first…

I meant the thorough step by step workflow of each item with add many details as you can, especially on “how the decision” is maid to do x, y and z.

Personally, I would start from the end result: lesson, trying to convert it into a sort of an object with “properties” that are common to all lessons.

Then go upward with how do you get those "properties " and what is needed for that. Nothing down all thoughts and decisions.

Then basically look into the whole thing and try to find “blocks” that define your workflow.

But then, it needs more detailed brainstorm.

Here’s what you want to do…

  1. Increase the number of training examples to between 500 and 1,000
  2. Train the model.

My advice for accomplishing Number 1 is to, using as many prompts as necessary, build up a synthetic dataset using GPT-4.



Each lesson is a JSON object, containing the same properties.

The main properties are “teaching” where the AI teaches the user about the topic, “exercises” which is an object containing properties “practicalExercise” and “notesPrompt”, and “test” which is an array of objects containing “question” and “answer” properties.

I suppose I could split that up even further to produce the exercises and test separately and just include the lesson content in the context window?

Here is an example lesson:

{“content”:{“teaching”:“# The Perceptron: Understanding the Basics In this lesson, we’ll dive into the foundational concept of the Perceptron, a fundamental building block in the field of artificial neural networks. ## What is a Perceptron? ### History and Conceptual Overview The Perceptron is inspired by the way a single neuron in the human brain operates. It receives input signals, processes them, and produces an output. The output is determined by applying a set of weights to the inputs, summing them up, and then passing the result through an activation function. Let’s get into the nitty gritty of those fascinating processes eh? 1. Input Signals (x): The Perceptron takes multiple input signals, denoted as x_1, x_2, .., x_n. Each input is associated with a weight which determines its importance. Let’s say the weights are w_1, w_2, .., w_n. 2. Weighted Sum: The Perceptron calculates the weighted sum of the inputs and weights. This is done by computing: z = w_1 \\times x_1 + w_2 \\times x_2 + .. + w_n \\times x_n + b, where b is the bias term. 3. Activation Function: The result of the weighted sum is then passed through an activation function (often represented by a). The activation function introduces non-linearity (splits values into distinct categories) into the output and determines whether the Perceptron should ‘fire’ or not. - If the output of the activation function is above a certain threshold, the Perceptron will produce a ‘1’ or ‘firing’ output. Otherwise, it will produce a ‘0’ or non-firing output. This firing signal indicates that the Perceptron has classified the input data point as belonging to a particular category or class. In the context of binary classification (one or the other), if the Perceptron fires, it signifies that the input data point belongs to one class (often denoted as the positive class or class 1). Conversely, if the output is below the threshold, the Perceptron does not fire, indicating that the input data point belongs to the other class (often denoted as the negative class or class 0). 4. Learning and Training: The weights and the bias of the Perceptron which we talked about earlier, are adjusted during the training process. The goal is to learn the optimal set of weights that allow the Perceptron to make accurate predictions. E.g. if the perceptron applies a really low weight to certain inputs, and it’s predictions are pretty poor, it migh try applying higher weigths to those inputs. - One common algorithm used for training the Perceptron is the Perceptron Learning Rule. 6. Applications and Limitations: Perceptrons have been used in a variety of applications, including binary classification problems. However, they are limited to problems that can be linearly separable, where a single straight line can correctly separate the classes. To understand this, picture a graph with points scattered all over it. If this graph is linearly seperable graph, a line could be drawn through it to seperate the points accurately into categories. In mathematical terms, the Perceptron can be represented as a(w_1x_1 + w_2x_2 + ... + w_nx_n + b) where a is the activation function. Common activation functions include the step function, the sigmoid function, and the ReLU (Rectified Linear Unit) function. ### Multilayer Perceptrons (MLPs) While the single-layer Perceptron is limited to linear decision boundaries, Multilayer Perceptrons (MLPs) can overcome this limitation by introducing one or more hidden layers. These hidden layers allow MLPs to learn non-linear decision boundaries, making them more powerful for a wide range of tasks, including complex pattern recognition and classification problems. ### Backpropagation To train Multilayer Perceptrons, the backpropagation algorithm is commonly used. Backpropagation works by iteratively adjusting the weights of the network based on the error between the predicted output and the actual output. E.g. if the predicted output was 1, and the model output was 0.5, the gap would be 0.5, and the perceptron might increase the weights it was using to get it’s output closer to the predicted one. This process involves propagating the error backward through the network and updating the weights accordingly, allowing the network to learn from its mistakes and improve its performance over time. ### Activation Functions While the step function was historically used as the activation function for Perceptrons, modern neural networks make use of a variety of activation functions to introduce non-linearity into the network. Some commonly used activation functions include: - Sigmoid: S-shaped curve that squashes the output between 0 and 1, useful for binary classification tasks. - ReLU (Rectified Linear Unit): Returns 0 for negative inputs and the input value for positive inputs, providing faster training compared to sigmoid and addressing the vanishing gradient problem. - Tanh: Similar to the sigmoid function but squashes the output between -1 and 1, often used in hidden layers of neural networks. ### Conclusion The Perceptron laid the foundation for modern artificial neural networks, and its evolution into Multilayer Perceptrons paved the way for deep learning. Understanding these concepts is crucial for anyone interested in delving into the field of artificial intelligence and machine learning.”,“searchQuery”:“Introduction to single-layer perceptron in neural networks”},“exercises”:{“practical”:{“instructions”:“Implement a single Perceptron algorithm in your programming language of choice. You can start with a simple AND gate example and then extend it to other logical functions or linearly separable datasets.”,“solution”:null},“notesPrompt”:“Discuss the characteristics of a linear decision boundary in the context of the single-layer perceptron. Consider how it impacts the perceptron’s ability to classify data and its limitations.”},“test”:{“questions”:[{“question”:“Explain the role of the activation function in a single-layer perceptron.”,“answer”:“The activation function introduces non-linearity and determines the output of the perceptron based on the weighted sum of inputs.”,“qType”:“oneAnswer”,“options”:,“lNumbers”:[2]},{“question”:“A single-layer perceptron can model any function.”,“answer”:“False”,“qType”:“trueFalse”,“options”:,“lNumbers”:[2]},{“question”:“Backpropogation involves measuring the error between what?”,“answer”:“Predicted ouput and…”,“qType”:“multipleChoice”,“options”:[“actual output”,“perceptron weights”],“lNumbers”:[2]},{“question”:“Name two key elements of a single-layer perceptron?”,“answer”:“Any two from: inputs, weights, a weighted sum function, an activation function, and the output.”,“qType”:“oneAnswer”,“options”:,“lNumbers”:[2]},{“question”:“Name one activation function…”,“answer”:“ReLU/Sigmoid/Tanh”,“qType”:“oneAnswer”,“options”:,“lNumbers”:[2]},{“question”:“In the context of a single-layer perceptron, how are the weights related to the decision boundary?”,“answer”:“The weights determine the orientation of the decision boundary.”,“qType”:“oneAnswer”,“options”:,“lNumbers”:[2]},{“question”:“What is a common training algorithm used for single layer perceptrons?”,“answer”:“Perceptron Learning Rule”,“qType”:“oneAnswer”,“options”:,“lNumbers”:[2]}]}}

1 Like

You may want to just use RAG to inject knowledge as your starting point.

Then, if needed, your fine-tune would produce the specific tone you need, but not any knowledge or details.

Also sounds like you need a high level controller that tracks the learning status, based on your quizzes, and then re-enforces areas that need to improve with the student.

1 Like

Yeah this is something I’m looking into adding now. What is a high level controller?