Improve fine tuning by adding embedding

I am an experienced backend Python developer, but I am very new to AI/ML/LLM. I am building and application to classify emails into 1 of 14 categories. This will be used by a childcare facility to categorize, and in the future triage, incoming emails.

I have manually labeled about 600 emails using a number 1-14 for the completion, and then I have also used the OpenAI API to synthesize another 300 or so emails to round out some of the categories with low sample counts.

I have fine tuned 10 different models, mostly curie, with various inputs for n_epochs. I have also tried fine tuning them with just the manually labeled data, manual + synthetic data, and 1 that I fine tuned with manually labeled data followed by a fine tune run with the synthetic data(the results on this last one were terrible).

The results for a curie model that has been fine tuned with the full dataset at 2 epochs seems to have the best results, and for many topics it is fairly accurate at classification. It’s not quite where I want it though so I have 2 questions.

  1. I read this post about fine tuning plus embedding. I have also watched several videos from the YouTube channel linked in that post. I now understand what embedding does, but I’m having a hard time understanding how to implement it. I know there is documentation, so I don’t expect anyone to hand-hold me through this, but I need some clarification. Can I use the same training dataset to create the embeddings? I tried adding embeddings to my fine-tuned curie model, but that model doesn’t support it. Does this mean that I will need to use 2 separate models, and therefor 2 separate API calls in order to use the fine tuned model and the model with embeddings? What would that look like? Call each and then find a way to reconcile any differences?

  2. The training data is not evenly distributes. There are at least 60+ examples for each category, but there are 2 categories in particular that have 150-200 examples. It seems like my fine tuned model is favoring one of these topics, even when an email very clearly belongs in another category. Is this expected? What are some options to resolve this? I could strip out the heavy classes, but I don’t want to reduce the size of my dataset either.

I know that there are a lot of questions in this post, but I’m trying to provide as much context as I can. I’m really excited to learn how to use this API, and I’m particularly excited about how accessible it is(you don’t need to be a data scientist to learn how to use it).

Any help is greatly appreciated. Thanks in advance.

To get an embedding, you send a text string to the embeddings API endpoint with an embedding model ID (like text-embedding-ada-002). Your training dataset is likely not just a single text string, so it wouldn’t work as-is.

You don’t necessarily need 2 calls. You can use only embeddings for classification, as described in this example: Classification using the embedding features. You could calssify emails into categories using just embeddings.

If you really wanted to, you could also do something like get the 5 most likely categories with embeddings, and then send over just those 5 over to your fine-tuned model, but I don’t know if this would improve accuracy. It would be more time-consuming and expensive.

You could artificially generate examples for the 60+ categories to get everything to the recommended 100+ categories, and then pare down those with 150-200 to get them down to the same 100+.

That is very helpful, thank you.

Re: single text string -
This is an example of one of the datapoints in my training dataset:
{"prompt":"subject: Levi this week \nbody:Good morning, I hope you had a wonderful little break with your family! I was wondering if Levi could do a full Thursday this week? No worries if that wont work. I'll be out of town Wednesday & Thursday so I'm going to have him skip tomorrow either way to stay with me. Thank you! Have a great day! Eryn \n\n###\n\n","completion":" 3"}

That seems like a single text string to me(if I were to extract the prompt). Am I mistaken?

Yes, the line starting with “subject” would be a text string.

1 Like

As someone who struggled for months trying to understand embeddings and vectorization, I will share with you my point of view. I don’t think you can really understand what embeddings are and how to use them until you understand where they fit in the query flow. Here are some diagrams that do just that:



So, when you say you want to “embed” your data, you’re really saying you want to “vectorize” your data, you want to create a “vector store” of your data. You then run your queries against this vector store to get back the most relevant results.

Here are some videos that talk about this process (data embedding → data query) from start to finish:

My suggestion: Try Weaviate. Create a sandbox cluster and upsert some test data and run some test queries against it. The video link above walks you through this.

To be clear, you use an embed model (i.e. ada-002) to vectorize your data, but you still need someplace to store these vectors. You can do it internally or use a hosted database – but your vectors, your embeds, are NOT stored on the OpenAI embed model (if that is what you were thinking).

If you watched this video, https://www.youtube.com/watch?v=9qq6HTr7Ocw&t=110s&ab_channel=DavidShapiro~AI, then what you are looking for is “semantic search” with embeddings.

Just my humble opinion.

2 Likes

Just from skimming this, a ton of misconceptions were cleared up for me. I’m sure that by digging into the resources you provided this will put me on the right track. A very deep and heartfelt thank you for providing such a detailed answer.

1 Like

No problem. I just sent a couple of “thank you” posts to some of the folks who helped me out along the way. It’s a lot to grasp all at once, but it will eventually makes sense. Good luck!

For anyone that stumbles across this:

I decided to continue fine tuning following the suggestion by @freddybussler to even out my dataset by synthesizing more labeled data in the categories that were light, so that almost every category has 100 examples. I did this in place of generating embeddings, knowing that I may add embedding if the additional data did not deliver results.

There are 2 categories which still have > 100 examples and I didn’t trim them. I fine tuned a curie model at various epochs and noticed a linear improvement in model accuracy and consistency up to 4 epochs. There doesn’t seem to be much of a difference between 4 and 8, except that 8 will be a little bit more jittery more quickly if I increase the temperature when prompting the models.

Rounding out the dataset dramatically improved performance. I will most likely need/want embeddings when I implement automatic responses, but that will be implemented at a later time so I have more time to better grasp embeddings and how to use them.

I hope this helps if anyone stumbles along this message when googling about fine tuning for categorization.

2 Likes