Is it possible to fine tune the embedding model?

Is it possible to fine tune the embedding model?

Greetings OpenAI community!

I have a question about the usage for the embedding model text-embedding-ada-002. Is it possible to fine-tune this model? I could only find examples for fine-tuning the prompt model however extracting embedding from prompt models is forbidden.

1 Like

what’s your use case? I dont see why you would need to fine-tune embeddings

1 Like

We have an in-house recommendation model to match A and B (both are long text, we first get their embedding and then use a two-tower model trained with A-B pairs to do the ranking), and we would like to test the performance using GPT-3 to initialize embeddings for A and B. Ideally, fine-tuning embedding with positive and negative A-B pairs should get even better performance.

Hello @ray001

From the API docs (which I have also confirmed via testing);

Fine-tuning is currently only available for the following base models: davinci , curie , babbage , and ada . These are the original models that do not have any instruction following training (like text-davinci-003 does for example)

Reference:

OpenAI (Beta) - Fine-tuning

Hope this helps.

1 Like

Thanks for the information. I also found this page, was wondering if anyone found alternatives.

There are no “alternatives”.

@ray001 Did you end up finding a way to fine-tune ada? I am trying to do the exact thing that you wanted to and would love to know if you’ve figured it out. Thanks!

Perhaps this bias matrix approach will be of use to some inquiring here.

“This notebook demonstrates one way to customize OpenAI embeddings to a particular task.”

1 Like

From what I understand, the two-tower model is just a neural network on top of the embeddings, so why do you need to tune the original embedding model? You need to create another NN.

Here is the engine eBay uses:

There are some alternatives.
For example in a two-tower taking embedding from entities. You can

  1. get the raw GPT-3 embedding for those entities
  2. apply a set of CNN+FC layers to the original embedding
  3. guide the training of layers in step 2 with good/bad-fit labeled data
  4. then raw GPT-3 embeddings processed by the CNN+FC layers trained in 3 would be the fine-tuned embedding
2 Likes

Raw gpt-3 embedding can already be used in a two-tower model and return a reasonable result. This is because the more critical part of a two-tower model is the embedding compared to NN layers after.

The reason one could benefit from fine-tuning the original gpt-3 embedding is that the raw gpt-3 embedding might not have been exposed to the specific tasks or the subdomain knowledge.

A foo-bar example would be, imagine there is a limited corpus with only 2 words,
[“machine operation”, “artificial intelligence”]
And we want to find the most similar word to an input word of ‘machine learning’. Similarity calculation using raw GPT3 embedding returned that
machine learning and machine operation has a sim score of 0.87
machine learning and artificial intelligence has a sim score of 0.88
Both scores make sense since the first is by checking the letter overlaping and the second is by checking the semantic meaning. But in my use case the first type of similarity would introduce noise. I managed to fix it in the reply to vamsi. Please feel free to have a look and see if it makes sense to you

At a high level I understand what you are saying, which is, you need high scores on semantic meaning and not word overlap. Got it. Then you say you can achieve this by a NN (two-tower). Got it. Then you say the fine-tuned embedding is the output of your NN. Got it. All of this is fine and good and doesn’t need a direct fine-tune of the original embedding engine, since you are creating them in the output of your NN. I think you answered your own question, which is yes, you can create a fine-tuned embedding, which is created by the output of you own neural net. Totally feasible and makes sense. But you can’t upload some training file to the OpenAI API for embedding-ada-002 and get the same thing. Which is what I thought your original post was about.

And FYI, you can improve the geometry of the embeddings too, I did this in this thread. Some questions about text-embedding-ada-002’s embedding - #42 by curt.kennedy

It removes the mean embedding vector and uses PCA to reduce the dimensions and increase the spread without altering the meaning too much.

So yeah, post-processing of the embeddings is certainly advised and encouraged in certain situations.

1 Like

Hi @ray001, Could you please share how your dataset is built and how it is trained, especially with the good/bad-fit labeled data? I have a problem statement where I need to return similar accessories from the dataset (which contains the name and color of the accessory) for a user query (a type of accessory). What changes need to be made to my dataset, and what else do I need to consider? (P.S. The similarity between the query and the similar results from the dataset is currently not very good.)

Thanks,

Hi @k0rthik, the dataset was generated with human labelling. It looks the training dataset for your projects would be pairs of accessories, for example, 50% of which are similar ones, 50% of which are not(both easy and hard cases). You can try use vanilla embeddings from text-embedding-ada-002 and see how it looks

@ray001 @k0rthik - Can you please share the text-embedding-ada-002 source, am unable to find the same and also if possible kindly share a reference notebook link to accommodate the fine-tuning over it, we are struggling with that.

I’m running a vector database for PC games based on openai embeddings. The use case for me is that searching for nearest nodes for “ace combat” returns “ace academy” before “ACE COMBAT™ 7: SKIES UNKNOWN” (first and second place, respectively). This is for an embedding that’s 100% weighted on title. As far as I can tell, there’s no other smart tuning I can do to make this return the correct result. The embeddings themselves are “wrong” and need to be “tuned.” Maybe the problem is that the embedding model thinks combat and academy are synonyms. This is the type of variable I’d like to be able to adjust so that the embedding can be generated more literally in cases where I’d want that.

ACE COMBAT 7 SKIES UNKNOWN contains more than one concept (“skies” and “unknown”) and thus might be a poorer match – embeddings try to capture concepts not words and when there’s more than one concept, the embedding vector will end up in between the point of each of those concepts in vector space.

When I have a similar problem, I end up doing hybrid retrieval – both keyword based (important word matches, where I have a good idea of what’s “important words”) and embedding based.

I have also done hybrid model, found out that running embedding and bm25 on different threads in python and merge them by myslef rather then using the alpha (for example in weaviate) peformed better, but posting this since I am wondering where this all goes? we need more ideas to make search accurate

anyone knows where there is any advanced capabilities around fine-tune embedding mode for my own use or maybe new ideas on how to get better results?

This approach totally makes sense. I’m curious to whether it actually improved the performance of your recommender engine.