I have a question about the usage for the embedding model text-embedding-ada-002. Is it possible to fine-tune this model? I could only find examples for fine-tuning the prompt model however extracting embedding from prompt models is forbidden.
We have an in-house recommendation model to match A and B (both are long text, we first get their embedding and then use a two-tower model trained with A-B pairs to do the ranking), and we would like to test the performance using GPT-3 to initialize embeddings for A and B. Ideally, fine-tuning embedding with positive and negative A-B pairs should get even better performance.
From the API docs (which I have also confirmed via testing);
Fine-tuning is currently only available for the following base models: davinci , curie , babbage , and ada . These are the original models that do not have any instruction following training (like text-davinci-003 does for example)
Raw gpt-3 embedding can already be used in a two-tower model and return a reasonable result. This is because the more critical part of a two-tower model is the embedding compared to NN layers after.
The reason one could benefit from fine-tuning the original gpt-3 embedding is that the raw gpt-3 embedding might not have been exposed to the specific tasks or the subdomain knowledge.
A foo-bar example would be, imagine there is a limited corpus with only 2 words,
[“machine operation”, “artificial intelligence”]
And we want to find the most similar word to an input word of ‘machine learning’. Similarity calculation using raw GPT3 embedding returned that machine learning and machine operation has a sim score of 0.87 machine learning and artificial intelligence has a sim score of 0.88
Both scores make sense since the first is by checking the letter overlaping and the second is by checking the semantic meaning. But in my use case the first type of similarity would introduce noise. I managed to fix it in the reply to vamsi. Please feel free to have a look and see if it makes sense to you
At a high level I understand what you are saying, which is, you need high scores on semantic meaning and not word overlap. Got it. Then you say you can achieve this by a NN (two-tower). Got it. Then you say the fine-tuned embedding is the output of your NN. Got it. All of this is fine and good and doesn’t need a direct fine-tune of the original embedding engine, since you are creating them in the output of your NN. I think you answered your own question, which is yes, you can create a fine-tuned embedding, which is created by the output of you own neural net. Totally feasible and makes sense. But you can’t upload some training file to the OpenAI API for embedding-ada-002 and get the same thing. Which is what I thought your original post was about.
Hi @ray001, Could you please share how your dataset is built and how it is trained, especially with the good/bad-fit labeled data? I have a problem statement where I need to return similar accessories from the dataset (which contains the name and color of the accessory) for a user query (a type of accessory). What changes need to be made to my dataset, and what else do I need to consider? (P.S. The similarity between the query and the similar results from the dataset is currently not very good.)
Hi @k0rthik, the dataset was generated with human labelling. It looks the training dataset for your projects would be pairs of accessories, for example, 50% of which are similar ones, 50% of which are not(both easy and hard cases). You can try use vanilla embeddings from text-embedding-ada-002 and see how it looks
@ray001@k0rthik - Can you please share the text-embedding-ada-002 source, am unable to find the same and also if possible kindly share a reference notebook link to accommodate the fine-tuning over it, we are struggling with that.
I’m running a vector database for PC games based on openai embeddings. The use case for me is that searching for nearest nodes for “ace combat” returns “ace academy” before “ACE COMBAT™ 7: SKIES UNKNOWN” (first and second place, respectively). This is for an embedding that’s 100% weighted on title. As far as I can tell, there’s no other smart tuning I can do to make this return the correct result. The embeddings themselves are “wrong” and need to be “tuned.” Maybe the problem is that the embedding model thinks combat and academy are synonyms. This is the type of variable I’d like to be able to adjust so that the embedding can be generated more literally in cases where I’d want that.
ACE COMBAT 7 SKIES UNKNOWN contains more than one concept (“skies” and “unknown”) and thus might be a poorer match – embeddings try to capture concepts not words and when there’s more than one concept, the embedding vector will end up in between the point of each of those concepts in vector space.
When I have a similar problem, I end up doing hybrid retrieval – both keyword based (important word matches, where I have a good idea of what’s “important words”) and embedding based.
I have also done hybrid model, found out that running embedding and bm25 on different threads in python and merge them by myslef rather then using the alpha (for example in weaviate) peformed better, but posting this since I am wondering where this all goes? we need more ideas to make search accurate