Is it possible to fine tune the embedding model?

ray001 · February 5, 2023, 8:19am

Greetings OpenAI community!

I have a question about the usage for the embedding model text-embedding-ada-002. Is it possible to fine-tune this model? I could only find examples for fine-tuning the prompt model however extracting embedding from prompt models is forbidden.

nunodonato · February 5, 2023, 9:05pm

what’s your use case? I dont see why you would need to fine-tune embeddings

ray001 · February 5, 2023, 10:19pm

We have an in-house recommendation model to match A and B (both are long text, we first get their embedding and then use a two-tower model trained with A-B pairs to do the ranking), and we would like to test the performance using GPT-3 to initialize embeddings for A and B. Ideally, fine-tuning embedding with positive and negative A-B pairs should get even better performance.

ruby_coder · February 6, 2023, 2:32am

Hello @ray001

From the API docs (which I have also confirmed via testing);

Fine-tuning is currently only available for the following base models: davinci , curie , babbage , and ada . These are the original models that do not have any instruction following training (like text-davinci-003 does for example)

Reference:

OpenAI (Beta) - Fine-tuning

Hope this helps.

ray001 · February 6, 2023, 7:43pm

Thanks for the information. I also found this page, was wondering if anyone found alternatives.

ruby_coder · February 6, 2023, 10:13pm

There are no “alternatives”.

vamsi.tetali · March 3, 2023, 5:00pm

@ray001 Did you end up finding a way to fine-tune ada? I am trying to do the exact thing that you wanted to and would love to know if you’ve figured it out. Thanks!

cliff.rosen · March 3, 2023, 6:22pm

Perhaps this bias matrix approach will be of use to some inquiring here.

“This notebook demonstrates one way to customize OpenAI embeddings to a particular task.”

github.com

openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vq31CdSRpgkI"
      },
      "source": [
        "# Customizing embeddings\n",
        "\n",
        "This notebook demonstrates one way to customize OpenAI embeddings to a particular task.\n",
        "\n",
        "The input is training data in the form of [text_1, text_2, label] where label is +1 if the pairs are similar and -1 if the pairs are dissimilar.\n",
        "\n",
        "The output is a matrix that you can use to multiply your embeddings. The product of this multiplication is a 'custom embedding' that will better emphasize aspects of the text relevant to your use case. In binary classification use cases, we've seen error rates drop by as much as 50%.\n",
        "\n",
        "In the following example, I use 1,000 sentence pairs picked from the SNLI corpus. Each pair of sentences are logically entailed (i.e., one implies the other). These pairs are our positives (label = 1). We generate synthetic negatives by combining sentences from different pairs, which are presumed to not be logically entailed (label = -1).\n",
        "\n",
        "For a clustering use case, you can generate positives by creating pairs from texts in the same clusters and generate negatives by creating pairs from sentences in different clusters.\n",
        "\n",

This file has been truncated. show original

curt.kennedy · March 3, 2023, 6:50pm

From what I understand, the two-tower model is just a neural network on top of the embeddings, so why do you need to tune the original embedding model? You need to create another NN.

Here is the engine eBay uses:

ray001 · March 3, 2023, 7:21pm

There are some alternatives.
For example in a two-tower taking embedding from entities. You can

get the raw GPT-3 embedding for those entities
apply a set of CNN+FC layers to the original embedding
guide the training of layers in step 2 with good/bad-fit labeled data
then raw GPT-3 embeddings processed by the CNN+FC layers trained in 3 would be the fine-tuned embedding

ray001 · March 3, 2023, 7:42pm

Raw gpt-3 embedding can already be used in a two-tower model and return a reasonable result. This is because the more critical part of a two-tower model is the embedding compared to NN layers after.

The reason one could benefit from fine-tuning the original gpt-3 embedding is that the raw gpt-3 embedding might not have been exposed to the specific tasks or the subdomain knowledge.

A foo-bar example would be, imagine there is a limited corpus with only 2 words,
[“machine operation”, “artificial intelligence”]
And we want to find the most similar word to an input word of ‘machine learning’. Similarity calculation using raw GPT3 embedding returned that
machine learning and machine operation has a sim score of 0.87
machine learning and artificial intelligence has a sim score of 0.88
Both scores make sense since the first is by checking the letter overlaping and the second is by checking the semantic meaning. But in my use case the first type of similarity would introduce noise. I managed to fix it in the reply to vamsi. Please feel free to have a look and see if it makes sense to you

curt.kennedy · March 3, 2023, 8:12pm

At a high level I understand what you are saying, which is, you need high scores on semantic meaning and not word overlap. Got it. Then you say you can achieve this by a NN (two-tower). Got it. Then you say the fine-tuned embedding is the output of your NN. Got it. All of this is fine and good and doesn’t need a direct fine-tune of the original embedding engine, since you are creating them in the output of your NN. I think you answered your own question, which is yes, you can create a fine-tuned embedding, which is created by the output of you own neural net. Totally feasible and makes sense. But you can’t upload some training file to the OpenAI API for embedding-ada-002 and get the same thing. Which is what I thought your original post was about.

And FYI, you can improve the geometry of the embeddings too, I did this in this thread. Some questions about text-embedding-ada-002’s embedding - #42 by curt.kennedy

It removes the mean embedding vector and uses PCA to reduce the dimensions and increase the spread without altering the meaning too much.

So yeah, post-processing of the embeddings is certainly advised and encouraged in certain situations.

k0rthik · June 21, 2023, 11:35am

Hi @ray001, Could you please share how your dataset is built and how it is trained, especially with the good/bad-fit labeled data? I have a problem statement where I need to return similar accessories from the dataset (which contains the name and color of the accessory) for a user query (a type of accessory). What changes need to be made to my dataset, and what else do I need to consider? (P.S. The similarity between the query and the similar results from the dataset is currently not very good.)

Thanks,

ray001 · June 22, 2023, 9:17pm

Hi @k0rthik, the dataset was generated with human labelling. It looks the training dataset for your projects would be pairs of accessories, for example, 50% of which are similar ones, 50% of which are not(both easy and hard cases). You can try use vanilla embeddings from text-embedding-ada-002 and see how it looks

anujay.ds · July 20, 2023, 5:36am

@ray001 @k0rthik - Can you please share the text-embedding-ada-002 source, am unable to find the same and also if possible kindly share a reference notebook link to accommodate the fine-tuning over it, we are struggling with that.

Preplan3722 · September 16, 2023, 4:26pm

I’m running a vector database for PC games based on openai embeddings. The use case for me is that searching for nearest nodes for “ace combat” returns “ace academy” before “ACE COMBAT™ 7: SKIES UNKNOWN” (first and second place, respectively). This is for an embedding that’s 100% weighted on title. As far as I can tell, there’s no other smart tuning I can do to make this return the correct result. The embeddings themselves are “wrong” and need to be “tuned.” Maybe the problem is that the embedding model thinks combat and academy are synonyms. This is the type of variable I’d like to be able to adjust so that the embedding can be generated more literally in cases where I’d want that.

jwatte · September 16, 2023, 5:50pm

ACE COMBAT 7 SKIES UNKNOWN contains more than one concept (“skies” and “unknown”) and thus might be a poorer match – embeddings try to capture concepts not words and when there’s more than one concept, the embedding vector will end up in between the point of each of those concepts in vector space.

When I have a similar problem, I end up doing hybrid retrieval – both keyword based (important word matches, where I have a good idea of what’s “important words”) and embedding based.

mushon · September 20, 2023, 8:08pm

I have also done hybrid model, found out that running embedding and bm25 on different threads in python and merge them by myslef rather then using the alpha (for example in weaviate) peformed better, but posting this since I am wondering where this all goes? we need more ideas to make search accurate

mushon · September 20, 2023, 8:09pm

anyone knows where there is any advanced capabilities around fine-tune embedding mode for my own use or maybe new ideas on how to get better results?

haneulkim214 · March 29, 2024, 6:35am

This approach totally makes sense. I’m curious to whether it actually improved the performance of your recommender engine.

Topic		Replies	Views
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4066	April 9, 2024
What's better for the type of chatbot I am building? Fine tune or embedding? Community chatgpt , api	10	2193	August 20, 2023
Reducing Cost of GPT 4 by using embeddings Prompting	23	10505	May 4, 2023
Fine-Tuning plus Embedding API	2	4790	May 3, 2023
Embeddings vs finetunes API	7	2866	January 16, 2023

Is it possible to fine tune the embedding model?

Related topics