Anyone else testing the new V3 embedding models for QnA?

supershaneski · January 27, 2024, 3:55am

I am testing the new embedding models (text-embedding-3-small, text-embedding-3-large) and the result is very different from text-embedding-ada-002. I am not sure what to make of it.

Query

what is my favorite tom cruise movie?

Text1

My all time favorite action movie is Top Gun. I used to have a movie poster of Tom Cruise from Top Gun hanging in my room.

Text2

One of my favorite comedy movies is Tropic Thunder. It’s hilarious, especially with RDJ and Tom Cruise. The way Tom Cruise danced to the song Low was just epic!

Text3

Daruma is a delightful restaurant that serves yakiniku, Japanese style grilled meats, and jingisukan, a grilled lamb dish which is a Hokkaido delicacy. The food is not only delicious but also affordable.

Results

		ada						small					large
Text1	0.8526715878737584		0.5631829592887835		0.594466374201587
Text2	0.8326902960730671		0.5492379659408032		0.5156284404991819
Text3	0.6683329263351143		0.008682087570941398	0.05392372968797883

My threshold is around 0.7 so I will not get result.

I was thinking, maybe my query is not exact enough. So I tried "what is my favorite action movie?" and got the following for the small model.

            small
Text1 0.5733264320688367
Text2 0.4358654870000338
Text3 0.09655090917844694

Diet · January 27, 2024, 4:01am

Yeah the new models have a completely different spread. with ada, we almost never saw orthogonality (i.e. cosim approaching 0), now we have that with the new ones, which I think is great.

you need a new cutoff!

undefined
lifted from here: Unit circle - Wikipedia

we’ve been abused by ADA for so long we forgot what cosims are actually supposed to represent

that said, 0.56 isn’t amazing - but it really is all relative.

supershaneski · January 27, 2024, 4:10am

Yes, I think you are right. Looking at the scores, the new ones give better score comparison to the nearest answer and not relevant ones.

curt.kennedy · January 27, 2024, 4:11am

Those results are encouraging!

I like the higher diversity, and ‘large’ seems to offer the most in the top 2 answers it seems.

So eyeballing here, large looks the best there.

In general, you should expect your values to range from -1 to +1.

It’s an anomaly that ada-002 was only from 0.7 to 1.0.

Your results look good!

Maybe the new thresh is 0.3 / 0.4 ??? You’d have to test.

supershaneski · January 27, 2024, 4:14am

I think perhaps 0.5 cutoff, judging from the last result. I will test more

_j · January 27, 2024, 5:30am

I think even less. It depends on how specialized your queries are, and if “neutrino physics” would match 100 documents specifically about that.

Here’s results of a user query about GPTs, against various GPT blog and embeddings docs chunked to around two paragraphs.

“Top-n in X tokens” on top of a reject threshold would work best.

 == Cosine similarity ==
0:"How to add documents to my own" <==> 0:"How to add documents to my own":
   1.00000000000000000000 - identical: True
1:"[1] Documentation does not sup" <==> 0:"How to add documents to my own":
   0.19778059389735827556 - identical: False
2:"[2] Both of our new embeddings" <==> 0:"How to add documents to my own":
   0.13452488168373297195 - identical: False
3:"[3] We’re rolling out custom v" <==> 0:"How to add documents to my own":
   0.43988936843141823729 - identical: False
4:"[4] GPTs let you customize Cha" <==> 0:"How to add documents to my own":
   0.45278086198655564942 - identical: False
5:"[5] The GPT Store is rolling o" <==> 0:"How to add documents to my own":
   0.44064855357381010892 - identical: False
6:"[6] We’ve set up new systems t" <==> 0:"How to add documents to my own":
   0.36558419871885522445 - identical: False
7:"[7] Developers can connect GPT" <==> 0:"How to add documents to my own":
   0.39966454268300044550 - identical: False
8:"[8] Since we launched ChatGPT " <==> 0:"How to add documents to my own":
   0.43432215026683346215 - identical: False
9:"[9] We want more people to sha" <==> 0:"How to add documents to my own":
   0.30948992336293501548 - identical: False
10:"[10] Creating a GPT

How to cr" <==> 0:"How to add documents to my own":
   0.55752453059361806176 - identical: False
11:"[11] Here’s how to create a GP" <==> 0:"How to add documents to my own":
   0.55959997969812447227 - identical: False
12:"[12] Advanced Settings

In the" <==> 0:"How to add documents to my own":
   0.49851405229663103835 - identical: False
13:"[13] Settings in the Configure" <==> 0:"How to add documents to my own":
   0.46590037610635792742 - identical: False
14:"[14] FAQ: 

Q: How many files " <==> 0:"How to add documents to my own":
   0.46340230123865322476 - identical: False

Granted, nothing I included is “to add your own documents…”; here’s what crossed the threshold of 0.5.

Part 10 and 11 of 14 are >0.5

%%

Creating a GPT

How to create a GPT

GPTs are custom versions of ChatGPT that users can tailor for specific tasks or topics by combining instructions, knowledge, and capabilities. They can be as simple or as complex as needed, addressing anything from language learning to technical support. Plus and Enterprise users can start creating GPTs at chat.openai.com/create.

%%
Here’s how to create a GPT:
Head to https://chat.openai.com/gpts/editor (or select your name and then “My GPTs”)

Select “Create a GPT”

In the Create tab, you can message the GPT Builder to help you build a new GPT. You can say something like, "Make a creative who helps generate visuals for new products" or "Make a software engineer who helps format my code."

To name and set the description of your GPT, head to the Configure tab. Here, you will also be able to select the actions you would like your GPT to take, like browsing the web or creating images.

When you’re ready to publish your GPT, select “Publish” and share it with other people if you’d like. 
Now you’ve created a GPT!
%%

supershaneski · January 28, 2024, 11:53pm

Yes, your are right. After adding more data and asking many questions, from vague to direct, a good cutoff is around 0.3. Even if the results would include irrelevant data among the bunch, the final chat completions API call is still able to pick the correct answer.

Topic		Replies	Views
Rule of thumb cosine similarity thresholds? API	5	7270	November 6, 2024
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	14	4586	December 11, 2023
Transitioning to the new embeddings models from ada API embeddings	8	6501	January 27, 2024
Open AI embedding with 3072 dimensions API api	24	6437	July 4, 2024
Use embeddings to measures how well an answer fits the question API embeddings	5	519	June 29, 2024

Anyone else testing the new V3 embedding models for QnA?

Related topics