I’ve had a specific issue with search results … looking for text with Patch numbers.
I’ve found text-embedding-3-small in combination with Cosine simularity was returning many results for snippets with the wrong version numbers in it.
So, for example, if I’m looking for text on “update 5.23” te3s was returning, preferentially lots of snippets on update 5.21, instead of 5.23. When combined with a threshold cut-off or a “Top 5” slice, this was removing all results that should have been found with 5.23.
Roll back to ADA 2 and, boom, problem gone.
I’m using an HNSW index … I’m wondering if that might be having an influence … but the ADA 2 setup without changing any index parameters, is working for me again.
I’m tempted to try a higher resolution index at some point with the new model, but that will eat memory!
No, this was not the problem. The problem was regardless of threshold, the order of cosine distance was prefering data that was clearly more different. And that’s definitely a problem!
I’ve seen this as well; with text-embedding-3-small, I get a totally different set of similar results compared to ada-002. The relevance seems broken, and is unusable.
It’s pretty concerning, and I’m not seeing this mentioned anywhere else. I’m going to try and reach out to some OpenAI folks about this.
I just ran a test where I ingested 10 text files with text-embedding-3-small, and searched for some text, and then did the same thing with ada-002, and got completely different list of results. (And ada-002 was 100% correct with the top hits.)
Unfortunately not. Tried to reach out but no response.
I need to retest this, since I just changed how we preprocess the text by not removing newline characters. (Found this was only a thing in v1 of ada-002, and not needed in v2, so probably not in text-embedding-3 either.)
I’ll followup again after I have another chance to compare.
Came here after having similar issues. Recently I’ve started using text-embeddding-3-small and my search results coming back empty. Switching back to text-embedding-ada-002 appears to be improving it.
Now what I don’t know really, to be honest, if I have embeddings previously produced using text-embedding-ada-002 and if I search in them using a query vector produced by text-embedding-3-small is there a huge impact on the results?
I have quite a large db of vectors and if possible I would like to avoid re-indexing them using te3s. Especially while having issues with search results
Yea, ok after reading a bit more around the forum, it became clear that ada and te3s are not compatible, this is not like changing a gpt model for a given input. You can not simply start using te3s on an existing vector db produced by ada.
@merefield Any update on this? I’m getting very strange results with 3-large compared to Ada.
If I take a subset of my document (like, multiple sentences) and query the indexes, the Ada index will find the text the subset came from without fail whereas 3-large will never find it! It’s not even in the top 10!
It’s as you say,
regardless of threshold, the order of cosine distance was preferring data that was clearly more different
I can’t mathematically wrap my head around how a subset of a document can be more like an entirely different document than it’s source! And it’s not like I’m taking a needle-in-a-haystack and comparing it to a document on needles and haystacks, either. It’s semantically obvious that the found documents are way less related than the original…
I’ve been twisting and turning this for days now and I can’t find anything wrong on my end. I’m as certain as can be that I’m querying 3-large indices with 3-large queries and vice versa.
Gotcha, I was mostly worried it’d be a ticking time-bomb if they deprecate Ada since they denote it as an “earlier generation”.
I’ll add this, too: I’m using a FAISS vector database and according to the LangChain package, it’s supposed to return values in the range of [0, sqrt(2)]. Yet, I’m somehow getting out-of-bounds similarity scores like 1.45 as the most similar. It seems indicative that there’s something wrong in my code base
Thanks for the speedy reply! I’ll probably cut my losses and revert back to Ada for the time being, too.