Semantic search with Embeddings

richmandan · January 11, 2023, 12:45pm

I have a list of animals e.g. bird, dog, fish etc. uploaded as embeddings.
I then query this list with the query, “which animal has scales?”
Using ADA, the results often show “bird” as having a greater cosine similarity than “fish” for this query.
I don’t understand why that would be or whether my understanding of embeddings is incorrect. Would appreciate any guidance.

curt.kennedy · January 11, 2023, 5:29pm

For the list of animals, did you only embed the word “fish” or did you embed a description “A fish has scales, lives underwater …”?

Also, are you using ‘text-embedding-ada-002’ or some other engine?

richmandan · January 11, 2023, 5:43pm

Just fish. using text-embedding-ada-002

curt.kennedy · January 11, 2023, 5:47pm

Try embedding a description of a fish (or each animal). Be explicit and detailed. The embedding engine you are using can handle 8k tokens, which a lot of text, so don’t be shy! I know this sounds like a lot of work, so you can also have GPT-3 describe each animal and embed those responses instead.

This will enrich the embedding vector and should give you a closer match.

Dent · January 11, 2023, 5:47pm

Could you please provide a sample of the input text? It’s difficult to gauge why it’s giving that response without seeing what the source of the embedding looks like

PaulBellow · January 11, 2023, 5:49pm

I mean “birds” did come from dinosaurs, they say. Haha. I wonder if that’s why it occasionally is higher?

I agree with @curt.kennedy that more information would likely be helpful…

Good luck. Let us know if you get it sorted out.

richmandan · January 11, 2023, 5:54pm

I don’t doubt that it would produce more predictable results if I gave it more data about each animal.
If that is the issue, I wouldn’t describe this as semantic search nor would it be very useful. There are better methods than GPT for simple word-match searches.

richmandan · January 11, 2023, 5:55pm

The list shows the data and the cosine similarity for the query “which animal is commonly perceived as most likely to have scales?” - Best answer: Sheep (!?)

The other image shows a list of other queries and the best answer provided. The only query returning fish as the best answer was “What animal has scales and lives in water?”

curt.kennedy · January 11, 2023, 6:06pm

I definitely understand your frustration. There are other embedding engines out there. I personally have used Glove before, but this was at the word level, not the sentence or paragraph level.

There might be embedding engines out there that are better at embedding smaller chunks of text. But generally this would shrink the vector space (length of the embedding vector), so that more features are embedded in a smaller space, making them closer by default. So this engine would likely not distinguish as well.

It’s all a trade. But at the word level, I had good performance with only 50 to 100 dimensions. At the phrase level, I would go much higher, the 1k-2k that OpenAI is using now seems reasonable.

PaulBellow · January 11, 2023, 6:06pm

Limitations

The new text-embedding-ada-002 model is not outperforming text-similarity-davinci-001 on the SentEval linear probing classification benchmark. For tasks that require training a light-weighted linear layer on top of embedding vectors for classification prediction, we suggest comparing the new model to text-similarity-davinci-001 and choosing whichever model gives optimal performance.

Check the Limitations & Risks section in the embeddings documentation for general limitations of our embedding models.

Is text-similarity-davinci-001 still available?

Also, maybe it’s misunderstanding the word “scales”…

As in “scales” can be read a couple different ways, and maybe sheep (somehow) are classified as likely having “scales” (but the other definition…ie hierarchies…)

“What animal has scales and lives in water?” helps the model understand which definition of scales you’re after…

Just trying to help you brainstorm…

curt.kennedy · January 11, 2023, 7:15pm

Just to add to the response from @PaulBellow

Many of these embedding models are trained on data scraped from the internet. And guess what the internet says:

“Birds evolved from a group of meat-eating dinosaurs called theropods. That’s the same group that Tyrannosaurus rex belonged to.”

The model doesn’t always have common-sense in realizing that when you embed ‘bird’ that you only mean things only modern birdlike. There are soooo many bird/dinosaur articles out there that it isn’t out of the question for the model to link the two, given the data it was trained on.

richmandan · January 11, 2023, 7:46pm

I understand completely why it’s returning these results. However, it makes it much less useful.

Topic		Replies	Views
Semantic search through embeddings API	3	1352	January 22, 2023
Semantic vs search embedding API	3	7270	September 28, 2023
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4415	April 9, 2024
Use embeddings to measures how well an answer fits the question API embeddings	5	460	June 29, 2024
Reducing Cost of GPT 4 by using embeddings Prompting	23	10850	May 4, 2023

Semantic search with Embeddings

Limitations

Related topics