Embedding very sensitive to punctuation

leo.bachelier.ext · December 6, 2023, 1:13pm

I have been using OpenAI Embeddings specifically text-embedding-ada-002 and noticed it was very sensitive to punctuation even. I have around 1000 chunks and need to extract each time the 15 most similar chunks to my query. I have been testing my query without punctuation and when I add a dot ‘.’ at the end of my query it changes the initial set I got from the retriever with the query without punctuation (some chunks are the same but new ones may appear or the initial order is different).

Have you noticed anything similar ?
Is it the basic behaviour of this embedding to be that sensitive to punctuation ?
Is there a way to make it more robust to minor changes in the query ?

FYI: I am using PGvector to store my chunks vectors

lmccallum · December 6, 2023, 8:29pm

I haven’t noticed punctuation, but I have noticed a significant downgrade in performance using ada-002 compared to the davinci-001 embeddings model. I am really frustrated because I re-embedded all my texts, and now the results aren’t fit for my use case. Before, the most relevant text always showed up in the first 1-5 search results, now it’s the 50th search result!

anon10827405 · December 6, 2023, 8:32pm

Yes. Grammar do be like that.

Have you tried normalizing/correcting the text using GPT and then embedding it?

leo.bachelier.ext · December 8, 2023, 10:12am

Actually, it’s more in the query I am sending to the retriever, if I add a dot at the end it changes the set returned. Depending on the query sometimes the version with the dot returns what I am looking for and sometimes it’s the version without the dot.
So not sure how to normalize it correclty.

anon10827405 · December 8, 2023, 12:43pm

What was your chunking strategy? Could you provide some examples of the data you chunked?

Topic		Replies	Views
Does ada support other languages than English? API embeddings , question	13	12940	October 18, 2023
Splitting text into chunks versus reducing the text API embeddings , ada	9	2742	April 5, 2024
Text Pre-processing for text-embedding-ada-002 Community embeddings	2	5172	December 17, 2023
Embeddings results using Ada-Embedding-data-002 API	10	2403	March 29, 2023
Embedding - text length vs accuracy? API	13	15968	December 25, 2023

Embedding very sensitive to punctuation

Related topics