I have been using OpenAI Embeddings specifically text-embedding-ada-002 and noticed it was very sensitive to punctuation even. I have around 1000 chunks and need to extract each time the 15 most similar chunks to my query. I have been testing my query without punctuation and when I add a dot ‘.’ at the end of my query it changes the initial set I got from the retriever with the query without punctuation (some chunks are the same but new ones may appear or the initial order is different).
- Have you noticed anything similar ?
- Is it the basic behaviour of this embedding to be that sensitive to punctuation ?
- Is there a way to make it more robust to minor changes in the query ?
FYI: I am using PGvector to store my chunks vectors
I haven’t noticed punctuation, but I have noticed a significant downgrade in performance using ada-002 compared to the davinci-001 embeddings model. I am really frustrated because I re-embedded all my texts, and now the results aren’t fit for my use case. Before, the most relevant text always showed up in the first 1-5 search results, now it’s the 50th search result!
Yes. Grammar do be like that.
Have you tried normalizing/correcting the text using GPT and then embedding it?
Actually, it’s more in the query I am sending to the retriever, if I add a dot at the end it changes the set returned. Depending on the query sometimes the version with the dot returns what I am looking for and sometimes it’s the version without the dot.
So not sure how to normalize it correclty.
What was your chunking strategy? Could you provide some examples of the data you chunked?