I am struggling with embeddings when i have numeric values in my data.
I have created a list of documents, with have a field called ID, in the format item-XXX, being XXX a numeric value, so i have documents like item-001, item-002 etc…
However, when ada does the embeddings, it does it only on the string values, not on the numbers, so, item-001, item-002 generate the same embedding, as it just consider the word item. Therefore, when i query about a given item, it does not work correctly.
I have tried to generate the embeddings based on the word representation of the item, like item-one, item-two etc, but still not working, as i believe underneath it is converted to numeric.
Has anybody else had a similar issue? What is the right way to make it work?
Personally haven’t worked with numbers in embeddings for GPT but have you had a though about using One-Hot encoding or any other similar method ?
These methods are quite common to use in ML pipelines when you want to embedding numerical data, might be helpful here as well.
If that isn’t help, you can use a FT-transformer to encode the data, though that would make the pipeline a it more complicated.
hi. no, i have not heard about that technique. what i have done is avoid using numbers and use the word of the number, so 003 is “three”, but it still does not work.
is there any doc about one-hot encoding i can read?
I dont have a specific one in mind, but you can search on Google and you should be able to find it.
Embedding item-001 and item-002 with the
text-embedding-ada-002 model does not produce the same vectors. Here’s the curl command to generate embedding vectors for both at once:
curl 'https://api.openai.com/v1/embeddings' \
-H 'authorization: Bearer <OPENAI_APIKEY>' \
-H 'content-type: application/json' \
and here’s the response body truncated to just show the first few dimensions of each vector:
I suspect you are using a tokenizer and submitting token arrays instead of strings? Can you check the request and make sure you are sending 3 tokens for each term for a total of 6 tokens?
This has not been my experience. These map into two subtly different vectors as @rwaldin points out by example.
The bigger question - are they useful for similarity comparisons? I would enjoy expert commentary on that.
right, i have tested this more in deep, and you are totally right, i was chasing a ghost, my assumption was wrong. The embeddings generated are different and, with a simple example, it works well
The good news is that i have identified what the problem is. I have a field called description, that has some numeric value, for example:
For item-001, my description is “this item costs £19.99 and has a width of 54mm”.
I generate the embedding from the normalised text of the concatenation of ID+Description, here is where the engine gets confused. If i generate the embedding just from the ID, it resolves well the document to retrieve… but it is of little value, as it doesnt generate any embedding about the description, so if i ask about a part that has a width of 54mm, it doesnt know what to answer.
How would you go through that? By now i have done a workaround to check with a regex if in the question there is a reference to the item ID, and search for that document, but i am sure there are better solutions for that