Question about embeddings (ada 002) with numeric values

javi.xeneize · June 9, 2023, 8:41am

Hello

I am struggling with embeddings when i have numeric values in my data.

I have created a list of documents, with have a field called ID, in the format item-XXX, being XXX a numeric value, so i have documents like item-001, item-002 etc…

However, when ada does the embeddings, it does it only on the string values, not on the numbers, so, item-001, item-002 generate the same embedding, as it just consider the word item. Therefore, when i query about a given item, it does not work correctly.

I have tried to generate the embeddings based on the word representation of the item, like item-one, item-two etc, but still not working, as i believe underneath it is converted to numeric.

Has anybody else had a similar issue? What is the right way to make it work?

Thanks

udm17 · June 9, 2023, 8:47am

Personally haven’t worked with numbers in embeddings for GPT but have you had a though about using One-Hot encoding or any other similar method ?

These methods are quite common to use in ML pipelines when you want to embedding numerical data, might be helpful here as well.

If that isn’t help, you can use a FT-transformer to encode the data, though that would make the pipeline a it more complicated.

javi.xeneize · June 9, 2023, 9:01am

hi. no, i have not heard about that technique. what i have done is avoid using numbers and use the word of the number, so 003 is “three”, but it still does not work.

is there any doc about one-hot encoding i can read?

Thanks

udm17 · June 9, 2023, 1:02pm

I dont have a specific one in mind, but you can search on Google and you should be able to find it.

rwaldin · June 9, 2023, 2:31pm

Embedding item-001 and item-002 with the text-embedding-ada-002 model does not produce the same vectors. Here’s the curl command to generate embedding vectors for both at once:

curl 'https://api.openai.com/v1/embeddings' \
  -H 'authorization: Bearer <OPENAI_APIKEY>' \
  -H 'content-type: application/json' \
  -d '{"model":"text-embedding-ada-002","input":["item-001","item-002"]}'

and here’s the response body truncated to just show the first few dimensions of each vector:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.008231333,
        -0.009892395,
        0.0043476773,
       ...
      ]
    },
    {
      "object": "embedding",
      "index": 1,
      "embedding": [
        -0.023350336,
        -0.025565965,
        0.0071974755,
        ...
       ]
    }
  ],
  "model": "text-embedding-ada-002-v2",
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 6
  }
}

I suspect you are using a tokenizer and submitting token arrays instead of strings? Can you check the request and make sure you are sending 3 tokens for each term for a total of 6 tokens?

bill.french · June 9, 2023, 3:36pm

This has not been my experience. These map into two subtly different vectors as @rwaldin points out by example.

The bigger question - are they useful for similarity comparisons? I would enjoy expert commentary on that.

javi.xeneize · June 12, 2023, 9:56am

right, i have tested this more in deep, and you are totally right, i was chasing a ghost, my assumption was wrong. The embeddings generated are different and, with a simple example, it works well

The good news is that i have identified what the problem is. I have a field called description, that has some numeric value, for example:

For item-001, my description is “this item costs £19.99 and has a width of 54mm”.

I generate the embedding from the normalised text of the concatenation of ID+Description, here is where the engine gets confused. If i generate the embedding just from the ID, it resolves well the document to retrieve… but it is of little value, as it doesnt generate any embedding about the description, so if i ask about a part that has a width of 54mm, it doesnt know what to answer.

How would you go through that? By now i have done a workaround to check with a regex if in the question there is a reference to the item ID, and search for that document, but i am sure there are better solutions for that
Thanks

Topic		Replies	Views
Help with embeddings and semantic search Community embeddings , chatgpt , semantic-search	7	1477	September 27, 2024
Does ada support other languages than English? API embeddings , question	13	12457	October 18, 2023
What am I doing wrong on my semantic search JSON embeded? API	16	4289	February 21, 2024
Embeddings support for numbers API	5	2243	December 17, 2023
Does openAI provide API that takes Embeddings as an input? API embeddings	10	3900	December 18, 2023

Question about embeddings (ada 002) with numeric values

Related topics