Embedding with "" and list [..., "", ...]

pclnvu1009 · December 25, 2023, 10:43am

I have a natural language processing problem. I want to embed the data in my database, but the database is too large for me to embed each piece of data individually. I’ve been researching and found out about batch embedding.
Usually, if embedding each piece of data individually, the data “” is embedded.

client.embeddings.create(input="", model="text-embedding-ada-002")

It not raise an error.
but when i run following code:

client.embeddings.create(input=[""], model="text-embedding-ada-002")

or

client.embeddings.create(input=["1", ""], model="text-embedding-ada-002")

it raise an error by input[1]:

BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

anyone solve it?
thanks you very much

curt.kennedy · December 25, 2023, 7:06pm

According to the docs, you should expect the model to error out for empty strings, and input[1] is an empty string.

So try again without any empty strings in the array.

pclnvu1009 · December 26, 2023, 1:26am

but im expecting to embedding empty string because my data have missing values

curt.kennedy · December 26, 2023, 2:21am

Then make up an embedding vector for the empty situation.

A good one is the vector matching your vector dimensions, with a single one, and the rest zeros. This is to maintain the unit vector length.

So something like:

[1,0,0,….,0]

_j · December 26, 2023, 3:37am

(what I typed a few hours ago but apparently never pressed “reply” on)

An empty string with no language model tokens makes no sense as an embedding. The AI has nothing to process to return a semantically-related embedding state from its pretraining about a token.

If it worked, do they bill 0 tokens, and 1 token for embedding an “x”? (maybe what I wanted to experiment on if there was indeed a vector return for a null).

If you want to extend the matching to include “is another case of an empty string”, you might assign it a lookup table entry such as [0.9, -0.9, 0.9, …] of your own creation that will be like no other language.

curt.kennedy · December 26, 2023, 5:39am

I would normally skip embedding an empty string too BTW.

The only “forced” workaround is making up a vector if the string is empty, or set the vector to a vector of NaN’s, or the single “1’ vector in one of the positions. Or even all zeros vector if just taking dot products.

It just depends on what you want to do with the empty string in your comparison … do you just want a simple vector that just runs but has no meaning? Then use your made up vector.

Does your situation need to detect nonsense, use the NaN vector. But this requires some additional code to detect NaN’s, which isn’t a bad idea anyway.

Does the empty string mean you mirror the incoming vector? Use the 1/\sqrt{N} vector, anti-mirror then -1/\sqrt{N} vector.

Once you define meaning for some text to correlate with the empty string, you can pick your vector, or decide to drop the comparison altogether because it doesn’t make sense to compare something with nothing. Or does it? You have to define it.

In logic this is basically a vacuously true statement, so do whatever you want that yields your desired behavior.

But don’t expect the embedding model to define it.

Topic		Replies	Views
Embedding API change? $.input is invalid API embeddings , api	5	4112	September 3, 2024
[Embeddings.create] Improve InvalidRequestError message: "['', 'a'] is not valid under any of the given schemas - 'input'" for large arrays API	8	4577	August 14, 2023
Getting 400 response with already working code API	16	8560	August 6, 2024
What are the valid embedding input values? API	6	2356	December 25, 2023
Embedding model type error Community embeddings	5	1278	January 17, 2024

Embedding with "" and list [..., "", ...]

Related topics