I have a natural language processing problem. I want to embed the data in my database, but the database is too large for me to embed each piece of data individually. I’ve been researching and found out about batch embedding.
Usually, if embedding each piece of data individually, the data “” is embedded.
(what I typed a few hours ago but apparently never pressed “reply” on)
An empty string with no language model tokens makes no sense as an embedding. The AI has nothing to process to return a semantically-related embedding state from its pretraining about a token.
If it worked, do they bill 0 tokens, and 1 token for embedding an “x”? (maybe what I wanted to experiment on if there was indeed a vector return for a null).
If you want to extend the matching to include “is another case of an empty string”, you might assign it a lookup table entry such as [0.9, -0.9, 0.9, …] of your own creation that will be like no other language.
I would normally skip embedding an empty string too BTW.
The only “forced” workaround is making up a vector if the string is empty, or set the vector to a vector of NaN’s, or the single “1’ vector in one of the positions. Or even all zeros vector if just taking dot products.
It just depends on what you want to do with the empty string in your comparison … do you just want a simple vector that just runs but has no meaning? Then use your made up vector.
Does your situation need to detect nonsense, use the NaN vector. But this requires some additional code to detect NaN’s, which isn’t a bad idea anyway.
Does the empty string mean you mirror the incoming vector? Use the 1/\sqrt{N} vector, anti-mirror then -1/\sqrt{N} vector.
Once you define meaning for some text to correlate with the empty string, you can pick your vector, or decide to drop the comparison altogether because it doesn’t make sense to compare something with nothing. Or does it? You have to define it.
In logic this is basically a vacuously true statement, so do whatever you want that yields your desired behavior.
But don’t expect the embedding model to define it.