Fine-tuning or update embedding of a String

curt.kennedy · August 14, 2023, 1:29am

The Hadamard product is totally useful and good. It is the precursor to cosine similarity: sum(Hadamard(u,v)), where the sum is taken of the coordinates of the Hadamard product vector.

If I were to pinpoint what is throwing me off, is why you are normalizing it back out to unit length? This implies you are treating this product as a “new embedding vector” and this I am not sure is a good idea, and why the “out-of-range/closure-failure” red flags started lighting up in my head.

I think you will get more mileage out of the raw Hadamard product, without normalization, and treat it as an “interaction vector”, where it’s potential non-unit length is an important aspect of the interaction of the two vectors. So this is more of a raw correlation vector, before the sum. The angle of rotation is part of it, but more information is in the length IMO.

As for the Ada-002 range, this is from my own experience on embedding 100’s of thousands of random strings, not exactly nonsensical strings, and seeing their overall angular spread. It is also evident by others observations on this forum about the cosine similarities being very close, even for different things, and only going down to 0.7 at the lowest, where the theoretical lowest is -1. So it only has a variation of 1 to 0.7 (spread of 0.3), out of a total theoretical variation of 1 to -1 (spread of 2). So it only varies 15% over its possible range.

Topic		Replies	Views
Embedding Results Scale Seems Off API embeddings , ada	8	5179	December 24, 2023
Expected Angular Differences in Embedding Random Text? API	9	1118	December 24, 2023
What is a proper way to combine multiple cosine similarities? API embeddings	17	1780	March 24, 2024
Some questions about text-embedding-ada-002’s embedding API	146	44058	December 13, 2023
Is it possible to convert vector to text? API chatgpt	27	15696	June 1, 2024

Fine-tuning or update embedding of a String

Related topics