Fine-tuning or update embedding of a String

The Hadamard product is totally useful and good. It is the precursor to cosine similarity: sum(Hadamard(u,v)), where the sum is taken of the coordinates of the Hadamard product vector.

If I were to pinpoint what is throwing me off, is why you are normalizing it back out to unit length? This implies you are treating this product as a “new embedding vector” and this I am not sure is a good idea, and why the “out-of-range/closure-failure” red flags started lighting up in my head.

I think you will get more mileage out of the raw Hadamard product, without normalization, and treat it as an “interaction vector”, where it’s potential non-unit length is an important aspect of the interaction of the two vectors. So this is more of a raw correlation vector, before the sum. The angle of rotation is part of it, but more information is in the length IMO.

As for the Ada-002 range, this is from my own experience on embedding 100’s of thousands of random strings, not exactly nonsensical strings, and seeing their overall angular spread. It is also evident by others observations on this forum about the cosine similarities being very close, even for different things, and only going down to 0.7 at the lowest, where the theoretical lowest is -1. So it only has a variation of 1 to 0.7 (spread of 0.3), out of a total theoretical variation of 1 to -1 (spread of 2). So it only varies 15% over its possible range.

1 Like