You’ve got questions? I’ve got…blank stares?
No, I’ve got some stuff to try.
First question: does your data actually have language? Identical JSON with just interest rates and database dumps will be very poor. Embeddings contains a representation of semantics of the context that is sent. Such data will have an overwhelming commonality in meaning: json with numbers and all the same fields.
Let’s say instead your json is simple, and has runs of tokens that do have meaning. Let’s first compare some raw language strings to find a similarity score:
== Cosine similarity comparisons ==
- 1:"I’m a relative beginner i" <==> 0:" I’d like to know if anyon" -
float32: 0.2633
- 2:"First of all, I’d like to" <==> 0:" I’d like to know if anyon" -
float32: 0.4182
- 2:"First of all, I’d like to" <==> 1:" I’m a relative beginner i" -
float32: 0.2729
It’s your own words! We find that the first two sentences match the highest, compared to any comparisons against the last sentence.
Let’s wrap that in JSON, like a user role message for chat completions…
'{"role": "user", "content": "First of all, I\’d like to thank you in advance...
And the results are as expected – because of the commonality, the similarity is much higher. This would also mean that any threshold you had set to disregard poor matches will return much more – or almost all.
== Cosine similarity comparisons ==
- 1:"{"role": "user", "content" <==> 0:" {"role": "user", "content" -
float32: 0.6457
- 2:"{"role": "user", "content" <==> 0:" {"role": "user", "content" -
float32: 0.7190
- 2:"{"role": "user", "content" <==> 1:" {"role": "user", "content" -
float32: 0.6199
The top 2 results are distinguished by a difference of .07 instead of .15.
I hope that gives some insights. You might want to consider if you can strip out the contents and just make it sentences or give emphasis to just what is different through preprocessing.