How to make embeddings on multiple JSON files

Hello everyone!

First of all, I’d like to thank you in advance for taking the time to answer my question!

I’m a relative beginner in this world, and I’m looking to perform embedding on a multitude of JSON files.

I’d like to know if anyone has done this before!

See you soon,

Arnaud

  1. Start with using the Retrieval system to see how well a typical embedding system works with your JSON file
  • Consider instead using Function-Calling to convert unstructured text into a structured function. This will reduce costs and be more reliable than embeddings when dealing with a structured format like JSON.
  1. If for whatever reason you decide to continue with Retrieval and are happy with the results, you can decide to implement your own RAG using a number of vector databases such as Weaviate, Qdrant, or Pinecone. They all offer intuitive client libraries that make it very easy to hook up & offer the wonderful combination of more control, and less cost than the retrieval system

  2. Create your own training/validation set by sending the JSON structure/schema to cGPT-4. Ask it to create (at minimum) 4 levels of questions. Easy, Med, Hard, Impossible. This will help tremendously in seeing how your retrieval system works.

You’ve got questions? I’ve got…blank stares?

No, I’ve got some stuff to try.

First question: does your data actually have language? Identical JSON with just interest rates and database dumps will be very poor. Embeddings contains a representation of semantics of the context that is sent. Such data will have an overwhelming commonality in meaning: json with numbers and all the same fields.

Let’s say instead your json is simple, and has runs of tokens that do have meaning. Let’s first compare some raw language strings to find a similarity score:

 == Cosine similarity comparisons ==
- 1:"I’m a relative beginner i" <==> 0:" I’d like to know if anyon" -
 float32: 0.2633
- 2:"First of all, I’d like to" <==> 0:" I’d like to know if anyon" -
 float32: 0.4182
- 2:"First of all, I’d like to" <==> 1:" I’m a relative beginner i" -
 float32: 0.2729

It’s your own words! We find that the first two sentences match the highest, compared to any comparisons against the last sentence.

Let’s wrap that in JSON, like a user role message for chat completions…

'{"role": "user", "content": "First of all, I\’d like to thank you in advance...

And the results are as expected – because of the commonality, the similarity is much higher. This would also mean that any threshold you had set to disregard poor matches will return much more – or almost all.

 == Cosine similarity comparisons ==
- 1:"{"role": "user", "content" <==> 0:" {"role": "user", "content" -
 float32: 0.6457
- 2:"{"role": "user", "content" <==> 0:" {"role": "user", "content" -
 float32: 0.7190
- 2:"{"role": "user", "content" <==> 1:" {"role": "user", "content" -
 float32: 0.6199

The top 2 results are distinguished by a difference of .07 instead of .15.

I hope that gives some insights. You might want to consider if you can strip out the contents and just make it sentences or give emphasis to just what is different through preprocessing.