How to make embeddings on multiple JSON files

arno.mege · February 14, 2024, 3:37pm

Hello everyone!

First of all, I’d like to thank you in advance for taking the time to answer my question!

I’m a relative beginner in this world, and I’m looking to perform embedding on a multitude of JSON files.

I’d like to know if anyone has done this before!

See you soon,

Arnaud

anon10827405 · February 14, 2024, 4:09pm

Start with using the Retrieval system to see how well a typical embedding system works with your JSON file

Consider instead using Function-Calling to convert unstructured text into a structured function. This will reduce costs and be more reliable than embeddings when dealing with a structured format like JSON.

If for whatever reason you decide to continue with Retrieval and are happy with the results, you can decide to implement your own RAG using a number of vector databases such as Weaviate, Qdrant, or Pinecone. They all offer intuitive client libraries that make it very easy to hook up & offer the wonderful combination of more control, and less cost than the retrieval system
Create your own training/validation set by sending the JSON structure/schema to cGPT-4. Ask it to create (at minimum) 4 levels of questions. Easy, Med, Hard, Impossible. This will help tremendously in seeing how your retrieval system works.

_j · February 14, 2024, 5:35pm

You’ve got questions? I’ve got…blank stares?

No, I’ve got some stuff to try.

First question: does your data actually have language? Identical JSON with just interest rates and database dumps will be very poor. Embeddings contains a representation of semantics of the context that is sent. Such data will have an overwhelming commonality in meaning: json with numbers and all the same fields.

Let’s say instead your json is simple, and has runs of tokens that do have meaning. Let’s first compare some raw language strings to find a similarity score:

 == Cosine similarity comparisons ==
- 1:"I’m a relative beginner i" <==> 0:" I’d like to know if anyon" -
 float32: 0.2633
- 2:"First of all, I’d like to" <==> 0:" I’d like to know if anyon" -
 float32: 0.4182
- 2:"First of all, I’d like to" <==> 1:" I’m a relative beginner i" -
 float32: 0.2729

It’s your own words! We find that the first two sentences match the highest, compared to any comparisons against the last sentence.

Let’s wrap that in JSON, like a user role message for chat completions…

'{"role": "user", "content": "First of all, I\’d like to thank you in advance...

And the results are as expected – because of the commonality, the similarity is much higher. This would also mean that any threshold you had set to disregard poor matches will return much more – or almost all.

 == Cosine similarity comparisons ==
- 1:"{"role": "user", "content" <==> 0:" {"role": "user", "content" -
 float32: 0.6457
- 2:"{"role": "user", "content" <==> 0:" {"role": "user", "content" -
 float32: 0.7190
- 2:"{"role": "user", "content" <==> 1:" {"role": "user", "content" -
 float32: 0.6199

The top 2 results are distinguished by a difference of .07 instead of .15.

I hope that gives some insights. You might want to consider if you can strip out the contents and just make it sentences or give emphasis to just what is different through preprocessing.

Topic		Replies	Views
Help with determining if its less efficient to create embeddings based on JSON Community gpt-4 , chatgpt , api , vector-db	5	8894	December 23, 2023
Preparing complex data for embedding that is originally in JSON API	3	6012	June 15, 2024
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3425	August 28, 2024
Is there any sample code to split a json file into smaller chunks? API	11	13932	October 26, 2023
Embedding and searching from similar embeddings API	6	6717	October 27, 2023

How to make embeddings on multiple JSON files

Related topics