I am building a search application using RAG and need to process deeply nested JSON data. It would be really helpful if anyone could guide me on whether I can chunk the JSON to find similar matches or if there’s a better approach to handle this. Additionally, I would love to know if OpenAI can understand deeply nested JSON structures and provide accurate responses when queried. Any insights would be greatly appreciated!
Hey,
this does not make sense at all.
If you want to search in structured data you can use normal programming logic. Why waste time, money and energy to use a LLM when you already have structured data.
@123s - As @jochenschultz mentioned it is not required to use AI. If you wish to do so - Isolate the logic to classify the top level JSON using AI, extract it using programming logic, perform RAG and repeat. Alternatively, If RAG is the sole purpose and context is the primary need:
a. Chunk the JSON data into smaller, manageable pieces
b. Convert these chunks into string format
c. Perform a simple RAG operation using these converted strings… Note: you may use programming logic to highlight levels of JSON, key-value pairs and see how it performs.
Hope this helps! Cheers!
Beyond stupidity? If I have a Key for user comments. I want to query across comments of all users. Would DB querying still work for this to parse on those comments? On what factors would you query when intent plays a role?
that would be a proper usecase for llm
Funny, I assumed SDEs were supposed to think through all cases
Who told you that?
Are you programming a “bull collision avoidance system” into a moon lander because there might be a herd of flying cows in space?
ah wait I forgot the emoji
Repeat after me: for similarity search you don’t use a GPT!
There are sentence transformers if you insist on using AI stuff.
Thank you for your responses. I completely understand that LLMs are preferred for unstructured data. I may not have conveyed my context clearly…
However, in my case, the backend data is stored in JSON format, and assume I need to implement a search function on top of an API documentation platform. Since the backend data is structured as JSON rather than plain text, I need to chunk the JSON data effectively to enable efficient searching within the documentation. Here, my query will be a natural language query based on intent. (eg: endpoints for specific scenerio)
Even, I was trying the logic which you have mentioned here. Just, checking is there any alternative approach for this…
A great example is Home Assistant, which includes its entire YAML data structure in the prompt. However, the results are somewhat mixed:
You can use LLM to create a RDBMS database structure and put the structured data into it and you combine entries with a graph with multiple subgraphs and add embeddings.
Data is context - providing context to the LLM is key to get good results. So you need to build a software that prompts the LLM correctly by selecting which data it needs and which not and which might confuse it (you can see that in the graph and you can tell it not to go that way again if it fails).
Here is something I made to demonstrate it on a smaller dimension…
The “System prompt” is generated per Chat Message - which reduces the cost for the LLM drastically as well…
Would be a very good base for HomeAssistant as well btw…
The only problem I see is that it has extreme complexity (postgresql, so you need SQL knowledge, postgis so you need to know how geoinformatics work, pgvector, neo4j, minio, rabbitmq, python, php (Symfony, api platform), typescript (vue3 with pinia - woah I love that pinapple state), golang and 50% of it consists of shellscripts to automate the infrastructure) I guess once the system itself is analyzed that will be easier… Then the devs can just get an issue and the relevant code parts are highlighted and a course is generated that onboards you to the task.