How can I use chat/completion API on large datasets of "arbitrary" JSON

I have an app where my customers perform certain tasks and the output of that task generates a json in the end. The cumulative set of tasks result in a set of JSON blobs and can be quite large per customer/tenant - 500mb-1GB. I’d like to train a model or use an existing one where I can train one model per customer/tenant (?), so that they can chat with the model to get useful information out of it.

Each customer’s JSON schema is different and hence the “arbitrary” in the title.

I can see that if I take small snippets of that JSON and feed into the chat to GPT 4+ and ask questions on it, it does well enough. But ofcourse I can’t use/send that much JSON over the API given the token restriction.

I am also not sure if I should have a fine tuned model per tenant, given the dataset size. Curious to hear any ideas or feedback.

FWIW, I have considered something like RAG, where I generate the embeddings for this subset of the JSON task, as it is generated, feed it into a vector DB and then use the completion / chat API in open AI. However, the size of json even for the subset of the tasks is also quite large.

Maybe something gets released in tomorrow’s launch perhaps?

There is no other way but to chunk your JSON file if it is too large to handle. Just setup the metadata of each JSON file so that you can track from what original JSON file it is associated if you need to reference it later.

1 Like

Is it possible to find a tune a single model with a single set of ~500MB of data even if its chunked?

Based on your requirement, you will be needing a RAG solution thus embedding not fine-tuning.

Your main problem is how to chunk each JSON file within the allowed 8191 tokens for text-embedding-ada-002. You can probably just treat it as text file, chunking it by text length or if you want to maintain valid JSON, it will depend on your own schema how to do it.

After that, finding answer to your query will be a database process so it does not matter if you have 500MB or 1GB vector data since you are not calling any API.

Yeah, I was wondering about that. Thank you for helping me brainstorm.

I guess, I can just then use vector search. Because the dataset is a lot of loose json, the accuracy can break and gpt4 completion is actually useful.

Is there some way for me to use gpt4 completion on this? Or it have to be fully RAG?

They just unveiled a lot of stuff today. Even for file retrieval. Sam mentioned that it will solve manually chunking things. I still need to dig to the docs and see what changes.