How can I use chat/completion API on large datasets of "arbitrary" JSON

nehec82319 · November 5, 2023, 9:41pm

I have an app where my customers perform certain tasks and the output of that task generates a json in the end. The cumulative set of tasks result in a set of JSON blobs and can be quite large per customer/tenant - 500mb-1GB. I’d like to train a model or use an existing one where I can train one model per customer/tenant (?), so that they can chat with the model to get useful information out of it.

Each customer’s JSON schema is different and hence the “arbitrary” in the title.

I can see that if I take small snippets of that JSON and feed into the chat to GPT 4+ and ask questions on it, it does well enough. But ofcourse I can’t use/send that much JSON over the API given the token restriction.

I am also not sure if I should have a fine tuned model per tenant, given the dataset size. Curious to hear any ideas or feedback.

nehec82319 · November 5, 2023, 9:45pm

FWIW, I have considered something like RAG, where I generate the embeddings for this subset of the JSON task, as it is generated, feed it into a vector DB and then use the completion / chat API in open AI. However, the size of json even for the subset of the tasks is also quite large.

Maybe something gets released in tomorrow’s launch perhaps?

supershaneski · November 5, 2023, 11:55pm

There is no other way but to chunk your JSON file if it is too large to handle. Just setup the metadata of each JSON file so that you can track from what original JSON file it is associated if you need to reference it later.

nehec82319 · November 6, 2023, 2:31am

Is it possible to find a tune a single model with a single set of ~500MB of data even if its chunked?

supershaneski · November 6, 2023, 2:55am

Based on your requirement, you will be needing a RAG solution thus embedding not fine-tuning.

Your main problem is how to chunk each JSON file within the allowed 8191 tokens for text-embedding-ada-002. You can probably just treat it as text file, chunking it by text length or if you want to maintain valid JSON, it will depend on your own schema how to do it.

After that, finding answer to your query will be a database process so it does not matter if you have 500MB or 1GB vector data since you are not calling any API.

nehec82319 · November 6, 2023, 8:37pm

Yeah, I was wondering about that. Thank you for helping me brainstorm.

I guess, I can just then use vector search. Because the dataset is a lot of loose json, the accuracy can break and gpt4 completion is actually useful.

Is there some way for me to use gpt4 completion on this? Or it have to be fully RAG?

supershaneski · November 6, 2023, 11:16pm

They just unveiled a lot of stuff today. Even for file retrieval. Sam mentioned that it will solve manually chunking things. I still need to dig to the docs and see what changes.

otito · March 12, 2024, 8:49am

Hey, have you been able to solve this issue, i’m facing a similar problem and i can’t make headway on it.

Topic		Replies	Views
What is the best way to upload datasets that exceed the token limit? API	3	1476	December 18, 2023
App architecture --> how to send large dataser for analysis (exceeding token limit) API	8	8305	December 17, 2023
Creating a conversational chat bot with a large data set API	4	3090	March 2, 2023
Using gpt to structure large amounts of data to json format API gpt-4 , chatgpt , api , json , data	9	2896	May 23, 2024
Working with GPT 3.5 Turbo to query JSON data - ChatGPT and Token Limits API	4	3240	May 17, 2023

How can I use chat/completion API on large datasets of "arbitrary" JSON

Related topics