Using gpt to structure large amounts of data to json format

IntelliJJ · May 4, 2024, 4:11pm

Hey guys,
I am using a gpt api call to feed unstructured data into GPT-4 and get it back as a structured json object. This is then further used in my application.

The problem I’ve encountered now is that some users want to feed really large data into it, so large that I am running into the token limit. With the current token limit the LLM then starts to summarize things in order to still send a complete json object back.

What would be the best way to parse unstructured data of lets say 6k tokens to a structured json object via open ai api?

jr.2509 · May 4, 2024, 4:28pm

So are you trying to get back the content of the input data verbatim or is there some transformation happening as it is restructured into a JSON object?

sps · May 4, 2024, 5:57pm

If you have access to an older gpt-4-32k, it could work.

Additionally with gpt-4-turbo if you could get it to produce the output verbatim and get it to automatically cut-off at 4096 tokens, the rest of the output can be produced by appending the assistant message received to the existing messages list and making the API call again.

Medi · May 4, 2024, 8:41pm

if it’s a Facebook download, try to divide them into batches.

IntelliJJ · May 5, 2024, 7:05am

Yes, verbatim. I am also using a temperature of 0. I just want to transform unstructured data to a structured json object.

IntelliJJ · May 5, 2024, 7:08am

I am accessing it via api, so I could specify that model. But doesnt that have 4096k max tokens for output as well?

What you are describing would solve my problem. How would I do that, as sending the same api call would just get me the first 4096 tokens again? Now I am just making a single api call and saving the response in my database.

IntelliJJ · May 5, 2024, 10:57am

Is there a way I can use assistant + thread function to accomplish what I want?

jr.2509 · May 5, 2024, 11:16am

Depending on what structure you are opting for, you might want to take a look at this thread:

A few members of the Forum including myself discussed and worked out a solution for semantically chunking a document using GPT-4-turbo. In essence, the approach involves using GPT-4-turbo to create an outline of the document (incl. the identification of the start and end position of individual sections within the document) and then use the information to programmatically extract the text verbatim from the document into a structured JSON.

The benefit of this approach is that is that you only need one API call to get the document’s basic structure and that you don’t have to worry about the output token constraints. Additionally, it saves a lot of cost compared to a scenario where you ask the model to return the text verbatim. That said, the approach currently is mostly applicable to documents that have clearly defined sections.

IntelliJJ · May 5, 2024, 11:38am

Great discussion, thanks for linking. I actually used to have your exact method implemented a few months ago to save cost. Ill think about it and try if I can chip in in the other thread.

david.arteaga · May 23, 2024, 9:39pm

I’ve gotten this to work with gpt-4o quite well with the following:
(btw I’ve found JSON mode actually does not work well when the response is longer then 4096 tokens)

let responseMessage = "";
  const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  });
  // eslint-disable-next-line no-constant-condition
  while (true) {
    const responseStream = await openai.chat.completions
      .create({
        model,

        messages: [
          {
            role: "system",
            content: system,
          },
          ...messages,
          ...(responseMessage
            ? [
                {
                  role: "assistant" as const,
                  content: responseMessage,
                },
              ]
            : []),
        ],
        temperature,

        stream: true,

        ...(responseFormat === "json" && {
          response_format: {
            type: "json_object",
          },
        }),
      })
      .withResponse();

    let lastFinishReason:
      | ChatCompletionChunk.Choice["finish_reason"]
      | undefined = undefined;
    for await (const chunk of responseStream.data) {
      const content = chunk.choices[0]?.delta?.content || "";
      if (!responseMessage) {
        onEvents.onStart?.();
      }
      responseMessage += content;
      onEvents.onChunk?.(content, responseMessage);
      lastFinishReason = chunk.choices[0]?.finish_reason;
    }
    if (lastFinishReason === "length") {
      console.log("Response stopped because of max tokens");
      continue;
    } else {
      break;
    }
  }

Topic		Replies	Views
How can I use chat/completion API on large datasets of "arbitrary" JSON API gpt-4 , fine-tuning , token , json	7	2804	March 12, 2024
ChatGPT answers partially to request API chatgpt	6	150	February 20, 2025
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	8550	December 17, 2023
Can't fetch large data from external API calls despite GPT-4's ability to handle up to 128k Plugins / Actions builders gpt-4 , gpts	7	2812	November 19, 2023
Working with GPT 3.5 Turbo to query JSON data - ChatGPT and Token Limits API	4	3266	May 17, 2023

Using gpt to structure large amounts of data to json format

Related topics