How to process structured data?

bionary · June 17, 2025, 4:28pm

I’ve spent the past week trying various methods to get reliable outputs via structured data with disappointing results.

I have a CSV file with headers: [category, phrase, subject, is_proper]
Then each row has data such as:

transportation, the bike is blue, bike, false
food, how to cook an omelet, omelet, false,
transportation, how to drive a ferrari, Ferrari, true

I’m trying to extract the subject from each phrase and then determine if that subject is a proper noun.

I’ve tried many methods including the simulated back/forth messages at the start of the prompt to train the model (few shot prompting). This becomes very expensive and wasteful for 1000s of rows.

I tried creating a fine-tuned model with structured data batched in 10 examples. When I call the fine-tuned model I also use a function call demonstrating the JSON output I would like. Not only does the model occasionally ignore the desired output structure, it also drops rows randomly.

I find any approach starts working semi-reliably if I reduce batch sizes (think CSV rows of data) to a maximum of 5 and even then it is pushing it. Is AI only capable of processing 1 data row at a time? A typical CSV file could take hours to process if this is the case.

I’m only asking the AI to accomplish a few things:

Extract the subject
Determine if the subject is a proper noun
Capitalize the subject if necessary

Am I missing something in my approach? I find it painfully disappointing that such a simple set of tasks is so difficult for the AI model.

My objective is to process 1000s of CSV rows, quite possibly with more complicated needs, but so far this is not going very well.

Have any suggestions?

Foxalabs · June 18, 2025, 8:40am

Hi!
If I need to extract semantic meaning from long lists like this, the best method I have found is to use one of he mini models (fast and cheap) and then use a template to show what the output “should” look like and then a few lines of the data… for sure not more than 5, and then split the data into chunks and parallel process all of the chunks.

mohanlalranvir · June 18, 2025, 11:30am

I’ve found that best approach to this type of requirement is to process each record individually. I’d convert the csv to JSON records, then pass one at a time to whichever model (4* mini would do well with your use-case) and instruct the model to output JSON with your required changes/outputs. The issue of performance depends on your setup, latency to openai servers and service tier. You need to figure out a way to send multiple requests in parallel or use the batch API.

bionary · June 18, 2025, 3:07pm

Thank you @Foxalabs

I am using 4o-mini

Would love to know more; can you elaborate please?

By template do you mean you are providing a JSON example somewhere in your prompt?
Do you do pass any few-shot prompts prior to the real prompt?

I was hoping that my few-shot prompts had enough examples showing returning structure.
And I am also providing a function call that demonstrates the desired fields in a JSON format.

bionary · June 18, 2025, 3:23pm

Thank you @mohanlalranvir
The batch API is definitely not an option, because I’m running this in a browser and expecting output in a timely manner.
I will look into making parallel requests, although that complicates things significantly!

I wonder how these plugins/addons for Excel and Google Sheets are doing their things.

mohanlalranvir · June 22, 2025, 10:48pm

What are you building your front end in?

bionary · June 25, 2025, 11:08pm

The entire project is Laravel/php @mohanlalranvir so the frontend is Tailwindcss/alpine.js/Livewire (typical Laravel stuff)
This app is for internal use only…it’s just for me.

mohanlalranvir · July 1, 2025, 10:58am

That’s cool - I haven’t used this stack before so I can’t provide any recommendations

Foxalabs · July 1, 2025, 11:16am

In this case, you give the model a “shot” in the form of the expected output format, e.g.

{
  "name": "<name>",
  "age": <age>,
  "email": "<email.com>",
  "isActive": <true/false>,
  "roles": ["admin", "user"],
  "profile": {
    "bio": "Software developer from California.",
    "website": "https://johndoe.dev"
  }
}

and then some of the log lines related to that and ask the AI to output the JSON only. See how your reliability is.

You could also look into structured outputs, but often that’s not required.

bionary · July 1, 2025, 2:28pm

“give the model a shot” - Do you mean user role or assistant role?
I have been providing json outputs as the assistant role when few shot prompting. (sometimes it works, sometimes it doesn’t.

What do you mean by “some of the log lines”? What is a log line?

Foxalabs · July 1, 2025, 2:34pm

You said you have a CVS file with entries, that’s what I meant by “log line” poor word choice by me, a “shot” in this instance is an example of what the model should do, like an example.

Topic		Replies	Views
Accuracy of retrieved data (large CSV) GPT builders	5	1116	April 17, 2024
How to best structure CSV embeddings to elicit clear and correct answers from Prompting gpt-35-turbo	3	5992	July 24, 2023
Efficient way for Chunking CSV Files or Structured Data API	9	4191	September 5, 2024
How to format excel files best for API ingestion? API gpt-4 , api , data-preparation	13	3292	October 17, 2024
Tips on getting 4o to answer questions with a given JSON file? Prompting chatgpt , gpt-4o	5	1678	June 13, 2024

How to process structured data?

Related topics