How to process structured data?

I’ve spent the past week trying various methods to get reliable outputs via structured data with disappointing results.

I have a CSV file with headers: [category, phrase, subject, is_proper]
Then each row has data such as:

  • transportation, the bike is blue, bike, false
  • food, how to cook an omelet, omelet, false,
  • transportation, how to drive a ferrari, Ferrari, true

I’m trying to extract the subject from each phrase and then determine if that subject is a proper noun.

I’ve tried many methods including the simulated back/forth messages at the start of the prompt to train the model (few shot prompting). This becomes very expensive and wasteful for 1000s of rows.

I tried creating a fine-tuned model with structured data batched in 10 examples. When I call the fine-tuned model I also use a function call demonstrating the JSON output I would like. Not only does the model occasionally ignore the desired output structure, it also drops rows randomly.

I find any approach starts working semi-reliably if I reduce batch sizes (think CSV rows of data) to a maximum of 5 and even then it is pushing it. Is AI only capable of processing 1 data row at a time? A typical CSV file could take hours to process if this is the case.

I’m only asking the AI to accomplish a few things:

  1. Extract the subject
  2. Determine if the subject is a proper noun
  3. Capitalize the subject if necessary

Am I missing something in my approach? I find it painfully disappointing that such a simple set of tasks is so difficult for the AI model.

My objective is to process 1000s of CSV rows, quite possibly with more complicated needs, but so far this is not going very well.

Have any suggestions?

Hi!
If I need to extract semantic meaning from long lists like this, the best method I have found is to use one of he mini models (fast and cheap) and then use a template to show what the output “should” look like and then a few lines of the data… for sure not more than 5, and then split the data into chunks and parallel process all of the chunks.

1 Like

I’ve found that best approach to this type of requirement is to process each record individually. I’d convert the csv to JSON records, then pass one at a time to whichever model (4* mini would do well with your use-case) and instruct the model to output JSON with your required changes/outputs. The issue of performance depends on your setup, latency to openai servers and service tier. You need to figure out a way to send multiple requests in parallel or use the batch API.

2 Likes

Thank you @Foxalabs

I am using 4o-mini

Would love to know more; can you elaborate please?

  • By template do you mean you are providing a JSON example somewhere in your prompt?
  • Do you do pass any few-shot prompts prior to the real prompt?

I was hoping that my few-shot prompts had enough examples showing returning structure.
And I am also providing a function call that demonstrates the desired fields in a JSON format.

Thank you @mohanlalranvir
The batch API is definitely not an option, because I’m running this in a browser and expecting output in a timely manner.
I will look into making parallel requests, although that complicates things significantly!

I wonder how these plugins/addons for Excel and Google Sheets are doing their things.

What are you building your front end in?

The entire project is Laravel/php @mohanlalranvir so the frontend is Tailwindcss/alpine.js/Livewire (typical Laravel stuff)
This app is for internal use only…it’s just for me.

That’s cool - I haven’t used this stack before so I can’t provide any recommendations

In this case, you give the model a “shot” in the form of the expected output format, e.g.

{
  "name": "<name>",
  "age": <age>,
  "email": "<email.com>",
  "isActive": <true/false>,
  "roles": ["admin", "user"],
  "profile": {
    "bio": "Software developer from California.",
    "website": "https://johndoe.dev"
  }
}

and then some of the log lines related to that and ask the AI to output the JSON only. See how your reliability is.

You could also look into structured outputs, but often that’s not required.

“give the model a shot” - Do you mean user role or assistant role?
I have been providing json outputs as the assistant role when few shot prompting. (sometimes it works, sometimes it doesn’t.

What do you mean by “some of the log lines”? What is a log line?

You said you have a CVS file with entries, that’s what I meant by “log line” poor word choice by me, a “shot” in this instance is an example of what the model should do, like an example.

1 Like