Practice over larger data sets - as result of function

Hi guys

I’m looking for advice on how to handle extensive data sets as a result of a function.

I have a function (tool available to the model) that creates an SQL query based on the user’s question and returns data to LLM

User questions can be:

  1. how many articles do we have
  2. Give me 10 latest articles
  3. Export all articles published today
  4. Give me all articles written by XX YY

Of course, this is just a simplified description of the problem. Our relative data is much more complex, but the articles’ examples are hopefully good enough to explain the problem.

So the issue we are facing with is how to deal when SQL generated based on a user’s query returns 1,000, 10,000 or 100,000 records.

Obviously, this is too much to give back to the LLM. But we still need LLM to process this data.

How do you solve this kind of issue

Do you save data somewhere and pass it on as a link instead of extracting it and sending it back to LLm as JSON and then relying on the coding assistant to process it (if the user wants some summarisation)?

thanks in advance

Above a certain amount of rows, I would return an error to the LLM stating that the user needs to be more specific in their requirements and suggest they consider grouping by a column if required.

In this case you want as much work as possible happening, deterministically, on the DB, and leave the context the LLM has to deal with as small as possible, both for cost and accuracy.

You use a DB every day that has the same problem… it’s called Google.

So one suggestion would be to sort the results by some reasonable dimension and show the model the first-n results. Make it clear to the model that it’s only seeing part of the result set. You could then give it a function to page through additional results if you need to but be careful with that.

This is a data-oriented b2b solution, so it’s entirely valid for the user to request the extraction/export of the data behind it. The cost of tokens is also not an issue.

We tried with 1,000 records returned from the database, but LLM got lazy and listed only the first 50.

e.g.
How many articles XYZ wrote
LLM creates Count(*) SQL and our method returns 175 - LLM returns this to user

But when the user asks to list all the articles - in our extract data method LLM correctly generates select * query and returns all 175 to LLM - LLM writes:

Article 1
Article 2

This is a sample of articles written by XYZ. What do you want to do:

When user says export to excel - LLM correctly calls coding assistant, but coding assistant gets just a few articles not all

This is a very natural tendency for LLMs. It’s part a function of their fine tuning and part their natural desire to summarize. The longer the task the more likely they are to want to summarize their response. Getting models to reliably use their full output context window is incredibly difficult.

I’m assuming you’re trying to use the model as an intelligent transform which is reasonable. They’re good at that task.

If you have a lot of rows that you need transformed you’re going to need to pass them in in batches. I’ll warn you that the larger the batch size the more likely it is to make a mistake. I do this a lot and the only way to get 100% of the rows out that you put in is to do it one by one. Even increasing the batch size to 2 will result in the occasional dropped row.

Ah you’re doing this as part of an assistant. That’s going to be tough to fix. Like I said, it stems largely from their fine tuning and natural desire to summarize

Expecting the LLM to reliably process and display 100,000 rows is unreasonable.

I would add an export feature to the function code which the LLM can link to user via URL.