How to format excel files best for API ingestion?

I’m building a tool where understanding the excel contents well is essential. I’ve been sending in a JSON mapping of cell:value to GPT4o, and it understands it, but I feel it could be a much better understanding if it’s in the best format.

What’s the best way to format excel for understanding in the API?
Were excel sheets used as training data? If so, what format was that?

The technique I use which works well is to convert .csv and .xls files into markdown files that present the rows to the model as a sequence of records. So let’s say have a table like:

| name | age | favorite color |
| —- | —- | —- |
| Steve | 56 | red |
| Ava | 1 | pink |
| Donna | 50 | purple |

The way you want to present the information to the model is as:

name: Steve
age: 56
favorite color: red

name: Ava
age: 1
favorite color: pink

name: Donna
age: 50
favorite color: purple

There are 3 reasons why this works best for tabular data:

  1. These models have very poor spatial awareness. They’re better at reading information from top to bottom then side to side.
  2. These models love patterns so the repeating pattern of the records reinforce the models inherent pattern matching ability.
  3. If all of your column names are at the top of the table there’s a lot of distance between the value for row 100s 2nd column and the name of that column. By formatting the table as records the model has no excuse to not know the name of every column even if you have thousands of rows.

By the way the largest table I’ve tested had well over 1 million rows and the retrieval accuracy is excellent using this technique

2 Likes

Is there any library you recommend to do this? I tried openpyxl → pandas → markdown, but it doesn’t evaluate the formulas correctly, and i’m missing a lot fo styling that wasn’t applied to the markdown. Appreciate any tips.

I’m a TypeScript developer so I use Sheet JS. I’m not sure about python libraries

2 Likes

Hi @bbirkelund ! I’ve been doing exactly this and similar to what @stevenic described, I find it easiest to set it up as rows of key-value pairs. I usually have a utility function that does this but it’s very simple and easy to implement yourself in few lines of Python.

So let’s say we have a dummy Pandas DataFrame like so:

df = pd.DataFrame({
    "user_id": [
        "john",
        "emily",
        "anna",
        "doug"
    ],
    "nr_of_commits": [
        125,
        200,
        75,
        80,
    ]
})

To turn it into rows of key-value pairs, ready for injecting into the LLM prompt, I just do:

columns = list(df.columns)
data = []

for row in df.itertuples(index=False):
    data.append("; ".join([f"{key}: {val}" for key, val in zip(columns, row)]))

The output will look like this:

print("\n".join(data))

user_id: john; nr_of_commits: 125
user_id: emily; nr_of_commits: 200
user_id: anna; nr_of_commits: 75
user_id: doug; nr_of_commits: 80

And finally I just make the API call with this key-value data injected into the user prompt like so:

...
messages=[
        {
            "role": "user",
            "content": "\n".join(data)
        },
...

I never found any need for external libs. If it’s a very large amount of data that won’t fit in the context window, or you are constructing .jsonl batch files, it’s very easy just to iterate over that data list in chunks/steps.

Hope this helps!

2 Likes

I primarily use Sheet JS to read in the excel file. Sheet JS has a nice helper function that maps all of the rows on a page to an array of JSON objects. I call that and then convert these objects to markdown using code. I add some logic to remove empty fields from the output as that can save you a ton of tokens for spreadsheets that are really sparse.

1 Like

I’ll add that I have growing evidence that mapping table rows to this record format literally improves the models ability to reason over the data.

These models are pattern matchers so giving every record a similar shape leans into the models pattern matching abilities. It moves all of the records to be closer to each other in embedding space, which makes it easier for the model to compare and contrast the records with each other.

I’d encourage you to think of your prompts as transforms and with every prompt you’re creating a transform that intersects your query/instruction with patterns in the prompt and the patterns in the models world knowledge. The more you can align all of your patterns the easier it is for the model to perform the desired transform and the more reliable the models response.

4 Likes

@stevenic ah yes, there is also gspread that I use often in combination with Pandas. And with Pandas (in Python world) you can do so much (cleaning, augmenting, filtering). I just meant, for that very specific task of transforming from Pandas df to a prompt, I didn’t find much use of external libs.

Hi guys,

Personally I would store the data as objects where column names are fields and values are values after adding some extra data to the object as data source, document and whatever else is needed by your RAG engine to properly handle the data.

Then I would vectorize the object, while I mostly advise against including the field labels in vectors, in case of spreadsheet data I think you should include the field labels as well. However, maybe not all fields must be included in the final vector.

Good thing to consider bedside vector DB, is to use regular DB as well for easier retrieval when no semantic search is needed.

When presenting the data to LLM, I would present it as simple text formatted similar to:

Object name
Field name: value
Field name: value

But the most appropriate presentation will depend on your application needs.

As far as prompting is concerned, I would see what operations can run on a single data item, where instructions are system message and input is user message. And run them in parallel. Then combine the results and pass to models that need the totality of data items. But again, the logic depends on your application.

Also depending the input format of the CSV files and the flexibility of your objects, might be worth checking weaviate for the RAG engine solution, especially their auto schema feature.

Another point, with CSV files try to handle as much as possible in your regular code before using LLMs.

I have several other tips published in this form, so feel free to check them up.

1 Like

This looks “suspiciously” like a YAML or JSON representation of the CSV file.

Do you feel you would get the same/similar/better/worse results with your suggested format compared to the YAML/JSON representation of the same data?

I don’t think the format really matters. It’s a matter of repeating the pattern and a function of distance. The closer distance wise all the fields are to each other the easier it is for the model to recognize the pattern.

It’s the same basic principles that make multi-shot examples work. The repeating pattern helps the model better predict a similar pattern.

2 Likes

That jibes with my experience as well, anything done in service to reducing the cognitive load of the attention mechanism is almost always worth the investment.

For me, what works best is

df.to_dict('records')

It is close to the response from @stevenic