Efficient way for Chunking CSV Files or Structured Data

What is the best way to chunk CSV files - based on rows or columns for generating embeddings for efficient retrieval ?

2 Likes

Depends on what you are trying to achieve as the final result

CSV data is one of the sources for our RAG app, I am already selecting only the necessary columns and my theory is that the chunking logic for structured vs unstructured data should be different.

Without specifics hard to tell:

  1. RAG app - what is the object structure stored in the app?
  2. What are the columns in the CSV?
  3. What is the typical content of the cells per column? And what are the text size ranges?
  4. How does the text in the cells look like (raw unformatted, raw formatted/split in sections/paragraphs/etc., markdown, html, xml, json, etc.)?
  5. What’s the goal of the app outside of the RAG engine?

Thank you for the questions, they are helpful in thinking through. I cannot share any specifics since we have an NDA, however, the content is a combination of text and numbers, a good analogy is a database of books in a public library system. The applications intent is the ability to query over structured (csv) data and unstructured data to answer questions in a multi-step retrieval process. chatGPT Plus ( 4-o) is able to process the CSV and answer questions and I would like to replicate that behavior across multiple CSVs for our specific set of use cases.

Here is an old video I made few months ago about this, one of my oldest videos :smile:

I have an Assistant made with API, and he will use the RAG to read the CSV file. Here are instructions and some functions as example flexiai/examples/Instructions and functions examples/assistant_using_csv.md at main · SavinRazvan/flexiai · GitHub

Here is the Repo:

Personally I prefer to structure my thinking backwards from the final goal:

  1. What is the feature you need to develop.
  2. How that should work (user story).
  3. What are the current workflows the humans use to get this thing done.
  4. What is common in those workflows and why?
  5. What would be the ideal workflow (general bricks)
  6. From the workflow’s last step backwards:
  • what is the outcome at this step (data structure model)?
  • how it is produced?
  • what is the input needed to perform this operation (data structure model)?
  • how this step works (detailed sub-procedure)
  • repeat for all steps in reverse order
  1. What is the best way to organize the data storage for the models generated in step #6
  2. How to process the available data to produce the storable items described in #7 (workflow)
  3. Where do I get the data to feed in step #8 (sources + way to find the data)
  4. Where are the weak points in the system above and how to improve them.

Hope that helps

As weak points (on the spot):

  • numbers processing by LLM is probably to replace by regular programming
  • double check your data models in RAG engine so that they make sense for retrieval operations and do not stay inside the thinking box of the domain you’re working with
  • often it is better to pull more data out of vector DB and pass through the “data quality filter” before selecting items to stuff into your prompts
  • ideally, retrieved data item should not need post processing to be inserted in prompt (so it’s more your prompt that will decide how you store the data) because you data-mine once and search for it all the time
  • use classic code whenever possible as LLM is not the exact science and errors fly all over, the best approach on the long run is to use LLM as a tool to allow classic code access easily the semantics( /si’mantiks/ - the meaning as you hear it™ - will be my new brand for my AI tools) of your data to be able to use solid logic to process it
  • break LLM tasks as much as you can to simplify them and be able to use short prompts on cheap models
  • log your operations with input/output from the start to gather the training data for fine-tuning in case you need it later on
2 Likes

Here’s what I experimented with. Preprocessed my csv dataset to generate a single csv file with the necessary information and followed the “design pattern” specified in this cookbook to create a to create a text summary column for retrieval. Since embeddings work effectively on textual data and rightfully so, I baked in a “semantic” meaning to each row. For example: if I have a table of books, my summary column includes : Book Title : “As the crow flies”, Author : “John Grisham”,“Price : $12.5”.

I do think that for certain use cases the LLM may be an overkill, however, my theory is if you can ask a question in natural language , LLM + classic programming should be able to provide the results.

Other options could include having a text to sql prompt feature and actually running the sql query as part of the conversation chain. However, I think that is only a temporary solution.

1 Like

Were you able to achieve good results by following this approach?
I am also trying to resolve such problem where my file has a mixture of html, tabular, simple text and repeated data, however I am not able to get good results. I have tried number of things till now.

@gurjant.singh are you able to share specifically what have you tried and what is not working?