What is the best way to chunk CSV files - based on rows or columns for generating embeddings for efficient retrieval ?
Depends on what you are trying to achieve as the final result
CSV data is one of the sources for our RAG app, I am already selecting only the necessary columns and my theory is that the chunking logic for structured vs unstructured data should be different.
Without specifics hard to tell:
- RAG app - what is the object structure stored in the app?
- What are the columns in the CSV?
- What is the typical content of the cells per column? And what are the text size ranges?
- How does the text in the cells look like (raw unformatted, raw formatted/split in sections/paragraphs/etc., markdown, html, xml, json, etc.)?
- What’s the goal of the app outside of the RAG engine?
Thank you for the questions, they are helpful in thinking through. I cannot share any specifics since we have an NDA, however, the content is a combination of text and numbers, a good analogy is a database of books in a public library system. The applications intent is the ability to query over structured (csv) data and unstructured data to answer questions in a multi-step retrieval process. chatGPT Plus ( 4-o) is able to process the CSV and answer questions and I would like to replicate that behavior across multiple CSVs for our specific set of use cases.
Here is an old video I made few months ago about this, one of my oldest videos
I have an Assistant made with API, and he will use the RAG to read the CSV file. Here are instructions and some functions as example flexiai/examples/Instructions and functions examples/assistant_using_csv.md at main · SavinRazvan/flexiai · GitHub
Here is the Repo:
Personally I prefer to structure my thinking backwards from the final goal:
- What is the feature you need to develop.
- How that should work (user story).
- What are the current workflows the humans use to get this thing done.
- What is common in those workflows and why?
- What would be the ideal workflow (general bricks)
- From the workflow’s last step backwards:
- what is the outcome at this step (data structure model)?
- how it is produced?
- what is the input needed to perform this operation (data structure model)?
- how this step works (detailed sub-procedure)
- repeat for all steps in reverse order
- What is the best way to organize the data storage for the models generated in step #6
- How to process the available data to produce the storable items described in #7 (workflow)
- Where do I get the data to feed in step #8 (sources + way to find the data)
- Where are the weak points in the system above and how to improve them.
Hope that helps
As weak points (on the spot):
- numbers processing by LLM is probably to replace by regular programming
- double check your data models in RAG engine so that they make sense for retrieval operations and do not stay inside the thinking box of the domain you’re working with
- often it is better to pull more data out of vector DB and pass through the “data quality filter” before selecting items to stuff into your prompts
- ideally, retrieved data item should not need post processing to be inserted in prompt (so it’s more your prompt that will decide how you store the data) because you data-mine once and search for it all the time
- use classic code whenever possible as LLM is not the exact science and errors fly all over, the best approach on the long run is to use LLM as a tool to allow classic code access easily the semantics( /si’mantiks/ - the meaning as you hear it™ - will be my new brand for my AI tools) of your data to be able to use solid logic to process it
- break LLM tasks as much as you can to simplify them and be able to use short prompts on cheap models
- log your operations with input/output from the start to gather the training data for fine-tuning in case you need it later on
Here’s what I experimented with. Preprocessed my csv dataset to generate a single csv file with the necessary information and followed the “design pattern” specified in this cookbook to create a to create a text summary column for retrieval. Since embeddings work effectively on textual data and rightfully so, I baked in a “semantic” meaning to each row. For example: if I have a table of books, my summary column includes : Book Title : “As the crow flies”, Author : “John Grisham”,“Price : $12.5”.
I do think that for certain use cases the LLM may be an overkill, however, my theory is if you can ask a question in natural language , LLM + classic programming should be able to provide the results.
Other options could include having a text to sql prompt feature and actually running the sql query as part of the conversation chain. However, I think that is only a temporary solution.
Were you able to achieve good results by following this approach?
I am also trying to resolve such problem where my file has a mixture of html, tabular, simple text and repeated data, however I am not able to get good results. I have tried number of things till now.
@gurjant.singh are you able to share specifically what have you tried and what is not working?