Efficient way for Chunking CSV Files or Structured Data

Personally I prefer to structure my thinking backwards from the final goal:

  1. What is the feature you need to develop.
  2. How that should work (user story).
  3. What are the current workflows the humans use to get this thing done.
  4. What is common in those workflows and why?
  5. What would be the ideal workflow (general bricks)
  6. From the workflow’s last step backwards:
  • what is the outcome at this step (data structure model)?
  • how it is produced?
  • what is the input needed to perform this operation (data structure model)?
  • how this step works (detailed sub-procedure)
  • repeat for all steps in reverse order
  1. What is the best way to organize the data storage for the models generated in step #6
  2. How to process the available data to produce the storable items described in #7 (workflow)
  3. Where do I get the data to feed in step #8 (sources + way to find the data)
  4. Where are the weak points in the system above and how to improve them.

Hope that helps

As weak points (on the spot):

  • numbers processing by LLM is probably to replace by regular programming
  • double check your data models in RAG engine so that they make sense for retrieval operations and do not stay inside the thinking box of the domain you’re working with
  • often it is better to pull more data out of vector DB and pass through the “data quality filter” before selecting items to stuff into your prompts
  • ideally, retrieved data item should not need post processing to be inserted in prompt (so it’s more your prompt that will decide how you store the data) because you data-mine once and search for it all the time
  • use classic code whenever possible as LLM is not the exact science and errors fly all over, the best approach on the long run is to use LLM as a tool to allow classic code access easily the semantics( /si’mantiks/ - the meaning as you hear it™ - will be my new brand for my AI tools) of your data to be able to use solid logic to process it
  • break LLM tasks as much as you can to simplify them and be able to use short prompts on cheap models
  • log your operations with input/output from the start to gather the training data for fine-tuning in case you need it later on
2 Likes