So a very quick intro. We’re building a tool where we extract certain data from excel sheets. What we want to do is basically export the excel sheets as CSV or HTML and ask gpt-4 to extract certain data and interpret it. Extracting data from a CSV/HTML element as text works pretty well, so that’s not the issue here.
Now the issue is, these sheets and tables can get pretty huge and have a lot of formatting. As you can see in the picture below, this is way too many tokens for an API call. So the question is, does anyone have experience with these kinds of solutions and is there a good way of chunking HTML/CSV data without losing too much context? Would it be something for a vector database or any other way of formatting this data? I know that these guys let you upload your own data but I haven’t fully tried it out yet, maybe someone has experience with that? https://finchat.io/
Thanks for the help!!
Hi and welcome to the Developer Forum!
To get a finance system working like this you need to be very careful with how you vectorise the data, consider how people use these systems, very little of the important numerical data is give any attention by people other than bottom lines, or query specific requirements. Numerical data is not well suited to sematic similarity, how is 10 important to 600? there are no logical linguistic links between 10 and 600 until you understand that 10 is the quantity and 600 is the final sales price including tax… so you will need to delink the numbers from the text in the spreadsheet while maintaining logical links, perhaps with meta headers embedded with the vector chunks.
This will not be a simple matter of ingesting the spreadsheets as a CSV and it will all work flawlessly. You can certainly try that and see what the results will be like, but I imagine there will be less than optimal output.