Reading structured csv files (containing survey data with a huge amount of text) using API

I am building an app where i am using survey data to ask questions from. the survey data is very structured but has tonnes of textual data. like there is a column where its asked “What are the active areas that you are reducing spending?” (a column) and the responses (each row is a respondent’s input) can have values like, grocieris, OTT etc. (separated by “|”)

This has resulted in big file (10k rows, 50 columns, lots of textdata within the rows)

I have created a function which identifies what is the question type - i.e. “general” or “mathematical”. if it is mathematical, I am using Langchain’s dataframe agent, I am getting correct answers of count, sum etc (basically mathematical operations) but when I query something like “what is the general mood of consumer?” (basically ‘general’ questions), token limit error reaches.

I have tried a number of ways, converting my data to text and then retreiving the info, built a simple RAG model, using openai to first tell me what columns would be most relevant to get the answer and creating a filtered df on which the query runs - but nothing seems to work on 10k data however, a subset of data of ~500 rows works pretty neatly.

can anyone guide me please on how to do this?

1 Like

I don’t have enough experience in LangChain to help in that area but in general splitting the data into more digestible sections and then synthesizing the results can help avoid hitting token limits.

1 Like

I would first try to assure that everything the LangChain agent put and get from the LLM is within the token limit.

1 Like