I am building an app where i am using survey data to ask questions from. the survey data is very structured but has tonnes of textual data. like there is a column where its asked “What are the active areas that you are reducing spending?” (a column) and the responses (each row is a respondent’s input) can have values like, grocieris, OTT etc. (separated by “|”)
This has resulted in big file (10k rows, 50 columns, lots of textdata within the rows)
I have created a function which identifies what is the question type - i.e. “general” or “mathematical”. if it is mathematical, I am using Langchain’s dataframe agent, I am getting correct answers of count, sum etc (basically mathematical operations) but when I query something like “what is the general mood of consumer?” (basically ‘general’ questions), token limit error reaches.
I have tried a number of ways, converting my data to text and then retreiving the info, built a simple RAG model, using openai to first tell me what columns would be most relevant to get the answer and creating a filtered df on which the query runs - but nothing seems to work on 10k data however, a subset of data of ~500 rows works pretty neatly.
can anyone guide me please on how to do this?