I have a questionnaire file in a vector store uploaded as a whole (which i’m assuming is stored as a single chunk). Now i’m prompting the LLM to ask and obtain all the answers to those question from a user ( I am using Threads). However, by the time its reaching end of questionnaire, its costing a lot of tokens, 95 percent of which is input tokens. How can I optimize and reduce the number of input tokens? Would dividing the questionnaire file into smaller chunks myself and uploading them to the vector store help?
If you’re serious about this then you can create an API which manages the questions & answer types. Then you can use the model to interact with the API.
It’s quite straightforward. When a user makes an answer the model sends it to the endpoint, which returns a success, along with the next question.
And avoid the need to have a vector store for the questionnaire file?
Yes. It requires a lot more effort but if your goal is to save money & have more control then it may be a good option.
You could use a combination of the two.
That is a great idea!
Then you could store those answers into a user profile for use later.
But the problem with that is that questionnaire is very complex to be deployed on an API. I need an LLM to some extent. You sure there’s no other way I can reduce the input token count?