We have created an app using openAI API.
Brief Description: We have gotten lot of PDFs available openly on internet → created a vector database using Qdrant → When user asks questions, the search happens in Qdrant database and we send chunks of PDF to openAI with prompts and the question. But now any question is using 3,000 -5,000 tokens per answer → mainly input due to large chunks of text sent from PDF → this costs > 10-12 cents per answer. We tried 4o, but accuracy of answers went down.
Can anyone suggest any technical workaround this to minimize the tokens used? We are applying for credits to temporarily sustain ourselves. Otherwise the business model could fail just due to extensive openAI cost. Any advice is appreciated.
Hi @vijai04,
I see how that can be frustrating. But without detailed description of what you’re trying to achieve, it’s hard to give a meaningful advice.
If I were in this situation, this is what I would try to analyse/improve:
- Check the vector DB price (storage/retrieval) are you sure weaviate is not cheaper?
- Start from the result backwards and write the complete workflow of how you get the final answer and what data does it need to retrieve from DB.
- Go through your workflow and see what is consuming the tokens most, get a list and sort by priority (most tokens to lowest tokens).
- See why those steps consume that many tokens? Is it because they need so much data as input, or is it because the data items minimum size is that big (you can’t cut chunks further because the context will be lost), or (most likely all 3 of the following) a) you feed everything the DB returns as results into the prompt, b) you failed to properly split the chunks and they’re that big, or c) your current chunks where stored as is and valuable data is spread across the chunk and was not properly “distilled”.
- Based on the above, build the optimal data composition (fields and values) of the retrievable objects, so that you don’t need to post-process them after the retrieval to optimize your prompts.
- From #5 see how you would pre-process your chunks before storage and what size they need to be (you pre-process them once, then use all the time, so it’s cheaper on the long run).
- See how you would filter the results from the db before sending them as context (say db gets me 100 results, but I use cheap model that selects only 5 of those that are finally sent as context).
- See if some of the data can be stored in regular DB behind an API so that your engine can get it directly bypassing the AI tools.
- Maybe among the most important, how you split your chunks? Is it based on “atomic idea”/semantics or sliding window? Why is this like so?
- See if caching can be implemented?
- Are you splitting the tasks to simple steps or trying to get the results in one shot? Maybe that’s why you can’t use cheap models?
- After doing the above, I’m sure you can improve the whole workflow so that cheaper models can do tasks with same or better quality.
But one other question that bothers me, what’s the business where a valuable answer does not worth a couple of cents? Maybe your pricing or the core approach is wrong? Just asking.
Thanks for getting back. This is helpful feedback. I think the main issue is the we are feeding everything the DB returns into the prompt. May be we could use a cheaper model to distill and then use GPT Turbo to answer. I mentioned it wrong, it costs 10-12 cents per answer. 1-2 cents would be ideal. I will connect though Linkedin as well.
Here it takes a good direction. Sure, let’s connect on linked in.