I was experimenting with openai-api and Assistant of retrieval tool type. I have a large khowledge base file e.g. with 100K tokens, uploaded into Assistant and I need to make different retrieval questions about it, but seems each time the gpt model re-reads the whole file to answer (it’s seen by consumed token count), which puzzled me: shouldn’t the Assistant read the knowledge base once, create internal context for it inside and use it to answer next questions without re-reading all input tokens again?
More details what I did:
- created Assistant and uploaded large khowledge base to it
- created new thread with that Assistant and added first message-question there
- started running that thread, waited until it’s completed successfully (e.g. 30 sec and consumed about 80K tokens, the most of the file approx.)
- checked model responses to first message/question
- added second message/question and repeated steps until answer
- checked the response, it’s ok, but it again consumed e.g. 90K tokens from the khowledge file of Assistant…
so if that is the expected mode of operation - then Assistant is just seemed as … a file holder?
Otherwise - why it doesn’t create kind of interpreted context internally after first read of the input file and use it mainly for each next question without consuming token budget again, e.g. re-reading only small amount of tokens from input file (which were not taken into account on the previous reads etc.)
Thanks, apart from that - Assistant & Threads seems as very needed features!