Maybe a silly simple question, but: When using the completion API, we always have to send all the previous parts of the conversation.
Are we being charged only for the newest part, or are we being charged for all the tokens that are in the context window, every time we send something to the API?
E.g. if I send a short story of 1,000 tokens, and then ask varios seperate questions abuot this short story, I have to send the initial story text every single time. Am I being charged these 1,000 tokens every time I ask a question?
But I am looking into hacking something together where multiple users will be able to ask various different questions to a text file or maybe eventually video. Iād really rather not have to pay for the tokens in the file hundreds of times.
So what would be the correct way to go about this? Maybe finetuning using the file, and then let the users ask that fine-tuned version?
Fine-tuning has worse recall than information provided in the prompt. Maybe consider RAG so you do not need to deliver the full text upon every model invocation or using a cheaper model.