Are we repeatedly charged for all tokens in the context window?

Maybe a silly simple question, but: When using the completion API, we always have to send all the previous parts of the conversation.
Are we being charged only for the newest part, or are we being charged for all the tokens that are in the context window, every time we send something to the API?

E.g. if I send a short story of 1,000 tokens, and then ask varios seperate questions abuot this short story, I have to send the initial story text every single time. Am I being charged these 1,000 tokens every time I ask a question?

Yes. You are charged for what you send as input.

The AI is stateless, having no memory of past users, sessions.

You provide that simulation of memory when you give some of the past chat to a new API call.

2 Likes

Thank you for your answer.

But I am looking into hacking something together where multiple users will be able to ask various different questions to a text file or maybe eventually video. Iā€™d really rather not have to pay for the tokens in the file hundreds of times.

So what would be the correct way to go about this? Maybe finetuning using the file, and then let the users ask that fine-tuned version?

Fine-tuning has worse recall than information provided in the prompt. Maybe consider RAG so you do not need to deliver the full text upon every model invocation or using a cheaper model.

For your case you will either need to use custom RAG or try using the Assistants API: https://platform.openai.com/docs/assistants/tools/file-search/quickstart ā€¦ which might be cheaper/easier, but its hard to say!