Context reuse for shared GPTs and Assistants without additional per-session input token cost

If I create a GPT via builder or create an Assistant that has “common” context will I be billed for the initial context input tokens for each new user session?

Do embedding or plugins provide a model for extension of the GPT model/context without per-session reprocessing costs?

What is the best practice to reduce “per session” costs for common initial context?

Hey there and welcome to the community!

To be clear, the API and ChatGPT are separate services. So, you are not billed anything to build a custom GPT (except for the plus subscription), but you are billed per token with any API call. Each time you feed tokens and receive tokens back, in any sense, those tokens are charged.
Assistants are also a bit tricky because OAI handles thread management for you, meaning you can’t modify messages after they’re added to the thread.

For complete control and maximum efficiency, I would use chat completion instead. That way you have full control over the context and how many tokens you want to send the model for completion.

Then, if I understand correctly, there is no shared reuse of context to reduce being repeatedly charged for the same input?

The use case is I’m creating a chat interface on meeting transcripts. The URL will identify a specific meeting identity for the transcript. Many users will be able to access the same meeting chat interface, but each will have a distinct session so that their prompts do not interfere with each other.

My hope is that there was a way to reuse an existing initial context (preloaded if you will) without being charged for those input tokens each time a user session is established.

Correct. “Sharing” context does not reduce the price; it just allows multiple assistants to basically run with the same kind of data attached to it.

This is not possible. Or, at least, the “not being charged” part. Remember, the API doesn’t care what you do before or after its call; the input itself is tokenized when it is sent to the API, so the only way to not get charged for tokens is by not inputting them into the model. Keep in mind too, you are also being charged for the resultant output tokens as well. Now, you might be able to hold onto the context and wait to send it until it’s necessary, but the only way to not be charged for it as input is to not send it at all (defeating the purpose of it in the first place).

However,

If what you want to create is this:

Perhaps RAG (retrievable augmented generation) of either parsed or whole transcription data might help?

What RAG does is allow a model to intelligently retrieve relevant chunks of contextual data to add to its context.

Will it make such contexts free? No.
Will it reduce the amount of context providing to a user, reducing the cost of the API call? Yes, if you set it up well.

This also means it will only retrieve what is necessary when it is necessary to do so (although don’t expect that on the first try lol; it takes some finesse and trial and error). Overall, if you want cost-efficient ways to handle and retrieve context, use RAG.