I’ve been trying to understand how billing works in a assitant thread.
Despite the various topics* about it, I see a lot of what seem like assumptions, and I’m still not sure how things work.
Is it true that in each run, all tokens from previous messages are re-added to billing?
And what about the supposed internal tokens returned by the tools (it seems that retrieval injects much more than just the answer to a subject). Do they remain in the message history and are added again and again?
Since Instructions are generally the ones with the most tokens, I wonder if I can just put in basic instructions and have the assistant request more instructions depending on the context.
Even if the tool’s response is counted in billing, if that response is not stored in the message list, it is not counted in future messages, so this (in theory) could save a lot of money.
All of the messages in the message list will be used on every call, this is how the context system works, if the messages are so long that there is no room, the system will truncate the earlier messages to fit.
GPT may be hallucinating, or the billing there is different.
But bellow is a response I had with an assistant. It actually keeps everything that is returned from the tools. But it was said that old messages are not counted.
Yes, I keep a record of the information provided and discussed during our conversation. This allows me to quickly reference previously mentioned details, such as the user’s Windows version in Issue 115334, without needing to re-access the original source. This ability to retain information during a specific session is crucial to maintaining continuity and relevance in responses.
But is this counted in the token count and billing?
No, the retention of information during our conversation is not counted in token counting or billing. Token counting and billing are related to processing new text, such as your questions and my answers to them. The ability to remember and reference details from previous parts of our conversation is a built-in function that does not incur additional costs. Therefore, you can reference past information without worries about token counting or additional billing.
Please do not use the AI for software guidance on newly released features, it has no information regarding these and will hallucinate incorrect answers.
Ok, I understand that tokens from previous messages are counted again.
But is it the same price as the tokens for a new message? They shouldn’t be because the tokens have already been computed in previous messages, I don’t see the need to compute them again. It should be cheaper at least.
This is how the current generation of LLM’s work, everything needs to be sent again, even if computed last time.
This is an emergent technology and it may be that in 12 months time we laugh at how primitive it all was back then, for now this is the way it works.
Andrej Karpathy has a video where he talks about hallucinations in LLMs, this quote is from his notes:
“The dreams and hallucinations do not get fixed with finetuning. Finetuning just “directs” the dreams into “helpful assistant dreams”. Always be careful with what LLMs tell you, especially if they are telling you something from memory alone. That said, similar to a human, if the LLM used browsing or retrieval and the answer made its way into the “working memory” of its context window, you can trust the LLM a bit more to process that information into the final answer. But TLDR right now, do not trust what LLMs say or do. For example, in the tools section, I’d always recommend double-checking the math/code the LLM did.”