Question about token usage per message in commercial chatbot

Hello everyone,

I’m building a commercial chatbot app with a group of developers. Right now, each user message — even if it’s just a single word — is consuming around 3,000–4,000 tokens.

The developer insists this is normal because the code sends the entire chat history, the system prompt, and all the functions/tools definitions together with every new message.

My argument is that this approach is not practical and very expensive, and that OpenAI provides solutions inside the platform (like Threads in the Assistants API, storing instructions once in the Assistant, and using function calling properly) to avoid resending everything every time.

The developer completely refuses and says his method is correct and the only practical way.

:backhand_index_pointing_right: Could someone from the community clarify what is the standard/best practice here for token efficiency in commercial chatbots? Is it really “normal” for a short user message to cost 3k–4k tokens, or is this a sign of inefficient implementation?

Thanks a lot!

Welcome to this forum.

In short, yes your developers are right: you pay for all inputs and tools needed for the AI model to generate an answer.

The “memory” you refer to is sent all over again for the model to process, regardless the convenience the Assistants API (which is being deprecated btw) might give you to recover previous context.

Context storage (tools, system instructions and conversation history) and context processing are different things, the LLM needs to reprocess things all over again to generate a proper answer.

You can either limit the turns for the conversation or “forget” older messages to reduce costs, but it is not like you only pay for each new “hello” and the history is “already paid” (it is not), there is usually more things involved.

I recommend both you and your team to take a step back and try to put some extra effort into understanding each other a little better.

I know how you feel. However, AI models adjust their responses based on context. With only short user messages, the model has almost no information to work with and can only generate random, generic responses. With proper design, higher token usage actually leads to more distinctive, specialized, and contextually accurate answers.

For example, the commercial chatbots I design use around 4,000–5,000 tokens just for the system prompt. Fortunately, the OpenAI API provides a feature called Prompt Caching, which can help reduce token consumption. Of course, results vary depending on the developer’s skill, but in general, token usage is directly proportional to both the chatbot’s performance and its hallucination rate.