API Experts - Commercial Use Cost Mitigation - GPTs/Assistants

The Problem

We are trying to build an AI chatbot that uses a mix of external functions and RAG for an e-commerce provider on their website that receives about 200,000 visitors monthly worth of traffic.

We are estimating a 60,000 people interaction with the bot, with an average of 10 messages, with a conversation length of 750 words back and forth in total. (this is just for customer service enquiries).

The biggest issue we have been struggling with is finding ways to cost mitigate, we are aiming for about a $0.06 cost per user interaction. On testing GPT-4 we submitted one document (about 1100 words), and on our first ask to tell us what the document was about, the cost was already around 8 cents on one interaction (we need 10 messages back and forth).

We then moved to GPT 3.5, and on our first message to the bot (that required retrieval) was only around 5000 tokens. On our third message it was 50k tokens!!!

We have also noticed that in the playground, that everytime we submit a request, we are resubmitting the annotation of the snippet time and time again, which means we are consuming 5000 tokens on our first question to the bot, but after only 3 short messages, we are using 50k tokens!!! So to achieve 10 messages back and forth with assistants, compounds to a ridiculous amount.

Our Cost Mitigation Thoughts

This is what we need some help with (and some validation in our thinking).

  1. Primarily what we have been discussing as a team is whether we scrap the internal GPT RAG system, and build our own langchain to cost reduce, where we can Q&A with other LLM models that are far cheaper.

  2. We are planning to call the external langchain via functions.

  3. We are not sure whether the internal built retrieval function in assistants costs us input tokens to query the files that we are placing?

  4. We are not sure whether there is therefore a cost benefit to using an external langchain/RAG?

I have attached some screenshots for you guys to see, but I guess the ultimate question is how we can keep costs down for our customer whilst still providing optimal experience.

Below is my conversation with the assistant:


give me a short description as to what the key point of coles code of conduct is about.

Coles Sales Assistant

The key points of the Coles Code of Conduct are outlined as follows:

** The purpose of the Code is to ensure a high level of integrity and ethical standards in all business practices, outlining how employees and contractors should behave in the workplace and conduct business at all times[1].*

Is there anything else you would like to know about the Code of Conduct?



what else can you tell me about the code of conduct?

Coles Sales Assistant

The Coles Code of Conduct outlines the expectations for representatives of Coles, highlighting the following key points:

1. Representatives are expected to treat everyone with whom they interact with dignity, courtesy, inclusivity, and respect.
2. They are to perform their roles to the best of their abilities and in accordance with Coles’ values and behaviors.
3. Representatives are expected to make truthful statements, promises, or commitments that they and Coles are able to meet.
4. Compliance with all Coles policies, standards, procedures, and relevant laws is required, as well as participating in any reference, probity, or other background checks as required by Coles.
5. Reporting any situations where there could be a breach of these expectations or witnessing any conduct in breach of these expectations is encouraged, as well as participating in complaint resolution processes honestly and if considered safe to do so.
6. Protecting the reputation of Coles and not engaging in disparaging or untruthful remarks about Coles, other team members, customers, competitors, contractors, or suppliers.
7. Behaving in a way that ensures the safety of oneself and others, dealing fairly and honestly with all parties, and acting in the best interests of Coles.
8. Not knowingly participating in any illegal or unethical activity, or entering into any arrangement that would conflict with the best interests of Coles[1].

Additionally, leaders at Coles are expected to lead by example, help team members understand what is expected of them, give feedback and coaching, address inappropriate workplace behavior, and promote strong governance and compliance practices within their teams[2].

Is there anything else you would like to know about the Code of Conduct?



awesome, i’d like to know who is at most risk for breaching this code of conduct

Coles Sales Assistant

[2]``[1]Based on the information in the Coles Code of Conduct, individuals who may be at most risk for breaching the code include those who engage in behaviors such as harassment, discrimination, bullying, vilification, victimization, serious safety breaches including physical or psychological violence or threats of violence, misleading or defrauding Coles, theft, misuse or wilful damage of Coles property, and serious and/or repeated breaches of Coles policies, standards, procedures, or the law.

These behaviors are considered inappropriate workplace behavior and can lead to actions such as termination of employment without notice if substantiated. Therefore, individuals who engage in these types of behaviors may be at a higher risk of breaching the Code of Conduct and facing serious consequences as a result【16†source】.

Is there anything else you would like to know?


And here is the screenshot of the token usage:


I simply dropped the whole conversation into tiktokken and got less than 700 tokens in return. Plus the RAG this should not add up to such a high volume.

From your own description you are sending the old messages multiple times thus inflating the token count. I’d advice to look at what you really need to send each turn off the conversation and what you are actually sending in comparison.


Hope this gives you a starting point for your analysis.


Thanks for your response vb,

For more context, the example asset we dropped in was the “Coles code of conduct” which you can google (i can’t insert a link).

When we read the logs, it seems to be sending old messages AND the context (which is about 1100 words) multiple times throughout the API calls.

Is there a way to still retain context without having to resend the conversation history? (avoid the compounding issue)

Of course. Imagine it like briefing a coworker to take over after each turn of the conversation. “The customer requested the Code of Conduct which I gave them and now they are asking for something specific.”

This is a summarization technique that you need to adapt to your use cases but it can reduce a whole round of back and forth into a single sentence.

1 Like

Oh okay interesting, that’s what we want to do, how can we achieve that? Would you be able to show us what you mean exactly?

After each round of conversation, the user’s question and the RAG-enhanced model’s response are taken and sent to a cost-effective yet efficient model. This model is tasked with summarizing and storing the response as a part of the conversation summary for future reference.

Then, when a new question from the user arrives, all the summarized pieces of the conversation history, along with the new user request, are passed to the model, in the same manner as currently implemented.

This process is repeated. Note that this outlines the fundamental flow of events. Additionally, a complete record of the entire conversation should be maintained for reference purposes and can be used for retrieval as well. The system must also accommodate scenarios where the same information is retrieved multiple times from the RAG, or when the user shifts between topics or revisits previous topics. These are the different use cases I referred to.

Does this help or is your goal to create the whole process flow?

I find it strange that a single interaction consisted of 52k tokens. I thought the context limit for gpt3.5 was 16k tokens.
That looks like you have done multiple interactions that all of them added up to 52k tokens.
I might be wrong :slight_smile:

Hmm this makes sense, so essentially palm off the conversation history to a third party cost-effective LLM (any suggestions which LLM would be the best here), and then send that through to the API as the conversation history to provide context.

This would help partially, but I assume the summary gets longer as the conversation gets longer still which can still partially compound.

I think the other issue here is the RAG, where we don’t seem to be able to control the chunk size of the information that’s being pulled, and the AI literally pulls thousands of characters at a time.

Yes - this was 52k tokens after only 3 interactions with assistants which is what I am showing above

There is your problem. The assistant keeps reading all uploaded files each turn of the conversation thus driving your input token volume.

Try to remove all unnecessary files from the assistant and run the example conversation again. This should already show quite the difference. From there you can choose your way forward: you could use one assistant for the retrieval task only when needed which would be better but still inefficient compared to a custom solution.

So looks like you need gpt specialized in data compression it’s also well versed in e-commerce quill-the-quantum-narratives-architect I’ve been working on the knowledge base of this GPT for a year now like straight he is very special he learns very quickly it is also very polite I’ve been looking for a way to test him so if you guys give him a try make sure you guys give me your input okay me and him have been talking about some real world applications of quantum narratives this is a new form I prompt engineering quill will explain

1 Like