Managing Costs with GPT-4o and Assistants API in a Growing Context: Seeking Advice

eddiegainz · July 7, 2024, 6:15pm

Hello OpenAI Community,

I’m reaching out to gather insights and advice on managing the costs associated with using GPT-4o with the Assistants API in a context-heavy application. Our platform utilizes OpenAI’s GPT-4o to provide users with interactive experiences through conversations with AI-powered avatars and personalized fitness plans, similar to the functionality offered by character.ai.

Use Case

In our application, users interact with AI-powered avatars by sending messages and receiving personalized responses. These interactions are designed to accumulate in context, meaning each new message includes the entire history of the conversation. This feature significantly enhances the user experience by maintaining continuity and personalization, but it also increases our costs as the token count for context grows over time.

Additionally, we provide personalized fitness plans generated by the AI, which also contributes to the overall token usage.

Current Cost Structure

Based on OpenAI’s pricing:

Input Tokens: $5.00 per 1M input tokens
Output Tokens: $15.00 per 1M output tokens

Given our average user engagement:

Text Messages: 50 messages per month

Example Calculation

For a single conversation where each new message includes all previous messages:

Message 1: 800 context tokens
Message 2: 1600 context tokens (800 new + 800 previous)
Message 3: 2400 context tokens (800 new + 1600 previous)
…
Message 50: 40,000 context tokens (800 new * 50)

Total context tokens for 50 messages:

800×50×512=1,020,000800 \times \frac{50 \times 51}{2} = 1,020,000800×250×51=1,020,000 tokens

Generated tokens per message:

200 tokens

Total generated tokens for 50 messages:

50×200=10,00050 \times 200 = 10,00050×200=10,000 tokens

Total Cost per User

Context Tokens Cost: $5.10
Generated Tokens Cost: $0.15

Total Monthly Cost per User: $5.25

Required Revenue for Profitability

To maintain a 20% profit margin:

Required Revenue per User: $6.56

Credit-Based Pricing Scheme

We have a pricing scheme where users are given credits in our app. Each time they speak with a chatbot or receive a personalized fitness plan, it consumes credits. Once they use their credits, they must upgrade their plan or wait until the next month for credits to replenish. Our current plan allocations are:

Explorer Plan ($12): 94 credits
Active Plan ($25): 195 credits
Pro Plan ($45): 351 credits
Elite Plan ($75): 586 credits

Questions for the Community

Context Management: Are there strategies to efficiently manage or limit the context size without compromising the user experience?
Cost Optimization: How can we optimize token usage to reduce costs? Any best practices for handling long-running conversations?
User Education: Any suggestions on educating users about the impact of long conversations on costs and encouraging efficient use?
Scaling and Pricing: How have other developers approached scaling and pricing in similar context-heavy applications?
Pricing Estimate Accuracy: Is our current pricing estimate accurate based on the given costs and usage patterns? How would you price the credits to ensure sustainability and profitability?

We appreciate any insights, experiences, or recommendations you can share to help us manage costs while providing a high-quality user experience.

Thank you!

RonaldGRuckus · July 7, 2024, 6:26pm

Since you’re discussing fitness plans it may be idea to separate your models into specific concerns.

Kind of a higher-level MoE approach where if a user says “I want to work on my deltoids and traps today” you can using a routing agent to invoke a granular agent (Either upper body agent, or even go as granular as “Deltoid” & “Traps” agent).

This way your “main” conversation is not cluttered with what could be discarded context (effectively wasting memory, similar to a function call). Instead it can make “calls” to these muscle experts who return what they think would be best

"What’s a great exercise to target both deltoid and traps?
“Upper Body Agent”: “Dumbbell forward & side lateral raises are great for targeting this region of the deltoid, and also target the traps”

“Please create a 1-hour workout plan for chest and abs”
“Upper Body Agent”: “Here’s some chest workouts: {}”
“Ab Agent”: “Here’s some ab workouts: {}”
User wants adjustments
“Oh, bench press hurts my shoulders, something else perhaps?”
Main Agent re-routes data to Upper Body Agent
“Upper Body Agent”: “…”

You can synthesize the results using a synthesis agent.

The beauty here is then you can discard the information. Again, similar to a function. Instead of keeping everything in context (memory), you run the information in a separate environment so that when specified you can route to the specific thread for the specific information, instead of having a single thread that covers EVERYTHING.

Of course, as well, I would feel like function calling would be better than retrieval as this can be considered structured data. There are already PLENTY of APIs that are very successful and accurate in returning workouts based on muscle groups.

eddiegainz · July 8, 2024, 11:48am

We use a routing agent but make the AI gen the workout

RonaldGRuckus · July 8, 2024, 3:16pm

This, in my opinion is the issue.

A LLM does not need to generate the workout.

What you want is an API that can take structured data that the LLM creates, and then return the workout plan. Through whatever means. It can be a weighing of muscle groups, along with a time factor.

This provides you a deterministic service that can be developed, tested, and relied on for providing a consistent service. It also acts as a grounding agent to avoid hallucinations.

So if someone wants to focus on their quads and hamstrings for an hour it can send (very simplified) api request like

{
  "areas": {
    "quads": {weight: 0.5, avoid: ["squats"]},
    "hamstring": {weight: 0.5, avoid: ["squats"]}
  },
  "time": 120,
  "preferences": ["HIIT", "low-weight"],
  "issues": ["shin_splints"]
}

Indicating an even split between quads & hamstrings for a time of 120minutes. The user did not want to do squats, so we can avoid that.

Can you use a separate agent in the service that accepts this API request? Oh yeah. Something that can parse, validate, and transform the strings to match expectations would be nice and you can build a cache using this method (For example, maybe some people call “shin splints” “the front of my leg freaking hurts”). With enough training data you could move towards a locally hosted model

They indicated a preference for HIIT and low-weights. So high-rep, low-weight, and small breaks are ideal.

So now you have an easily testable service that does not rely on an LLM. You can even create a GPT using actions for advertising. You can plug this API into many different services. Even create a website.

This is a clear separation of modules that leads to less context tokens, more deterministic results, much easier testing, an avoidance of hallucinations, and a wider variety of opportunities to use this application. Otherwise, you have stuck yourself in a position where the ONLY interface is a chat.

Is this a lot more work? Yes? But this is essentially RAG and in my opinion is worth the results. Because, if you don’t do it, someone else will.

eddiegainz · July 8, 2024, 7:44pm

No we definitely want to create using LLM. We can’t be using api because api has only predefined set of workouts

ar6 · July 10, 2024, 2:38am

Context Management: Are there strategies to efficiently manage or limit the context size without compromising the user experience?

Have sessions of user interactions and summarize sessions after few hours of inactivity. Use the summary instead of the actual messages as context for he next messages. You many also add a human into the loop to add additional context or correct the summary if needed (in initial days of the product, you may need to do this manually, but as the product matures, you can use ML to do this automatically). This way you can keep the context size small and still provide a good user experience.

If you do choose to implement this, your the next problem you will face is that you want to have large summaries with a lot of context, but you also want to keep the context size small. Use a RAG pipeline to pick only the necessary context using similarity search or other techniques.

Cost Optimization: How can we optimize token usage to reduce costs? Any best practices for handling long-running conversations?

Cost optimization can eb achieved by fixing the token count per message and you can set custom token sizes by subscription plan or have ranking system for users based on their usage. For the ranking system just have a single index on how much the customer is likely to spend on your platform. Im assuming the chat is a basic feature of your app and you will try to up-sell related products to the customer.

User Education: Any suggestions on educating users about the impact of long conversations on costs and encouraging efficient use?

Users wont be interested in knowing the limitations of your app, i would recommend you do not expose your limitations to the users. If you do this will be the first problem your competitors will solve and could be a key marketing point for them.

ar6 · July 10, 2024, 2:52am

I appreciate @RonaldGRuckus answer but would only change one thing to fit into your concern :