More Intuitive Constraint Than Tokens for End Users?

What solutions have you created for limiting user request volume that doesn’t require explaining the concept of tokens?

BACKGROUND: Developing assistants for specific functional areas as part of a SaaS solution. The end users will be extremely nontechnical. So I’d like to give them a framework of capacity they can understand. Obviously it will be far from exact match, but close.

CURRENT SOLUTION: Best solution I’ve come up with is to limit the input field length while not permitting file uploads, and tell the users they get X questions / Month. Then I track and log their questions and responses with a counter.

I’m wondering if anybody has a better, intuitive solutions?

There must be a robust discussion on this somewhere that I cannot find. If you have a link please share.

Per request charging is a pretty standard model that I think most people will understand. I agree that making users think about tokens (especially input vs output tokens) just complicates things. Ideally you could get solid numbers for you min, max, and average request lengths across all customers. You could then set your limits to be just above your observed max and base your pricing off the observed average.

Keep in mind that token costs are constantly falling.

Thanks @stevenic - that’s my plan. As I track consumption adjust rates, but if I get it too far off then early subscribers could be displeased.

Anybody else have another approach that works?

I’d love to hear other ideas as well. We’re planning to charge per token but we’ve spun it in another direction. We do let people using load files (that’s our thing) so we’re essentially charging based on the size of the files you’ve uploaded. We’ve simplified things to where you don’t have to worry about input vs output tokens rates and we’re able to charge a low $1 per million tokens for GPT-4o quality output.

Hi Ben, this is legit question. Combining my experience with Assistants and reading responses on this thread, these are my thoughts:

The need
Is it absolutely necessary to communicate the interaction limitation part aloud to the user? As you mentioned, it is sort of a buzzkill for new users to read that there are some limitations–no matter they might be edge-case limitations.

Is it possible to use some nudges the way Claude.AI uses when the conversation begins getting lengthy? Claude shows a tiny message just below the user’s chat window something like: “Did you know, using a new conversation thread uses less tokens?” (You can use some other more nontech wording for the user.)

Inflated token count
As you might know, assistants inflate token count since they send the whole message history with each new chat.

Using combinations
Based on these facts, I would suggest using a combination of strategies that would make the whole experience intuitive:

  1. The nudge method
  2. Limiting the input response (you have already suggested this)
  3. Creating a new thread after an exchange of only say 4 or 6 messages. This wouldn’t allow token inflation, and hence indirectly help you not worry about token count and cost. I’m already in the process of doing this with my application. I’m going to limit the exchange to 6 messages: 3 user messages, 3 assistant responses. There will be a message counter [0/6 messages]. After the sixth message, the user would know that they have to switch to another window. Should they choose to carry the history of the last conversation, i.e. 1 user message, 1 assistant response, the user can have a one-click choice to carry the conversation ahead in the new chat as an added context.

I hope this was useful. Let me know your thoughts.

Best.