API cost - How to provide free prompts?

Yes, they absolutely do. Otherwise they would offer a free version of the API.


I don’t want any VC involvement in my experiments, so I ended up building a token-based currency that’s roughly mapped to the cost of OAI tokens + some overhead cost. I give each new user 2 bucks worth of tokens. I have a simple Stripe payment system that replenishes tokens. What I’ve seen so far is that most people who use all the initial tokens re-up and help pay for the initial gift. The long-tail of users use less than 10 cents worth. So far it’s working — I’m pretty close to break-even. If you’d like to check it out: https://sndout.com


Thanks all for your help. What about open source LLMs, can’t they be an alternative?

Thanks and you’re absolutely right. Hacked solutions aren’t sustainable, let alone unethical. I just didn’t think the likes of POE and You.com are paying that much in return for a limited number of subscribers. It just doesn’t add up. But who knows, everyone has secrets.

1 Like

That’s what I did but it doesn’t make sense economically. ChatGPT has an estimated 200-300M users last I read. Out of those only 200-300K are subscribers so 0.1%. Their business model is dependant on APIs I guess.

If you can find a good way to host a model, sure! Note that each GPU typically can only run one (or maybe a few) queries in parallel, so if your volume is low but spiky, resource management is tricky. You pay per month for the hosts with GPUs, but requests come in nothing at all for hours, and then multiple overlapping requests in a few seconds.

It’s very clear that the free ChatGPT is explicitly providing value to OpenAI – they use the chats for training their next models. (As well as marketing.)
If they were to pay people to do all that text processing and chatting to generate training data, it would be super expensive, so giving away free 3.5-turbo chats is probably a cheaper way of getting the data they need.

Thanks. But don’t main cloud providers offer that flexibility? Which means at the end of the day you could scale as needed and pay for actual consumption, so you wouldn’t worry about resource management. Yet I see what you mean, costs would be unpredictable.

Not for GPU inference of custom models. Just loading a model, even a small one, takes many seconds (5 on the low end, 30 on the high.) And if you want to pay “by the minute,” then you need to first get an instance allocated, then get the model parameter volume mapped to the instance, then load the model, then run your inference, and then shut the whole thing down.

“lambda” type “cloud functions” are convenient for simple HTTP requests that sit in front of other always-on systems, but hosting LLMs is not that, and comes with more complications.


The costs are very predictable … and very high! :sweat_smile:

1 Like

Anyone here has any info on the average no. of monthly prompts/tokens per user or how it’s distributed?

I think you can imagine how diverse those answers would be…


  • product finder bot
  • advanced company document retrieval, insert token
  • free interactive AI game with premiums

Thanks and I can imagine the diversity. I was interested in the regular general purpose Gen AI users who would fall in the long tail.

There’s still no good answer.
For example, one of our systems is a RAG-based documentation browser.
We literally try to jam the context as full as we can without going over a particular limit, for each request (as long as there’s chunks that match with at least some particular score.)
It doesn’t matter if it’s one question, or ten, or 100; each question will use approximately the same N token budget.

We of course have control over N, so we can trade quality for cost by tuning it. But each request is pretty well defined in how big it is, we choose how big it is, and thus we choose how much each request should cost us. And because we’re an enterprise product where our users largely ask a few questions, and not try to discuss the philosophy of life, that’s a totally acceptable cost for us, as long as the answers are fast enough and don’t omit important information from the included chunks.