When I shipped my first AI feature, I budgeted by gut feel, and the real bill at scale was nothing like my estimate — re-embedding, retries, and a too-expensive default model quietly added up. I’d have made very different architecture choices if I’d seen the numbers up front.
So I’m curious how others here handle it: do you forecast cost before building, or find out after? What surprised you most — embeddings, the vector DB, model choice, or retries? Trying to learn how people approach this earlier in the process.
While building (at drawing board) you have some per-call estimate, then you compose and test. Then staging gives you estimates per run (optimistic). Basically, if subjective complexity factor (0-5) * cost per run * 50 < customer cost per run… is the point where you ask your self 2 questions:
- Can you divide the costs at least by 2 (better 4)?
- Is it worth bothering?
I just build, deploy, monitor and tune iteratively.
In Production, if you can limit the population that is processed initially you can work out the cost for a smaller set, then optimise and finally broaden the population when you are satisfied with the behaviour and cost on the smaller set.
Personally, I saw cost exploding when the context window was not thought through. Stacking and compressing is not the best approach (at least for what I usually do).
I prefer what I call “composition”, where the context is processed in small tasks separately, then results from there are assembled into “answer context” and finally the model gets you the answer you attach to the conversation.
But it’s me who pulls the context from the conversation, so I have more control there.