Cost when building chat with text with embeddings and chatgpt 4-128k

I am bit confused about pricing when it comes to chat with text.

How much will this reduce the cost if we use embeddings with ChatGPT 4 latest model for the full context window? My understanding is that if we use the latest 128k model with the full context window, it costs almost 1 USD per search.

So if I am having a conversation with a document and I ask it 10 questions, it will use the entire context window every time to give me an answer, and thus almost 10 USD for those 10 questions, is that correct? If I use embeddings for the entire document, does that mean that we only pay the initial time we fed all of the text into it, and then for the other times that we query the document, it will only charge us the small amount of tokens used to give the initial prompt + the completion prompt. Is that correct? Also, does the new Assistant API use embeddings?

If someone could please explain this, that would be great. I am bit worried about exceedingly high costs if we are being charged for the entire context window for every response in the chat conversation.

1 Like

The embeddings would help to isolate smaller chunks of text to feed to the prompt. Smaller and more relevant == Less tokens == Less $$$ and possibly better answers.

You would then control how much history/context you send to the prompt to limit excessive token usage. This is how current RAG systems do it. So last few turns + current retrieved context, whatever your needs are.


You explained it quite well.

A reasonable developer would use the context frugally, for example embedding that document with a semantic database and retrieving just the most similar chunks of text relevant to the current line of questioning that meet a threshold of similarity and budget. Then record the customer’s inquiries for accurate billing.

Assistants however does not negotiate and does not reason. Like the Terminator. You get max context loaded up.

Knowing what a “run” cost you and why is near impossible unless you make one call a day.

While they say assistants uses embedding, and only when documents are too large beyond filling the complete model context themselves, OpenAI is using other undisclosed functions also to iteratively browse document files. Embedding a knowledgebase costs them money up front.


Thanks, much appreciated! :raised_hands: Do you have any suggestions on guides (or videos) that I could take a look at to better understand how to implement this?

From your experience, would it be overkill to use gpt 4 to do this? If we can create chunks of the document, I assume we could use 3.5 to accomplish the same thing. Basically our users upload a lot of documents to our app, and we want them to be able to chat with the docs.

When chatting with text that has been embedded, the difference in quality seems negligible, no?

There are many RAG/Embedding threads here to check out.

One of many, starting with what I normally do:

If you can write code, it’s usually a LOT cheaper doing this yourself. By a lot, I mean 1000x cheaper!


Could you provide more detail as to how the solution works? in particular the async when using multiple pickle objects

You could decouple this using something like AWS DynamoDB, where you generate a random UUID and database entry, and some completeness criteria like the count of how many objects under that UUID makes for a complete execution.

Then you run all at once as async Events. The output of the search is writing an entry to the database. You have a lambda function triggered by each NEW_IMAGE in the database, and checks the completeness criteria. Once met, it sends the collective work to another lambda function listed for each entry in each output.

So essentially you use a database layer, and triggers off of database entries, to decouple everything and enabling async and massive parallelization.

The possibilities are endless with this approach. And the cool thing, your function inputs are just database keys, so it makes for easy debug and replays. You could also set a TTL on each entry so after X days the data will auto-delete, saving costs in storage and incurring zero costs in the delete.