Do I misunderstand retrieval pricing?

So, I was looking for ways to reduce the cost of my OpenAI integration and found the data retrieval option. I am considering to put all the questions and data into a markdown file and ask a GPT assistant to give me a JSON with every answer. Then the file and assistant are immediately disposed of to avoid getting charged for more days of storage. I tested this, and it seems to produce good results reliably.

The cost seems to be minimal since the document’s contents aren’t counted as tokens (Or are they? That’d explain a lot). The only cost it’d incur is the file itself, at 20 cent per GB (~167M tokens), plus a negligibly short request along the lines of “answer every question in the file and return a JSON”. That is roughly equivalent to 8.35M tokens for 1 cent, 8350 times cheaper than 1 cent / 1K tokens for conventional requests.

I am finding this hard to believe because if it were so, why would anyone ever send conventional requests instead of just using assistants+retrieval? Or, for that matter, why wouldn’t OpenAI just make that happen automatically with long enough Chat Completion requests, reducing the cost drastically? What am I missing?

Thanks in advance!

Hi and welcome to the Developer Forum!

What you are describing is the use of “embeddings” these are pre processed chunks of text that can be intelligently searched for and retrieved for context generation. This uses a semantic similarity search, not a traditional word based search and so is very powerful at pulling back data related to the question being asked.

In its simplest form you would “embed” the users query and then see which of your pre-stored pre processed embeddings from your documentation have the closest similarity and then pull back, lets say the top 5, of those to include as context and then append a prompt like "given the above context, how would you answer the following: “{users_question}”.

You still pay for the tokens used as context but not every token stored as embeddings. So you can have megabytes of documentation and only use kilobytes of context.

OpenAI includes this functionality as part of the assistants system, but it is currently in very early testing and so can still generate quite large contexts, this will hopefully be addressed soon. There are also external solutions for this including ChromaDB, Weaviate and Pinecone to name but a few.

You can read about it here

1 Like

Thank you for the answer, I’ll look into embeddings.

1 Like