Severe confusion re: embeddings. Can someone help clear this up for me?

I posted a question earlier about a special use case and if I could make OpenAI work for that, and in the meantime I’ve been reading documentation and examples, and had even bigger questions come up. For background, this is for a game set a few hundred years in the future. Clearly there is new technology, new government, new politics, etc. So the AI needs to be able to reference this data to be able to answer questions about it.

My end goal is to have a document that OpenAI can use to answer questions from. It would contain all the relevant information on the game world and locations and special items and the like. As I was reading today, it seems like fine-tuning won’t help me there? Unless I’m feeding it hundreds of questions I’m already writing answers for? I guess I am really fuzzy on the purpose of the fine tuning, and reading the documentation didn’t help a ton. I’m currently using RASA for intent and entity classification. I’m guessing the fine tuning for OpenAI is akin to my RASA training data where I have many variations of questions tagged with intents? That’s less important, as I already have a trained custom model for RASA.

So, embeddings. Unless I am severely misunderstanding the examples at, it looks like the process is:
1 - Preprocess the document into vectors

2 - Search the embeddings for the relevant sections. That example page states that for small embeddings, its best to store and search it locally - and to me, locally implies that there is no openai involvement in searching the embeddings? - or for larger ones, to “consider using a vector search engine like Pinecone or Weaviate to power the search” - which absolutely would not involve openai. Conclusion, OpenAI is not involved in the embeddings search. Maybe?

3 - Pull the relevant document sections out, add them to the query prompt, and submit a potentially large token cost prompt. I pasted the example prompt on that page ending in the question “Who won the 2020 Summer Olympics Men’s high jump?” into the playground.

 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."


* The women's high jump event at the 2020 Summer Olympics took place on 5 and 7 August 2021 at the Japan National Stadium. Even though 32 athletes qualified through the qualification system for the Games, only 31 took part in the competition. This was the 22nd appearance of the event, having appeared at every Olympics since women's athletics was introduced in 1928.
* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics. Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).
* The men's triple jump event at the 2020 Summer Olympics took place between 3 and 5 August 2021 at the Japan National Stadium. Approximately 35 athletes were expected to compete; the exact number was dependent on how many nations use universality places to enter athletes in addition to the 32 qualifying through time or ranking (2 universality places were used in 2016). 32 athletes from 19 nations competed. Pedro Pichardo of Portugal won the gold medal, the nation's second victory in the men's triple jump (after Nelson Évora in 2008). China's Zhu Yaming took silver, while Hugues Fabrice Zango earned Burkina Faso's first Olympic medal in any event.

 Q: Who won the 2020 Summer Olympics men's high jump?

It came out to 524 tokens. Which on the standard Davinci, would cost 1 US penny per query (1k tokens being $0.02).

Which leaves me with a HOST of questions. By my understanding, that can all be done with a standard, untrained, basic API call.

BUT - Embeddings have a special price on the pricing page. 10x more than the standard calls, $0.20 for 1k tokens in Davinci.

Why? I am very clearly missing a huge part of the step. If I search my embeddings locally, or use a service like Pinecone, I can generate a prompt and run it against the standard API that is 10x cheaper. But at the same time, I’m not seeing where the embeddings cost comes in. Is that the cost of preparing the document vectors? But if thats the case, what’s the benefit to using davinci? Better/more accurate vectors? (Upon a third quick read of the embeddings page, it appears thats where the embeddings cost comes in, building the vectors?)

Clearly I am beyond confused, and two read throughs of the documentation have left me no better off. A third skim just now and I found where the embedding cost comes in.

As I post this, it seems that embeddings just allow you to pull out relevant sections of your source data to send along with your prompt, possibly having a very large token cost with each query.



For your source data, you only need to get the embeddings once. You store the embeddings file. When a query is made, the query is embedded on the fly. Then a similarity score needs to be obtained, e.g., cosine similarity, to find the top n items (rows) of source data based on similarity to the query. Then, the text associated with those top n items, plus the query text, plus any instructions you want to add, e.g., “answer the following question using the facts provided” are sent to the text-davinci-002 endpoint to obtain the answer. The prompt (comprised of the text associated with the top n items of source data, plus the query text, plus your instructions) plus the completion (the answer to the question) must not exceed the token limit of 4096.

Hey, just posted an option on similar issue here: QA fine-tuned chatbot not answering from the trained data but nonfactual - #28 by sergeliatko

I would probably add a “post-filter” to make sure bot replies stay in the game universe:


  1. Once the bot answered, get it’s reply, embed it and run similarity against your facts
    2 select many related facts and contract a prompt like:

Facts: include your selected docs/facts.
Bot’s suggested reply: the reply to be sent to the gamer
Bot’s reply adjusted to facts above:

As training data for this filter, I would aim 50/50 % replies, modified only because the facts were contradicting the suggested reply and other half left intact as no contradiction with facts.

API calls with low temp . Fine tuning Curie or davinci