Use file with text-davinci-001 to increase tokens in prompt

I am working with text-davinci-001 (formerly davinci-instruct-beta-v3) to generate answers to questions. The sources on which the answers should be based are in my prompt. Hence my prompts are long. Is it possible to upload a file containing the sources so that I can include > 2048 tokens? Thanks.

There is a function in the API for answering questions based on resources. Read up on it here.

That’s exactly why I was asking the question, because the answers endpoint does this:

“The endpoint first [searches] over provided documents or files to find relevant context. The relevant context is combined with the provided examples and question to create the prompt for [completion].”

I don’t need or want the above functionality in my workflow. I am using the embeddings endpoint to find the top n documents most similar to my user’s query. I want to dynamically populate a file with those top n documents (rather than include them inline) and use those n documents as the source of truth for generating answers using text-davinci-001.

1 Like

That’s very interesting @lmccallum

What would be even fascinating would be if we could use embeddings directly in the answers api as search model.

Agree. This might be something OpenAI will offer in the future.

1 Like

Perhaps I can try a different method around the 2048 tokens limit. Each of my top n search results is associated with a unique ID. If I upload a json lines file in advance containing the text versions of all of my embeddings, along with their unique IDs, then perhaps I could instruct GPT-3 to write the completion taking into account only the text associated with those IDs.

Thanks! I think that method could require too much manual quality control, given the length, complexity and interactions of my passages. But I will give it some thought.

We’re trying this workflow to cope with the 2048 token limit:

  1. Get the top n search results for the query from the embeddings endpoint.
  2. Each search result consists of an arbitrary number of paragraphs of text.
  3. Parse each search result into sentences.
  4. Re-rank the sentences based on their similarity to the query.
  5. Use only the top n sentences (up to a limit of 2048 tokens) in the prompt for text-davinci-001.
  6. Also in the prompt, provide instructions to answer the user’s query based strictly on the sentences.

This is essentially a filter to obtain the most relevant information for answering the user’s query, before building the prompt, allowing us to shorten the prompt. We still achieve our goal of getting GPT-3 to use only the provided information to generate the answer.

I’ll let you know how well this works. Could be a useful workflow to share once ready.

1 Like

Wow! This is quite an interesting approach. My guess is this might take a good amount of time between user asking the question and the whole workflow returning the answer. But if that isn’t an issue, the quality of responses should hopefully be much better.

We’ve got it working! Yes it’s a bit slow. Perhaps we need to spend some money on compute resources? Also, the answers are of varying quality, so I need to fiddle with instructions, temperature, top n results to use, etc. We definitely have to use Davinci. Tried with Babbage and it was hopeless.

1 Like

I think most of the delay is because of the multiple API calls chained together. Spending money on compute on your end isn’t going to be effective. However you can experiment on Azure or other cloud. I say Azure because OpenAI is itself hosted on Azure, so that should minimize network delay if correct region is used.

Definitely go for davinci. I read somewhere in the OpenAI docs suggesting to develop functionality first using davinci and then try to replicate the same on lesser compute intense models like curie, babbage, ada.

Using davinci will slow down the process further though.

Also, you can reduce the delay by reducing the max_tokens used in the completions.

If you want to use GPT to answer questions about a text that is longer than 2048 tokens (or 4000 I think for instruct v2) then in my experience the best approach is to split your document up into smaller pieces, then use the embeddings API endpoint to query the embeddings for those pieces as well as the question you are asking, then perform a cosine similarity comparison on the embeddings and then finally use a prompt to formulate the answer. This is similar to how the (now deprecated) answers API of OpenAI works.


  • Split your document up into pieces. (sentences, paragraphs or something)
    How you want to split your document depends on your use case, the type of document. If you want to split it up into sentences for example, you can split the text based on a period, question mark or exclamation mark delimiter. Make sure your document is plain text and stripped of any unneeded stuff like markup.
  • Fetch the embeddings for each bit
    Send a request to the OpenAI embeddings endpoint. This will return an array of numeric (float) values.
  • Fetch the embeddings for your question
    This will be used to find the most relevant document (piece).
  • Calculate semantic similarity using a cosine similarity in whatever language you are using
    A cosine similarity function is a relatively simple function that calculates the similarity of two sequences of numbers. In this case, your embeddings. Read more
  • Generate a prompt starting with the most relevant piece of text (the one with the highest similarity), followed by the question.
    If you are using instruct then you can simply append the question to the document piece separated by two newlines. For example:

In quantum computing, a qubit or quantum bit is a basic unit of quantum information —the quantum version of the classic binary bit physically realized with a two-state device.

What is a qubit?

If you are using davinci base model then it is better to put the question as a Q/A pair. For example:

In quantum computing, a qubit or quantum bit is a basic unit of quantum information —the quantum version of the classic binary bit physically realized with a two-state device.
Q: What is a qubit?

This makes the model assume a question/answer scenario. Whichever model you choose to use, the response will be your answer.

Note: make sure to cache your embeddings locally or on your server where-ever your application is running. This will save you a lot of money. There is no need to get the embeddings for something more than once.

I hope this helps somebody and if you have any questions, by all means ask.


This speaks to the exact problem I have been working on for a couple of days. Your post is extremely helpful and explained the concepts better than the docs (imo). It also cleared up a few questions as the docs assumed a higher level of previous competence than I had.

Most notably, the fact that you can cache the embeddings, and that the cosign similarity is done locally. Thank you!