Token Limit For Embeddings vs. Text-Davinci-0003

The new embeddings model has a token capacity of 8191, while text-davinci-003 still has a token capacity of 4000, correct? So, for a question answering application, does it actually make sense to use the new embeddings model? I am struggling to see how embedding larger chunks of text at once is helpful, given that the NLP funnel gets narrower, so to speak, when setting up the prompt and requesting a completion. Has anyone else thought about this? Am I missing something here? Thanks.

1 Like

Personally, I find “diluting prompt with text not precisely answering the query” somewhat not optimal. But that’s maybe because of the use cases I’ve chosen.

Sure but that level of precision requires more upfront data preparation and raises the risk of the prompt not containing sufficient context to answer the question well. Regardless, my question was about the mismatch between token limits of two endpoints that are meant to be used together in an NLP pipeline.

From a user perspective, the use cases for embedding and text generation can be different, so I am not bother by that, is there a reason you will like to have both models with the same number of tokens?

In terms on why OpenAI has different token length, it’s a lot easier to support 8191 tokens for embedding than training a text generation model with 8191 tokens, it’s likely OpenAI is still working on the next generation of text generation model that can support 8191 tokens, but they want to release the embedding at 8191 tokens to the market first so users can take advantage of it.

I don’t work at Open AI so I can’t speak for them, but just my two cents.

Thanks @nelson, I understand your point. In my NLP pipeline, I first identify the best n embeddings, and then the corresponding pieces of text go into my prompt. So being able to have more tokens per embedding is, I think, irrelevant. Unless I am missing something…

Probably require testing to find out, I haven’t yet evaluate the quality of the search results on the new embedding size yet. I agree it will be interesting to find out more about that.

As I said, it depends on your particular use case. Then, what I do is following:

  1. Find the top X items to include in the prompt by cosine similarity on embedded vector
  2. Get the source text where the items were taken from
  3. Calculate the context size for each item ( Total Context length / Number of items to include )
  4. Extract the context out of the source text for each item (personally, I found splitting the context before and after at 70% x 30% proportion gives me good results, but again it depends on your use case)
  5. Include contexts into your prompt…

This way, you keep the size of the embedded/searchable item text and the size of context for each item separate from the very beginning, and if you decide to use a different model (with different max tokens) or a different number of found items to include in the prompt - you can without re-embedding the sources.

Also, smaller chunks of text embedded for search produce wider variations of cosine similarity so it is easier to select the most relevant items…


Also, a bigger size on embeddings allows you to embed the huge chunks of text for a “loose” search; then, you may do another search inside those found items if needed for more “precise” results. Like a two-step search, I personally never used this approach as not needed (yet).

1 Like

Agree with you. Smaller token length for davinchi limits what could be done with larger embeddings. Actually davinchi should have 3-5 times higher token limit than embedding because you want to pass 3-5 similar embeddings to get the answer.


I’m looking at strategies for dealing with the limited token size when submitting queries with context text for completion. I don’t quite understand when you say “extract the context out of the source text for each item”?

How exactly do you extract “context” out of source text?

For example, if I do a vector search against my embedded knowledgebase and get back 3 hits, that means I go and grab the source files for each of those hits. So far, so good. Now, ideally, I would like to submit the text of these files to text-davinci-003 as context for a query completion. But, if the total token amount of this text is over 4000 tokens, I’m out of luck.

Each text would represent, for example, a section out of regulatory text. Perhaps one paragraph out of 20 or 30 in the text would be the actual “context” matching the initial query. How would I extract just that paragraph from the text before submitting the completion call?

Or, did I misunderstand what you are saying?

Hi there, you have to divide up your text into smaller chunks in the first place. Then obtain embeddings for the smaller chunks. Then your search results will be more specific, and you can fit more into the prompt. “Context” just means the search results that you are including in the prompt. Hope that’s helpful.

1 Like

@SomebodySysop here is what I came up with for my use cases, looks pretty similar to yours:

I have several long documents with multiple sections (10-50) and hundreds of paragraphs.

I embed each paragraph separately, and I attach ID of the paragraph, section and document to the stored vector, so that when I find the best matches I know exactly where the text is coming from.

Now, each paragraph is roughly about 200 tokens. If my limit is 2k tokens and I expect an answer of about 400 tokens, it leaves me about 1.6k tokens for prompt. 100 tokens go for the question, so realistically the context may be up to 1.5k tokens.

If I consider 3 best matches, that’s only 600 tokens. So what I do, I grab the text around the found paragraphs to make them about 500 tokens each and use that as my context. Doing so I do not go outside the actual section where the paragraph is, as usually sections are somewhat delimited by meaning/logic. So if my paragraph is the first in the section I grab the text after it, and if it is the last one - I grab the text before it.

But that’s my use case. If you need to provide the whole document as a context and it’s bigger than the token limit, then you need to summarize the document to make it shorter and fit into the prompt.

Other approach may be chain the prompts with contexts and update the answer based on new information found in chained contexts.


@sergeliatko Thank you very much for that detailed explanation. I can’t express how much it helps to get feedback from someone who is actually doing what you are trying to do.

My current plan is a combination of your method and the suggestion from @lmccallum : I have broken my documents into relevant sections, then broken those down into chunks of < 3.8K tokens that are then embedded. Since each text will be below the 4K Davinci limit, I will take the 3 top results from a vector search and send each as context with original query for 3 completions. This “chained” responses will be the answer returned to the end user.

I was curious about

Because it seemed to present a way to retrieve only the relevant text from each of the top 3 results, and allow me to make just one completion call to Davinci instead of 3.

FYI, I asked the AI about this, and here was the response:

One possible strategy for extracting the relevant paragraph from a larger document is to use natural language processing techniques to identify the sentence or paragraph that is most semantically similar to the query. Here are some steps that you could follow:

  1. Parse the text of each of the hits returned by your vector search, and convert them into a machine-readable format (such as JSON or XML).
  2. Use a natural language processing library (such as spaCy or NLTK) to tokenize the text into individual sentences or paragraphs.
  3. For each hit, calculate the semantic similarity between each sentence or paragraph and the query. There are several methods for calculating semantic similarity, such as cosine similarity or the Jaccard index.
  4. Identify the sentence or paragraph that is most semantically similar to the query. This will likely be the one that contains the relevant context.
  5. Extract this sentence or paragraph from the hit, and submit it as the context for your query completion.

Note that this approach assumes that the relevant context can be expressed in a single sentence or paragraph. If the context is more complex, you may need to extract multiple sentences or paragraphs and concatenate them together.

It’s also worth noting that some query completion models, such as GPT-3, are capable of generating context-aware completions even without explicit context text.

Although, I think it may be hallucinating on this last part:

It’s also worth noting that some query completion models, such as GPT-3, are capable of generating context-aware completions even without explicit context text.

I will probably create a new topic to see if anyone has used NLP to additionally tokenize their vector search results.

Again, thanks for the responses.

@SomebodySysop from my former linguistics education base I can say that almost 4k of tokens in a chunk will contain at least several ideas in it (because a finished idea is more often contained in a paragraph or at most 3 paragraphs, so in tokens you’re about 500-600 tokens per idea at most). And the goal of using vector search is to match one idea (query) to another one (source chunk) as close as possible… Embedding more than one idea in a chunk will dilute the precision of the vector search (concept match) and make the perfect match almost unachievable.

Embedding chunks of text that big makes sense when you need vectors for clustering or classification of entire documents, or subject search. But when you need the facts search inside the documents - you need precision, and in this case it doesn’t make sense to me to vectorize texts longer than one “idea” (1-3 paragraphs or 200-600 tokens).

I would revisit your approach to check if that’s not your underlying issue…

1 Like

That’s exactly one of the benefits of my approach embedding 1 paragraph at a time and stuffing the context with surrounding text if token limit allows it. This way the search is possible across multiple documents and changing the number of items considered is a simple change in parameters for the prompt constructor function.

The answer is: make it available separately in the first place by embedding/vectorizing it separately from the surrounding text.

It will cost you more on vector tokens, but you pay those only once. And doing so will save you tokens used in completion requests (the ones you pay all the time).

@sergeliatko Believe it or not, I also have a linguistics degree. It’s from almost 1/2 a century ago, but what you say still makes perfect sense to me.

My biggest concern was losing important context before and/or after the idea in the target paragraph. But, as you say, adding several paragraphs before and after would mitigate that concern significantly.

Going to revisit my current embedding process to see if I can make something like this work.

Again, my most humble gratitude for taking the time to explain this to me.

The exact reason I took this approach. I also was considering a “checker” model that would see the context and determine if it is sufficient to reply to the query. But never implemented that as adding paragraphs from the same section was giving good enough results.

Mine is close to a half of that, but the teaching was old school (in post URSS at the time). Some degrees do not expire that fast. Almost never worked in the domain, but also never saw a domain where it was not a huge advantage.

1 Like

Hello, i am trying to make a code that takes a database of properties and then assigns them to exercices but i have difficulty overcoming the token limit, have you got any tips?
I let someone make an input in the form of an exercise, for example: what is ohm’s law.
Then it will assing properties from a list i provide using chatgpt. An example of a propertie is: the students can understand and use ohm’s law.