Processing Large Documents - 128K limit


We have some challenges with ingesting large documents. For example:
We need to ingest documents 100+ pages inside. We are using the API to generate topics based on these documents prior to embedding but we are running up against the 128K token limit.
Does anyone have any ideas for work-arounds other than breaking up large documents into smaller chunks below the 128K limit?



Given that im building a product to solve this very problem I have to be a bit tactful with my answer but if you give me a little insight as to why you need to process so many documents maybe I can offer a bit of guidance. Are you using a vector database and RAG? Is that just not working for you? Even at best RAG is only 80% efficient at retrieving the most relevant document text so you typically need to retrieve about 30% more text then what you think you need. What kinds of questions are you trying to ask?

1 Like

One thing I’ll warn you about using really large context window sizes like 128k for RAG is that they often have an attention problem where they miss facts in the middle of the prompt. I don’t know how bad GPT-4-128k’s attention in the middle exactly is but I can tell you the sweet spot for RAG is a context window size between 8k and 16k


GPT-4 Turbo Preview: Exploring the 128k Context Window (

Here’s a good article on some testing done with gpt-4-turbo-preview (the 128k context model you’re mentioning) and it’s accuracy with various context lengths.


The fact it was reliable up to 64k is impressive to me. I work with a lot of OSS models so I stay below 16k


OpenAI is the world leader in this space for a reason it seems. I use its massive context window to easily generate meeting minutes from hours long calls.

I struggle to believe anything else at the moment could match it in quality (as we’ve just seen with the lackluster release of Gemini Ultra).

Part of my theory for why this phenomenon exists is that these models are fundamentally pattern matchers. If they get to many facts in the context window they start conflicting with each other and the model starts to get confused so they do what they always do when they’re confused, they hallucinate. Why it seems to happen more in the middle I have no clue

1 Like

Hi Steve. Thanks for responding

Yes we are using RAG and VDB. We are using AWS PostGres with the VectorDB plug-in.
Our business function requires ingesting large documents into the VectorDB. It’s an essential function due to the types of industry we are targeting where large documents are very common.

For example: One test doc is the annual report for a large Pharmaceutical company - 106 pages. Once assigned topics and tokenized into the Vector DB the questions we would expect to be asked are things like:
“Please summarize this annual report for key points?”
“what are the stated risks in the company based on their outlook?”

Does that help?

1 Like

There was an interesting write up by Juan a couple months ago that found almost perfect results by repeating the content in the middle of the haystack:

Two Needles Strategy: By duplicating the target statement within the text, we found that GPT-4 could retrieve the information with 100% accuracy, suggesting that reinforcing the signal (i.e., the target information) enhances the model’s retrieval capability.

Direct Use of GPT-4 API: Unlike other experiments that used LangChain, we directly employed the GPT-4 API. This might have influenced the results, as our method yielded a 100% retrieval rate even with a single instance of the target statement.

Contrast with Previous Experiments: Our results differ significantly from earlier experiments done by other researchers, where the model struggled with retrieving information in large contexts when it was located in the middle. This discrepancy suggests that the methodology, including factors like the use of LangChain or the nature of the embedded statement, might significantly impact the model’s performance.

In conclusion, these experiments suggest that GPT-4’s ability to retrieve specific information from large contexts can be significantly improved by reinforcing the target information, either by duplication or other means. Additionally, the direct use of the GPT-4 API seems to yield better results compared to methods using intermediary libraries. This indicates that the way information is presented to and processed by LLMs can greatly influence their performance in context retrieval tasks.

1 Like

The obvious conclusion: Duplicate every word in every document and always get 100% accuracy :sunglasses:

As @mark.elder said though it seems that some industries/documents are still just out of reach for in-context prompting.



It does help and I’d say you’d need my product to solve a problem like that. I’m currently able to reliably process questions like that on upwards of 500m tokens of content. Feel free to DM me :slight_smile:

LOL. I mean, hey, it works :person_shrugging:

I think the idea is to recognize the thresholds and then duplicate the content that falls beneath it.

1 Like

I’ve worked out to how to do reliable Q&A over documents of any size but I have to do a bunch of tricks


I wish there was one trick I could give you to answer queries that complex over documents that large but we employ a dozen tricks. For example that annual report probably has financial tables in it. These models lack spatial awareness so they’re likely to jumble up all their numbers. We have tricks that compensate for that. Everything from our data ingesting pipelines to our vector db to our inference layer is all custom and tuned for the specific class of complex queries you specified.

It makes sense that if a specific answer to a specific question happens to be sitting in the document, right in the middle of the doc. However I don’t get how that can be leveraged for the purpose of improving queries across all possible questions. That would mean you’d need an infinite number of answers pre-determined and injected into the content. Maybe I read that too fast, but I don’t see how it can be used for anything in general.

I’m not sure I understand what you mean

The concept is straightforward. If you are sending context of 118k tokens and know that from the document depth of 10% - 50% the information isn’t correctly retrieved then you can repeat it.

It’s not perfect, but it sure beats a hallucinated/incorrect answer.

As @stevenic has already implied, it’s much better to just use a lower context size. But a problem was identified and a “patchy” solution has been found.

If you spend some more time looking at the Github you’ll see Jupyter Notebooks you can run which will answer all of your questions.

My challenge with some of the Q&A approaches is that they look at narrowly defined questions. Often, I am not looking for an answer to a highly specific question but rather the goal is to create insights based on information that sit in various places in the document or across documents and require to understand relationships between information.

Running bespoke analysis over a large and potentially diverse corpus of documents, this is where poersonally for me the real value add from LLMs would come from.

1 Like

It seems like this would be an initial job for embedding models to capture underlying relationships through documents at varying levels. Then you could use an LLM to try and identify the relationships that the embedding model found.

Keep in mind that certain embedding models (instruct) can be given “tasks” that match whatever insights you are hoping to find

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
   get_detailed_instruct(task, 'How to bake a chocolate cake'),
   get_detailed_instruct(task, 'Symptoms of the flu')
# No need to add instruction for retrieval documents
passages = [
   "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar, cocoa powder, baking powder, baking soda, salt, eggs, milk, vegetable oil, and vanilla extract. Start by preheating your oven to 350°F (175°C). In a mixing bowl, combine the dry ingredients (flour, sugar, cocoa powder, baking powder, baking soda, and salt). In a separate bowl, whisk together the wet ingredients (eggs, milk, vegetable oil, and vanilla extract). Gradually add the wet mixture to the dry ingredients, stirring until well combined. Pour the batter into a greased cake pan and bake for 30-35 minutes. Let it cool before frosting with your favorite chocolate frosting. Enjoy your homemade chocolate cake!",
   "The flu, or influenza, is an illness caused by influenza viruses. Common symptoms of the flu include a high fever, chills, cough, sore throat, runny or stuffy nose, body aches, headache, fatigue, and sometimes nausea and vomiting. These symptoms can come on suddenly and are usually more severe than the common cold. It's important to get plenty of rest, stay hydrated, and consult a healthcare professional if you suspect you have the flu. In some cases, antiviral medications can help alleviate symptoms and reduce the duration of the illness."

I think a graph database could work also wonders here. I have played with the idea of creating a graph database of character sheets and environments, then letting a LLM “crawl” them, playing hypotheticals and creating “events” (What would happen if Frodo met a Goblin on Mount Doom and he has The Ring)

1 Like

I get that if I insert the “answer” to my question in the middle of the context (prompt) that the AI is more likely to be able to find that answer. But what I’m saying is that if I knew what the answer was to my question I wouldn’t need to be asking the question.

You’re not inserting the answer. The Needle in The Haystack was a technique of inserting the answer at different levels and seeing how often the model was able to retrieve it to determine which depths the model fails (as seen in the graph that @trenton.dambrowitz showed)

The idea is not to “insert the answer” at the best level, but to be aware of the depths in which an LLM tends to “skip over”

1 Like