Practical Tips for Dealing with Large Documents (>2048 tokens)

speedplane · April 22, 2022, 7:27pm

What are some practical approaches for dealing with large documents that exceed the 2000/2048 token limit?

Use cases include summarizing / classifying large and complex documents, e.g., scientific articles, legal filings, financial disclosures.

Would the standard approach be to divide into paragraphs, and then feed those in individually? Or is GPT just not suited for long documents yet?

Jacques1 · May 9, 2022, 5:19am

Dear Mr Plane,

Please correspond with the OpenAI team and they will reply with your specific parameters.

Kind Regards, Robinson

SecMovPuz · May 9, 2022, 3:26pm

Ok first, @Jacques1 what you said doesn’t make sense… This is a place to ask questions…

Anyways a while ago i had a similar question here:

Basically there are a few methods for doing what you want. You can break the text into chunks then summarise them for gpt. Then if you feed gpt the summarised information plus like the last few sentences it can generate a decent response. There are a few difficulties with this including having an accurate consistent summariser but this is one method.

This is a project @daveshapautomator has worked on in the past. It is not fully functional but is still impressive. It might be a good place to start.

Also this is kinda a summary of a couple other posts with an opinion or two thrown into the mix so if you want more information check out some past posts.

Edit: also new davinci has 4000 token limit so that may help

georg · May 10, 2022, 12:40am

The way I’ve implemented it is similar to how @SecMovPuz described it. In a nutshell:

Count the tokens of the input
Split it into < 2048 token chunks. I added some logic here to preserve sentences/paragraphs
Summarise each chunk
Combine chunks into a summary

porter · January 9, 2023, 6:02pm

Would you mind sharing your logic to preserve sentences and paragraphs?

curt.kennedy · January 9, 2023, 11:08pm

Not a big document slicer here … but …

Paragraphs end in ‘\n’. Sentences end in ‘.’, ‘!’, or ‘?’

Also the new ‘text-embedding-ada-002’ handles 8k tokens, which should be plenty for paragraphs and sentences.

Topic		Replies	Views
Handling text larger than limits? API	2	3504	December 17, 2023
How should a program be written to summarize a long text using an API, and what are the considerations regarding the maximum number of tokens allowed? API	2	2263	April 19, 2024
Large input summarized text Prompting	3	2193	December 17, 2023
Any idea how to input more than 8k token in GPT 4? Prompting gpt-4	4	2050	December 17, 2023
Best practice for a big RAG API chatgpt	7	922	May 11, 2024

Practical Tips for Dealing with Large Documents (>2048 tokens)

Related topics