Practical Tips for Dealing with Large Documents (>2048 tokens)

What are some practical approaches for dealing with large documents that exceed the 2000/2048 token limit?

Use cases include summarizing / classifying large and complex documents, e.g., scientific articles, legal filings, financial disclosures.

Would the standard approach be to divide into paragraphs, and then feed those in individually? Or is GPT just not suited for long documents yet?

2 Likes

Dear Mr Plane,

Please correspond with the OpenAI team and they will reply with your specific parameters.

Kind Regards, Robinson

Ok first, @Jacques1 what you said doesn’t make sense… This is a place to ask questions…

Anyways a while ago i had a similar question here:

Basically there are a few methods for doing what you want. You can break the text into chunks then summarise them for gpt. Then if you feed gpt the summarised information plus like the last few sentences it can generate a decent response. There are a few difficulties with this including having an accurate consistent summariser but this is one method.

This is a project @daveshapautomator has worked on in the past. It is not fully functional but is still impressive. It might be a good place to start.

Also this is kinda a summary of a couple other posts with an opinion or two thrown into the mix so if you want more information check out some past posts.

Edit: also new davinci has 4000 token limit so that may help

4 Likes

The way I’ve implemented it is similar to how @SecMovPuz described it. In a nutshell:

  • Count the tokens of the input
  • Split it into < 2048 token chunks. I added some logic here to preserve sentences/paragraphs
  • Summarise each chunk
  • Combine chunks into a summary
1 Like

Would you mind sharing your logic to preserve sentences and paragraphs?

2 Likes

Not a big document slicer here … but …

Paragraphs end in ‘\n’. Sentences end in ‘.’, ‘!’, or ‘?’

Also the new ‘text-embedding-ada-002’ handles 8k tokens, which should be plenty for paragraphs and sentences.