Ways to automate breaking a large piece of input into chunks that fit in the 4096 token constraint?

SomebodySysop · May 21, 2023, 9:39am

My notes on the subject:

Summarize Large Documents

How to Summarize a Large Text with GPT-3
How to Summarize a PDF file with ChatGPT (70 000+ Words)
State of the Art GPT-3 Summarizer For Any Size Document or Format | Width.ai
- Smaller chunks allow for more understanding per chunk but increase the risk of split contextual information. Let’s say you split a dialog or topic in half when chunking to summarize. If the contextual information from that dialog or topic is small or hard to decipher per chunk that model might not include it at all in the summary for either chunk. You’ve now taken an important part of the overall text and split the contextual information about it in half reducing the model’s likelihood to consider it important. On the other side you might produce two summaries of the two chunks dominated by that dialog or topic.
Building a Summarization System with LangChain and GPT-3 - Part 2 - YouTube
- “Extract the key facts out of this text. Don’t include opinions. Give each fact a number and keep them in short sentences.”
- Fact check summaries.
Building a Summarization System with LangChain and GPT-3 - Part 1 - YouTube
- Summarization Methodologies
  - Map Reduce
    - Chunk document. Summarize each chunk, then summarize all the chunk summaries. Using this currently in embed_solr_index01.php.
  - Stuffing
    - Summarize entire document all at once, if it will fit into prompt.
  - Refine
    - Chunk document. Summarize first chunk. Summarize 2nd chunk + 1st chunk summary. Summarize 3rd chunk + 1st and 2nd chunk summary. And so on…
Chunk large document by creating a list of summaries
- Break document down into chunks, then summarize each chunk, then submit the list of summaries as the document.
- https://community.openai.com/t/how-to-send-long-articles-for-summarizat…

Topic		Replies	Views
Logic behind uploading a large document API chatgpt	8	1153	June 4, 2024
Prompting with the chat/completions API against a large transcript file API	5	3574	October 4, 2023
The length of the embedding contents API	48	33769	December 13, 2023
Summarizing and extracting structured data from long text Prompting gpt-4 , api , token , limitations	14	12500	February 19, 2024
Preprocessing - I just don’t get it! API	15	3006	January 3, 2024

Ways to automate breaking a large piece of input into chunks that fit in the 4096 token constraint?

Related topics