Ways to automate breaking a large piece of input into chunks that fit in the 4096 token constraint?

jzhanglsw · May 21, 2023, 3:44am

Hi everyone, I’ve been trying to automate the inputting of prompts that adhere to the 4096 token limit. I’m doing this mainly for text summarization. What I’d like to ideally do is take a giant body of text and split them into chunks of 4096 tokens, send them into a chatgpt with a very generic prompt like “I am giving you a large body of text to summarize, just say okay.”

I’ve tried 2 methods and both fail before I can reasonably reach my goal.
Method 1: Using a chrome extension
OpenAI recommends gpt-3-encoder which is a no go because it’s a pain to import node modules into a chrome extension, and even after succeeding at this I found that the gpt-3-encoder uses the file system which is something chrome extensions are never allowed access to. This simply doesn’t work.

Method 2: Using python
With python I can read a large body of text in a file. Split it up into chunks that abide by the 4096 token limit accurately (tiktoken is awesome). BUT I can’t automate inputting the chunked text into openai. Using Selenium, OpenAI detects that I’m using an automated browser and won’t let me sign in. I am again stuck.

Are there any reasonable ways to automate inputting large bodies of text. I think this is a reasonable use case that would be great and generally useful but I can’t think of an easy way to do it. Any thoughts or insights would be appreciated!

novaphil · May 21, 2023, 3:56am

Sign up for the API and you can use Python to submit as many requests (as you want to pay for)

aikanarov · May 21, 2023, 4:02am

You can use many techniques like map reduce, refine, etc… I use langchain for it.

lachie1 · May 21, 2023, 9:03am

I believe @daveshapautomator created a Python script to do this.

SomebodySysop · May 21, 2023, 9:39am

My notes on the subject:

Summarize Large Documents

How to Summarize a Large Text with GPT-3
How to Summarize a PDF file with ChatGPT (70 000+ Words)
State of the Art GPT-3 Summarizer For Any Size Document or Format | Width.ai
- Smaller chunks allow for more understanding per chunk but increase the risk of split contextual information. Let’s say you split a dialog or topic in half when chunking to summarize. If the contextual information from that dialog or topic is small or hard to decipher per chunk that model might not include it at all in the summary for either chunk. You’ve now taken an important part of the overall text and split the contextual information about it in half reducing the model’s likelihood to consider it important. On the other side you might produce two summaries of the two chunks dominated by that dialog or topic.
Building a Summarization System with LangChain and GPT-3 - Part 2 - YouTube
- “Extract the key facts out of this text. Don’t include opinions. Give each fact a number and keep them in short sentences.”
- Fact check summaries.
Building a Summarization System with LangChain and GPT-3 - Part 1 - YouTube
- Summarization Methodologies
  - Map Reduce
    - Chunk document. Summarize each chunk, then summarize all the chunk summaries. Using this currently in embed_solr_index01.php.
  - Stuffing
    - Summarize entire document all at once, if it will fit into prompt.
  - Refine
    - Chunk document. Summarize first chunk. Summarize 2nd chunk + 1st chunk summary. Summarize 3rd chunk + 1st and 2nd chunk summary. And so on…
Chunk large document by creating a list of summaries
- Break document down into chunks, then summarize each chunk, then submit the list of summaries as the document.
- https://community.openai.com/t/how-to-send-long-articles-for-summarizat…

sps · May 21, 2023, 12:37pm

Welcome @jzhanglsw

gpt-3.5-turbo has a context length of 4096 tokens meaning that is the max tokens it can process(read + generate). If you send it 4096 tokens(original text + prompt), there’s none left for generation.

Summarization of text bodies that have near or greater than context-length token count is one of the limitations that exists even now.

Many tools claim to solve this problem but there is no perfect solution to this as of now except models with high context-length, like gpt-4-32k which isn’t even released to all gpt-4 users.

One way is to condense smaller and manageable chunks of text and the combine them together.

jzhanglsw · May 21, 2023, 12:55pm

I’d imagine this would get expensive if I wanted to summarize whole books. I have chatgpt plus so the normal chat I can use as much as possible. I just need to automate putting them into the actual text box.

jzhanglsw · May 21, 2023, 12:59pm

At the time of input, I actually don’t ask it to summarize anything. I just tell chatgpt that I’m giving it a large body of text and to say okay. I did an experiment where I split the first chapter of Harry Potter into 2 chunks that reach near the 4096 token limit (the first chapter is about 6471 tokens) and it did admirably generated the summary AFTER it had received both parts:

Blockquote
In Chapter One of “Harry Potter and the Sorcerer’s Stone,” the Dursley family is introduced as a seemingly normal and mundane family. Mr. and Mrs. Dursley, along with their son Dudley, fear that their secret connection to the Potters will be discovered. Strange occurrences, such as a cat reading a map and peculiarly dressed people, unsettle Mr. Dursley. He overhears a conversation mentioning the Potters and becomes increasingly concerned. Meanwhile, Professor Dumbledore arrives in the neighborhood and discusses the recent events with Professor McGonagall, including Voldemort’s disappearance and the rumors of the Potters’ deaths. Dumbledore reveals that Harry Potter, the Potters’ son, is alive and plans to bring him to live with his aunt and uncle. Hagrid arrives with baby Harry and hands him over to Dumbledore. Dumbledore places Harry on the doorstep of the Dursleys’ house and leaves. The chapter ends with Harry, still unaware of his fame as the “boy who lived,” sleeping peacefully.

jzhanglsw · May 21, 2023, 1:01pm

Any chance you can point me to the repo? Searching chatgpt or token on his github result in amusing but irrelevant code repositories.

sps · May 21, 2023, 1:07pm

Could be because of the model’s preexisting knowledge of the book.

jzhanglsw · May 21, 2023, 2:06pm

You have a valid point so I tried the experiment again with the following article written in April 2023:
(Apparently I can’t send links so just search the putin myth journal of democracy on google)

Again, the summary was quite good. I believe that despite the token limit, somehow chatgpt is able to understand if the input is chunked accordingly:

Post-2012, Putin’s regime aimed to demobilize liberal elements in Russia, resorting to social control and repression, and using a legitimacy narrative of Russia under siege from the West. Liberal ideas and calls for free elections were cast as foreign influence, with many US and European organizations being deemed undesirable in Russia.
Anti-Putin demonstrations and civil liberties were targeted, using social wedge issues such as religion and gay rights, notably exemplified by the Pussy Riot case and resulting laws. Putin, despite personal history, became a champion of traditional family values, contrasting Russian identity with the West. Elections became heavily controlled, the media was restricted, and the idea of an external enemy was promoted.
Putin’s war in Ukraine exposed the myth of autocratic competence, with significant Russian casualties and economic recession. Russians, especially the younger generation, became international pariahs, with many fleeing to other countries. As Putin’s image of competence faded, he resorted to increased fear and force, imposing severe penalties for dissent.
Despite these circumstances, it appears that Putin still retains considerable support, although there is a generation gap, with younger citizens less supportive. Popular sentiment against the war and the government is growing, as evidenced by declining military approval ratings and an increase in negative feelings. However, the regime’s response to this is likely to be more repression.
The case of Russia under Putin demonstrates how autocracies can shift from attempting to project an image of competence and legitimacy to relying on violent repression, particularly in the face of poor governance and international conflicts.

jzhanglsw · May 21, 2023, 2:09pm

I took a look at your video:

How to Summarize a PDF file with ChatGPT (70 000+ Words)

But it doesn’t seem like you actually show the code for how you accomplished this. Are you using the openai api behind the scenes to submit completions requests?

SomebodySysop · May 22, 2023, 9:11am

This isn’t my video. It is one of many tutorials on the subject of summarization that I found helpful and like to share.

As for coding, I believe there are lots of Python scripts out there now which will do this. This is the first link that come up for my Google search on the subject: How to do text summarization with deep learning and Python - ActiveState

Personally, I prefer using my own “Semantic Chunking” methodology that I discuss here: Semantic Chunking - YouTube

I am using the OpenAI API for chat completions. I am also using Weaviate as my vector store.

aleczander1911 · June 6, 2023, 7:11am

Thank you for sharing these resources! Helped me out a ton!!!

readyplayerone161803 · June 6, 2023, 7:31am

Thank you for the opportunity to be a part of something extraordinary. I’ll continue to study the required material to help create new content.

stevenic · June 6, 2023, 8:04am

I like the general approach you’ve outlined in your video.

SomebodySysop · June 6, 2023, 8:06am

The key for me was getting a clear overview of what I was trying to accomplish. Yes, I want to be able to create chat completion calls using my own data. But, what are the steps to get there? Everybody tells you, “You need to use embeddings!” But, what are embeddings, and how do you use them? And why? I just got more and more confused until I took the time to understand the entire process of getting from step A to the end.

This was the first flowchart I ever saw that helped me to understand that process.

And this was the video: https://youtu.be/Ix9WIZpArm0

For me, once I understood the overall process, and then each step in that process, the rest was easy. Well, maybe not easy, but it certainly made a lot more sense.

Welcome!

stevenic · June 6, 2023, 8:13am

I takes a month or two, at best, to get your head wrapped around how vector db’s, semantic search, and embeddings all fit into the equation. My head was swirling for the first month that I worked with GPT. At some point it just clicked and I got it.

Topic		Replies	Views
Logic behind uploading a large document API chatgpt	8	1326	June 4, 2024
Prompting with the chat/completions API against a large transcript file API	5	3644	October 4, 2023
The length of the embedding contents API	48	34572	December 13, 2023
Summarizing and extracting structured data from long text Prompting gpt-4 , api , token , limitations	14	12878	February 19, 2024
Preprocessing - I just don’t get it! API	15	3070	January 3, 2024

Ways to automate breaking a large piece of input into chunks that fit in the 4096 token constraint?

Related topics