Ways to automate breaking a large piece of input into chunks that fit in the 4096 token constraint?

Hi everyone, I’ve been trying to automate the inputting of prompts that adhere to the 4096 token limit. I’m doing this mainly for text summarization. What I’d like to ideally do is take a giant body of text and split them into chunks of 4096 tokens, send them into a chatgpt with a very generic prompt like “I am giving you a large body of text to summarize, just say okay.”

I’ve tried 2 methods and both fail before I can reasonably reach my goal.
Method 1: Using a chrome extension
OpenAI recommends gpt-3-encoder which is a no go because it’s a pain to import node modules into a chrome extension, and even after succeeding at this I found that the gpt-3-encoder uses the file system which is something chrome extensions are never allowed access to. This simply doesn’t work.

Method 2: Using python
With python I can read a large body of text in a file. Split it up into chunks that abide by the 4096 token limit accurately (tiktoken is awesome). BUT I can’t automate inputting the chunked text into openai. Using Selenium, OpenAI detects that I’m using an automated browser and won’t let me sign in. I am again stuck.

Are there any reasonable ways to automate inputting large bodies of text. I think this is a reasonable use case that would be great and generally useful but I can’t think of an easy way to do it. Any thoughts or insights would be appreciated!

1 Like

Sign up for the API and you can use Python to submit as many requests (as you want to pay for)

1 Like

You can use many techniques like map reduce, refine, etc… I use langchain for it.

1 Like

I believe @daveshapautomator created a Python script to do this.

My notes on the subject:

Summarize Large Documents

4 Likes

Welcome @jzhanglsw

gpt-3.5-turbo has a context length of 4096 tokens meaning that is the max tokens it can process(read + generate). If you send it 4096 tokens(original text + prompt), there’s none left for generation.

Summarization of text bodies that have near or greater than context-length token count is one of the limitations that exists even now.

Many tools claim to solve this problem but there is no perfect solution to this as of now except models with high context-length, like gpt-4-32k which isn’t even released to all gpt-4 users.

One way is to condense smaller and manageable chunks of text and the combine them together.

I’d imagine this would get expensive if I wanted to summarize whole books. I have chatgpt plus so the normal chat I can use as much as possible. I just need to automate putting them into the actual text box.

At the time of input, I actually don’t ask it to summarize anything. I just tell chatgpt that I’m giving it a large body of text and to say okay. I did an experiment where I split the first chapter of Harry Potter into 2 chunks that reach near the 4096 token limit (the first chapter is about 6471 tokens) and it did admirably generated the summary AFTER it had received both parts:

Blockquote
In Chapter One of “Harry Potter and the Sorcerer’s Stone,” the Dursley family is introduced as a seemingly normal and mundane family. Mr. and Mrs. Dursley, along with their son Dudley, fear that their secret connection to the Potters will be discovered. Strange occurrences, such as a cat reading a map and peculiarly dressed people, unsettle Mr. Dursley. He overhears a conversation mentioning the Potters and becomes increasingly concerned. Meanwhile, Professor Dumbledore arrives in the neighborhood and discusses the recent events with Professor McGonagall, including Voldemort’s disappearance and the rumors of the Potters’ deaths. Dumbledore reveals that Harry Potter, the Potters’ son, is alive and plans to bring him to live with his aunt and uncle. Hagrid arrives with baby Harry and hands him over to Dumbledore. Dumbledore places Harry on the doorstep of the Dursleys’ house and leaves. The chapter ends with Harry, still unaware of his fame as the “boy who lived,” sleeping peacefully.

Any chance you can point me to the repo? Searching chatgpt or token on his github result in amusing but irrelevant code repositories.

Could be because of the model’s preexisting knowledge of the book.

You have a valid point so I tried the experiment again with the following article written in April 2023:
(Apparently I can’t send links so just search the putin myth journal of democracy on google)

Again, the summary was quite good. I believe that despite the token limit, somehow chatgpt is able to understand if the input is chunked accordingly:

Post-2012, Putin’s regime aimed to demobilize liberal elements in Russia, resorting to social control and repression, and using a legitimacy narrative of Russia under siege from the West. Liberal ideas and calls for free elections were cast as foreign influence, with many US and European organizations being deemed undesirable in Russia.
Anti-Putin demonstrations and civil liberties were targeted, using social wedge issues such as religion and gay rights, notably exemplified by the Pussy Riot case and resulting laws. Putin, despite personal history, became a champion of traditional family values, contrasting Russian identity with the West. Elections became heavily controlled, the media was restricted, and the idea of an external enemy was promoted.
Putin’s war in Ukraine exposed the myth of autocratic competence, with significant Russian casualties and economic recession. Russians, especially the younger generation, became international pariahs, with many fleeing to other countries. As Putin’s image of competence faded, he resorted to increased fear and force, imposing severe penalties for dissent.
Despite these circumstances, it appears that Putin still retains considerable support, although there is a generation gap, with younger citizens less supportive. Popular sentiment against the war and the government is growing, as evidenced by declining military approval ratings and an increase in negative feelings. However, the regime’s response to this is likely to be more repression.
The case of Russia under Putin demonstrates how autocracies can shift from attempting to project an image of competence and legitimacy to relying on violent repression, particularly in the face of poor governance and international conflicts.

I took a look at your video:

  • How to Summarize a PDF file with ChatGPT (70 000+ Words)

But it doesn’t seem like you actually show the code for how you accomplished this. Are you using the openai api behind the scenes to submit completions requests?

1 Like

This isn’t my video. It is one of many tutorials on the subject of summarization that I found helpful and like to share.

As for coding, I believe there are lots of Python scripts out there now which will do this. This is the first link that come up for my Google search on the subject: How to do text summarization with deep learning and Python - ActiveState

Personally, I prefer using my own “Semantic Chunking” methodology that I discuss here: Semantic Chunking - YouTube

I am using the OpenAI API for chat completions. I am also using Weaviate as my vector store.

2 Likes

Thank you for sharing these resources! Helped me out a ton!!!

1 Like

Thank you for the opportunity to be a part of something extraordinary. I’ll continue to study the required material to help create new content.

I like the general approach you’ve outlined in your video.

1 Like

The key for me was getting a clear overview of what I was trying to accomplish. Yes, I want to be able to create chat completion calls using my own data. But, what are the steps to get there? Everybody tells you, “You need to use embeddings!” But, what are embeddings, and how do you use them? And why? I just got more and more confused until I took the time to understand the entire process of getting from step A to the end.

This was the first flowchart I ever saw that helped me to understand that process.

And this was the video: https://youtu.be/Ix9WIZpArm0

For me, once I understood the overall process, and then each step in that process, the rest was easy. Well, maybe not easy, but it certainly made a lot more sense.

Welcome!

I takes a month or two, at best, to get your head wrapped around how vector db’s, semantic search, and embeddings all fit into the equation. My head was swirling for the first month that I worked with GPT. At some point it just clicked and I got it.

1 Like