Hi everyone, I’ve been trying to automate the inputting of prompts that adhere to the 4096 token limit. I’m doing this mainly for text summarization. What I’d like to ideally do is take a giant body of text and split them into chunks of 4096 tokens, send them into a chatgpt with a very generic prompt like “I am giving you a large body of text to summarize, just say okay.”
I’ve tried 2 methods and both fail before I can reasonably reach my goal.
Method 1: Using a chrome extension
OpenAI recommends gpt-3-encoder which is a no go because it’s a pain to import node modules into a chrome extension, and even after succeeding at this I found that the gpt-3-encoder uses the file system which is something chrome extensions are never allowed access to. This simply doesn’t work.
Method 2: Using python
With python I can read a large body of text in a file. Split it up into chunks that abide by the 4096 token limit accurately (tiktoken is awesome). BUT I can’t automate inputting the chunked text into openai. Using Selenium, OpenAI detects that I’m using an automated browser and won’t let me sign in. I am again stuck.
Are there any reasonable ways to automate inputting large bodies of text. I think this is a reasonable use case that would be great and generally useful but I can’t think of an easy way to do it. Any thoughts or insights would be appreciated!
Smaller chunks allow for more understanding per chunk but increase the risk of split contextual information. Let’s say you split a dialog or topic in half when chunking to summarize. If the contextual information from that dialog or topic is small or hard to decipher per chunk that model might not include it at all in the summary for either chunk. You’ve now taken an important part of the overall text and split the contextual information about it in half reducing the model’s likelihood to consider it important. On the other side you might produce two summaries of the two chunks dominated by that dialog or topic.
gpt-3.5-turbo has a context length of 4096 tokens meaning that is the max tokens it can process(read + generate). If you send it 4096 tokens(original text + prompt), there’s none left for generation.
Summarization of text bodies that have near or greater than context-length token count is one of the limitations that exists even now.
Many tools claim to solve this problem but there is no perfect solution to this as of now except models with high context-length, like gpt-4-32k which isn’t even released to all gpt-4 users.
One way is to condense smaller and manageable chunks of text and the combine them together.
I’d imagine this would get expensive if I wanted to summarize whole books. I have chatgpt plus so the normal chat I can use as much as possible. I just need to automate putting them into the actual text box.
At the time of input, I actually don’t ask it to summarize anything. I just tell chatgpt that I’m giving it a large body of text and to say okay. I did an experiment where I split the first chapter of Harry Potter into 2 chunks that reach near the 4096 token limit (the first chapter is about 6471 tokens) and it did admirably generated the summary AFTER it had received both parts:
Blockquote
In Chapter One of “Harry Potter and the Sorcerer’s Stone,” the Dursley family is introduced as a seemingly normal and mundane family. Mr. and Mrs. Dursley, along with their son Dudley, fear that their secret connection to the Potters will be discovered. Strange occurrences, such as a cat reading a map and peculiarly dressed people, unsettle Mr. Dursley. He overhears a conversation mentioning the Potters and becomes increasingly concerned. Meanwhile, Professor Dumbledore arrives in the neighborhood and discusses the recent events with Professor McGonagall, including Voldemort’s disappearance and the rumors of the Potters’ deaths. Dumbledore reveals that Harry Potter, the Potters’ son, is alive and plans to bring him to live with his aunt and uncle. Hagrid arrives with baby Harry and hands him over to Dumbledore. Dumbledore places Harry on the doorstep of the Dursleys’ house and leaves. The chapter ends with Harry, still unaware of his fame as the “boy who lived,” sleeping peacefully.
You have a valid point so I tried the experiment again with the following article written in April 2023:
(Apparently I can’t send links so just search the putin myth journal of democracy on google)
Again, the summary was quite good. I believe that despite the token limit, somehow chatgpt is able to understand if the input is chunked accordingly:
Post-2012, Putin’s regime aimed to demobilize liberal elements in Russia, resorting to social control and repression, and using a legitimacy narrative of Russia under siege from the West. Liberal ideas and calls for free elections were cast as foreign influence, with many US and European organizations being deemed undesirable in Russia.
Anti-Putin demonstrations and civil liberties were targeted, using social wedge issues such as religion and gay rights, notably exemplified by the Pussy Riot case and resulting laws. Putin, despite personal history, became a champion of traditional family values, contrasting Russian identity with the West. Elections became heavily controlled, the media was restricted, and the idea of an external enemy was promoted.
Putin’s war in Ukraine exposed the myth of autocratic competence, with significant Russian casualties and economic recession. Russians, especially the younger generation, became international pariahs, with many fleeing to other countries. As Putin’s image of competence faded, he resorted to increased fear and force, imposing severe penalties for dissent.
Despite these circumstances, it appears that Putin still retains considerable support, although there is a generation gap, with younger citizens less supportive. Popular sentiment against the war and the government is growing, as evidenced by declining military approval ratings and an increase in negative feelings. However, the regime’s response to this is likely to be more repression.
The case of Russia under Putin demonstrates how autocracies can shift from attempting to project an image of competence and legitimacy to relying on violent repression, particularly in the face of poor governance and international conflicts.
How to Summarize a PDF file with ChatGPT (70 000+ Words)
But it doesn’t seem like you actually show the code for how you accomplished this. Are you using the openai api behind the scenes to submit completions requests?
The key for me was getting a clear overview of what I was trying to accomplish. Yes, I want to be able to create chat completion calls using my own data. But, what are the steps to get there? Everybody tells you, “You need to use embeddings!” But, what are embeddings, and how do you use them? And why? I just got more and more confused until I took the time to understand the entire process of getting from step A to the end.
This was the first flowchart I ever saw that helped me to understand that process.
For me, once I understood the overall process, and then each step in that process, the rest was easy. Well, maybe not easy, but it certainly made a lot more sense.
I takes a month or two, at best, to get your head wrapped around how vector db’s, semantic search, and embeddings all fit into the equation. My head was swirling for the first month that I worked with GPT. At some point it just clicked and I got it.