PDF summarizer using openai

pythondev · December 30, 2023, 4:45am

Hi, I’m trying to develop a pdf text summarizer using openai api. I tried uploading the pdf and extracting text from it and passing it to the openai.chatcompletion. However I’m getting a max token limit reached error. How do i rectify this error? Will splitting the pdf to chunks work? Please guide.

TonyAIChamp · December 30, 2023, 5:46am

Welcome to the forum!

Yes, splitting into chunks and processing one by one will work.

pythondev · December 30, 2023, 5:55am

but then will the cost increase? i’m using gpt-3.5-turbo-instruct model. also need to fine tune. could you guide?

TonyAIChamp · December 30, 2023, 5:57am

Cost is per tokens sent and received, no matter in how many chunks you do it (ok, it matters, but for most of the use cases it is negligible).

pythondev · December 30, 2023, 6:40am

so sorry to bother you again,
def generate_summarizer(max_tokens, temperature, top_p, frequency_penalty, document_text, prompt_text):
res = openai.ChatCompletion.create(
model=“gpt-3.5-turbo”,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
frequency_penalty=frequency_penalty,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant for text summarization.”},
{“role”: “user”, “content”: f"Summarize the document: Prompt: {prompt_text}"},
],
)
return res[“choices”][0][“message”][“content”]

this is my code. in this how do i pass the document text as chunks. i know to split the text to chunks but clueless how to pass the chunks inside this function, as chunks come in lists(?).

jonah_mytzuchi · December 30, 2023, 6:58am

what’s the value of max_tokens? To what I understand, it’s the maximum tokens this request can use (input tokens and output tokens).

In your example, you only use prompt_text, document_text was not accessed.
In case the total length of document_text exceed the max_tokens, you can split it to smaller chunk so that it can fit into one single request.

However, I would suggest you to consider Assistant instead of Chat Completion. Because with Assistant you can upload your file and call the retrieval tool. Most likely it’s not cheaper, but definitely simpler to handle.

pythondev · December 30, 2023, 7:13am

hi, when i’m splitting the document_text to chunks, do i need to embed it or anything? I’m confused how to send request using the chunks i’ve.

TonyAIChamp · December 30, 2023, 7:25am

chunk creating a list, loop through this list

jonah_mytzuchi · December 30, 2023, 7:25am

Just send the text.

messages = [{"role": "system","content":"You are a helpful assistant for text summarization."}]
# document_chunks
doc_chunks = split_document_to_chunk(document_path, chunk_size, overlap) 
for i, chunk in enumerate(doc_chunks):
    processed_chunk = f"[{i}]\n\n{chunk}"
    messages.append(processed_chunk)

# tada! now add this `messages` to the chat completion call

Are you trying to build a RAG-based application? Looks like you are lost.

TonyAIChamp · December 30, 2023, 7:26am

you send a chunk the exact same way you would send a full text

cyzgab · December 30, 2023, 7:40am

Embeddings are usually used so that we can retrieve chunks of text for an retrieval augmented generation (RAG) application. For example, given user query A, I want to find documents related to it. This process of “finding documents related to it” is done by comparing the embeddings of the Query A & your repo of documents

From what you’ve described, your scenario is much simpler: You’re just summarizing a given text in a pdf. If the length of your pdf exceeds the context window of the model, you can chunk it up into smaller parts & ask the LLM to summarize each smaller part.

As a first step, increasing the max_tokens parameter as others have suggested & also checking the token length of your document. Consider looking at this other post: Counting tokens for chat API calls (gpt-3.5-turbo)

pythondev · December 30, 2023, 8:32am

hi @jonah_mytzuchi @TonyAIChamp @cyzgab thanks for the suggestions. i tried using them but got this error
InvalidRequestError: This model’s maximum context length is 4097 tokens. However, your messages resulted in 11007 tokens. Please reduce the length of the messages.

import streamlit as st
import openai
import os
from dotenv import load_dotenv
import fitz
from langchain.text_splitter import CharacterTextSplitter

load_dotenv()
openai.api_key = os.getenv(“OPENAI_API_KEY”)

def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
text = “”
for page_num in range(doc.page_count):
page = doc[page_num]
text += page.get_text()
doc.close()
return text

document_text = extract_text_from_pdf(“uploads/9-23.pdf”)

text_splitter = CharacterTextSplitter(
separator=“\n”,
chunk_size=2000,
chunk_overlap=0,
length_function=len,
)
texts = text_splitter.split_text(document_text)

def calculate_tokens(document_text, prompt_text):
return len(openai.ChatCompletion.create(model=“gpt-3.5-turbo”, messages=[
{“role”: “system”, “content”: “You are a helpful assistant for text summarization.”},
{“role”: “user”, “content”: f"Summarize the document: {document_text}\nPrompt: {prompt_text}"},
])[“choices”][0][“message”][“content”].split())

def generate_summarizer(max_tokens, temperature, top_p, frequency_penalty, document_text, prompt_text):
res = openai.ChatCompletion.create(
model=“gpt-3.5-turbo-instruct”,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
frequency_penalty=frequency_penalty,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant for text summarization.”},
{“role”: “user”, “content”: f"Summarize from the document{document_text}: Prompt: {prompt_text}"},
{“role”: “assistant”, “content”: “Sure, let me summarize the portion you specified in the prompt.”},
],
)
return res[“choices”][0][“message”][“content”]

messages = [{“role”: “system”,“content”:“You are a helpful assistant for text summarization.”}]
for i, chunk in enumerate(texts):
processed_chunk = f"[{i}]\n\n{chunk}"
messages.append(processed_chunk)

st.title(“GPT-3.5 Text Summarizer”)
pdf_path = “uploads/9-23.pdf”
prompt_text = st.text_input(“Enter the prompt for summarization:”)
col1, col2, col3 = st.columns(3)

with col1:
token = st.slider(“Token”, min_value=0.0, max_value=4096.0, value=1024.0, step=1.0)
temp = st.slider(“Temperature”, min_value=0.0, max_value=1.0, value=0.0, step=0.01)
top_p = st.slider(“Nucleus Sampling”, min_value=0.0, max_value=1.0, value=0.5, step=0.01)
f_pen = st.slider(“Frequency Penalty”, min_value=-1.0, max_value=1.0, value=0.0, step=0.01)

with col3:
with st.expander(“Current Parameter”):
st.write(“Current Token :”, token)
st.write(“Current Temperature :”, temp)
st.write(“Current Nucleus Sampling :”, top_p)
st.write(“Current Frequency Penalty :”, f_pen)

if st.button(“Summarize”) and pdf_path:
#document_text = extract_text_from_pdf(pdf_path)
max_tokens = calculate_tokens(document_text, prompt_text)
print(max_tokens)
summary = generate_summarizer(max_tokens, temp, top_p, f_pen, messages, prompt_text)
st.write(“Summary:”)
st.write(summary)

this is my code. could you please help me? i’m unsure how to work with the chunks i’ve as they still throws the limit error.

jonah_mytzuchi · December 30, 2023, 10:58am

I believe gpt-3.5-turbo’s context length max at 4k. You might want to try:

gpt-3.5-turbo-1106, it go up to 16k
gpt-4-turbo-1106-preview, it go up to 128k

vb · December 30, 2023, 11:15am

gpt-3.5-turbo-1106 has a 16K context window and the input tokens are actually cheaper. This model will allow you approximately 5K tokens for your prompts and the analysis of the document you are sending. Fortunately it’s cheaper to send input tokens so it’s a good choice until you move up to the gpt-4-1106-preview with a 128K context window, but at a higher price.
You can find these numbers here:

Note that you don’t need to chunk if the entire context fits into the window. Chunking is used to improve the results of retrieving embeddings and save on costs when you have very large or many documents. In this case you can leverage the vector search to only retrieve the relevant text chunks instead of using the whole text.
Here is the documentation for embeddings:
https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

jr.2509 · December 30, 2023, 3:31pm

A few months ago I created a tool that enables to summarize documents section by section, whereby a section is defined by a section header. The tool can technically be used for documents of any size provided they have identifiable sections. It’s still in beta though and works better for some documents than for others. The release of GPT-4-turbo has also somewhat decreased the need for such a tool, so I have not made it a priority for now to enhance it further.

Nonetheless, it’s another way to think about summarization, especially of larger documents and especially when trying to preserve a certain level of detail in the summary and enable the easy reference back to the original document.

I personally find that when using some of the other common approaches, you often lose out important details and arguments as well as chains of logic. But of course, it also depends on the use case and what the summary is being created for.

jordanlavalley · December 30, 2023, 11:01pm

I would work on your prompts for now the upload knowledge file and code interpreter is absolute trash right now. Wait until further versions are available. I’ve completely given up on uploading anything.

pythondev · January 2, 2024, 3:36am

hi, how did you fit in the large documents within the token limit?

_j · January 2, 2024, 4:09am

By making them smaller documents?

jr.2509 · January 2, 2024, 5:31am

@pythondev @_j

I did a couple of different things here. Using the documents’ table of contents as a basis, I identified individual sections within a document, then extracted and summarized the text from those individual sections. The individual section summaries plus the section title was then consolidated into the full summary.

Under this approach, most sections are normally within the token limit.

Where the section size exceeds a pre-defined threshold, I split the section into multiple parts, summarize these individually and put them back together. To ensure coherence of the individual summaries within a section, I constructed the prompt such that the summary of the preceding part is considered.

It’s designed as a fully automated process.

SomebodySysop · January 2, 2024, 5:41am

Yes! I’ve been telling people this for months! This is how you do it.

Bravo for figuring out how to automate the process. In my experience, dealing with a variety of PDFs, it’s been very difficult finding a methodology that works for every case.

Topic		Replies	Views
Is there any way by which I can let GPT-4 API summarize large PDF texts? API gpt-4 , api	10	10938	May 6, 2024
Chained Prompt to complete text larger than 4000 tokens? API	14	6000	December 25, 2023
Help Needed: Tackling Context Length Limits in OpenAI Models Community gpt-4 , chatgpt , token , rate-limit , openai	8	14924	February 8, 2024
Sending large document via API call and asking for a question over complete document? Prompting api	3	1697	February 26, 2024
Information summary by using API API	3	6487	January 9, 2024

PDF summarizer using openai

Related topics