PDF summarizer using openai

Hi, I’m trying to develop a pdf text summarizer using openai api. I tried uploading the pdf and extracting text from it and passing it to the openai.chatcompletion. However I’m getting a max token limit reached error. How do i rectify this error? Will splitting the pdf to chunks work? Please guide.

Welcome to the forum!

Yes, splitting into chunks and processing one by one will work.

1 Like

but then will the cost increase? i’m using gpt-3.5-turbo-instruct model. also need to fine tune. could you guide?

1 Like

Cost is per tokens sent and received, no matter in how many chunks you do it (ok, it matters, but for most of the use cases it is negligible).

so sorry to bother you again,
def generate_summarizer(max_tokens, temperature, top_p, frequency_penalty, document_text, prompt_text):
res = openai.ChatCompletion.create(
model=“gpt-3.5-turbo”,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
frequency_penalty=frequency_penalty,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant for text summarization.”},
{“role”: “user”, “content”: f"Summarize the document: Prompt: {prompt_text}"},
],
)
return res[“choices”][0][“message”][“content”]

this is my code. in this how do i pass the document text as chunks. i know to split the text to chunks but clueless how to pass the chunks inside this function, as chunks come in lists(?).

what’s the value of max_tokens? To what I understand, it’s the maximum tokens this request can use (input tokens and output tokens).

In your example, you only use prompt_text, document_text was not accessed.
In case the total length of document_text exceed the max_tokens, you can split it to smaller chunk so that it can fit into one single request.

However, I would suggest you to consider Assistant instead of Chat Completion. Because with Assistant you can upload your file and call the retrieval tool. Most likely it’s not cheaper, but definitely simpler to handle.

hi, when i’m splitting the document_text to chunks, do i need to embed it or anything? I’m confused how to send request using the chunks i’ve.

chunk creating a list, loop through this list

1 Like

Just send the text.

messages = [{"role": "system","content":"You are a helpful assistant for text summarization."}]
# document_chunks
doc_chunks = split_document_to_chunk(document_path, chunk_size, overlap) 
for i, chunk in enumerate(doc_chunks):
    processed_chunk = f"[{i}]\n\n{chunk}"
    messages.append(processed_chunk)

# tada! now add this `messages` to the chat completion call

Are you trying to build a RAG-based application? Looks like you are lost.

2 Likes

you send a chunk the exact same way you would send a full text

Embeddings are usually used so that we can retrieve chunks of text for an retrieval augmented generation (RAG) application. For example, given user query A, I want to find documents related to it. This process of “finding documents related to it” is done by comparing the embeddings of the Query A & your repo of documents

From what you’ve described, your scenario is much simpler: You’re just summarizing a given text in a pdf. If the length of your pdf exceeds the context window of the model, you can chunk it up into smaller parts & ask the LLM to summarize each smaller part.

As a first step, increasing the max_tokens parameter as others have suggested & also checking the token length of your document. Consider looking at this other post: Counting tokens for chat API calls (gpt-3.5-turbo)

1 Like

hi @jonah_mytzuchi @TonyAIChamp @cyzgab thanks for the suggestions. i tried using them but got this error
InvalidRequestError: This model’s maximum context length is 4097 tokens. However, your messages resulted in 11007 tokens. Please reduce the length of the messages.

import streamlit as st
import openai
import os
from dotenv import load_dotenv
import fitz
from langchain.text_splitter import CharacterTextSplitter

load_dotenv()
openai.api_key = os.getenv(“OPENAI_API_KEY”)

def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
text = “”
for page_num in range(doc.page_count):
page = doc[page_num]
text += page.get_text()
doc.close()
return text

document_text = extract_text_from_pdf(“uploads/9-23.pdf”)

text_splitter = CharacterTextSplitter(
separator=“\n”,
chunk_size=2000,
chunk_overlap=0,
length_function=len,
)
texts = text_splitter.split_text(document_text)

def calculate_tokens(document_text, prompt_text):
return len(openai.ChatCompletion.create(model=“gpt-3.5-turbo”, messages=[
{“role”: “system”, “content”: “You are a helpful assistant for text summarization.”},
{“role”: “user”, “content”: f"Summarize the document: {document_text}\nPrompt: {prompt_text}"},
])[“choices”][0][“message”][“content”].split())

def generate_summarizer(max_tokens, temperature, top_p, frequency_penalty, document_text, prompt_text):
res = openai.ChatCompletion.create(
model=“gpt-3.5-turbo-instruct”,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
frequency_penalty=frequency_penalty,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant for text summarization.”},
{“role”: “user”, “content”: f"Summarize from the document{document_text}: Prompt: {prompt_text}"},
{“role”: “assistant”, “content”: “Sure, let me summarize the portion you specified in the prompt.”},
],
)
return res[“choices”][0][“message”][“content”]

messages = [{“role”: “system”,“content”:“You are a helpful assistant for text summarization.”}]
for i, chunk in enumerate(texts):
processed_chunk = f"[{i}]\n\n{chunk}"
messages.append(processed_chunk)

st.title(“GPT-3.5 Text Summarizer”)
pdf_path = “uploads/9-23.pdf”
prompt_text = st.text_input(“Enter the prompt for summarization:”)
col1, col2, col3 = st.columns(3)

with col1:
token = st.slider(“Token”, min_value=0.0, max_value=4096.0, value=1024.0, step=1.0)
temp = st.slider(“Temperature”, min_value=0.0, max_value=1.0, value=0.0, step=0.01)
top_p = st.slider(“Nucleus Sampling”, min_value=0.0, max_value=1.0, value=0.5, step=0.01)
f_pen = st.slider(“Frequency Penalty”, min_value=-1.0, max_value=1.0, value=0.0, step=0.01)

with col3:
with st.expander(“Current Parameter”):
st.write(“Current Token :”, token)
st.write(“Current Temperature :”, temp)
st.write(“Current Nucleus Sampling :”, top_p)
st.write(“Current Frequency Penalty :”, f_pen)

if st.button(“Summarize”) and pdf_path:
#document_text = extract_text_from_pdf(pdf_path)
max_tokens = calculate_tokens(document_text, prompt_text)
print(max_tokens)
summary = generate_summarizer(max_tokens, temp, top_p, f_pen, messages, prompt_text)
st.write(“Summary:”)
st.write(summary)

this is my code. could you please help me? i’m unsure how to work with the chunks i’ve as they still throws the limit error.

I believe gpt-3.5-turbo’s context length max at 4k. You might want to try:

  • gpt-3.5-turbo-1106, it go up to 16k
  • gpt-4-turbo-1106-preview, it go up to 128k
1 Like

gpt-3.5-turbo-1106 has a 16K context window and the input tokens are actually cheaper. This model will allow you approximately 5K tokens for your prompts and the analysis of the document you are sending. Fortunately it’s cheaper to send input tokens so it’s a good choice until you move up to the gpt-4-1106-preview with a 128K context window, but at a higher price.
You can find these numbers here:

Note that you don’t need to chunk if the entire context fits into the window. Chunking is used to improve the results of retrieving embeddings and save on costs when you have very large or many documents. In this case you can leverage the vector search to only retrieve the relevant text chunks instead of using the whole text.
Here is the documentation for embeddings:
https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

3 Likes

A few months ago I created a tool that enables to summarize documents section by section, whereby a section is defined by a section header. The tool can technically be used for documents of any size provided they have identifiable sections. It’s still in beta though and works better for some documents than for others. The release of GPT-4-turbo has also somewhat decreased the need for such a tool, so I have not made it a priority for now to enhance it further.

Nonetheless, it’s another way to think about summarization, especially of larger documents and especially when trying to preserve a certain level of detail in the summary and enable the easy reference back to the original document.

I personally find that when using some of the other common approaches, you often lose out important details and arguments as well as chains of logic. But of course, it also depends on the use case and what the summary is being created for.

1 Like

I would work on your prompts for now the upload knowledge file and code interpreter is absolute trash right now. Wait until further versions are available. I’ve completely given up on uploading anything.

1 Like

hi, how did you fit in the large documents within the token limit?

By making them smaller documents?

@pythondev @_j

I did a couple of different things here. Using the documents’ table of contents as a basis, I identified individual sections within a document, then extracted and summarized the text from those individual sections. The individual section summaries plus the section title was then consolidated into the full summary.

Under this approach, most sections are normally within the token limit.

Where the section size exceeds a pre-defined threshold, I split the section into multiple parts, summarize these individually and put them back together. To ensure coherence of the individual summaries within a section, I constructed the prompt such that the summary of the preceding part is considered.

It’s designed as a fully automated process.

2 Likes

Yes! I’ve been telling people this for months! This is how you do it.

Bravo for figuring out how to automate the process. In my experience, dealing with a variety of PDFs, it’s been very difficult finding a methodology that works for every case.

1 Like