Gpt-4o can’t read multiple pages correctly in pdf file

Gpt-4o can’t read multiple pages correctly

I want to collect all the title of chapter of report from the pdf file but it only worked when there are few contents (mostly within single page).
When the contents continue next pages it doesn’t read rest of them.

I tried both assistant and embedding method but it only read single page of the content part. Is prompt message is too general? Or should I ask more detail way? (The following code is what I tried with embedding method and I used more simple and popular text file because the origin one is used in my work and it is not english)

# code
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
import sys
import json
from IPython.display import display

load_dotenv()

def show_json(obj):
    display(json.loads(obj.model_dump_json()))

file_path = "./ins/harry_potter_and_the_goblet_of_fire.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

print("########### docs are successfully loaded ###########")
print(len(docs))

llm = ChatOpenAI(model="gpt-4o")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

system_prompt=(
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. "
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

result = rag_chain.invoke({"input": "Tell me all the chapter in the table of contents part of the book"})

print(result["context"])
print(result["answer"])

sys.stdout = open('./result/sample_results.txt','w')
print(result)
print("#############################################")
for rst in result["context"]:
    print(rst.page_content)
print()
print(result["answer"])
# result["answer"]
1. The Dark Mark (Page 117)
2. Mayhem at the Ministry (Page 145)
3. Aboard the Hogwarts Express (Page 158)
4. The Triwizard Tournament (Page 171)
5. Mad-Eye Moody (Page 193)
6. The Unforgivable Curses (Page 209)
7. Beauxbatons and Durmstrang (Page 228)
8. The Goblet of Fire (Page 248)
9. The Four Champions (Page 272)
10. Padfoot Returns (Page 509)
11. The Madness of Mr. Crouch (Page 535)
12. The Dream (Page 564)
13. The Pensieve (Page 581)
14. The Third Task (Page 605)
15. Flesh, Blood, and Bone (Page 636)
16. The Death Eaters (Page 644)
17. Priori Incantatem (Page 659)
18. Veritaserum (Page 670)
19. The Parting of the Ways (Page 692)
20. The Beginning (Page 716)
21. The Weighing of the Wands (Page 228)
22. The Hungarian Horntail (Page 313)
23. The First Task (Page 337)
24. The House-Elf Liberation Front (Page 363)
25. The Unexpected Task (Page 385)
26. The Yule Ball (Page 403)
27. Rita Skeeter’s Scoop (Page 433)
28. The Egg and the Eye (Page 458)
29. The Second Task (Page 479)

Welcome to the Community!

My two cents on this:

If you are using a RAG-based approaches that involve chunking the document, then it is quite possible that the table of contents gets chunked into different pieces. In that case, only the first chunk might actually contain the identifier “table of contents” and be returned as part of the retrieval process while the second part might not be recognized as being part of the table of contents and as such not be returned as part of retrieval. To overcome this, you could increase your chunk size.

Another simplified way to achieve your goal might be to just pass the model as context the first ~10 or so pages of the book/document (which should include the table of contents) and then just ask it to return the information.

Finally, there’s a also a dedicated thread discussing options for obtaining a document outline. It might be too specialized for your case, but dropping the link here anyway for reference:

3 Likes