OpenAI API and PDF Table of Contents from keywords

somesh1 · February 27, 2024, 12:59pm

I have a set of PDFs, each comprising over 100 pages. My objective is to leverage the OpenAI API to extract only the Table of Contents from these PDFs, indicating the page range where each keyword is present.

To illustrate, if a keyword is located on page 65 of a PDF, the extracted Table of Contents should precisely detail the range of pages where this keyword appears. My focus is solely on extracting the Table of Contents information.

I am in search of guidance on how to effectively implement the OpenAI API in Python for this specific task. If any community member has experience with similar projects, or if you can provide valuable code snippets, examples, or general advice, your input would be immensely valuable.

Myango · March 15, 2024, 2:35pm

Let’s clarify in your example you have a key word on page 65, this is not a word in the Table of Contents, right?

If not, you’re trying to create an array or “list tuple” in python , This list is the table of contents headings and their start page?
e.g.
Table of Contents
Heading 1 pg10
Heading 2 pg20
Heading 3 pg70

So the key word on 65 would be in Heading2 , thus the script would take the pg number the key word is in and compare it to the start page number in headings.

You will have to make sure all the Table of Contents in your PDF docs, are titled or have a way to identify their starting and endpoints. The question is why use a GPT is this a service for others? It can all be done in Python. Just have the user send the upload through GPT if attached to a custom GPT. Process in app.

Here a Python script by GPT4 though it might need a bit of fixing.

Example table of contents

Assuming each tuple contains (‘Heading’, start_page)

table_of_contents = [
(“Heading 1”, 10),
(“Heading 2”, 20),
(“Heading 3”, 70),
]

Function to find the heading for a given page

def find_heading_for_page(page_number):
current_heading = None
for heading, start_page in table_of_contents:
if page_number < start_page:
break
current_heading = heading
return current_heading

Example usage

keyword_page = 65 # Page number where the keyword was found
heading = find_heading_for_page(keyword_page)
print(f"The keyword is under: {heading}")

———you’ll need a search field, good luck, it’s important to provide good details of what your goal is, I wasn’t sure if you’re searching the doc or just the Table of Contents for the keyword. ——-

Topic		Replies	Views
Understanding PDF and Bookmarking API	4	1019	March 4, 2022
Gpt-4o can’t read multiple pages correctly in pdf file API gpt-4	1	425	July 3, 2024
Can you explain how to analyze a PDF file in GPT-4? API	9	72198	December 13, 2023
Search differents word in pdf file and then give as feedback tha page where find the word Prompting chatgpt	1	70	September 10, 2024
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3406	August 28, 2024

OpenAI API and PDF Table of Contents from keywords

Example table of contents

Assuming each tuple contains (‘Heading’, start_page)

Function to find the heading for a given page

Example usage

Related topics