OpenAI API and PDF Table of Contents from keywords

I have a set of PDFs, each comprising over 100 pages. My objective is to leverage the OpenAI API to extract only the Table of Contents from these PDFs, indicating the page range where each keyword is present.

To illustrate, if a keyword is located on page 65 of a PDF, the extracted Table of Contents should precisely detail the range of pages where this keyword appears. My focus is solely on extracting the Table of Contents information.

I am in search of guidance on how to effectively implement the OpenAI API in Python for this specific task. If any community member has experience with similar projects, or if you can provide valuable code snippets, examples, or general advice, your input would be immensely valuable.

1 Like

Let’s clarify in your example you have a key word on page 65, this is not a word in the Table of Contents, right?

If not, you’re trying to create an array or “list tuple” in python , This list is the table of contents headings and their start page?
e.g.
Table of Contents
Heading 1 pg10
Heading 2 pg20
Heading 3 pg70

So the key word on 65 would be in Heading2 , thus the script would take the pg number the key word is in and compare it to the start page number in headings.

You will have to make sure all the Table of Contents in your PDF docs, are titled or have a way to identify their starting and endpoints. The question is why use a GPT is this a service for others? It can all be done in Python. Just have the user send the upload through GPT if attached to a custom GPT. Process in app.

Here a Python script by GPT4 though it might need a bit of fixing.

Example table of contents

Assuming each tuple contains (‘Heading’, start_page)

table_of_contents = [
(“Heading 1”, 10),
(“Heading 2”, 20),
(“Heading 3”, 70),
]

Function to find the heading for a given page

def find_heading_for_page(page_number):
current_heading = None
for heading, start_page in table_of_contents:
if page_number < start_page:
break
current_heading = heading
return current_heading

Example usage

keyword_page = 65 # Page number where the keyword was found
heading = find_heading_for_page(keyword_page)
print(f"The keyword is under: {heading}")

———you’ll need a search field, good luck, it’s important to provide good details of what your goal is, I wasn’t sure if you’re searching the doc or just the Table of Contents for the keyword. ——-