What is the best way to parse a PDF file with ChatGPT?

I need to send various PDF files to ChatGPT and ask it to answer questions about the content in each file. Is there currently a way to do that only using OpenAI tools and/or APIs?

2 Likes

Wouldn’t let ChatGPT analyse the PDF. Guess you are using the (Chat) API and would recommend using a PDF to text tool to pre-process the PDF and then create a markdown or TXT file. These are easier to digest. PDF sometimes have a very complex structure and texts might be mixed up in their order. With a text-based format, you can at least check if the structure is correct.

We do work on a lot of data intensive projects, and we always say “garbage in, garbage out”… Keep your input as clean as possible. There also might be a lot of text that is not relevant for the model, and you are then paying for every page it is on. I am looking at page footer / header here :wink:

maybe something like this can help, then feed the text to assistant?

import cv2
import numpy as np
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
from IPython.display import display  # To display images in Jupyter notebook/Colab
import os

# Assuming Tesseract OCR is already installed
# If not, install it using: !apt install tesseract-ocr

# Check for GPU availability for OpenCV
use_gpu = cv2.cuda.getCudaEnabledDeviceCount() > 0

# Define the path to your PDF file
pdf_path = '/content/1936-1942 Chevrolet Parts Book.pdf'  # Replace with the path to your PDF

# Create directories for saving output
os.makedirs('/content/batch_texts', exist_ok=True)
os.makedirs('/content/batch_images', exist_ok=True)

# Function to preprocess an image with OpenCV
def preprocess_image(image):
    image_cv = np.array(image)
    if use_gpu:
        # Upload image to GPU
        image_gpu = cv2.cuda_GpuMat(image_cv)
        # Convert to grayscale
        gray_gpu = cv2.cuda.cvtColor(image_gpu, cv2.COLOR_BGR2GRAY)
        # Download image from GPU to CPU
        image_cv = gray_gpu.download()
    else:
        # Convert to grayscale
        image_cv = cv2.cvtColor(image_cv, cv2.COLOR_BGR2GRAY)
    return Image.fromarray(image_cv)

# Function to process a batch of pages as images
def process_batch(start, end, batch_number):
    # Convert a range of pages to images
    images = convert_from_path(pdf_path, first_page=start, last_page=end, dpi=200)

    # Perform OCR on each image after preprocessing
    for i, image in enumerate(images):
        # Preprocess the image
        image = preprocess_image(image)

        # Perform OCR using pytesseract
        text = pytesseract.image_to_string(image)

        # Save the text in a file
        text_file_path = f'/content/batch_texts/batch_{batch_number}_page_{start + i}.txt'
        with open(text_file_path, 'w') as file:
            file.write(text)

        # Save the image
        image_file_path = f'/content/batch_images/batch_{batch_number}_page_{start + i}.png'
        image.save(image_file_path)

        # Display the image inline
        display(image)

    # Clear the images list to free up memory
    del images

# Define the size of each batch
batch_size = 10  # Process 10 pages at a time, adjust based on your environment's capability

# Calculate the number of batches needed
total_pages = 20  # Total number of pages in your PDF
batches = (total_pages + batch_size - 1) // batch_size

# Process each batch
for batch in range(batches):
    start_page = batch * batch_size + 1
    end_page = min(start_page + batch_size - 1, total_pages)
    process_batch(start_page, end_page, batch)
1 Like

Hi @matthewcummings516! Are you looking to just query one PDF at a time or multiple files? Are you intending to ask very specific questions or are you looking at more complex questions that perhaps require analysis of content from across different sections of the document? I’m happy to share a few perspectives and suggestions based on my experiences depending on your specific needs.

Multiple files, very specific questions. I’ve been considering Anthropic because of its token limit but the OpenAI playground seems promising, specifically the Assistant functionality.

My PDF files are text heavy. I’ve written/worked with PDF parsers in the past, they’are a bit of a pain. . . because, as you mentioned, PDFs can be quite complex.

1 Like

ok. Yes, the Assistant Playground is your best bet in that case. It should work fairly well with specific questions. Custom GPTs are also an option as you can also upload multiple files there under the configure tab. I have not compared their performance in accurately responding to questions. Bear in mind that the way you phrase the question(s) can have a significant impact on how well its is answered - so a bit of trial and error initially can be helpful to see what works and what doesn’t. You can also consider providing some instructions in the Assistant Playground (or for the GPTs) on how the model should approach the Q&A.

Hey - that’s an interesting topic for many! there is an attempt by this youtube user that shows quite some interesting results using the assistant API - . I’ll try the playground as well. maybe we can keep updates here in this thread. Greetings, Heinzhttps://youtu.be/sNs6kGgoakc?si=ZhgVxSlwNSesDvf- (sorry have to copy the link - forom wount let me past ist in an other way)

I used to be able to get this to work with reading in PDFs and searching them. But it looks like it’s not as robust April 30th as it was in late Feb/early March. I had some assistants that were data sleuths. But I’m getting “too many words now”.

I’m going to see what I can do to read in the data using a schema, but would love some ideas if anyone has any!

The simplest solution is to convert the PDF into images, and then use the vision capability: https://platform.openai.com/docs/guides/vision

You can test the results in the playground to see if they are suitable for your case.

To convert a PDF into images with ImageMagick, the command is as simple as:

convert -density 300 input.pdf -background white -alpha remove -alpha off page-%d.jpg

It’s also easy to limit the number of pages to, say, 10 pages, with:

convert -density 300 'input.pdf[0-9]' -background white -alpha remove -alpha off page-%d.jpg

The gpt4o-mini ability to understand the content (both text and images) is great. More than I needed for my use case: the ability for our user to upload any kind of PDF and get a draft to work on.

1 Like