Evaluating multiple PDFs documents using a batch process

Hi everyone,

I’m looking for some guidance on improving a workflow where I need to query over 100 PDF reports individually using the OpenAI API. Each PDF ranges from 10 to 100 pages, and I need to run the same set of seven questions against each document. For example: “What is the estimated income for this organization in 2025?”

Current Approach:

  1. I have a CSV file that lists the file paths for each PDF.
  2. A Python script extracts the text from the PDFs and sends the extracted text, along with my prompt, to the OpenAI API.
  3. The responses are saved to another CSV file. I format my prompts so that the response is a simple two or three column table (my prompt includes “Only give me a table with 2 columns, Org Name and Income etc”.) for each PDF. By iterating through all files, I end up with a combined table.

This works, but it’s becoming expensive and I often run into timeout errors. I’d like to convert this workflow into a more efficient batch process.

Proposed New Approach:

  • Modify the Python script so that it still extracts text from the PDFs, but instead of making the API call immediately, it writes the prompt and PDF text into a batch file (e.g., a file containing JSON lines with {"content": "<PDF text> <my prompt>"}).
  • After sending the batch file, I’d run a separate process retrieve the responses (that would be in a structured format) and parse everything into a table.

Question:

  • Does this proposed batch approach make sense to manage cost and reduce timeout issues?
  • Are there alternative strategies— maybe functions, or a more specific batching methodology that may have found effective?
  • I’m assuming that embeddings or RAG techniques might not be ideal since I don’t want to query all documents collectively; each PDF needs to be queried on its own. Is this assumption correct?

Any advice would be greatly appreciated. Thanks!

Hi!

There are no shortcuts in terms of data that needs to be processed. As the models are stateless, they need to be fed everything required for context every time.

That being said with 100 x up to 100 page documents it should be fairly simple to implement.

Read PDF from directory list

convert PDF to text

Append prompt 1 to text

Send appended prompt and text to AI for evaluation

Write evaluation result to file PDF-prompt-1-evaluation.txt

Append prompt 2 to text

Send appended prompt and text to AI for evaluation

Write evaluation result to file PDF-prompt-2-evaluation.txt

/# do the above for all 7 prompts

increment file directory list

are there more files to process?

Yes → go to START

No-> FINISH

Send this to the BATCH API endpoint as it is not time critical and save 50%

Thanks, I think that is what I am doing already. Although I am asking all 7 questions in each prompt and then getting a table back with 7 rows every time I make the API call.

I suppose I want to find out if it would be better to do this as a batch process rather than doing 100 API calls.

you still need to do all 700 API calls, one for each prompt on each of the 100 files, but if you send them to the BATCH API endpoint (which is usually very quick, but can take up to 24 hours) you will save 50%

1 Like

Hi,

I am new to using all this and I have been trying to create an assistant in the playground. Now, that I have created it, how do I copy the code of this assistant to run it locally?