Hi everyone,
I’m looking for some guidance on improving a workflow where I need to query over 100 PDF reports individually using the OpenAI API. Each PDF ranges from 10 to 100 pages, and I need to run the same set of seven questions against each document. For example: “What is the estimated income for this organization in 2025?”
Current Approach:
- I have a CSV file that lists the file paths for each PDF.
- A Python script extracts the text from the PDFs and sends the extracted text, along with my prompt, to the OpenAI API.
- The responses are saved to another CSV file. I format my prompts so that the response is a simple two or three column table (my prompt includes “Only give me a table with 2 columns, Org Name and Income etc”.) for each PDF. By iterating through all files, I end up with a combined table.
This works, but it’s becoming expensive and I often run into timeout errors. I’d like to convert this workflow into a more efficient batch process.
Proposed New Approach:
- Modify the Python script so that it still extracts text from the PDFs, but instead of making the API call immediately, it writes the prompt and PDF text into a batch file (e.g., a file containing JSON lines with
{"content": "<PDF text> <my prompt>"}
). - After sending the batch file, I’d run a separate process retrieve the responses (that would be in a structured format) and parse everything into a table.
Question:
- Does this proposed batch approach make sense to manage cost and reduce timeout issues?
- Are there alternative strategies— maybe functions, or a more specific batching methodology that may have found effective?
- I’m assuming that embeddings or RAG techniques might not be ideal since I don’t want to query all documents collectively; each PDF needs to be queried on its own. Is this assumption correct?
Any advice would be greatly appreciated. Thanks!