Error reading pdf file with api (files && completions)

Hello,

I’m using OpenAI’s API (and ChatGPT web) to extract data from PDF invoices, requesting only a JSON output with specific keys, such as invoice number, CIF, total, dates, VAT, etc.

Problem:
When I upload a real PDF invoice and ask ChatGPT to extract the data as JSON, the response contains data that has absolutely nothing to do with the actual document. For example, it returns an invented invoice number, CIF, amounts, and supplier. This happens even when I upload the same file multiple times.

Technical details:

  • Model used: gpt-4.1
  • I set the prompt as follows:

text

CopiarEditar

You are an invoice data extractor. You will receive a PDF invoice as input. Return ONLY a JSON with these keys:
  - numero (the invoice number)
  - cifs (CIFs found, separated by commas)
  - nombre
  - proveedor_nombre
  - total
  - divisa
  - importe (equal to total)
  - ivas: array of objects { base, cuota, tipo }
  - fecha (YYYY-MM-DD)
  - fecha_vencimiento (YYYY-MM-DD or empty string)
  - irpf (withholding tax, if any, or 0/null)
Nothing else, just clean JSON.

Comment:
These values do not match the real data in the PDF at all (not the supplier, not the amounts, nothing). I have tested this with several invoices and always get unrelated/fake data, even when uploading the same file multiple times.

Questions:

  • Has anyone else experienced this issue?
  • Is this a known limitation or bug?
  • Could OpenAI staff look into this, or is there a workaround to get the real data from PDFs?

Thanks in advance for your help!

The PDF file attachment feature on Chat Completions is simply unreliable and unusable, and this has continued without improvement.

  • attach via base64 - only the last file will be read
  • attach via file id - 50% of trials, nothing from the PDF is included

I would recommend that you apply your own PDF text extraction and image render technology, because OpenAI continues to supply failure that they will not address.

1 Like