Hello,
I’m using OpenAI’s API (and ChatGPT web) to extract data from PDF invoices, requesting only a JSON output with specific keys, such as invoice number, CIF, total, dates, VAT, etc.
Problem:
When I upload a real PDF invoice and ask ChatGPT to extract the data as JSON, the response contains data that has absolutely nothing to do with the actual document. For example, it returns an invented invoice number, CIF, amounts, and supplier. This happens even when I upload the same file multiple times.
Technical details:
- Model used:
gpt-4.1
- I set the prompt as follows:
text
CopiarEditar
You are an invoice data extractor. You will receive a PDF invoice as input. Return ONLY a JSON with these keys:
- numero (the invoice number)
- cifs (CIFs found, separated by commas)
- nombre
- proveedor_nombre
- total
- divisa
- importe (equal to total)
- ivas: array of objects { base, cuota, tipo }
- fecha (YYYY-MM-DD)
- fecha_vencimiento (YYYY-MM-DD or empty string)
- irpf (withholding tax, if any, or 0/null)
Nothing else, just clean JSON.
Comment:
These values do not match the real data in the PDF at all (not the supplier, not the amounts, nothing). I have tested this with several invoices and always get unrelated/fake data, even when uploading the same file multiple times.
Questions:
- Has anyone else experienced this issue?
- Is this a known limitation or bug?
- Could OpenAI staff look into this, or is there a workaround to get the real data from PDFs?
Thanks in advance for your help!