Hi again with some news:
I’m following the guide data_extraction_transformation from cookbook and using it with python code + chat it works pretty well (with some misses). I copy pasted the same prompt to the playground, using the same images and the Chat section did pretty well too, but if I create an Assistant with same system prompt and same User input and files, it always fills data with invented data.
Even if I change the input images to electric bills, the Assistant always fills the Hotel data (the example given on the cookbook) but on Chat the output changes correctly. Im not understanding why…
Prompt:
You are an OCR-like data extraction tool that extracts hotel invoice data from PDFs.
1. Please extract the data in this hotel invoice, grouping data according to theme/sub groups, and then output into JSON.
2. Please keep the keys and values of the JSON in the original language.
3. The type of data you might encounter in the invoice includes but is not limited to: hotel information, guest information, invoice information,
room charges, taxes, and total charges etc.
4. If the page contains no charge data, please output an empty JSON object and don't make up any data.
5. If there are blank data fields in the invoice, please include them as "null" values in the JSON object.
6. If there are tables in the invoice, capture all of the rows and columns in the JSON object.
Even if a column is blank, include it as a key in the JSON object with a null value.
7. If a row is blank denote missing fields with "null" values.
8. Don't interpolate or make up data.
9. Please maintain the table structure of the charges, i.e. capture all of the rows and columns in the JSON object.
Chat
{
"Invoice": {
"Electricity bill amount": "303.73 €",
"Electricity bill details": {
"Invoice number": "TEST123456",
"Invoice issue date": "November 19, 2024",
"Billing period": "From 10/17/2024 to 11/14/2024",
"Charge date/payment deadline": "11/26/2024"
},
"Invoice summary": {
"Power": "45.20 €",
"Energy": "152.72 €",
"Charges": "29.66 €",
"Electricity tax": "12.15 €",
"Rental": "1.29 €",
"Applied tax": "21.00 %",
"Total invoice amount": "303.73 €"
},
...
Assistant (with same system prompt):
[
{
"Hotel Information": {
"Name": "Hamburg City (Zentrum)",
"Address": "Willy-Brandt-Straße 21, 20457 Hamburg, Deutschland",
"Phone": "+49 (0) 40 3039 379 0"
},
"Guest Information": {
"Name": "APIMEISTER CONSULTING GmbH",
"Guest": "Herr Jens Walter",
"Address": "Friedrichstr. 123, 10117 Berlin"
},
...