Text formatting issue due to which ai giving inaccurate responses

I am trying to convert pdf to text using pdf-parse library and feeding that text to open ai and prompting the ai to list all the transaction in bank statement in json format.

The problem is ai is putting up paid_out amount in paid_in amount when pdf is in improper format

The ai shows responses correct enteries when it is in this format

So this is my first issue
The second issue is about the token lenght once bank statement length of 9,10 pages than I will recieve length error as it will surpass the limit of 4096 tokens how can I overcome these issues here is my code

const bankStatementBuffer = fs.readFileSync('./cb_bs_no_table.pdf');
const bankPdfData = await pdfToText(bankStatementBuffer); // Convert PDF to text
const bankStatementText = bankPdfData.text;
console.log('bankStatementText', bankStatementText);
const bankStatementResponse = await axios.post(
    `https://api.openai.com/v1/chat/completions`, {
        messages: [{
            role: 'system',
            content: 'You are a bank statement reader. Strictly return data in json do not write json in begining or end'
        }, {
            role: 'user',
            content: `List all the transaction in this bank statement do not skip any key. Keys are date, payment_method, description, paid_out, paid_in. If some entry is empty or 0, enter 0 as a number \n ${bankStatementText}`
        }],
        model: 'gpt-3.5-turbo',
        // temperature: 0.2
    }, {
        headers: {
            'Authorization': `Bearer ${Openapikey}`,
            'Content-Type': 'application/json',
        }
    }
)

return res.status(200).json({
    data: bankStatementResponse.data
});

Hrm.

I’d probably try to get it to parse better (not always feasible) or give a 1-shot example of how to get the data out of garbled table. However, there’s a high chance it will eventually (maybe even frequently) hallucinate.

Another option is the new vision models, but you might have trouble with financial data.

You can find more on OpenAI models in the docs… Some have much larger context windows but varying prices.

@PaulBellow actually I tried one more approach I converted first page of pdf to image and feed that image input to gpt-4-turbo so now I am getting accurate results but if I have 8 pages can ofc increase or decrease in different cases. So lets talk about this case do I have to hit api 8 times on 8 different pages and then merge the result ? I think I am missing out something and it is not a feasible approach

EDIT
No it is interpreting wrong entries shifting rows upwards in wrong dates. Kind of hallucinating.

Hello guys I am trying to convert a bank statement pdf to text and give that text to open ai api and asking it to create a json and return me response in json. Now I am facing an issue here suppose paid out field is empty space and paid in field is filled with an amount so ai is putting the paid in entry to paid out entry.
So I realized that the issue is not of open ai, claude ai is also behaving the same the issue is where I am converting the pdf to text. I am using pdf-parse node lib for conversions

image

Now ai does not have the labels like what is the payment type, description, paid_out, paid_in, date still it is putting all the data in right format that is the power of ai. But as it is getting only one value in some cases like
VISAMZNMktplace
amazon.co.uk18.99
Now visa is payment_method, amazon marketplace is descriptionm and 18.99 is an amount but ai does not know what label it belongs to so thats why it is doing wrong enteries in paid out and paid in column

Note: I created my own properly formatted doc on word, then converted it to pdf now when i extract text from that pdf I am getting correct data because formatting is great as shown below and ai is also giving correct results

How can I solve this problem I think that pdf was html generated and that’s why not getting formatted how can I preprocess the text converting pdf back to html is also not giving results

Hi!

If the model struggles to interpret your data correctly, then you have a few different options at your disposal to try and remediate that:

  1. Include more detailed / specific instructions to the model about the task, including the nature and format of input data and how it is expected to interpret the input data.

  2. Include 1-2 worked examples that illustrate how a JSON in the desired format is produced on the basis of input data.

  3. Create a fine-tuned model based on training pairs of input data (i.e. your bank statement data after conversion into text) and the resulting JSON with the values.

1 Like

Maybe try the new version 4 with vision? Seems like you’d avoid all the weird formatting issues if you just upload an image of the pdf.

Having trouble with financial data as mentioned by @PaulBellow. Model is hallucination and inserting garbage values after a lot of prompting tries.