This is really confusing.
What kind of invoices are you processing that don’t already have the taxes and totals calculated?
In any case your first task would be to transform the document into a structured text, and THEN start to work on the document.
It’s critical to remember: You do not want a vision model to be OCRing your document AS WELL as performing numerical work on it.
This. Won’t. Work.
Or, it will kind of work but will be brutally inefficient and prone to constant edge cases.