I’ve focused mostly on 10-Q/10-K filings which have a lot of tables. If your form doesn’t have tables that should actually be easier for the model to deal with.
I’m definitely not trying to give you more work to do. If you’re happy with your current text extraction approach then stick with it. More that I was just pointing out that you pretty much would never want to pass HTML to the model unless you’re asking specifically for HTML back. Markdown is always going to be way more concise and there are lots of libraries out there, like Unstructured, that are specifically focused on converting unstructured docs to more structured and compact formats like markdown.
If you’re getting unreliable responses I’d look into how hard the model is having to work to return its answer. If the model sometimes returns a correct value but other times doesn’t, look to see how far distance wise the value is from its associated label. Are there other values between the desired value and its label.
These models are generally really good at retrieving facts if they’re a) shown the facts (RAG sucks at this) b) the facts value is relatively close distance wise to its label, and c) there aren’t values that could confuse the model between the desired value and its label.
I suspect that if you’re having to call the model multiple times to get your value then one of those 3 things is off or your prompt contains ambiguous instructions. Less is more for prompting. A long prompt for me would be 100 tokens of instructions. In most cases you should be able to get the model to do what you want in less than 20 tokens