How to confirm that you got the correct value from a text other than repeating the same prompt over and over

Dear community, I would be happy to hear your thoughts on the following:
I have recently began testing data extraction from financial forms using a python code implementing GTP-4o via the OpenAI API. The reasoning is that the format of these forms can vary quite a lot, so “old-school” programmatic approaches become complicated. The issue is that the answers returned by GTP-4o are not always correct (obviously), and can vary between attempts when repeating the exact same prompt.
My interim solution for this has been to ask for the same info X times, and get a majority vote. However, this seems like quite an expensive approach, and I wonder if any of you have figured out an alternative approach that is more efficient.

Looking forward to hearing your thoughts! :slight_smile:


Edit - a bit more context:

  1. Here you can find an example of the type of form I am looking at (10-Q and 10-K SEC filings).
  2. Here you can see my jupyter notebook containing a rough draft of how I attempt to access the relevant info.
2 Likes

Hi,

Personally I work hard on the following items:

  1. Chunking, so that my retrieved context contains all the data I need
  2. Query formulation, to be more precise in what I need to find from the vector DB. Query for me is composed of the “primary” - usually a question you need to find the answer for, and several “aimers” - examples of how the context might look like in the DB (keywords, samples, etc.)
  3. Preselection of retrieved context, so that the answering model has all the chunks it needs without the noise the vector database may return.
  4. Instructions. Usually for data extraction, my prompts contain the main question (often matching the query) + specific instructions on how to extract it and or process + several examples of the format I want to obtain in the result or strictly predefined answers the answering model has to choose from.
  5. Validator, the model that returns binary answer of whether the result matches expectations, I usually fine-tune this model.

Good practice make the data elements you want to extract be a single answer, ideally single value, sometimes homogenous list of values. This is what makes the answers more simple to produce with less errors.

In my specific use case (legal doc analysis) I also imposed a constraint to return 2 lists of context IDs:

  • all context items selected to answer (result of #3)
  • list of context IDs that the answer is based on (those items that actually contributed to extract the final value, a shorter list of IDs from the previous list).

This allows the engine justify the answer in case it needs to be verified further down in the app or by human.

Another trick is to allow your models answer: unknown, not_found, contradictory_context. So that instead of hallucinating when there is not enough data or data is confusing, the model will likely chose those ready and ready to use answers that can be verified by classic code but a simple match.

4 Likes

Hi Serge, thanks for your response!
I did invest quite a lot in prompt engineering, and it did vastly improve performance (or, rather, without this the results were useless…).

I would be really grateful if you could further explain two of the points you raised:

  1. Validator, the model that returns binary answer of whether the result matches expectations, I usually fine-tune this model.

What do you mean by this? Is this an additional prompt or part of the original prompt? Could you please provide a short code example?

  1. list of context IDs that the answer is based on (those items that actually contributed to extract the final value, a shorter list of IDs from the previous list).

I am afraid I don’t really know what you are referring to here, perhaps I am missing some underlying theory. Could you please explain and/or refer me to some relevant documentation? An example would also be greatly appreciated.

Thanks!

1 Like

Hey, sorry, I’m too marinated in my own context, so missed that my cryptic responses are not always clear.

If chunking done right and the elements you import into your RAG contain the IDs of the chunks (needed to be able to operate them effectively during the retrieval), your query to the vector DB will return objects with IDs (and other info you will be using to build your prompts). Here are some examples of my approach: Examples | SIMANTIKS if you look closer to the storable objects you’ll notice (beside the document UUID you have a path which acts as the address inside the document). The unique chunk id can be build using the doc UUID and chunk address/path. Sure your app may need a different approach, but after a couple of years in the field I came to what is there because it’s the most flexible minimum composition of context objects I found so far.

So once you run the query to vector DB, you get a list of those context objects sorted by relevance. Instead of using all objects from the list, I run them through a separate model to select the items I really need (preselection).

The prompt looks similar to:

Having the question and instructions from the user, evaluate whether the given except from the document contains the exact answer to the user's question, related information or other context somehow necessary to answer the question. Answer either 1 for yes or 0 for no.

Question: % question%
Instructions: % instructions% 
Found except:
%excerpt%

Your answer (single digit only):

The answer is single character so you can easily map it in your code and verify the log probs for certainty.

Run in parallel on all found items and accept only those that were selected.

Build the prompt for answering model.

Get your answer.

Then validate the answer using similar approach, but this time include all the selected items at once. The prompt should be similar to this:

As the expert in the subject, please confirm the correctness of the answer below that was based on the provided context. Answer by either 1 for yes, or 0 for no 

User query: % question% 

Context: 
%context%

Answer: %answer%

Do you confirm correctness of the answer? (reply by a single digit only):

Again, easy to parse and check log probs.

If all good, you continue your app logic with:

Take preselected items, query and the answer from the primary answering model and run it through a different model with a prompt similar to:

Having the context items and the answer to user's question, please select the IDs of the context items that contain the answer to the user query. 

User query: %question%
Context items:

%item1%

%item2%

...etc.

Answer: %answer%

ID(s) (comma-separated list of IDs if multiple items formed the anser):

Item format is:

%field1%: %value1%
…etc.
ID: %id%.

This will give you one or more items to justify the context used by the model to form the answer (you can use those IDs in your code/display logic).

As you see, the auxiliary models have simple tasks agnostic to the data they operate with, so ready and easy to fine-tune for better performance without retraining on specific domain (unless the domain is very specific and lacks general knowledge about it).

And you have a bonus of log probs on single token answers for certainty estimations.

4 Likes

The beauty of the beast is in precision and speed: 1 data item extracted / answer is about 2-5 seconds. 50 items (predefined in your data parsing engine) - about the same 2-5 seconds for all of them at once …

I’m unsure of what your starting point is, but if you are getting different digits from different runs there is no secret sauce besides working the best odds to your favor - which includes running it multiple times if you have exhausted all the single-run options

Just to confirm, are you pulling this information from a vector database via embeddings?

If you pushed your prompt as far as you can get it maybe start a small fine tuning set for the extraction cases that fail. Show in your fine tune examples the correct extractions. You may only need a handful of examples to see an improvement (or not :slightly_smiling_face:).

1 Like

I’ve had pretty good luck doing things in two passes. Pass 1, feed in the financial forms with your extraction query. Pass 2, feed in the form, your extraction query, and the first pass results. Ask it to modify the results as needed to satisfy the extraction query.

These AIs are often really good at checking their own work.

Let me know how it goes.

1 Like

Just to confirm, are you pulling this information from a vector database via embeddings?

Sorry, my bad for not giving enough context. The API refused to scan the form itself (htm page), so what I did was used old-school approaches to find the relevant text within the document and then feed this text as input to the model. I’m not sure if this is indeed the best approach for extracting info from pages like this one - would love some input on this. You can see a very rough draft of my attempts here.
Thanks for your input, really appreciate it! :slight_smile:

1 Like

This might be a key issue - I am currently not using a vector DB, I’m simply extracting text from htm pages (for example) according to specific keywords, and then feeding the text as part of the prompt to the model. You can see a rough draft of my attempts in this jupyter notebook.

If you have a suggestion for an alternative approach I would really love to hear it.

Thanks for your input - highly appreciated! :slight_smile:

Can I ask what types of financial forms are you trying to parse? You can covert your HTML to markdown or use a service like Unstructured to parse it and it’ll be a lot smaller. Even lengthy financial documents, like SEC filings, can typically fit into the context window if you first convert them to something like markdown.

Just because it fits doesn’t mean you’ll be able to extract the information you’re looking for with 100% reliability. I’ve found that the best you can hope for is around 99.7% reliability when calling these models and even that takes some doing . I realize that’s a very specific number but it’s the number we keep landing on in our testing.

2 Likes

This approach is really good especially when you deal with a lot of similar documents. On the other hand, when your input documents vary a lot, fine-tuning may lock you in specialized document structure. In this case it’s better to design a more flexible solution (what I ended up doing) to exact data from any type of doc using custom RAG engine and then see how various steps can be optimized.

1 Like

I’ll have a closer look and let you know in this thread.

2 Likes

I think you linked to the wrong notebook but I found the one extracting balance sheets from SEC filings.

On the prompt side of things… have you tried just asking the model to extract the balance sheet? You shouldn’t need all of the various hints you’re giving it about finding the units and so forth. The model will work that out. It’s seen more SEC filings then you or I have so it knows how they’re structured

Hi Steven, thanks for your reply.

I have edited my original post to provide more context for my question.
In general, I am looking into SEC filings (currently focusing on quarterly/yearly reports). My goal is to collect specific info from many thousands of htm pages, for subsequent storing in a DB. So, I’m looking to extract just a few specific numbers from each page, but there are many pages, each page contains many characters, and page format varies quite a bit. This format variation is both at the level of htm code (e.g., whether the info is stored within a table or not), and at the level of wording used (hence the need for semantics-based extraction).
My current solution is to find specific text blocks that contain relevant keywords and then feed the relevant text block to the model as part of the prompt. This works rather well; however, for a given text there is (small) variability in the responses obtained using the same prompt. I don’t really understand how converting to MD would be more efficient than extracting plain text - but maybe I missed something?

1 Like

One of our primary test scenarios is reasoning over SEC filings so this is an area I’ve done a fair amount of work in. If you’re already getting the text sections you want out of the filing then you’re probably ok.

In our system we convert HTML and PDF to markdown before sending content to the model as this significantly reduces the token length and you can typically just pass in the whole document if cost isn’t a huge concern. The model has also been trained on a ton of markdown which doesn’t hurt.

The big thing with SEC filings are the tables. The models have poor spatial awareness so they often struggle with the tables in filings. You can fix that by converting tables from column-row layout to record layout. Basically you create a separate table for each column in the source table. This will improve the accuracy of retrieving values for things like a specific year or quarter

4 Likes

I think you linked to the wrong notebook

Yes - you’re right, thanks for pointing this out! I’ve now corrected the link.

have you tried just asking the model to extract the balance sheet?

Yes, I initially tried just to provide the entire document to the model. However, I got an error message that it contains too many tokens (~725k, with my allowed limit being 30k). Therefore, extracting only the relevant sections of the document was my way of making the process cheaper.

1 Like

To better understand the spatial awareness issue you have to think about the fact that the model is sensitive to distance. The further away a facts value is from its label the less likely it is to be retrieved. When you look at a table you see a grid of rows and columns but what the model sees is a long string of tokens. If you have a large table, the value in the lower right cell will be just a few tokens away from its row label but it could be thousands of tokens away from its column label. Converting the table to record layout moves everything closer together distance wise

1 Like

I can get most filings down to under 50k tokens by converting to markdown but getting under 30k could be tough

How did you get over the fact that not all forms contain tables (at the HTML level)?
Converting to MD does reduce the number of tokens, but not enough to get below 30k - so it’s still more expensive than I’d like.