How to confirm that you got the correct value from a text other than repeating the same prompt over and over

stevenic · August 27, 2024, 11:56am

I’ve focused mostly on 10-Q/10-K filings which have a lot of tables. If your form doesn’t have tables that should actually be easier for the model to deal with.

I’m definitely not trying to give you more work to do. If you’re happy with your current text extraction approach then stick with it. More that I was just pointing out that you pretty much would never want to pass HTML to the model unless you’re asking specifically for HTML back. Markdown is always going to be way more concise and there are lots of libraries out there, like Unstructured, that are specifically focused on converting unstructured docs to more structured and compact formats like markdown.

If you’re getting unreliable responses I’d look into how hard the model is having to work to return its answer. If the model sometimes returns a correct value but other times doesn’t, look to see how far distance wise the value is from its associated label. Are there other values between the desired value and its label.

These models are generally really good at retrieving facts if they’re a) shown the facts (RAG sucks at this) b) the facts value is relatively close distance wise to its label, and c) there aren’t values that could confuse the model between the desired value and its label.

I suspect that if you’re having to call the model multiple times to get your value then one of those 3 things is off or your prompt contains ambiguous instructions. Less is more for prompting. A long prompt for me would be 100 tokens of instructions. In most cases you should be able to get the model to do what you want in less than 20 tokens

sergeliatko · August 27, 2024, 1:30pm

The whole thing suddenly becomes interesting. Anyone would be willing to test my models by providing me an example of a fitting document and a list of data to be extracted from it. I become more and more interested in this subject.

roni80 · August 27, 2024, 3:08pm

I’ve focused mostly on 10-Q/10-K filings which have a lot of tables. If your form doesn’t have tables that should actually be easier for the model to deal with.

Yes, I’m focused on the same kind of forms. What I meant was, that in most cases the HTML contains < table > instances, but sometimes it doesn’t - for example, see this form. Therefore, I decided to ignore tables - rather than trying to extract values of specific cells based on row+column headers, I ask questions about the raw text.

look to see how far distance wise the value is from its associated label

Valid point. However, it seems that errors are rather variable (which is good - then the majority wins approach usually works). Also, keep in mind that the exact same prompt used to go over the exact same text returns different values when run N times. Again, in most cases it returns the correct value, but every now and then it returns a wrong one. So, to clarify, my approach does work for me - I was just wondering if there are any other approaches that would achieve the same result with less cost.

Less is more for prompting

Interesting - I’ve had the opposite experience. My prompts grew longer and longer as I tested my code and identified different kinds of errors in reading and writing. I would say that >80% of the prompt text (especially for the ‘system’ content) was added during testing to nudge the model to read/write as intended. I would be really interested if you found a shorter prompt that achieves the same level of accuracy across a large sample of variable forms. If you have time and interest, please go ahead and play with my notebook!

anon10827405 · August 27, 2024, 3:10pm

Ideally, as mentioned you would convert the HTML into markdown. There’s some libraries that make this very easy. I would also try to eliminate some noise if possible as well like the <a> tags

You can apply a single GPT to a specified amount of context (Per item perhaps?). Run each GPT in parallel & begin synthesizing the results.

I run a GPT for converting baked-in text on images (PDFs) along with a mixed bag of OCR + Gemini Flash 1.5 (sometimes there’s “image” content that an OCR can’t pick up) and definitely rely on running it more than once to eliminate any weirdness.

The end result is very important to know as well though. Do you want to then use these results for RAG? Or are you just trying to fill up a database with numbers?

I would be very careful here. Typical RAG solutions don’t play well with numbers.

icdev2dev · August 27, 2024, 3:13pm

I think ignoring structure, where structure is present (i.e. table) might not be optimal. Rather extracting that table (where-ever possible) and then inserting that table as a record layout, post other-than-table extraction, might be more effective.

roni80 · August 27, 2024, 3:14pm

The whole thing suddenly becomes interesting.

I agree!
I’ve skimmed the literature and haven’t yet found any papers comparing methods to increase the accuracy of model responses to API calls (where the intended output is usually quite different than that of “classic” prompts).

Anyone would be willing to test my models by providing me an example of a fitting document and a list of data to be extracted from it

Feel free to PM me with exactly what you would need. I have an SQL table containing the URLs of >10k of these forms, and I could easily prepare a short list of the fields I’m trying to extract from.

thinktank · August 27, 2024, 4:00pm

Hi,

Forgive me, but I think your approach DOES look very expensive, and given some of the recent advances I think there’s a simpler way to go about the problem.

Assistants and 4o mini
I see you’re using Completions and 4o.

•I think it’s harder to get consistency from the Completions vs the Assistants API. You can gain a lot more control using the Assistants flow… and the variability you’re getting from those completions must be crazzzy.
• Using 4o is very expensive for this task. You’re just looking up the same data that moves around in the filings. Mini is literally 99.4% cheaper, you can run multiple checks without dropping $90 per document, and it it is excellent at search / retrieval when it has a keyword.

Data Preparation
I am very noob Python programmer, but given the variability of different SEC filings, I think just finding the chunks has a lot of difficulty? Just covering all the different permutations of “Statement of Income” looks like a royal pain.

Anyway, depending on how you want to read the file, I think you should upload it to File Storage then build a Vector Store out of it. You can make those uploads only persist until you’re done with the extraction so your storage costs don’t explode.

Multi-Assistant Flow
Every time you call Completions in your code is an opportunity to build a Specialized Assistant for the each flow.

As I look through your process, I think you’d want an “income statement flow,” a “debt operations flow,” and so on.

• The first specialized assistant can take the Filing from the vector store, parse it, and find the section. 4o does great at this.
•Second Assistant can clean up the data, use structured output to clean up the data how you need it to be. You can fine-tune this model to ensure your output is standardized.
•You know, I would collect all this data into another Vector Store for analysis later.
•Third Assistant with Function Calling can call various tasks and data based on the keywords in the user prompt.
•Fourth Assistant can be a big boy like 4o after all your context is gathered and you’re ready for high-level analysis.

Even calling four assistants multiple times, I bet you could keep each call below $0.25.

I just got a script working that uses a similar flow and goes through some extremely variable data and transforms it into a spreadsheet that is looking like it’s about 99% accurate on the first run. I think just a touch of fine-tuning will bring it to 100%.

anon10827405 · August 27, 2024, 4:06pm

Completely agree regarding usage of mini. There isn’t much inference going on here. It’s cheaper & efficient to run Mini N times and compare the results instead.

Mini is THE GOAT.

Considering that whatever OP is planning on doing involves a bunch of moving parts it makes sense to focus as much as possible on being able to transform this massive document into some delicious bite sized semantics.

Then, the next step would be the database which IMO would absolutely need to be a hybrid setup to take advantage of both the semantics of the documents, and also make sense and perform computations on the numbers.

I would consider fine-tuning a last-ditch approach. There’s just too much lock-in involved in an industry that’s rapidly advancing. Instead, I would try different models and synthesize the results.

stevenic · August 27, 2024, 5:06pm

I have somewhere around 2,000 - 3,000 hours prompting these models now so a lot of my prompting is just intuition at this point. I always start as simple as possible, usually 1 or 2 instructions around 20 tokens or less. More often then not I can stop there (again good intuition for how to craft those instructions) but as I see issues I’ll slowly layer on additional instructions one by one. If I get to 5 or 6 instructions I’ll usually stop and re-evaluate my approach because what I’m trying to do probably isn’t go to work.

A lot of it is honestly just not fighting the model. You have to pick your battles. I expect the model to make mistakes and I design for those mistakes. Getting a non-deterministic LLM to behave in a deterministic way is hard. It’s going to make mistakes and you’ve already found your answer. If you absolutely need the best possible answer call “A model” multiple times and take consensus.

I say “A model” because you don’t have to use the same model for each call. You can call 3 different models and you’ll get 3 slightly different opinions (think minority report) you can use 2 cheaper models for the first pass and then a third more capable model as a judge. It’s just important that all the models be different because they each bring a slightly different perspective to the problem.

In fact, I have growing evidence that you can use gpt-4o-mini to perform a task and then ask gpt-4o to check minis work and you’ll get slightly more accurate results then had you used gpt-4o alone. Generally gpt-4o is more capable then mini (by far) but every once in a while mini will view a problem from a perspective that gpt-4o misses and mini will get the right answer. When that happens gpt-4o is smart enough to realize that mini is correct and it will take minis answer.

sergeliatko · August 27, 2024, 5:10pm

Ok so here is what I need:

The raw text (.txt not .md) without any formatting (no HTML) of the document you would like to extract the data from.

List of data-extraction objects in the following format (.json is better)

name: %the name of the data we are extracting%
question: %the question to use as query to the vector DB (or user question to the document)%
instructions: %optional instructions on how to extract the data if needed%
data_samples: [ %keywords or examples of how the data is usually presented in doc%, %another sample%, %third sample if needed%]
data_formats: [ %example of the format of the answer you need to get from the model%, %maybe another one%]
strict_answers: [ “yes”, “no”, “unknown”, “not_found”, “contradictory_statements”, “some other answer I accept” ] (this one is needed only if the answer you are looking for must be one of the answers given to the model, eg, classification; if used the data formats field will be ignored)

Let’s start with 10-20 data items to extract and one or two samples of text you are parsing.

Feel free to send in PM or here if nothing private.

thinktank · August 27, 2024, 8:09pm

@anon10827405
I agree that going direct-to-fine-tuning isn’t the answer. I meant in cases where you have to go to the highest possible degree of accuracy for something very specific.

In my case, sometimes the model uses ‘-’ and I need it to use ‘-’ in spite of every kind of prompt I can think of, in every place I can think of placing it.

But hey, fine-tuning is free right now. Great time for a 'lil overkill.

Yessir, I think this is true. Mini also costs 99% less.

I’ve similarly found Mini is the best for simple repeatable tasks that still require a bit of intelligence but no creativity. Also this:

https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

sergeliatko · August 27, 2024, 11:30pm

Just checked the prompts in the notebook. Yep, positive about having a better approach on extracting the data to be used in a sort of a standard report if needed. So if you have the items for the test, we can run a couple of those by the end of the week (or sooner).

roni80 · August 28, 2024, 8:48am

Hi Scott, thanks for your reply.

Pass 2, feed in the form, your extraction query, and the first pass results

When doing this, do you change something in the ‘role’? If you could provide a short example of how this 2nd prompt compares to the first it would be great. Thanks!

roni80 · August 28, 2024, 9:44am

Hi Sean, thanks for your reply!

No need to apologize, I shared my code to get exactly this kind of critique

I will try using the assistant API, although I have read some unflattering reviews on their performance. Just to clarify - using the current approach I actually don’t get “crazy variability” - it actually works quite well for me accuracy-wise, but I wanted to hear about how others are dealing with such issues. Anyway, thanks for the suggested assistant flow!
Good point about trying it with mini! Although I definitely don’t get to the amounts you mentioned using the 4o model - perhaps because I’m not feeding the model entire documents. So far I’ve been paying something like $0.005 per prompt (~1000 input tokens and 20-30 output tokens). Naturally, I would like to pay less if mini really works as well.
Is it possible to use assistants with mini? I did not find any documentation that clearly states this.

I just got a script working that uses a similar flow

Would be really happy to see your code if/when you make it public!

roni80 · August 28, 2024, 9:48am

That’s really interesting - thanks! I’ll definitely try this.

roni80 · August 28, 2024, 10:04am

Cool, looking forward to seeing how you handle it!
I’m a bit swamped right now but I hope I can get you what you need within a couple of days - sent you a PM asking for a couple of clarifications.

sergeliatko · August 28, 2024, 7:46pm

Saw it. Will send details tomorrow morning (I’m in GMT+2)

thinktank · August 28, 2024, 10:08pm

Wow! It sounds like whatever you’re doing is cutting out a lot of extra data. That’s pretty cool. My process used something like 5m Input Tokens to 250k Output.

Use the platform to see the list of models available for Assistants.

When I let you know when I get it up.

rgriffiths · August 30, 2024, 3:46pm

Yes gtp-4o-mini is available via Assistants.

On more of a high-level perspective, using a large language model like GPT4o to identify and read data within documents with variable layouts and variable identifying text, and put those numbers into fields in a database table, is never going to be error free. But then asking a highly educated and intelligent intern to extract numbers from 10-Ks or 10-Qs will also result in some errors. At least GPT won’t get bored and make more mistakes as it goes along!

cmwalolo · September 1, 2024, 8:55pm

Same issue here… By crawling a list of words and finding the base word, the catgram and a definition. It works well at 90% of the cases by batch of 100 words… But it gets crazy in some cases, usually redoing the same request works.
but unfortunately in some cases it doesn’t want to output the correct data… Even if you ask individually to gpt4-o from the playground it can find the right answers.
I tryed 100 things but no way. => make a prompt , a format… And ask gpt to reformulate, and try again, over and over. Till I get some results. But on 400000 words there is no way I can correct 40000 manually. So now I have something, I just need to find how to fix all that crap So I’m going to the old fashion scrapping online french dictionaries to get my data correct, and try to find domains for each word in those definitions… Next step is to ask thar Semi Intelligent Ape to create a hierarchical taxonomies for each word with a list of given domains… .Which is another pain, as, again, it handle requests differently on each batch ! So I make prompts of 1000 tokens… Which has a cost, just to fix that idiot Which I really don’t want to go for a complete run.

Topic		Replies	Views
Prompt Fatigue Question For API Calls Prompting gpt-35-turbo	24	406	January 25, 2025
Need Help With Prompts? Ask me* Prompting chatgpt	149	18764	February 6, 2024
GPTs not much better than using GPT directly? Prompting gpt-4 , prompt , assistants , tp-1	57	11256	January 5, 2024
Let me help you build your project - no fees Community gpt-4 , chatgpt , api	167	15753	July 9, 2024
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4437	January 26, 2024

How to confirm that you got the correct value from a text other than repeating the same prompt over and over

Related topics