Using GPT 3.5 Turbo API to look up information in a table

Hi,

I am working on a project where I want to look up information from a PDF file. For a certain use-case, I need to do a relatively simple look-up in a table. There are 2 tables after each other in the source document. I already converted the PDF to markdown to see if that makes it easier for GPT to select the correct one. But to no success so far.

Context (snippet from the MD file):

Categorie 1 Categorie 2 Categorie 3 Categorie 4 Categorie 4
0j 14,1124 14,1124 14,1941 2.439,9800 14,8177
1j 14,7451 14,7451 14,8815 2.552,5300 15,5012
2j 15,0368 15,0368 15,1760 2.612,3500 15,8645
3j 15,3114 15,3114 15,5028 2.659,6600 16,1518
4j 15,5082 15,5082 15,7579 2.684,0300 16,2998
5j 15,5082 15,5082 15,8646 2.719,2900 16,5139
6j 15,5082 15,5082 15,9742 2.719,2900 16,5139
7j 15,5082 15,5082 15,9742 2.754,5100 16,7278
8j 15,5082 15,5082 15,9742 2.754,5100 16,7278
9j 15,6479 15,6479 16,1145 2.778,8500 16,8756
10j 15,6479 15,6479 16,1145 2.779,5900 16,8801
11j 15,6479 15,6479 16,1145 2.780,4300 16,8852
12j 15,6479 15,6479 16,1145 2.781,1000 16,8893
13j 15,6479 15,6479 16,1145 2.781,8400 16,8938
14j 15,7877 15,7877 16,2548 2.805,5400 17,0377
15j 15,7877 15,7877 16,2548 2.806,3500 17,0426
16j 15,7877 15,7877 16,2548 2.806,3500 17,0426
17j 15,7877 15,7877 16,2548 2.806,3500 17,0426
18j 15,7877 15,7877 16,2548 2.806,3500 17,0426
19j 15,9276 15,9276 16,3953 2.829,3000 17,1820
20j 15,9276 15,9276 16,3953 2.829,3000 17,1820
21j 15,9276 15,9276 16,3953 2.830,4200 17,1888
22j 15,9276 15,9276 16,3953 2.830,4200 17,1888
Categorie 5 Categorie 5 Categorie 6 Categorie 6 Categorie 7
0j 2.579,8000 15,6668 2.648,0500 16,0813 3.010,8500
1j 2.672,6000 16,2304 2.751,6800 16,7106 3.048,2600
2j 2.727,9200 16,5663 2.817,5300 17,1105 3.083,8300
3j 2.771,0100 16,8280 2.870,0200 17,4293 3.119,6400
4j 2.801,4900 17,0131 2.910,6000 17,6757 3.154,7800
5j 2.819,2400 17,1209 2.936,9400 17,8357 3.190,8800
6j 2.837,0100 17,2288 2.964,2300 18,0014 3.226,2800
7j 2.854,6900 17,3362 2.991,3700 18,1662 3.262,7400
8j 2.872,4300 17,4439 3.017,8600 18,3271 3.299,3300
9j 2.897,5200 17,5963 3.043,8600 18,4850 3.328,7800
10j 2.898,2000 17,6004 3.045,0500 18,4922 3.330,4000
11j 2.898,9700 17,6051 3.046,1300 18,4988 3.332,0300
12j 2.899,6800 17,6094 3.047,2400 18,5055 3.333,5900
13j 2.900,4200 17,6139 3.048,3100 18,5120 3.334,3000
14j 2.925,5000 17,7662 3.074,3900 18,6704 3.363,9700
15j 2.926,1700 17,7703 3.075,5600 18,6775 3.365,7900
16j 2.926,1700 17,7703 3.075,5600 18,6775 3.366,8400
17j 2.926,1700 17,7703 3.075,5600 18,6775 3.367,8900
18j 2.926,1700 17,7703 3.075,5600 18,6775 3.367,8900
19j 2.950,5300 17,9182 3.100,5100 18,8290 3.395,8900
20j 2.950,5300 17,9182 3.100,5100 18,8290 3.395,8900
21j 2.951,8000 17,9259 3.101,7900 18,8368 3.396,7100
22j 2.951,8000 17,9259 3.101,7900 18,8368 3.396,7100

Prompt:
query = ‘Wat is het loon van iemand in categorie 6 met 20j ervaring?’

prompt = f"“”
Antwoord op de vraag met behulp van de bijgevoegde context.
Het is mogelijk dat je het antwoord moet vinden door het doorlopen van verschillende stappen.

We gaan een voorbeeld geven van een redenering die je kan gebruiken om het antwoord te vinden.
Wat is het loon van een chef 20 jaar ervarig?
Stap 1: Vind de functiecategorie van een chef.
Stap 2: Vind het loon van een chef met 20 jaar ervaring.

Geef de volgende zaken terug:

Antwoord: Het antwoord op de vraag, gebruik makende van de context.
Redenering: Op welke manier ben je het antwoord bekomen.

Geef de output in json formaat terug.

Context: {txt}
Vraag: {query}
“”"

Code:

create a chat completion

chat_completion = openai.ChatCompletion.create(model=“gpt-3.5-turbo-16k”,
messages=[{“role”: “system”, “content”: “You are a helpfull assistant that helps recruiters find back information. The language spoken from here on out is dutch”},
{“role”: “user”, “content”: prompt}])

The response I get is as follows:
Antwoord: Het loon van iemand in categorie 6 met 20 jaar ervaring is €2.829,30 per maand.
Redenering:

  • Zoek de juiste categorie op in de tabel: categorie 6.
  • Zoek het juiste aantal functiejaren op, in dit geval is dat 20 jaar.
  • Lees het bijbehorende loon af in de tabel, wat in dit geval €2.829,30 is.

Which translated means:
Answer: The salary of someone in category 6 with 20 years of experience is €2,829.30 per month. Reasoning:

  • Locate the correct category in the table: category 6.
  • Find the appropriate number of years of experience, in this case, it is 20 years.
  • Read the corresponding salary from the table, which in this case is €2,829.30.

Has anyone else had some issues with this? I cannot just edit the source document because the final deployment will look at multiple pdf files.

Many thanks in advance! This has been breaking my head. :stuck_out_tongue:

Hi,

This particular task seems unsuited to a Large Language Model, it’s using fixed table data and you want deterministic results from it. Traditional software could do this task with ease, you could even get ChatGPT to code this function for you in python.

Long lists of numbers and tabulated values going on for thousands of tokens, could be breaking either the token limit of the attention heads of the model to the point you are getting gibberish.

Do you have an API call code snippet to look at for possible issues?

3 Likes

Or as I like to say, not every nail needs to be hit with the LLM hammer.

4 Likes

Thanks all for your quick replies. The reason why I take the LLM coarse is that the question that the user asks can be anything. It can be the definition of a certain term, or a value from a table, or a comparison or something else.

If I easily want to get a certain value upfront I can use pandas but I want it to be fully flexible.

1 Like

The weird thing is. It is mostly looking at the right “row” and the right “column number” though the value is retrieved from the wrong “table”.

How large is the MD file? and how are you giving the model access to it?

Hi @yves.demeersman

Instead of passing the whole data as markdown, store data in a csv, DB or any custom storage solution and have the model generate queries for lookup.

As @Foxalabs suggested, you can also use the function calling to have the model lookup data.

2 Likes

I’m not actually a fan of GPT’s function calling architecture, but having some LLM-driven user flow that 1) identifies a user question is asking for a value from a table, 2) generates some LLM output from a YAML template 3) validates the YAML, then 4) Passes the template value to some external API or function that retrieves the value using non-LLM, 5) passes the value back to the LLM, would be more reliable overall.

As @Foxalabs says, this isn’t the kind of task where LLMs excel. In fact, for non-linguistic in-context recall, you drop to like ~60% accuracy once you are dealing with 2500-3000 tokens. So you get rapid quality degradation pretty quickly, so feeding table pages into the model is not desireable.