Using GPT 3.5 Turbo API to look up information in a table

yves.demeersman · September 13, 2023, 3:56pm

Hi,

I am working on a project where I want to look up information from a PDF file. For a certain use-case, I need to do a relatively simple look-up in a table. There are 2 tables after each other in the source document. I already converted the PDF to markdown to see if that makes it easier for GPT to select the correct one. But to no success so far.

Context (snippet from the MD file):

	Categorie 1	Categorie 2	Categorie 3	Categorie 4	Categorie 4
0j	14,1124	14,1124	14,1941	2.439,9800	14,8177
1j	14,7451	14,7451	14,8815	2.552,5300	15,5012
2j	15,0368	15,0368	15,1760	2.612,3500	15,8645
3j	15,3114	15,3114	15,5028	2.659,6600	16,1518
4j	15,5082	15,5082	15,7579	2.684,0300	16,2998
5j	15,5082	15,5082	15,8646	2.719,2900	16,5139
6j	15,5082	15,5082	15,9742	2.719,2900	16,5139
7j	15,5082	15,5082	15,9742	2.754,5100	16,7278
8j	15,5082	15,5082	15,9742	2.754,5100	16,7278
9j	15,6479	15,6479	16,1145	2.778,8500	16,8756
10j	15,6479	15,6479	16,1145	2.779,5900	16,8801
11j	15,6479	15,6479	16,1145	2.780,4300	16,8852
12j	15,6479	15,6479	16,1145	2.781,1000	16,8893
13j	15,6479	15,6479	16,1145	2.781,8400	16,8938
14j	15,7877	15,7877	16,2548	2.805,5400	17,0377
15j	15,7877	15,7877	16,2548	2.806,3500	17,0426
16j	15,7877	15,7877	16,2548	2.806,3500	17,0426
17j	15,7877	15,7877	16,2548	2.806,3500	17,0426
18j	15,7877	15,7877	16,2548	2.806,3500	17,0426
19j	15,9276	15,9276	16,3953	2.829,3000	17,1820
20j	15,9276	15,9276	16,3953	2.829,3000	17,1820
21j	15,9276	15,9276	16,3953	2.830,4200	17,1888
22j	15,9276	15,9276	16,3953	2.830,4200	17,1888

	Categorie 5	Categorie 5	Categorie 6	Categorie 6	Categorie 7
0j	2.579,8000	15,6668	2.648,0500	16,0813	3.010,8500
1j	2.672,6000	16,2304	2.751,6800	16,7106	3.048,2600
2j	2.727,9200	16,5663	2.817,5300	17,1105	3.083,8300
3j	2.771,0100	16,8280	2.870,0200	17,4293	3.119,6400
4j	2.801,4900	17,0131	2.910,6000	17,6757	3.154,7800
5j	2.819,2400	17,1209	2.936,9400	17,8357	3.190,8800
6j	2.837,0100	17,2288	2.964,2300	18,0014	3.226,2800
7j	2.854,6900	17,3362	2.991,3700	18,1662	3.262,7400
8j	2.872,4300	17,4439	3.017,8600	18,3271	3.299,3300
9j	2.897,5200	17,5963	3.043,8600	18,4850	3.328,7800
10j	2.898,2000	17,6004	3.045,0500	18,4922	3.330,4000
11j	2.898,9700	17,6051	3.046,1300	18,4988	3.332,0300
12j	2.899,6800	17,6094	3.047,2400	18,5055	3.333,5900
13j	2.900,4200	17,6139	3.048,3100	18,5120	3.334,3000
14j	2.925,5000	17,7662	3.074,3900	18,6704	3.363,9700
15j	2.926,1700	17,7703	3.075,5600	18,6775	3.365,7900
16j	2.926,1700	17,7703	3.075,5600	18,6775	3.366,8400
17j	2.926,1700	17,7703	3.075,5600	18,6775	3.367,8900
18j	2.926,1700	17,7703	3.075,5600	18,6775	3.367,8900
19j	2.950,5300	17,9182	3.100,5100	18,8290	3.395,8900
20j	2.950,5300	17,9182	3.100,5100	18,8290	3.395,8900
21j	2.951,8000	17,9259	3.101,7900	18,8368	3.396,7100
22j	2.951,8000	17,9259	3.101,7900	18,8368	3.396,7100

Prompt:
query = ‘Wat is het loon van iemand in categorie 6 met 20j ervaring?’

prompt = f"“”
Antwoord op de vraag met behulp van de bijgevoegde context.
Het is mogelijk dat je het antwoord moet vinden door het doorlopen van verschillende stappen.

We gaan een voorbeeld geven van een redenering die je kan gebruiken om het antwoord te vinden.
Wat is het loon van een chef 20 jaar ervarig?
Stap 1: Vind de functiecategorie van een chef.
Stap 2: Vind het loon van een chef met 20 jaar ervaring.

Geef de volgende zaken terug:

Antwoord: Het antwoord op de vraag, gebruik makende van de context.
Redenering: Op welke manier ben je het antwoord bekomen.

Geef de output in json formaat terug.

Context: {txt}
Vraag: {query}
“”"

Code:

create a chat completion

chat_completion = openai.ChatCompletion.create(model=“gpt-3.5-turbo-16k”,
messages=[{“role”: “system”, “content”: “You are a helpfull assistant that helps recruiters find back information. The language spoken from here on out is dutch”},
{“role”: “user”, “content”: prompt}])

The response I get is as follows:
Antwoord: Het loon van iemand in categorie 6 met 20 jaar ervaring is €2.829,30 per maand.
Redenering:

Zoek de juiste categorie op in de tabel: categorie 6.
Zoek het juiste aantal functiejaren op, in dit geval is dat 20 jaar.
Lees het bijbehorende loon af in de tabel, wat in dit geval €2.829,30 is.

Which translated means:
Answer: The salary of someone in category 6 with 20 years of experience is €2,829.30 per month. Reasoning:

Locate the correct category in the table: category 6.
Find the appropriate number of years of experience, in this case, it is 20 years.
Read the corresponding salary from the table, which in this case is €2,829.30.

Has anyone else had some issues with this? I cannot just edit the source document because the final deployment will look at multiple pdf files.

Many thanks in advance! This has been breaking my head.

Foxalabs · September 13, 2023, 4:02pm

Hi,

This particular task seems unsuited to a Large Language Model, it’s using fixed table data and you want deterministic results from it. Traditional software could do this task with ease, you could even get ChatGPT to code this function for you in python.

Long lists of numbers and tabulated values going on for thousands of tokens, could be breaking either the token limit of the attention heads of the model to the point you are getting gibberish.

Do you have an API call code snippet to look at for possible issues?

anon22939549 · September 13, 2023, 4:27pm

Or as I like to say, not every nail needs to be hit with the LLM hammer.

yves.demeersman · September 13, 2023, 4:35pm

Thanks all for your quick replies. The reason why I take the LLM coarse is that the question that the user asks can be anything. It can be the definition of a certain term, or a value from a table, or a comparison or something else.

If I easily want to get a certain value upfront I can use pandas but I want it to be fully flexible.

yves.demeersman · September 13, 2023, 4:40pm

The weird thing is. It is mostly looking at the right “row” and the right “column number” though the value is retrieved from the wrong “table”.

Foxalabs · September 13, 2023, 4:51pm

How large is the MD file? and how are you giving the model access to it?

sps · September 13, 2023, 5:36pm

Hi @yves.demeersman

Instead of passing the whole data as markdown, store data in a csv, DB or any custom storage solution and have the model generate queries for lookup.

As @Foxalabs suggested, you can also use the function calling to have the model lookup data.

bobartig · September 14, 2023, 2:30am

I’m not actually a fan of GPT’s function calling architecture, but having some LLM-driven user flow that 1) identifies a user question is asking for a value from a table, 2) generates some LLM output from a YAML template 3) validates the YAML, then 4) Passes the template value to some external API or function that retrieves the value using non-LLM, 5) passes the value back to the LLM, would be more reliable overall.

As @Foxalabs says, this isn’t the kind of task where LLMs excel. In fact, for non-linguistic in-context recall, you drop to like ~60% accuracy once you are dealing with 2500-3000 tokens. So you get rapid quality degradation pretty quickly, so feeding table pages into the model is not desireable.

Topic		Replies	Views
Using large PDFs to make a ChatBot API chatgpt , api	21	6389	December 15, 2023
How to teach a model relational data? API assistants	10	971	July 17, 2024
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	24566	October 31, 2024
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	3030	March 29, 2024
Accuracy of retrieved data (large CSV) GPT builders	5	1023	April 17, 2024

Using GPT 3.5 Turbo API to look up information in a table

create a chat completion

Related topics