Using Assistant API Retrieval Hallucinates

Hello!
I’m using the Assistant API. Created one with a JSON file that contains 50 products with description and important parameters. I also have a function that shows the title and ID of the product that is being recommended to the user. I’m having issues answering simple questions like:

  1. Show me all the products that are from X category.
  2. Show me all the products that have Y characteristic.

Has anyone found the best file type and/or best file format that increases the performance of the assistants?

As added context, I’ve tried the newer GTP4 turbo model and activating Code Interpreter.

4 Likes

I am having also hallucinations passing the info into a markdown file to a GPT4 turbo configured assistant. In my case the main issue is handing urls provided in he files.

4 Likes

I have the same issue.
I uploaded JSON documents with various sport statistics and then even with simple question the assistant is not able to get them in the files.
Event when I give it the exact file to search in he is not able
I also tried to detail the role of every file in the description of the assistant but it does not work.

I hope it’s just some issue indexing the files since it’s only the beginning. But for now it’s really not helpful.

2 Likes

Hey @sergio.soto and @adrien.dulong . I finally had success with the reliable and simple plain .TXT file. Not sure what your use case is but I’m managing a product catalog with almost 100 rows and every row with over 2000 characters.

Final result, no hallucinations and was able to integrate well with functions. What has worked for my use case and tests is to create a CSV and pass that to a TXT file but using none common delimiters. For example, symbols instead of the typical commas and spaces.

Hope this helps!

7 Likes

Hi @diego.cardenas , thank you for your comment!

This is essentially a markdown: first row has field titles and next rows are data separated by a | (vertical bar symbol).

In my case the hallucinates are related with a field that is an url. Sometimes GPT4 removes the last number of the ID (is the end of the url), sometimes literally invent it.

Try encapsulating each content of the row and column in a . Again, the “|” or the commas and spaces isn’t enough. You could try converting the markdown or plain text file you have into CSV with GPT to see how it reads it. If you’re end results is a CSV with content out of place, you need to change the delimiters.

this created enormous improvement for us in terms of error rates. We are working with a 6000 product feed.

Thank you Diego!

1 Like

Thanks for coming back to let us know. We love to hear this as a community!

2 Likes

@diego.cardenas

Hey, Diego.
I used symbols like ‘®’ as delimiters, but seems like the assistant can use only up to 10% of the file at once. The file itself is CSV with custom delimiters in a .txt file. It has about 2000 rows and each row has about 10 different values.
Can you also throw some light on your experience with functions? Which functions helped your assistant to parse files efficiently?

Guys, is it possible to reach an specific file or assistant via API? I cound not find this example in documentation only to a specific API Key or via Playground.
The problem you mentioned above I solved using double “” before and after and in the instruction reinforcing to mention the attached doc information

Use files api to upload file, create assistant and thread and attach file to it using new openai assistant api

1 Like

So first I create an assistent > get the assistant_id > then send the file > and only after all I can run questions right? But How can ask to this new specific assistant_id later for instance?

Just saw here in documentation everything worked fine (Upload file, created assistent and attached the file but right after I created a thread I got an error or this response :
{
“id”: “thread_NyBgr6cYHo8iUGblB5OD16jA”,
“object”: “thread”,
“created_at”: 1701210208,
“metadata”: {}
}

I tried this https://api.openai.com/v1/threads and this https://api.openai.com/v1/threads/thread_GQjV02uQ7eLwmwg1abBQGTX0/messages

Right after I got the thread ID which post URL I should use in order to make the prompt to the specific file_id?

Thanks a thounsand for your assistance @yurifromuk

@diego.cardenas Would you be able to share a sample of the .txt file that you are using? Encountering similar problems here

Hey @yurifromuk ! Were you able to solve the issue you’re encountering?

Sure thing! I have noticed a decrease in efficiency for an assistant that was working perfectly and now is not. Not sure if it’s some changes going on to the Assistants since it’s still in beta.

Anyways, here is the document. I’m sending over an example. Please share what results you’re getting.

The file type is TXT (plain text file) and the content is the following:

|product_retailer_id|title|description|availability|condition|price|sale_price|category|brand|
|---|---|---|---|---|---|---|---|---|
|['10001']|['CB250 TWISTER GRIS']|['Agresiva de corazón, ágil de diseño. La nueva CB250 Twister, es una motocicleta hecha para derrotar a todo lo que, se cruce en su camino.<br>  <br> <br><br>  Características:<br>  • Marcas: Honda<br>  • Tipos de moto: Pisteras<br>  • Color: Gris<br>  • Cilindrada: 249.6 cc<br>  • Tipo de motor: Refrigerado por aire\, 4 tiempos\, OHC<br>  • Potencia: 22.1 HP / 7500 rpm<br>  • Transmisión: 6 velocidades<br>  • Freno Delantero: Disco<br>  • Freno posterior: Disco<br>  • Suspensión Delantera: Telescópica<br>  • Suspensión Posterior: Monoamortiguado<br>  • Meses de garantia: 12<br>  • Cantidad de ruedas: Lineal<br>  • Tipo de transmisión: 6 velocidades<br>  • Tipo de Frenos: Disco']|['in stock']|['new']|['PEN 20099.00']||['Pisteras']|['Honda']|
|['10002']|['EVO 125 ROJO']|['Rápida, livina y económica. Precisa para aquellos que por primera vez manejarán una moto, siendo ligera e ideal para el desplazamiento diario<br>  Si lo que buscas es comodidad a la hora de manejar en la ciudad, las motos con transmisión automática son ideales y cuentan con espacios para llevar algo de equipaje en su interior y están pensadas para la altura de todo tipo de conductores.<br>  <br> <br><br>  Características:<br>  • Marcas: Zongshen<br>  • Tipos de moto: Scooter<br>  • Color: Rojo<br>  • Cilindrada: 125.0 cc<br>  • Tipo de motor: 4 tiempos SOHC<br>  • Potencia: 9.10 HP a 8\,000 rpm<br>  • Transmisión: Automática<br>  • Freno Delantero: Disco<br>  • Freno posterior: Tambor<br>  • Suspensión Delantera: Horquillas telescópicas<br>  • Suspensión Posterior: Mono amortiguador<br>  • Meses de garantia: 12<br>  • Kilometros de garantia: 10\,000<br>  • Cantidad de ruedas: Lineal<br>  • Tipo de Frenos: Disco, Tambor<br>  • Tipo de transmisión: Automática']|['in stock']|['new']|['PEN 4419.00']||['Scooter']|['Zongshen']|

Thanks @diego.cardenas ! This is an interesting approach, shall give it a shot on my end.

1 Like

This worked for me, thank you @diego.cardenas and team!

Exported a Flat 2D Table from Excel into a Tab Delimited and using Code Interpreter I am getting the results I am looking for, matching incoming descriptions to nodes in my data based on the terms I have for them.

It is doing some pretty advanced reasoning right out of the gate, although it is very, very slow.

No worries, now that I have a basic version working I can start to break and iterate :smile:

This was a lot of trial and error, I see so many people out here trying to do something similar, looking up values tabular data, hopefully devs address this need.

1 Like

A couple of tips without specifically knowing how OpenAI is splitting documents… I’m going to just assume that they’re using Lang Chains Recursive Text Splitter class or some variant.

For markdown specifically, you want to use headers to separate out chunks of related information. Using #, ##, and ### you can actually control the chunking of data into their Vector DB and therefor steer the model. I don’t know what their chunk size is but its probably somewhere around 400 tokens (if not exactly 400 tokens) so try to keep chunks of related data around this size.

Models don’t have any sort of spatial awareness with regards to table data so showing them data using markdown tables is likely to have mixed results. You should reformat any tables to be a list of records. I’ll give an example:

| Name    | Age  | Title    |
| ------- | ---- | -------- |
| Steve   | 55   | Founder  |
| Paul    | 48   | Founder  |
| Brian   | 35   | Engineer |

Should be presented to the model as:

### Steve
Age: 55
Title: Founder

### Paul
Age: 48
Title: Founder

### Brian
Age: 35
Title: Engineer 

You really want to present table data to the model as whole record chunks where possible.

My final tip is to think about how the users query matches semantically to the chunks your searching over… Every chunk needs to map to a concept for cosine similarity to work. If you have a chunk of text that’s “| 5’10 | 55 | 4 |” what concepts does that chunk of text map to? If your chunk is instead “height: 5’10, age: 55, dependents: 4”, there are 3 clear concepts that this chunk maps to. So a query for “show me people over 50 with 3 or more dependents” has a much better chance of retrieving the second chunk then the first one.

Keep in mind though that these models are horribly at both math and counting so it’s a crap shoot if that query works in the first place. But without encoding the concepts into the chunk you have zero shot of it working.

Hope those tips help…

2 Likes

Here’s an example document from a test corpus I’m working with called Bowl Season. Bowl Season contains 42 documents currently (43 after next Monday) which summarize the details and outcome of all 43 college football bowl games. So each game is a separate document in the corpus:

# Capital One Orange Bowl
Georgia Bulldogs (13-1)
Q1: 7
Q2: 35
Q3: 14
Q4: 7
TOTAL: 63

Florida State Seminoles (13-1)
Q1: 0
Q2: 3
Q3: 0
Q4: 0
TOTAL: 3

## Game Information
Hard Rock Stadium
Miami Gardens, FL
4:00 PM EST, December 30, 2023
Coverage: ESPN/ESPN+
Line: UGA -23.5
Over/Under: 47.5
Attendance: 63,324 (97%)
CAPACITY: 64,992

## Game Leaders
### Passing Yards
UGA
C. Beck
13-18, 203 YDS, 2 TD

FSU
B. Glenn
9-26, 139 YDS, 2 INT

### Rushing Yards
UGA
K. Milton
9 CAR, 104 YDS, 2 TD

FSU
J. Douglas
8 CAR, 46 YDS

### Receiving Yards
UGA
D. Bell
5 REC, 86 YDS

FSU
K. Poitier
4 REC, 84 YDS

## Team Stats
### Total Yards
UGA: 673
FSU: 209

### Turnovers
UGA: 0
FSU: 4

### 1st Downs
UGA: 36
FSU: 11

### Possession
UGA: 35:38
FSU: 24:22

## Scoring Summary
### 1ST QUARTER
Georgia Bulldogs	
TD	4:05	
Kendall Milton 15 Yd Run (Peyton Woodring Kick)
7 plays, 69 yards, 2:40

### 2ND QUARTER
Georgia Bulldogs	
TD	14:57	
Kendall Milton 5 Yd Run (Peyton Woodring Kick)
6 plays, 82 yards, 1:56

Florida State Seminoles	
FG	12:34	
Ryan Fitzgerald 22 Yd Field Goal
6 plays, 71 yards, 2:23

Georgia Bulldogs	
TD	10:38	
Daijun Edwards 15 Yd Run (Peyton Woodring Kick)
4 plays, 75 yards, 1:56

Georgia Bulldogs	
TD	10:18	
Ladd McConkey 27 Yd Run (Peyton Woodring Kick)
1 play, 27 yards, 0:20

Georgia Bulldogs	
TD	3:39	
Arian Smith 12 Yd pass from Carson Beck (Peyton Woodring Kick)
5 plays, 62 yards, 2:20

Georgia Bulldogs	
TD	0:24	
Dominic Lovett 2 Yd pass from Carson Beck (Peyton Woodring Kick)
3 plays, 51 yards, 0:25

### 3RD QUARTER
Georgia Bulldogs	
TD	9:23	
Daijun Edwards 2 Yd Run (Peyton Woodring Kick)
10 plays, 75 yards, 5:37

Georgia Bulldogs	
TD	2:30	
Lawson Luckie 4 Yd pass from Gunner Stockton (Peyton Woodring Kick)
10 plays, 90 yards, 4:31

### 4TH QUARTER
Georgia Bulldogs	
TD	12:10	
Anthony Evans III 14 Yd pass from Gunner Stockton (Peyton Woodring Kick)
9 plays, 84 yards, 4:31

You can see how I’m using headers to control the chunking of the data. The system I’m building to consume this corpus is already capable of accurately answering questions like “tell me the final score of every bowl game and the player from each game that had the most receiving yards” Both GPT-4 and GPT-3.5 can accurately return both for all 42 games.

There are more tricks needed to make that work then just proper data prep, but data prep is super important.

One more tip is that ideally you want to present the model with a condensed version of the document that contains all the information needed to answer the question while retaining the general structure of the source document. So lets the say the query is “what games played in Florida had 4th quarter scores by the winner?” This is the ideal text you would show the model:

# Capital One Orange Bowl
Georgia Bulldogs (13-1)
Q1: 7
Q2: 35
Q3: 14
Q4: 7
TOTAL: 63

Florida State Seminoles (13-1)
Q1: 0
Q2: 3
Q3: 0
Q4: 0
TOTAL: 3

## Game Information
Hard Rock Stadium
Miami Gardens, FL
4:00 PM EST, December 30, 2023

## Scoring Summary
### 4TH QUARTER
Georgia Bulldogs	
TD	12:10	
Anthony Evans III 14 Yd pass from Gunner Stockton (Peyton Woodring Kick)
9 plays, 84 yards, 4:31

That is the minimum information needed for the model to answer that question. The challenge is working out that’s the text you need to show the model.

And here’s GPT 3.5’s answer to that question: