The assistant's file search results in inaccurate statistics

ispony520 · May 6, 2024, 8:34am

I used the assistant’s file search function in the background and uploaded a file using vector storage.

The data inside is in JSON format and pertains to flight ticket orders, containing information such as customer details, departure, and arrival.

When the data in this file is minimal, such as fewer than 10 entries, GPT returns correct data after querying in the background (e.g., calculating the order volume for customer A).

However, when the data volume is large, such as exceeding 1000 entries, the data returned by GPT statistics is incorrect. What could be the reason for this?

eslof.github · May 6, 2024, 9:24am

Im trying to ask the question:

Incorrect in what way?

ispony520 · May 7, 2024, 5:20am

The statistics are incorrect.

The calculated number of orders for the customer is only a few, far fewer than the actual amount.

After examining the processing steps of GPT, I found that it only took a few examples for the statistics, rather than considering all the orders related to a particular customer.

ispony520 · May 8, 2024, 6:47am

Hi,thanks for your reply.

I uploaded a file containing 100 order records, but only 35 of them were used for the statistics by the code_interpreter, leading to an inaccurate result.

For example, for the customer BASU, the assistant returned 6 records, but there are actually 8 records in the file

Please refer to the screenshot I provided.

The following content is the information returned by the assistant in the background：

User

please analyze the number of orders for each client

code_interpreter

# Let's parse the data from the uploaded file to analyze the number of orders for each client.
# The provided snippets show several clients with different INVDATE and other attributes, but we'll focus on counting orders by CLIENT.

# Import necessary libraries
import pandas as pd
from io import StringIO

# Data extracted from the provided snippets
data = """
INVDATE,CLIENT,SEGMENT_INDEX,NET_FARE,TAX,DEP_DATE,ARR_DATE,CABIN_CLASS,CABIN_CODE
28-Sep-21,YVM4,1,9330,566,26-Oct-21,27-Oct-21,Y,M
28-Sep-21,YVM4,1,9330,566,26-Oct-21,27-Oct-21,Y,M
28-Sep-21,WS8C,1,680,50,30-Sep-21,30-Sep-21,Y,U
28-Sep-21,SW62,1,710,630,28-Sep-21,28-Sep-21,Y,K
28-Sep-21,MMV7,1,1430,50,13-Oct-21,13-Oct-21,Y,B
28-Sep-21,MMV7,1,1430,50,15-Oct-21,15-Oct-21,Y,B
28-Sep-21,QWG5,1,2660,50,13-Oct-21,13-Oct-21,C,D
28-Sep-21,Q4DS,1,7750,50,07-Oct-21,07-Oct-21,C,J
28-Sep-21,QWG5,1,1670,50,11-Oct-21,11-Oct-21,C,I
28-Sep-21,BASU,1,1460,50,12-Oct-21,12-Oct-21,Y,E
28-Sep-21,BASU,1,550,50,13-Oct-21,13-Oct-21,Y,R
28-Sep-21,BASU,1,710,50,14-Oct-21,14-Oct-21,Y,R
28-Sep-21,BASU,1,550,50,13-Oct-21,13-Oct-21,Y,R
28-Sep-21,BASU,1,1460,50,12-Oct-21,12-Oct-21,Y,E
28-Sep-21,BASU,1,710,50,14-Oct-21,14-Oct-21,Y,R
29-Sep-21,87UE,1,9840,158,19-Dec-21,19-Dec-21,Y,I
29-Sep-21,B4YT,1,1000,50,08-Oct-21,08-Oct-21,Y,E
29-Sep-21,B4YT,1,740,50,10-Oct-21,10-Oct-21,Y,Z
29-Sep-21,337L,1,1720,50,29-Sep-21,29-Sep-21,Y,E
29-Sep-21,SRZF,1,0,0,30-Sep-21,30-Sep-21,Y,Y
28-Sep-21,YVM4,1,0,179,28-Sep-21,28-Sep-21,Y,Y
28-Sep-21,YVM4,1,2140,50,30-Sep-21,30-Sep-21,Y,Y
28-Sep-21,A67Z,1,500,50,08-Oct-21,08-Oct-21,Y,Z
28-Sep-21,YVM4,1,2000,50,07-Oct-21,07-Oct-21,Y,Y
28-Sep-21,SW31,1,1790,50,29-Sep-21,29-Sep-21,Y,Y
28-Sep-21,SRZF,1,960,50,29-Sep-21,29-Sep-21,Y,R
28-Sep-21,JGB7,1,1590,50,29-Sep-21,29-Sep-21,Y,M
28-Sep-21,HKEX,1,720,50,12-Oct-21,12-Oct-21,Y,R
28-Sep-21,SRZF,1,1000,50,29-Sep-21,29-Sep-21,Y,V
28-Sep-21,W4PN,1,500,50,08-Oct-21,08-Oct-21,Y,Z
28-Sep-21,SW25,1,2260,50,29-Sep-21,29-Sep-21,C,I
28-Sep-21,B4YT,1,450,50,04-Oct-21,04-Oct-21,Y,Z
28-Sep-21,3RCE,1,1780,50,13-Oct-21,13-Oct-21,C,Z
28-Sep-21,N7WW,1,1180,50,23-Sep-21,23-Sep-21,C,J
28-Sep-21,GWEW,1,1240,188,21-Dec-21,21-Dec-21,C,I
"""

# Create DataFrame
df = pd.read_csv(StringIO(data))

# Count the number of orders for each client
client_orders_count = df['CLIENT'].value_counts().reset_index()
client_orders_count.columns = ['CLIENT', 'ORDERS']

client_orders_count

CLIENT ORDERS 0 BASU 6 1 YVM4 5 2 B4YT 3 3 SRZF 3 4 QWG5 2 5 MMV7 2 6 Q4DS 1 7 87UE 1 8 SW62 1 9 337L 1 10 WS8C 1 11 A67Z 1 12 SW31 1 13 JGB7 1 14 HKEX 1 15 W4PN 1 16 SW25 1 17 3RCE 1 18 N7WW 1 19 GWEW 1

asst_RNgSNtNsi0eczKxJme7O9rWl

The analysis of the orders for each client is as follows:

BASU: 6 orders
YVM4: 5 orders
B4YT: 3 orders
SRZF: 3 orders
QWG5: 2 orders
MMV7: 2 orders
Q4DS: 1 order
87UE: 1 order
SW62: 1 order
337L: 1 order
WS8C: 1 order
A67Z: 1 order
SW31: 1 order
JGB7: 1 order
HKEX: 1 order
W4PN: 1 order
SW25: 1 order
3RCE: 1 order
N7WW: 1 order
GWEW: 1 order

This summary indicates the total number of orders placed by each client.

User

how many orders in this file?

code_interpreter(

To calculate the total number of orders in the data, we simply need the number of rows in the DataFrame. total_orders = df.shape[0] total_orders

)

35

asst_RNgSNtNsi0eczKxJme7O9rWl

The total number of orders in the file is 35.

masood.ahmad · June 3, 2024, 5:48am

I am facing similar issues. for a smalled file size having few hunders of rows, the results are better. however when the file sizes are large and there are multiple files it failes to provide accurate results.
any help/tweeks?

aaron.lutz · August 28, 2024, 2:16pm

This problem is to be expected when using the vector store and file search. file search performs a semantic and keyword search across your indexed and vectorized files you uploaded. It then returns up to a specific number (you can specify) of search result ‘snippets’ that match the context of the search query. Each snippet can only be up to a certain amount of tokens. This means, you won’t get the entire list of orders. Instead you’ll receive some snippets for some orders and therefore cannot do what you are trying to achieve with this tool. You’d have to use some kind of DB search / operations tool to do so.
Hope that helps.

Topic		Replies	Views
Assistant can not search the whole file using file search Bugs assistant	8	2240	November 20, 2024
Assistant doesn't count nor search information of entries in JSON API gpt-4 , playground , json , assistants-api , file-uploads	1	390	June 19, 2024
Assistant File Search not answering correctly API assistant , assistants-files	0	290	July 29, 2024
File_search assistants api - not returning full output, but just a preview of the output API lost-user , assistants-api , gpt-4o , file-search	4	280	February 4, 2025
File Search: Fail to choose the correct file API gpt-4 , chatgpt , assistants-api , file-search	0	206	September 16, 2024

The assistant's file search results in inaccurate statistics

To calculate the total number of orders in the data, we simply need the number of rows in the DataFrame. total_orders = df.shape[0] total_orders

Related topics