The assistant's file search results in inaccurate statistics

I used the assistant’s file search function in the background and uploaded a file using vector storage.

The data inside is in JSON format and pertains to flight ticket orders, containing information such as customer details, departure, and arrival.

When the data in this file is minimal, such as fewer than 10 entries, GPT returns correct data after querying in the background (e.g., calculating the order volume for customer A).

However, when the data volume is large, such as exceeding 1000 entries, the data returned by GPT statistics is incorrect. What could be the reason for this?

2 Likes

Im trying to ask the question:

Incorrect in what way?

The statistics are incorrect.

The calculated number of orders for the customer is only a few, far fewer than the actual amount.

After examining the processing steps of GPT, I found that it only took a few examples for the statistics, rather than considering all the orders related to a particular customer.

Hi,thanks for your reply.

I uploaded a file containing 100 order records, but only 35 of them were used for the statistics by the code_interpreter, leading to an inaccurate result.

For example, for the customer BASU, the assistant returned 6 records, but there are actually 8 records in the file

Please refer to the screenshot I provided.

The following content is the information returned by the assistant in the background:

User

please analyze the number of orders for each client

code_interpreter

# Let's parse the data from the uploaded file to analyze the number of orders for each client.
# The provided snippets show several clients with different INVDATE and other attributes, but we'll focus on counting orders by CLIENT.

# Import necessary libraries
import pandas as pd
from io import StringIO

# Data extracted from the provided snippets
data = """
INVDATE,CLIENT,SEGMENT_INDEX,NET_FARE,TAX,DEP_DATE,ARR_DATE,CABIN_CLASS,CABIN_CODE
28-Sep-21,YVM4,1,9330,566,26-Oct-21,27-Oct-21,Y,M
28-Sep-21,YVM4,1,9330,566,26-Oct-21,27-Oct-21,Y,M
28-Sep-21,WS8C,1,680,50,30-Sep-21,30-Sep-21,Y,U
28-Sep-21,SW62,1,710,630,28-Sep-21,28-Sep-21,Y,K
28-Sep-21,MMV7,1,1430,50,13-Oct-21,13-Oct-21,Y,B
28-Sep-21,MMV7,1,1430,50,15-Oct-21,15-Oct-21,Y,B
28-Sep-21,QWG5,1,2660,50,13-Oct-21,13-Oct-21,C,D
28-Sep-21,Q4DS,1,7750,50,07-Oct-21,07-Oct-21,C,J
28-Sep-21,QWG5,1,1670,50,11-Oct-21,11-Oct-21,C,I
28-Sep-21,BASU,1,1460,50,12-Oct-21,12-Oct-21,Y,E
28-Sep-21,BASU,1,550,50,13-Oct-21,13-Oct-21,Y,R
28-Sep-21,BASU,1,710,50,14-Oct-21,14-Oct-21,Y,R
28-Sep-21,BASU,1,550,50,13-Oct-21,13-Oct-21,Y,R
28-Sep-21,BASU,1,1460,50,12-Oct-21,12-Oct-21,Y,E
28-Sep-21,BASU,1,710,50,14-Oct-21,14-Oct-21,Y,R
29-Sep-21,87UE,1,9840,158,19-Dec-21,19-Dec-21,Y,I
29-Sep-21,B4YT,1,1000,50,08-Oct-21,08-Oct-21,Y,E
29-Sep-21,B4YT,1,740,50,10-Oct-21,10-Oct-21,Y,Z
29-Sep-21,337L,1,1720,50,29-Sep-21,29-Sep-21,Y,E
29-Sep-21,SRZF,1,0,0,30-Sep-21,30-Sep-21,Y,Y
28-Sep-21,YVM4,1,0,179,28-Sep-21,28-Sep-21,Y,Y
28-Sep-21,YVM4,1,2140,50,30-Sep-21,30-Sep-21,Y,Y
28-Sep-21,A67Z,1,500,50,08-Oct-21,08-Oct-21,Y,Z
28-Sep-21,YVM4,1,2000,50,07-Oct-21,07-Oct-21,Y,Y
28-Sep-21,SW31,1,1790,50,29-Sep-21,29-Sep-21,Y,Y
28-Sep-21,SRZF,1,960,50,29-Sep-21,29-Sep-21,Y,R
28-Sep-21,JGB7,1,1590,50,29-Sep-21,29-Sep-21,Y,M
28-Sep-21,HKEX,1,720,50,12-Oct-21,12-Oct-21,Y,R
28-Sep-21,SRZF,1,1000,50,29-Sep-21,29-Sep-21,Y,V
28-Sep-21,W4PN,1,500,50,08-Oct-21,08-Oct-21,Y,Z
28-Sep-21,SW25,1,2260,50,29-Sep-21,29-Sep-21,C,I
28-Sep-21,B4YT,1,450,50,04-Oct-21,04-Oct-21,Y,Z
28-Sep-21,3RCE,1,1780,50,13-Oct-21,13-Oct-21,C,Z
28-Sep-21,N7WW,1,1180,50,23-Sep-21,23-Sep-21,C,J
28-Sep-21,GWEW,1,1240,188,21-Dec-21,21-Dec-21,C,I
"""

# Create DataFrame
df = pd.read_csv(StringIO(data))

# Count the number of orders for each client
client_orders_count = df['CLIENT'].value_counts().reset_index()
client_orders_count.columns = ['CLIENT', 'ORDERS']

client_orders_count

CLIENT ORDERS 0 BASU 6 1 YVM4 5 2 B4YT 3 3 SRZF 3 4 QWG5 2 5 MMV7 2 6 Q4DS 1 7 87UE 1 8 SW62 1 9 337L 1 10 WS8C 1 11 A67Z 1 12 SW31 1 13 JGB7 1 14 HKEX 1 15 W4PN 1 16 SW25 1 17 3RCE 1 18 N7WW 1 19 GWEW 1

asst_RNgSNtNsi0eczKxJme7O9rWl

The analysis of the orders for each client is as follows:

  • BASU: 6 orders
  • YVM4: 5 orders
  • B4YT: 3 orders
  • SRZF: 3 orders
  • QWG5: 2 orders
  • MMV7: 2 orders
  • Q4DS: 1 order
  • 87UE: 1 order
  • SW62: 1 order
  • 337L: 1 order
  • WS8C: 1 order
  • A67Z: 1 order
  • SW31: 1 order
  • JGB7: 1 order
  • HKEX: 1 order
  • W4PN: 1 order
  • SW25: 1 order
  • 3RCE: 1 order
  • N7WW: 1 order
  • GWEW: 1 order

This summary indicates the total number of orders placed by each client.

User

how many orders in this file?

code_interpreter(

To calculate the total number of orders in the data, we simply need the number of rows in the DataFrame. total_orders = df.shape[0] total_orders

)

35

asst_RNgSNtNsi0eczKxJme7O9rWl

The total number of orders in the file is 35.

1 Like

I am facing similar issues. for a smalled file size having few hunders of rows, the results are better. however when the file sizes are large and there are multiple files it failes to provide accurate results.
any help/tweeks?

This problem is to be expected when using the vector store and file search. file search performs a semantic and keyword search across your indexed and vectorized files you uploaded. It then returns up to a specific number (you can specify) of search result ‘snippets’ that match the context of the search query. Each snippet can only be up to a certain amount of tokens. This means, you won’t get the entire list of orders. Instead you’ll receive some snippets for some orders and therefore cannot do what you are trying to achieve with this tool. You’d have to use some kind of DB search / operations tool to do so.
Hope that helps.