Does OpenAI not chunk my documents in vector store?

Hi, first timer here. I looked at some similar questions, but the only solution I found was to pay $50 and I don’t think that will help me given the file sizes and number of them I am working with. I am looking for any other solutions.
I created a file search Assistant, and uploaded some files to a vector store. Then I created a Thread and tried to create a run.

The run always fails and I always get this error:
LastError( 32521. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.')
truncation_strategy=TruncationStrategy(type=‘auto’, last_messages=None), usage=Usage(completion_tokens=24, prompt_tokens=845, total_tokens=869), temperature=1.0, top_p=1.0, tool_resources={})

I don’t get this error when the file I upload for file search is tiny.
I have paid $5 already.
I thought OpenAI docs said it would auto chunk the files I upload and do a keyword and semantic search over them to retrieve relevant data?
Does OpenAI send the entire doc in the prompt?
Can I modify the chunking strategy to fit my quota limits?

Details:

from openai import OpenAI

client = OpenAI(api_key = config["openai_api_key"])
 
assistant = client.beta.assistants.create(
  name="Assistant",
  instructions="You are an expert on a person. Use you knowledge base to answer questions about his works.",
  model="gpt-4o",
  tools=[{"type": "file_search"}],
)

# Create a vector store called "Financial Statements"
vector_store = client.beta.vector_stores.create(name="TheData")
 
# Ready the files for upload to OpenAI
file_paths = [<some large files>]
file_streams = [open(path, "rb") for path in file_paths]
 
# Use the upload and poll SDK helper to upload the files, add them to the vector store,
# and poll the status of the file batch for completion.
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
  vector_store_id=vector_store.id, files=file_streams
)
 
# You can print the status and the file counts of the batch to see the result of this operation.
print(file_batch.status)
print(file_batch.file_counts)

assistant = client.beta.assistants.update(
  assistant_id=assistant.id,
  tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

# Save the assistant's ID for future use
with open("assistant_id.txt", "w") as f:
    f.write(assistant.id)

print("Assistant setup complete. ID saved to assistant_id.txt")

Now I try to send a message to the Assistant:


thread = client.beta.threads.create(
    messages =[
        {
            "role":"user",
            "content": "tell me about  the main themes in doc 1"
        }
    ]
)

assistant_id_got = load_assistant_id()

run = client.beta.threads.runs.create_and_poll(
  thread_id=thread.id,
  assistant_id=assistant_id_got,
  instructions="Please address the user as Jane Doe. The user has a premium account."
)

if run.status == 'failed':
    print(run)
if run.status == 'completed': 
  messages = client.beta.threads.messages.list(
    thread_id=thread.id
  )
  print(messages)
else:
  print(run.status)

The run always fails and I always get this error:
LastError( 32521. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.')
truncation_strategy=TruncationStrategy(type=‘auto’, last_messages=None), usage=Usage(completion_tokens=24, prompt_tokens=845, total_tokens=869), temperature=1.0, top_p=1.0, tool_resources={})

I don’t get this error when the file I upload for file search is tiny.
I have paid $5 already.
I thought OpenAI docs said it would auto chunk the files I upload and do a keyword and semantic search over them to retrieve relevant data?
Does OpenAI send the entire doc in the prompt?
Can I modify the chunking strategy to fit my quota limits?

1 Like

Yes, you are encountering the absurdly low token rate limit given to tier 1. 30000 tokens per minute. You cannot make a single assistant run in a thread that would use multiple internal calls to a model rapidly after being loaded up full with vector store results, or if that thread has a growing conversation that has multiple vector store returns. The limiter predicts and blocks the API call that would exceed the rate.

You have no control over the length of past chat history of a thread sent to models, which includes unseen tool messages.

gpt-4o-mini has higher TPM if you need to switch models to continue a thread.

But you are on the right track with controlling the chunking size and the number of chunks returned.

You will have to remove files from a vector store and re-embed them with a chunking strategy that uses a lower size, and lower overlap too. Then as a file_search parameter for an assistant, you can set the maximum chunks returned to less than the default of 20, and also experiment with threshold. max_num_results and ranking_options. (a good threshold is 0.45)

An assistant also has parameters to be set on it that set the chunks and max returns of auto-created vector stores by file attachment to a thread.

Ultimately, the lack of controls, and that OpenAI will send runs to certain failure at your expense, should be one of many reasons to develop with Chat Completions and not Assistants.

2 Likes