Control chunk size when adding files to a Vectorstore for the new Assistant?

billster45 · May 1, 2024, 6:48am

I’m using the OpenAI API to create a Vectorstore for the new Assistant. While I can easily add files to the Vectorstore, I haven’t found any information in the API documentation on how to control the chunk size used when vectorizing the document during the file addition process.

I’m interested in exploring different chunking strategies. By controlling the chunk size, I hope to optimize the Vectorstore’s performance and retrieval accuracy.

If you have any guidance on how to manage chunk sizes when creating a Vectorstore using the OpenAI API, I would greatly appreciate your help

vb · May 1, 2024, 7:02am

Hi!

It is currently not possible to change the chunking strategy.
From the docs:

We have a few known limitations we’re working on adding support for in the coming months:

Support for modifying chunking, embedding, and other retrieval configurations.

The current state is described here:
https://platform.openai.com/docs/assistants/tools/file-search/how-it-works

So, it’s on the roadmap!

davidawarshawsky · September 18, 2024, 11:37am

@vb I did see that I was able to see the file chunking strategy from before the file was attached to the Vector Store. Do you know if changing the chunking strategy changes the vector store representation of the file? Do they keep it around in case the file needs to be disconnected from the vector store?

from openai import OpenAI
from vector_db import get_files_uploaded_to_vector_store, get_vector_store_id

name = 'v_1'

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

vs_id = get_vector_store_id(client,name)

vs_files = get_files_uploaded_to_vector_store(client,vs_id)

print(vs_files)

It gives me

SyncCursorPage[VectorStoreFile](data=[VectorStoreFile(id=‘file-vI1MnqFjUc217DxL21zx0pLB’, created_at=1726241178, last_error=None, object=‘vector_store.file’, status=‘completed’, usage_bytes=57337, vector_store_id=‘vs_cxmnSRsWHwBsP2SIiA6Uif6g’, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type=‘static’)), etc…

_j · September 18, 2024, 1:53pm

You would need to remove a file from a vector store and re-add it with new chunk size parameters. Then the file extraction, chunking, and embeddings is performed again. You can then check the vector store metadata to observe the change with the new file id return given.

davidawarshawsky · September 19, 2024, 7:11am

This is quite clear, I just wanted to see if it will have any effect without doing this

_j · September 19, 2024, 10:25am

The only place for employing a chunking strategy parameter is when you POST a file ID to a vector store.

Try on the same file without removing it first?

“Cannot reindex file file-BPIcVdasj11112234 with a new strategy”

“without doing this”, meaning, without deleting the file_id from the vector store first? Or by just thinking happy thoughts about chunks??

Imagine the unseen functions:

print("file id: ", fid := upload_file())  # uploads with default file name

vector_create_result = client.beta.vector_stores.create(name="chunko")
print("vector store: ", vid := vector_create_result.id)
print(vs_connect(vid, fid, 0.5).chunking_strategy)  # add file to vector store

try:
    connect_result2 = vs_connect(vid, fid, 0.9)  # try to add again??
except openai_BadRequestError as e:
    error_message = str(e)
    if "Cannot reindex file" in error_message:
        # raise FileExistsError("You can't modify the strategy, silly.") from e
        print(e)
    else:
        raise

Since I made handling of that rechunking with “not deleting first” just for you:

file id: file-5vBOxv8QSZKsKgag0x9ZQEJM
vector store: vs_zD8wbkojq59Ub66tSOpusd4M
.
ChunkingStrategyStatic(static=ChunkingStrategyStaticStatic(chunk_overlap_tokens=100, max_chunk_size_tokens=400), type=‘static’)
Error code: 400 - {‘error’: {‘message’: ‘Failed to create file operation: Bad Request: {“detail”:“Cannot reindex file file-5vBOxv8QSZKsKgag0x9ZQEJM with a new strategy”}’, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

Topic		Replies	Views
How does chunking work in OpenAI Assistant API's vector store? API chatgpt , assistants-api	1	134	March 13, 2025
Customizing chunk size for file_search tool API api , assistants-api	1	79	March 21, 2025
"Understanding Chunking and Duplicate File Handling in OpenAI's Vector Store Documentation chatgpt	1	1089	July 11, 2024
File retrieval in assistant uses huge amount of input tokens API assistants-api	11	2781	June 12, 2024
What is the chunking strategy used by the Assistant? API assistants	6	5075	December 5, 2024

Control chunk size when adding files to a Vectorstore for the new Assistant?

Related topics