Control chunk size when adding files to a Vectorstore for the new Assistant?

I’m using the OpenAI API to create a Vectorstore for the new Assistant. While I can easily add files to the Vectorstore, I haven’t found any information in the API documentation on how to control the chunk size used when vectorizing the document during the file addition process.

I’m interested in exploring different chunking strategies. By controlling the chunk size, I hope to optimize the Vectorstore’s performance and retrieval accuracy.

If you have any guidance on how to manage chunk sizes when creating a Vectorstore using the OpenAI API, I would greatly appreciate your help

Hi!

It is currently not possible to change the chunking strategy.
From the docs:

We have a few known limitations we’re working on adding support for in the coming months:

Support for modifying chunking, embedding, and other retrieval configurations.

The current state is described here:
https://platform.openai.com/docs/assistants/tools/file-search/how-it-works

So, it’s on the roadmap!

1 Like

@vb I did see that I was able to see the file chunking strategy from before the file was attached to the Vector Store. Do you know if changing the chunking strategy changes the vector store representation of the file? Do they keep it around in case the file needs to be disconnected from the vector store?

from openai import OpenAI
from vector_db import get_files_uploaded_to_vector_store, get_vector_store_id

name = 'v_1'

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

vs_id = get_vector_store_id(client,name)

vs_files = get_files_uploaded_to_vector_store(client,vs_id)

print(vs_files)

It gives me

SyncCursorPage[VectorStoreFile](data=[VectorStoreFile(id=‘file-vI1MnqFjUc217DxL21zx0pLB’, created_at=1726241178, last_error=None, object=‘vector_store.file’, status=‘completed’, usage_bytes=57337, vector_store_id=‘vs_cxmnSRsWHwBsP2SIiA6Uif6g’, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type=‘static’)), etc…

You would need to remove a file from a vector store and re-add it with new chunk size parameters. Then the file extraction, chunking, and embeddings is performed again. You can then check the vector store metadata to observe the change with the new file id return given.

This is quite clear, I just wanted to see if it will have any effect without doing this

The only place for employing a chunking strategy parameter is when you POST a file ID to a vector store.

Try on the same file without removing it first?

“Cannot reindex file file-BPIcVdasj11112234 with a new strategy”

“without doing this”, meaning, without deleting the file_id from the vector store first? Or by just thinking happy thoughts about chunks??

Imagine the unseen functions:

print("file id: ", fid := upload_file())  # uploads with default file name

vector_create_result = client.beta.vector_stores.create(name="chunko")
print("vector store: ", vid := vector_create_result.id)
print(vs_connect(vid, fid, 0.5).chunking_strategy)  # add file to vector store

try:
    connect_result2 = vs_connect(vid, fid, 0.9)  # try to add again??
except openai_BadRequestError as e:
    error_message = str(e)
    if "Cannot reindex file" in error_message:
        # raise FileExistsError("You can't modify the strategy, silly.") from e
        print(e)
    else:
        raise

Since I made handling of that rechunking with “not deleting first” just for you:

file id: file-5vBOxv8QSZKsKgag0x9ZQEJM
vector store: vs_zD8wbkojq59Ub66tSOpusd4M
.
ChunkingStrategyStatic(static=ChunkingStrategyStaticStatic(chunk_overlap_tokens=100, max_chunk_size_tokens=400), type=‘static’)
Error code: 400 - {‘error’: {‘message’: ‘Failed to create file operation: Bad Request: {“detail”:“Cannot reindex file file-5vBOxv8QSZKsKgag0x9ZQEJM with a new strategy”}’, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}