"Understanding Chunking and Duplicate File Handling in OpenAI's Vector Store

harijayaraman05 · July 11, 2024, 11:07am

I am interested in understanding how OpenAI’s vector store handles file chunking and the behavior of the system when the same file is uploaded multiple times. Specifically, I would like to know:

Chunking in Vector Store: Is there a way to check how chunking is performed in OpenAI’s vector store? How does the system determine the size and number of chunks for a given file? Are there any configurable parameters that influence this process?
Handling Duplicate File Uploads: If I upload the same file twice to the vector store, what happens? Will the system recognize the duplicate and merge the content, or will it override the existing file? Are there any settings or best practices for managing duplicate file uploads to ensure data integrity and avoid redundancy?

I am looking for detailed insights into these mechanisms to better understand the underlying processes and optimize my use of OpenAI’s vector store.

jr.2509 · July 11, 2024, 11:38am

Regarding chunking, the following is the default behaviour as per the official documentation:

By default, the file_search tool uses the following settings but these can be configured to suit your needs:

Chunk size: 800 tokens

Chunk overlap: 400 tokens

Embedding model: text-embedding-3-large at 256 dimensions

Maximum number of chunks added to context: 20 (could be fewer)

As noted , OpenAI has introduced options for customizing the chunking approach to better reflect your needs.

As for your second point: I have personally tested this but would assume that there is no automatic recognition or merging of duplicate files.

Topic		Replies	Views
How does chunking work in OpenAI Assistant API's vector store? API chatgpt , assistants-api	1	132	March 13, 2025
Understanding the Chunking Process in a Vector Store for Plain Text Files API	2	1058	July 2, 2024
What is the chunking strategy used by the Assistant? API assistants	6	5072	December 5, 2024
Control chunk size when adding files to a Vectorstore for the new Assistant? API	5	2078	September 19, 2024
What model does the Vector store functionality use? API vector-store	5	420	August 7, 2024

"Understanding Chunking and Duplicate File Handling in OpenAI's Vector Store

Related topics