"Understanding Chunking and Duplicate File Handling in OpenAI's Vector Store

I am interested in understanding how OpenAI’s vector store handles file chunking and the behavior of the system when the same file is uploaded multiple times. Specifically, I would like to know:

  1. Chunking in Vector Store: Is there a way to check how chunking is performed in OpenAI’s vector store? How does the system determine the size and number of chunks for a given file? Are there any configurable parameters that influence this process?
  2. Handling Duplicate File Uploads: If I upload the same file twice to the vector store, what happens? Will the system recognize the duplicate and merge the content, or will it override the existing file? Are there any settings or best practices for managing duplicate file uploads to ensure data integrity and avoid redundancy?

I am looking for detailed insights into these mechanisms to better understand the underlying processes and optimize my use of OpenAI’s vector store.

Regarding chunking, the following is the default behaviour as per the official documentation:

By default, the file_search tool uses the following settings but these can be configured to suit your needs:

  • Chunk size: 800 tokens
  • Chunk overlap: 400 tokens
  • Embedding model: text-embedding-3-large at 256 dimensions
  • Maximum number of chunks added to context: 20 (could be fewer)

As noted , OpenAI has introduced options for customizing the chunking approach to better reflect your needs.

As for your second point: I have personally tested this but would assume that there is no automatic recognition or merging of duplicate files.

1 Like