Understanding the Chunking Process in a Vector Store for Plain Text Files

medi.mhb.w · July 2, 2024, 7:39am

I have a vector store with many plain text files of different sizes. I would like to understand how these text files are divided into smaller chunks for storage in the database. Specifically, I am interested in knowing how the system ensures that the context and continuity of the original text are preserved when large text files are split into smaller chunks. Thank you.

sergeliatko · July 2, 2024, 7:44am

Using gpt-4 API to Semantically Chunk Documents would be a good starting point

_j · July 2, 2024, 8:23am

Assistants

TL;DR: There is no context understanding for split points. No “ensuring”.

The extracted file is split at 800 token intervals (approximately), then including an overlap into adjacent sections, 400 tokens.

These parameters can now be altered when adding a file to a vector store.

Your vector database solution

(You can probably do better.)

Topic		Replies	Views
Automating Chunking for Customized GPT Knowledge in Vector Databases API embeddings , vector-db	2	1576	April 27, 2024
"Understanding Chunking and Duplicate File Handling in OpenAI's Vector Store Documentation chatgpt	1	1274	July 11, 2024
How does chunking work in OpenAI Assistant API's vector store? API chatgpt , assistants-api	1	253	March 13, 2025
How to deal with real and recommended chunk size? GPT builders	0	830	July 5, 2024
Vector store file granularity recommendation API	1	318	May 31, 2024

Understanding the Chunking Process in a Vector Store for Plain Text Files

Assistants

Your vector database solution

Related topics