Understanding the Chunking Process in a Vector Store for Plain Text Files

I have a vector store with many plain text files of different sizes. I would like to understand how these text files are divided into smaller chunks for storage in the database. Specifically, I am interested in knowing how the system ensures that the context and continuity of the original text are preserved when large text files are split into smaller chunks. Thank you.

Using gpt-4 API to Semantically Chunk Documents would be a good starting point


TL;DR: There is no context understanding for split points. No “ensuring”.

The extracted file is split at 800 token intervals (approximately), then including an overlap into adjacent sections, 400 tokens.

These parameters can now be altered when adding a file to a vector store.

Your vector database solution

(You can probably do better.)

1 Like