Automating Chunking for Customized GPT Knowledge in Vector Databases

Hello everyone,

Lately, I’ve been delving deeper into AI, particularly the OpenAI API, and I’m eager to embark on a project involving vector databases where GPT will possess customized knowledge. I’ve come across numerous resources, and I’ve almost gathered all the necessary information. However, one aspect remains unclear to me. When inputting information into a vector database, it’s necessary to break down this information into chunks. Yet, I’m unsure how to automate this process. How is this typically handled? I’ve already perused the article from the OpenAI cookbook titled “Embedding Wikipedia articles for search,” but it relies on a library that splits these chunks based on Wikipedia. Any insights or guidance would be greatly appreciated.

That article uses a direct call to a GPT model. What you want has already been automated by Assistants using File Search tool.

They recently updated the Assistants API so check the linked articles because they have everything you need for customized knowledge.

2 Likes

There are a lot of opinions on chunking.

The “standard” way is to pick a chunk size, so X number of sentences or paragraphs, and then create chunks with a 50% (or whatever) overlap, and call this your set of chunks.

But this doesn’t always contain complete “thoughts” or align with general queries since it is not informed by any semantics.

On the other extreme, you can take a query, embed this, and then search for an optimal chunk in the corpus that has the highest match, by iteratively picking chunks, with shrinking and expanding search radii, and shifting offsets within the corpus, embedding them, and correlating with the input until you find the best one, or hit your search limit. This is time consuming, but in theory has the best semantic fit for any chunk in the corpus.

But ideally, you go in, understand your data, and manually create the chunks. If time permits, this is what I would do. If no time, do the “standard”. If you have no time, but willing to let the computer grind, do the search.

1 Like