Automating Chunking for Customized GPT Knowledge in Vector Databases

ikonodim · April 27, 2024, 5:05pm

Hello everyone,

Lately, I’ve been delving deeper into AI, particularly the OpenAI API, and I’m eager to embark on a project involving vector databases where GPT will possess customized knowledge. I’ve come across numerous resources, and I’ve almost gathered all the necessary information. However, one aspect remains unclear to me. When inputting information into a vector database, it’s necessary to break down this information into chunks. Yet, I’m unsure how to automate this process. How is this typically handled? I’ve already perused the article from the OpenAI cookbook titled “Embedding Wikipedia articles for search,” but it relies on a library that splits these chunks based on Wikipedia. Any insights or guidance would be greatly appreciated.

geekykidstuff · April 27, 2024, 5:12pm

That article uses a direct call to a GPT model. What you want has already been automated by Assistants using File Search tool.

They recently updated the Assistants API so check the linked articles because they have everything you need for customized knowledge.

curt.kennedy · April 27, 2024, 7:50pm

There are a lot of opinions on chunking.

The “standard” way is to pick a chunk size, so X number of sentences or paragraphs, and then create chunks with a 50% (or whatever) overlap, and call this your set of chunks.

But this doesn’t always contain complete “thoughts” or align with general queries since it is not informed by any semantics.

On the other extreme, you can take a query, embed this, and then search for an optimal chunk in the corpus that has the highest match, by iteratively picking chunks, with shrinking and expanding search radii, and shifting offsets within the corpus, embedding them, and correlating with the input until you find the best one, or hit your search limit. This is time consuming, but in theory has the best semantic fit for any chunk in the corpus.

But ideally, you go in, understand your data, and manually create the chunks. If time permits, this is what I would do. If no time, do the “standard”. If you have no time, but willing to let the computer grind, do the search.

Topic		Replies	Views
How to Optimize Text Chunking for Improved Embedding Vectorization? API vector-db , semantic-search	6	11422	December 15, 2023
How to deal with real and recommended chunk size? GPT builders	0	880	July 5, 2024
How to perform dynamic chunking based on the content Community chatgpt , api	1	1559	November 30, 2023
Assistants with knowledge base: How to determine atomic piece of information during chunking for more accurate retrieval? API assistants , assistants-api	0	1346	November 10, 2023
Understanding the Chunking Process in a Vector Store for Plain Text Files API	2	1398	July 2, 2024

Automating Chunking for Customized GPT Knowledge in Vector Databases

Related topics