Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · June 4, 2024, 7:00pm

Thank you for the kind words. I created this video a year ago, https://youtu.be/w_veb816Asg, and as such, I think it makes me one of the first persons to coin the phrase “Semantic Chunking”.

In RAG, the quality of your model responses are 100% dependent upon the quality of your vector store retrievals. So it’s simple, the better your embeddings, the better your RAG application is going to perform.

While I began organizing my document chunks to be embedded in a more hierarchal manner, I still used the “sliding window” approach when it came to the actual embedding of the text chunks. As a result of this discussion back in early April, RAG is not really a solution - #43 by SomebodySysop, I decided to start this thread and explore how to totally automate a Semantic Embedding process – using only code and the actual models, and without having to rely on LangChain.

Glad I did, because with the help of other participants, including @sergeliatko and @jr.2509 , I have come up with a solution that fits into my embedding pipeline beautifully and – so far – appears to do what I’ve been wanting to do for over a year now.

I would love to make this code available in a public distribution, but the amount of time and effort it would take me to pull it out of my existing infrastructure would be prohibitive. In thinking about this, I realized what would be far easier would be to make the API itself publicly available. Yes, it would be for a fee, but I would basically only charge for the tokens used, with a reasonable markup.

So, to be clear, the idea of a Semantic Chunking API is just that: an idea. I’ve still got plenty work to do to test this thing out on a variety of documents to discover the glitches.

Again, many thanks to everyone who has helped on this project.

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4571	January 26, 2024
Preparing data for embedding API	33	15127	December 16, 2023
The length of the embedding contents API	48	35143	December 13, 2023
It looks like GPT-4-32k is rolling out API gpt-4	202	71896	July 16, 2023
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45933	December 12, 2023

Using gpt-4 API to Semantically Chunk Documents

Related topics