Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · May 22, 2024, 9:53am

The “atomic idea” makes a lot of sense. I think the methodology I am working on is related, although it approaches the atomic idea from the outside in rather than the inside out (as yours does).

However, my approach is also based somewhat on the “layout aware” concept of chunking/embedding, which is discussed in this AWS Textract article: Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks | AWS Machine Learning Blog

Here are a couple of charts (you’ll find them towards the end of the article) representing the LLM results from layout-aware vs non layout-aware embeddings:

While your approach is more granular and may, in fact, be the best, I think mines shares in your core principal of getting a whole item within a chunk.

At any rate, thank you for your explanation. I struggled to find a way to characterize “numeric chunking”, and “sliding window” is a perfect term!

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4498	January 26, 2024
Preparing data for embedding API	33	14779	December 16, 2023
Building first RAG system API	17	651	July 6, 2025
Poor quality response on trained LLM with pdf files Community gpt-4	29	6393	May 1, 2024
API Prompt for gpt-3.5-turbo-16k API gpt-35-turbo	11	3400	January 8, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics