Using gpt-4 API to Semantically Chunk Documents

sergeliatko · May 31, 2024, 10:22pm

You definitely should make it and put it out there.

Here is some “burned” tips:

Make a simple one-pager website where you’re clearly explaining what you’re up to and approximate ETAs.
Start a wait-list with subscription from that page (you might also offer some wait-list member discount on the first year of using your API)
Share a link to that website in a separate thread in this forum and also mention both from this thread
Add LinkedIn audience/conversation tracking to that website for remarketing later if needed
Add Facebook audience pixel and maybe CRM trackers if you already use stuff like hubspot or Salesforce
Put the links to it on all of your social profiles
try boosting a couple of posts on Facebook and LinkedIn about the API your building to see if people start getting on that wait-list
Look at your numbers and decide if building that API is balanced between how many people you can get as users VS what is the extra effort needed to make it happen.

If you go in that order the failure will be the cheapest and the success will be easier.

SomebodySysop · May 31, 2024, 11:03pm

Thanks. My primary purpose for the API is my internal embedding pipeline. Making it available to others is an idea that kind of popped in my head, but is definitely not a priority. I would only do it if it made good business sense and was relatively easy to do.

I assume from your “burned” recommendations that you tried it? If so, what was your experience?

sergeliatko · June 1, 2024, 10:42am

Yes, not for this API, but for several (couple of dozens) other ideas. And having the wait-list with audience trackers is definitely a must to be able to properly target adds when launching. The forum links (especially on this forum) tend to rank fast and well to give you some additional exposure.

jr.2509 · June 1, 2024, 11:22am

Don’t take it the wrong way but this is not the way the Forum is supposed to be used, i.e. for SEO purposes.

sergeliatko · June 1, 2024, 12:40pm

It doesn’t bring the SEO, but gives rather a place to show what exists for a specific use case closely related to the main subject of the forum: OpenAI usage.

The goal is exposure.

SomebodySysop · June 2, 2024, 2:39am

As I outlined here: Using gpt-4 API to Semantically Chunk Documents - #95 by SomebodySysop

These are my steps:

export the pdf (or whatever) document to txt.
run code to prepend linenoxxxx:
send this numbered file to model along with instructions to create hierarchy json file
process this file with code to add end_line numbers and output that json file.
new: also add token_count to the json file
run code on json output to create the chunks.
new: semantically sub-chunk chunks that are > x tokens

So, I’ve finally created the prompt to “sub-chunk” chunks that are greater than x tokens. Here is the prompt I have so far:

Please divide the following text into semantically relevant chunks, ensuring each chunk is under 300 tokens.

Return the text exactly as provided, but divided into chunks.

Each chunk should be clearly marked with its number, starting from 0, enclosed in square brackets.

For example:

[0]
…text…
[1]
…text…

Additionally, provide a brief summary title for each chunk, also enclosed in square brackets on the first line of the chunk.

For example:

[0]
[Summary]
…text…
[1]
[Summary]
…text…

Text:

[Insert your text here]

Here is the sample text: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/semantic_prompt_example_text.txt

Here are the outputs, using the exact same prompt and text, from various models: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/semantic_prompt_example_output.txt

So, what do you think? Any suggestions for improvement? This prompt is intended to be a catch-all to make sure any chunk that has been through the process that exceeds x tokens is semantically sub-chunked.

SomebodySysop · June 4, 2024, 2:56am

After a little testing, it appears that the prompt works just fine. I only needed to add the following:

If the first line of the input text contains a bracketed title path (i.e. [ title path ]), then add the summary title to that path at the top of each chunk.
For example, if the input text first line starts with:
[ARTICLE 21. Supplemental Markets - (a) The provisions of this Article relate… - (3) Definition]
Then add the chunk summary to that line like this:
[ARTICLE 21. Supplemental Markets - (a) The provisions of this Article relate… - (3) Definition - Summary]

So each chunk will now be formatted like this:

[0]
[title path - Summary]
…text…
[1]
[title path - Summary]
…text…

Your output must be formatted in this manner.

This addition does the following:

Adds the full title path (including it’s summary title) to each individual sub-chunk. This helps identify exactly where in the document hierarchy this segment belongs. This is in addition to the title metadata of the original chunk.
Re-enforces the fact that the output must be rendered in this exact format. All of the program code in this process depends upon the models responding in a specific format.

So far, so good. Test documents are being semantically chunked and sub-chunked under the token limits.

thinktank · June 4, 2024, 2:37pm

Exposure on the web is definitely “SEO.”

But, @jr.2509 , of course we’re using this place for SEO purposes.

In SEO, it is essential to establish “Authority.” High-quality backlinks and other profile information. A backlink to OpenAI is exceptionally valuable because of the high degree of authority associated with the domain.

Especially in this case, @SomebodySysop is putting his whole heart into this project, and he definitely deserves a backlink to his website, especially on a thread he’s been active on for months—so would Serge, and your esteemed self, for that matter.

These forum links rank fast, because of the high degree of earnest work we, the community, are placing into the work. It’s fascinating.

On the other hand, I certainly agree that if you notice anyone spamming links to their website without putting in the work, flag it immediately.

But recent search policy changes make this demonstration of experience in high-profile (well maintained) communities is essential to all of our future professional development.

jr.2509 · June 4, 2024, 2:46pm

I greatly enjoy the intellectual exchange on this topic as we all are trying to solve a similar problem and can benefit from sharing our experiences including how we’ve individually overcome certain technical challenges. I can’t think of a better place to do this than this Forum.

In the spirit of the Community guidelines, I think however it would be best to keep that intellectual discussion separate from any business and/or SEO objectives.

SomebodySysop · June 4, 2024, 7:00pm

Thank you for the kind words. I created this video a year ago, https://youtu.be/w_veb816Asg, and as such, I think it makes me one of the first persons to coin the phrase “Semantic Chunking”.

In RAG, the quality of your model responses are 100% dependent upon the quality of your vector store retrievals. So it’s simple, the better your embeddings, the better your RAG application is going to perform.

While I began organizing my document chunks to be embedded in a more hierarchal manner, I still used the “sliding window” approach when it came to the actual embedding of the text chunks. As a result of this discussion back in early April, RAG is not really a solution - #43 by SomebodySysop, I decided to start this thread and explore how to totally automate a Semantic Embedding process – using only code and the actual models, and without having to rely on LangChain.

Glad I did, because with the help of other participants, including @sergeliatko and @jr.2509 , I have come up with a solution that fits into my embedding pipeline beautifully and – so far – appears to do what I’ve been wanting to do for over a year now.

I would love to make this code available in a public distribution, but the amount of time and effort it would take me to pull it out of my existing infrastructure would be prohibitive. In thinking about this, I realized what would be far easier would be to make the API itself publicly available. Yes, it would be for a fee, but I would basically only charge for the tokens used, with a reasonable markup.

So, to be clear, the idea of a Semantic Chunking API is just that: an idea. I’ve still got plenty work to do to test this thing out on a variety of documents to discover the glitches.

Again, many thanks to everyone who has helped on this project.

sergeliatko · June 4, 2024, 8:08pm

Totally confirm with all the 2 hard years of hitting this wall @LAWXER

SomebodySysop · June 6, 2024, 3:55am

I spent some time looking at various chunking methods being promoted. These are some particularly good videos I found:

James Briggs discussion on his version of Semantic Chunking

Chunking Strategies

https://www.youtube.com/watch?v=pIGRwMjhMaQ&ab_channel=MervinPraison

The 5 levels of text splitting

https://youtu.be/8OJC21T2SL4?si=Wv1HjWQr2USmyiP-

It’s been difficult wrapping my head around these various strategies, but it appears that the key one involves splitting a document by sentence, then using an embed model to find the cosine similarity distance between them and then “chunk” the ones that are most similar.

I assume this has the effect, as an embedding, of giving the best response that is available from the document on any particular question. However, as someone else mentioned earlier, what if two sentences are similar, but from completely different sections of the document? And, more importantly, when you return the chunk to the LLM, how does it figure out how to cite the specific document sections referenced?

In my applications, I always list the references with links to those specific areas in the document so the user can cross-check in real time. Of course, most user won’t – but they do this at their own peril.

Because of this, I prefer my own “layout-aware” chunking approach I have outlined in this thread. If I’ve done my embeddings correctly, a cosine similarity search will find similar ideas wherever they appear – and those ideas can be referenced, with links, back to the specific areas in the document(s) they are found.

Not to mention the fact that you’ve got to load and maintain a bunch of libraries with those strategies mentioned above. In my approach, there are two prompts. So long as those prompts return data in the specific structures as instructed, the rest of the code will work perfectly – now, and 5 years from now.

egils · June 6, 2024, 7:27am

Built our RAG having all KB content in markdown with well structured headings and preserving/inheriting all parent headings when docs get chunked, no overlaps. No semantics involved though.

Had to write custom md splitter since I could not find anything ready to be useful with required level of control over chunk length.

SomebodySysop · June 6, 2024, 7:32am

Preserving the “title path” in embeddings I think is very useful. Which chunking methodology are you using?

egils · June 6, 2024, 7:43am

Not sure about “methodology”… I wrote my custom markdown splitter that preserves parent headings and respects defined token amount. This helps to get proper embedding distances for chunked context since parent headings add context to the chunk itself and clearly identifies chunk “location” in the larger document taxonomy.

SomebodySysop · June 6, 2024, 8:34am

By “methodology”, I am referring to one of the methods described in the videos I posted here: Using gpt-4 API to Semantically Chunk Documents - #112 by SomebodySysop

Level 1: Character Splitting - Simple static character chunks of data
Level 2: Recursive Character Text Splitting - Recursive chunking based on a list of separators
Level 3: Document Specific Splitting - Various chunking methods for different document types (PDF, Python, Markdown)
Level 4: Semantic Splitting - Embedding walk based chunking
Level 5: Agentic Splitting - Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
Bonus Level: Alternative Representation Chunking + Indexing - Derivative representations of your raw text that will aid in retrieval and indexing

egils · June 6, 2024, 10:57am

Ah, then that would match to custom implementation of Document Specific Splitting for Markdown.
Semantic splitting could be very expensive if you have a large KB that tends to change over the time. In our case, main KB repository is created from our proprietary (although public) web content having hundreds of documentation and tutorial pages, and they tend to change. And that would be very costly to semantically chunk this content using API with every change.

What would really help is that OpenAI would offer possibility to “push” our company’s web site content into model training data upon changes on our side.

SomebodySysop · June 7, 2024, 5:27am

You are referring to fine-tuning, right? Because if you are referring to embeddings, that is something you can easily do.

I use Weaviate as my vector store. I use the Drupal Content Management System (CMS) as my platform. Whenever content is created, modified or deleted, the requisite information is queued up to be executed during an hourly cron. This process, effectively, updates my vector store on an ongoing basis for any document changes on my site.

My point being that you should be able to create the mechanism on your end to automatically “push” your changes to your embeddings vector store.

However, if you are referring to fine-tuning, that’s a horse of a different color. I’ve never used it, but I understand that you can’t “update” fine-tuned models. You can only create new ones with updated data.

egils · June 7, 2024, 9:47am

Nope, I’m not referring to fine-tuning.
I’m talking about having possibility to “push” proprietary company’s web content into LLMs training data, I mean to update on change/demand (assuming current model’s training data already contains data from our web site). The issue is that our web content tends to change and quickly becomes outdated. As a result, ChatGPT may create answers based on outdated data and mislead customers about our product (which is technically quite sharp software tool).
I completely understand current OpenAI policy which presumes only pulling public training data to have a control over data quality and filter out “undesired” content. Perhaps, this is the only way to ensure someone do not “inject” some malicious code/data that could break AGI free.

thinktank · June 7, 2024, 3:55pm

I’ve been wondering. (But I only learned the term ‘Vector Store’ three days ago, so I haven’t been wondering long.) Why don’t you use the OpenAI vector store? The OpenAI VS allows you to put in so many meta fields, it looks like it could cover most of the things we’ve talked about.

And, different note, if Drupal dev is PHP heavy you’ll have a ready outlet in WordPress for commercializing your jam.

To accomplish this (I’m guessing), you might look to Pushing your data into a Vector Store and having that Store attached to an Assistant. I think those Assistants can use webhooks and what-not to listen for changes on a given website.

The Store automatically adds embeddings and keywords and stuff, which you can tweak in-transition as well—and the data is [almost] instantaneously available to the Assistant the Store is assigned to.

You can set the Vector Store to expire, ensuring only current data is referenced… when the Store does expire, you could condense it into Fine Tuning the backend of the model for better answers moving forward.

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	31707	April 1, 2025
The length of the embedding contents API	48	34164	December 13, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4466	January 26, 2024
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45233	December 12, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	6218	May 1, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics