I’m currently working on a project where I am breaking text up into arbitrary segments. The idea is to keep whole ideas together. So for example, if I were segmenting a book, the paragraphs would be their own segments/chunks. But, if I were to segment a GitHub readme, the segments could be different sections, like the installation, the roadmap, etc.
Right now I just have an LLM doing “smart parsing”, which is basically me just telling it to go break up the document into x number of segments based on its best judgement. It is working for now, but as you could imagine, pretty lazy.
I was wondering if anyone was aware of any research regarding some sort of semantic segmentation that doesn’t involve direct LLM inference/generation. Maybe there is some sort of sliding window embedding method that looks for context switching.
I’m mostly just wondering if there is anything that has already been done before like this. I’d rather not experiment right now. Already too many variables, but if there is some sort of algorithm, just let me know and I’ll check it out.
I’m breaking them up for classification purposes. Github and the book are just examples, I’m doing this for arbitrary text and sources.
The process is basically:
Scrape the data, either from a document, a web page, etc.
Clean the data with regex and standard string manipulation.
Format the text with an LLM into readable format, in my case markdown.
Break the formatted document up with an LLM prompt.
For each chunk, determine if the chunk is categorized one way or another.
I highlight for each chunk because line by line or sentence by sentence classification either leaves too much context out for proper classification, or requires too many API calls to be realistic. I’m working with like 300 page and up documents/text sources.
Right now, the Break the formatted document up with an LLM prompt., works well. Just thinking about if I can skip this step entirely with something more sophisticated and efficient.
But the key point is it needs to be adaptable and hardcoded strategies are a no go.
The only thing I am trying to do is automate text segmentation based on semantic relevance without direct LLM generation.
I have clients with lots of documentation from many different sources, PowerPoints, GitHub, transcripts, PDFs, webpages. Some of it needs to be updated with relevant references.
For example, one client has legal text that needs to be updated from one state to another. I need to break up the document into several pieces, then ask the AI if it contains any text that needs to be updated via a set of criteria established by the client.
All of that is ancillary, because I already know how to do that. It’s the easy part and mostly just infrastructure code. The only real thing in question right now is if I can do away with the segmentation AI. I’d rather do it algorithmically as it would cut the cost by a third and makes the unit economics of the service more viable.
Gotta get back to working, but I figured I would ask around and see if anyone knows of anything. I have a couple of light machine learning ideas I might try out eventually, but was hoping there was existing stuff I could use now.
There was a thread here a while back where I was talking about dynamic embeddings. So iterating over text with variable length embeddings, shifting both the offset and radius, and correlating to your known targets in the database. In this correlation, you will get a peak, and this peak should then represent peak similarity, and the optimal chunk to use.
For segmentation, you would do the opposite, have the sliding embedding window search for the largest null, or discontinuity in adjacent sliding windows.
So take pairwise adjacent sliding windows, slide them across, but take their dot products. Where the derivative has the steepest slope (positive or negative) would represent a segment.
So calculus on sliding windows of embedded text. That’s the rough idea at least.
That’s a similar approach to one idea I want to do a full mockup for. Same basic principle:
Sliding embedding window.
Capture the embeddings and their corresponding text.
Do some fast classifiers like pca or kmeans clustering on the embeddings.
Based on the population classification, split the texts up.
I tried a version of it recently with kmeans. It works well, but doesn’t keep temporal context. So, sentences and phrases are out of order a lot of times. I have to do some more stabilization on it. But groupings work well.
My latest thought was add a couple more dimensions (maybe 3) to the end and move those coordinates diagonally from 0 to 1 based on the position of the text. Bassically an engineered “temporal context” feature.
Regardless though, I think we’re thinking along the same lines of leveraging simplicity of existing methods.
OK, I found the thread where I was talking about these “dynamic embedding heatmaps”.
The idea there is you have one large continuous text that you are drawing your RAG from.
So a user has an incoming query, and this query would then generate starting seeds into the large continuous document. This would be either from keywords, or normal chunk embeddings.
Then you dynamically grow and shrink a window around each of these seeds in the large document, and embed, then correlate this with the input.
So your retriever has these dynamic “floating” chunks that are optimized for each query.
The problem I was trying to solve was creating optimal chunks, specific to the query.
Now this does add latency to @sergeliatko ’s point, and no I haven’t tried it! It’s more of a theoretical concept.
But I did have a caching strategy in mind where I wouldn’t re-embed for the same start and stop offsets into the large continuous text corpus. So some savings there in cost and latency.
So if anyone wants to try this. You would start with nominal chunks, maybe even ones totally chosen arbitrarily, like what most people do, say some fixed size and percent offset. You also record the starting character index (and even last character index) into the large continuous stream.
These offsets would help the time ordering problem @codie mentions.
So after you form the initial grid, you perform initial vector queries into your document. But based on whatever flag (classifier) or other condition, you spawn a dynamic embedding search, and cache your results. And now you have an optimal matched chunked version for the given query.
If latency is a big deal, you can do normal RAG always, but in the background compute the “real” chunk that has been optimized in this process, which would then be used later for searches. It just gets added to the pile essentially, but now it is optimal.
Over time, you can even start to remove your initial grid of chunks, after they get covered by optimal ones. So that over time you have continuous coverage of optimal chunks, instead of the initial grid that was used to initialize the system.
Anyway, those are my thoughts. Concepts like this might also help segmentation because these optimal chunks from real queries are now considered “whole” with real boundaries and not artificial from the arbitrary initial chunking process. The first problem IMO is artificial boundaries generated in the laydown of the initial grid. Now the boundaries are real, and have semantic meaning.
So after these real boundaries are formed, you have proper segments, and can now send them through classification or clustering to form properly aligned segmentations semantically as well.