@ForcefulCJS & @RobbieInOz – Sure, I’ll start with some basics.
First, the majority of my work where I find that I need to chunk / submit shorter context, results from a lot of work I do with scientific manuscripts and occasionally legal filings.
In both instances, the majority of the content would always fit within 128k tokens (with simple text, assume 180+ pages). Therefore, in theory, I should be fine without chunking, you might assume.
The reality is that I gain far more accurate and comprehensive results by chunking providing that I test the guiding/system prompt extensively prior to executing any batch processes.
Because I’m processing hundreds, not thousands of documents, the most effective methods is to simply open a document, look for the areas that are relevant and insert my own escape characters to indicate it should break there. I would search the document to confirm it contains no pipes, then typically I use three pipes “|||” as a break. If it were markdown, with a table, it may contain pipes but any sequence can work, eg combine your sequence with backticks, etc.
This may sound time consuming or like an extra step; however, in my experience, it’s the best method to assure proper outcome.
Alternative: When I use an Code to Divide
- I enumerate the text by max content in the simplest fashion of X characters per iteration (typically no more than 16k characters or roughly 2k words).
- I scan through the current iteration in reverse, looking for anything that appears to be a section header using a variety of logic; this is different for legal complaints versus manuscripts, etc.
- Markdown is the easiest option; if I can run a quick formatted doc to Markdown conversion, you can quickly divided with a simple regex seeking a header, eg
^\s*#+
- Non-Markdown – I often look for patterns including numbers / roman numerals with less than 110 characters that follow (or double+ linefeeds on either side of short string of text indicating a un-numbered header, etc). For example, here I use a lookbehind to confirm a minimum of two line breaks, then search for only three roman numerals relevant to a four-digit, limited number section-heading:
(?<=\n\n)[IVX]{1,4}\.\s+.{6,110}$
- I have several search patterns like this.
- Remember: this is done in reverse search (or w/regex reverse enumerated) to start at the end – and using a weighted system, to determine whether I’m comfortable with the break.
- I also have used tools like SpaCy (Python module; I should mention I prefer strongly typed languages but typically write scripts like this in Python) – for NLP
- I then take everything from that break forward and append to the subsequent chunk (eg - an “appendNext” string that is always appended with each new loop iteration).
- If I have failed / have low confidence of break - I will either call a local LLM or GPT to find the best break. It’s very difficult to find a very short prompt to do this properly, which may surprise you because it’s a relatively easy task, in theory. For example: you cannot ask for a character position, you’d have to ask for the first or last sentence where the most substantial change of topic exists. Yet even that isn’t as easy as it sounds because the LLM will correct the text (remember - it’s an LLM, it wants to put words in the most common order and fix things even when you tell it not to).
- Simply put – testing a great deal at this LLM-topic-division point is the most important task prior to running many documents. The other option is having it return the text you provided and insert its own break – however, as you get over 1200 characters you see it begin to trim, etc.
- This is a situation where you may find an iterative task with simple/light LLM like Mistral, for something so rudimentary, is just easier – but that’s an entirely separate multi-page post (lol)
Simply put - It’s an enumeration with scan-to-remove-and-append-to-next … logic. In most cases it helps to have an idea of your content structure. Scientific studies/manuscripts and legal filings are very different; yet tailoring a pattern for each, individually, is fairly straightforward.
It may sound complicated at first, but as you iterates a dozen documents to test, it becomes more of a “shooting fish in a barrel” situation where you just methodically begin to see/catch more patterns based on your content.
However, after reading this, I suspect you can appreciate why 30-60 seconds to throw escape-sequences where you find them most relevant, is easier in low-volume cases. This also gives you a solution that can essentially be parsed in a single Python operation without any potential pitfalls.
One last tip - Understanding the manner that LLM’s use to emulate logic and considering this when writing your prompt explaining that each chunk is merely one of many, can be a source for issues. You often have to go overboard to reinforce that this is merely one in a group of many pieces of content. Reinforce it at the beginning and with an “IMMUTABLE RULE:” (or similar technique) at the end, plus one somewhere adjacent to the most fundamental of rules, if that’s not at the beginning.
I realize all of this sounds convoluted; however, it’s really not – particularly if you are processing a fair number of documents over the longterm and accuracy is more important to you than a day customizing a script or iteratively optimizing prompts.
In the end, you have far superior accuracy – and even better – you have a framework you can quickly adapt to another task.
Speaking of Frameworks
You may think to yourself “this all sound convoluted, I’ll just throw LangChain into the mix, tokenize, then iterate, and save myself some time.” I haven’t found that to be the case. I’ve found that it’s easier to simply write a script like this, which is a relatively basic loop/search, than a “new and shiny” process leveraging tools like LangChain.
While that’s been my experience, I’m often wrong – I’m very interested in anyone’s opinion that’s found an alternative that is more effective.
I hope this can be of some help