The dndGPT Case Study for You and Me!

How to Organize Long Documentation in a Vector Store

Thanks to better analysis tools on the Platform, we can better understand how to set up a document search structure. The whole goal here is to really, really understand a particular document inside and out.

The SRD is an excellent source document for public experimentation. What I’m looking into is efficiency with regard to searching a) the full 400 page SRD pdf, b) the SRD dividing into smaller 25 pageish pdf chapters, and c) the specific 100 page sub-section of the SRD I’m working with (the monsters).

A Vector Store was created for each file set, then 4o Mini since we are only asking for simple search and retrieval for an entire monster, a value that varies 400-1000 [legacy] tokens per monster. These experiments were all performed with the “Aboleth,” an approximately 750 token monster.

First a Fun Mistake

Well I made an error but discovered something interesting. The error was that my System Instructions declared the exact file to find the information I was wanting to extract. :sweat_smile:

WHAT’S interesting is that, having this simple 20 word explanation, enabled the model to find the exact information I was looking for no matter the source document, check it out:

Results from the full document:

The SRD Divided into Chapters:

Only the three chapters of the SRD regarding monsters:

I expected the small section to be significantly more efficient than the divided section, and the divided significantly more efficient than the full document search. But both input and output tokens are relatively similar in all three experiments suggesting that, if the model knows where to look, regardless of the size of the document, it can find that information with significant efficiencies.

This means that, if you know vaguely where to look for a thing, if you tell the model, it can find it. The file name was enough information to find the requested subsection in the full document. Thus a 20 word vague prompt can save both time and money.

Fully Searching a Document Without Knowing Location
Alright, I removed said 20 words from the System Instructions, and the results became more what I expected in the first place:

Searching the Full Document Without Knowing Location

Searching the Full Document Split into Chapters

Searching a Small Subsection

Analysis

Alright, the biggest surprise here is what happens if you already kinda know the location of what you’re looking for.

If you don’t have, or provide, that information, we can see the model using significantly more input tokens 18,624 vs 37,809 to search the full document to extract the requested information. That’s 2.03x more input tokens when it doesn’t vaguely know what to look for.

Interestingly, if you don’t provide vague location information, but have divided the source document into smaller (well named) subsections, the model provides almost the same response when it already knows where to look. 18, 742 to 19,193 input tokens.

What was expected, and is here demonstrated, is that, if you do not know the location, but have previously split the document into subsections and related them through the same Vector Store, and the files are named appropriately, there are significant efficiency gains. 19,193 to 37,809 input tokens when searching the split vs the full document. Again, a 1.96x difference. :smirk:

What was unexpected and is also demonstrated is that further splitting the document into smaller subsections (only the area of the document with monsters, a 100 page subsection) didn’t yield significantly different input results from a full divided document. 18,889 input tokens knowing location, and 19,490 input tokens when not knowing location.

Here, the win is the size of the Vector Store, which drops from 5mb to 1mb when searching either the full document and/or the full split document, and the small subsection. A minor efficiency gain. So if you can take the time, splitting the full document is the way to go. If you’re in a hurry, grab a full subsection.

Finally, it is awesome to note that, regardless of the input size, the model extracted the appropriate information with only one small variance in all six experiments. (It sometimes does/does not include information about custom instructions regarding the Aboleth’s Legendary Actions. :thinking:) You can see this in the stability of output tokens. That’s pretty cool.

Conclusions

  1. If you only kinda-sorta know where you’re looking in a document of any length, include that information in your prompt. It can save all sorts of compute and money. That’s wild.
  2. Dividing a long document into smaller, appropriately named sections related through the same Vector Store yields significant demonstrable search efficiencies. i.e., It is worth taking a long document and splitting it into chapters.
  3. Further dividing a document into a smaller vector store does decrease storage costs, but does not yield significant search efficiencies from the full split document in a single vector store.
  4. Regardless of input, the output was remarkably stable, with only the variance of a single paragraph in all six experiments.