The dndGPT Case Study for You and Me!

Automating the SRD-to-Spreadsheet Task

The current goal of this project is to automate the data extraction from the SRD PDF using Python and Assistants then input it into a csv using Structured Output.

This has proven surprisingly challenging.

Persistent Issue with Name Lookup

It’s hard to look up names from the SRD PDF. :expressionless:

From the start of this project months ago, it's been hard to generate a simple list of names from the Monsters section of the SRD 5.1, and it persists to the API and mini.

The model usually gets the first few names next in the list right, but then it starts pulling from all over the place and/or hallucinates answers. 4o-mini has gotten stuck in loops on pages in the SRD, for some reason having trouble remembering to "look on the next page" when performing a simple sequential lookup.

Even with the Price Reduction, 4o is Too Expensive for the Task

Interestingly, I reliably got 4o to retrieve the immediate next monster correctly. However, Tokens In was equal to the amount needed to search & extract the information if you already knew the name.

dndgpt_capture_assistant_search_4ovsmini

First, I was thinking you could use 4o to look up the name from the SRD, then pass the information via a Thread to a 4o mini for extraction and pre-structuring, then passing to another Assistant for Structured Output.

However, since the same amount of Tokens In are used as context for each model, this is inherently inefficient—you’re paying twice to search through the same data.

Furthermore, even if I just used 4o for the full retrieval for the next item, it would be prohibitively expensive. That’s around $0.30 per monster for roughly 300 monsters = $90 for the full extraction. The 4o was also more likely to add information to the data set, which isn’t useful.

Hypothesis: Too Convoluted

Insofar as Data Extraction is concerned, the D&D SRD 5.1 is an edge case.

The reason I think identifying and listing the names of these monsters is hard is because the Source Document is convoluted. Even the names require some thought.

There are these Parent Monster headings that apply to some, not all, of the monsters—the heading looks slightly different, but there is no visual indication of when a monster is no longer nested beneath the parent beyond the context of what the monster is.

You can have a Parent Monster heading of "Ghoul," followed by a Ghast which is a "Ghoul", and a Ghoul... which is also a Ghoul...

My Solution

My solution here is, and has been, to 'roll up my sleeves,' and look up the names myself, then create the first part of the spreadsheet manually.

Ultimately there were so many variations, hallucinations, errors, and out-of-order information that even the simple lookup required so much oversight that I was doing the search myself anyway to confirm the next items on the list.

It's surprising that this simple name lookup takes so much intelligence, and, I guess, "discipline."