Using a vector store means that the AI has no idea what it will find. It can only write probable queries and get ranked results. And then call again when results are unsatisfactory.
This will create multiple turns of growing context with tool results, each internal iteration costing you more and more tokens before you even get an AI that wants to output some text, distracting from and pushing out chat.
Therefore, it may be much more useful to place the whole ruleset into context, and self-manage the division between that and the conversation length. Then the AI doesn’t have to deliberate and make sense of randomly retrieved repeated overlapping chunks in multiple turns with arbitrary splits.
The rules and guidebooks seem consumable, although lots of the information is visual, and the “tiles” are incorrigible to vision. The bulk of information needed is “cards”, but while there are 300-400 of them, they are not information-dense:
If I was going to have held-out information “retrievable”, it would be by a function. Such as “retrieve items” or “retrieve chapter full text” when a summary is instead given in a permanent context. These can have enums such as “2.2-gameplay-phases” for indexed direct retrieval. Then expire the tool returns rapidly.
BTW: AI file extraction of the rulebook with character counts, where tokens will be about 25% of that. I ran several Python-based extraction methods against your PDF and saved a continuous (non-paginated) text file for each. Here are the character counts for the full-file extractions, from largest to smallest (higher counts usually indicate more text recovered, though not always “cleaner” text):
PyPDF2: 83,161 chars — 5c-machina-arcana-rulebook.PyPDF2.txt
pdfminer.six: 77,249 chars — 5c-machina-arcana-rulebook.pdfminer.txt
PyMuPDF (fitz): 75,856 chars — 5c-machina-arcana-rulebook.pymupdf.txt
semantic search practicality (one of several tomes to incorporate)
- The file is a board-game rulebook (“Machina Arcana,” v3.0) with clear sections: game overview, setup, phases, units/attributes, items, events, abilities, targeting, map tiles, effects, and monster behavior. These repeatable headers and domain terms will index well for semantic and hybrid keyword search.
- Several extractions (especially PyPDF2) include typographic artifacts from small-caps/ligatures (e.g., “/g.sc”, “/uni25C6”). This noise can slightly reduce embedding quality. A light text normalization/cleanup pass (strip “/x.sc” fragments, map “/uni25C6” to bullets, collapse spaces) is advisable before embedding.
- Content density is moderate (~75–83k characters across methods), suitable for chunking (e.g., 800–1200 tokens per chunk with overlap). Natural chunk boundaries are the section headers (GAME SETUP, GAMEPLAY, ITEM, EVENT, ABILITIES, etc.), which are plentiful.
TL;DR: vector stores will cost you, delay you, bloat the context with distraction anyway. Use a model that can ingest the full rules and immediately answer turns, placed consistenly in a cacheable manner.