Hey all,
we’re trying to design a help bot that lives inside of our application. If we upload our existing documentation to an Assistant (Files), does it cost tokens every time the Assistant accesses these files?
If yes, is it not possible to train (/finetune) a model on our existing documentation and use that instead?
Or, what alternative architecture would be available to keep costs low but quality high enough? For example a colleague suggested keeping the documents in a local vectorstore, then for each question, find the most relevant documentation from our local vector store and include the contents&images in the prompt. Would that use less tokens?
Hi,
The Assistant files system is a vector store, but it can also include whole documents where it thinks they are needed, so you have to budget for additional tokens with that.
I self managed vector store can be an efficient way to add additional context, but it’s not perfect, no system is. Chunking (the act of splitting a document into sections to be embedded into a vector database) is again something to experiment with, you can have:
Simple fixed length chunking
Fixed length with overlap
Punctuation bounded, i.e. sentences ending with period (.)
Paragraph chunking
Page chunking
Semantic chunking where you process chunks that are close to each other to see when the similarity changes rapidly and use that as a boundary.
AI assisted chunking, where you show each page to an AI and ask it to split the sections up.
probably a bunch more I’ve not listed as well. You also need experimentation with the overlap size, and overall chunk size to see what works best for you use case.
It is non-trivial to chunk, embed and then search on a vector store if you are managing the whole thing yourself, some of the commercial packages such as Pinecone, Quadrant, ChromaDB, etc, have chunking helpers and even fully managed system to do that, but each has its strengths and weaknesses.
The best option is to include everything in the main prompt if budget allows.
Finetuning of the kind currently available does not teach the model new information, it trains it to act in new ways, i.e writing style not content.
Clean the data with a cheap model like gpt-4o-mini, then chunk it using sliding window with roughly a 25% overlap.
For cleaning, you can do it in a very basic/crude way but also extremely cost-efficient way. Add line numbers to the data when you send it to the ai and ask it what lines or ranges of lines to remove. Then remove the lines starting from the end of the data working toward the start of it so you do not alter the count and discard data you should keep.
Important: Be sure to tell it not to account for changes in the line count.