Do Assistants use tokens to access Files (every time)?

van.der.haegen.j · April 22, 2025, 6:34pm

Hey all,
we’re trying to design a help bot that lives inside of our application. If we upload our existing documentation to an Assistant (Files), does it cost tokens every time the Assistant accesses these files?
If yes, is it not possible to train (/finetune) a model on our existing documentation and use that instead?
Or, what alternative architecture would be available to keep costs low but quality high enough? For example a colleague suggested keeping the documents in a local vectorstore, then for each question, find the most relevant documentation from our local vector store and include the contents&images in the prompt. Would that use less tokens?

Foxalabs · April 22, 2025, 6:45pm

Hi,

The Assistant files system is a vector store, but it can also include whole documents where it thinks they are needed, so you have to budget for additional tokens with that.

I self managed vector store can be an efficient way to add additional context, but it’s not perfect, no system is. Chunking (the act of splitting a document into sections to be embedded into a vector database) is again something to experiment with, you can have:

Simple fixed length chunking

Fixed length with overlap

Punctuation bounded, i.e. sentences ending with period (.)

Paragraph chunking

Page chunking

Semantic chunking where you process chunks that are close to each other to see when the similarity changes rapidly and use that as a boundary.

AI assisted chunking, where you show each page to an AI and ask it to split the sections up.

probably a bunch more I’ve not listed as well. You also need experimentation with the overlap size, and overall chunk size to see what works best for you use case.

It is non-trivial to chunk, embed and then search on a vector store if you are managing the whole thing yourself, some of the commercial packages such as Pinecone, Quadrant, ChromaDB, etc, have chunking helpers and even fully managed system to do that, but each has its strengths and weaknesses.

The best option is to include everything in the main prompt if budget allows.

Finetuning of the kind currently available does not teach the model new information, it trains it to act in new ways, i.e writing style not content.

FullTimeAI · April 22, 2025, 7:39pm

Clean the data with a cheap model like gpt-4o-mini, then chunk it using sliding window with roughly a 25% overlap.

For cleaning, you can do it in a very basic/crude way but also extremely cost-efficient way. Add line numbers to the data when you send it to the ai and ask it what lines or ranges of lines to remove. Then remove the lines starting from the end of the data working toward the start of it so you do not alter the count and discard data you should keep.

Important: Be sure to tell it not to account for changes in the line count.

Topic		Replies	Views
Seeking Advice on Reducing Costs for RAG Chatbot Using File Search Assistant API api	4	1065	July 6, 2024
Best Practice to save money on Calling Assistant API API gpt-4 , api	3	991	November 24, 2023
Understanding the current Assistant Retrieval process API assistants	7	13817	November 20, 2023
Are we repeatedly charged for all tokens in the context window? API	4	535	May 30, 2024
Assistants API Cost Exceeds Reasonable Expectations API gpt-4	4	1073	April 11, 2024

Do Assistants use tokens to access Files (every time)?

Related topics