Unexpectedly High Token Consumption in OpenAI Assistant

I’m experiencing unexpectedly high token usage when using OpenAI’s Assistant API with file search. My input message (user+system) is relatively short (estimated ~200 tokens), but every request consumes around 21,000 tokens, which is completely disproportionate.

Use Case:

I am building an image analysis script where:

  • The system submits an image (refered as image_url).
  • I provide a short text description of the image (small string, probably ~200 characters long).
  • OpenAI’s Assistant API with file search should return relevant tags (industries) from a vector store that contains a single text file with ~1,000 keywords.
  • The API should only return industry names that match the provided vector data.
  • The API should search the vector store, extract relevant industries, and return a small JSON response (this is working fine btw).

More Context:

  • Model: gpt-4o
  • Using OpenAI Assistants API
  • File Storage: A single text file (~1,000 words, the tokenizer calculates about ~4.300 tokens) stored in a vector store

Debugging Steps Taken

  • Average token usage for output: ~50 tokens
  • Tried setting truncate_context: true
  • Limited file search size (max_chunks is not supported in API)
  • Confirmed the vector store only has a single text file (~1,000 words/~4.300 tokens)
  • Checked the assistant response (very short, ~50 tokens only)
  • Made sure no persistent threads or excessive history is being loaded

Why is the Assistant API consuming so many tokens (~21,000) when my prompts are probably max. 100 tokens long and the text file (vector store) should only eat about 4,300 tokens? Does the image analyzis really cost more than 15,000 token for a 600x400px image?

How can I reduce token consumption while still leveraging the vector store efficiently?

Any insights or suggestions would be highly appreciated.

public async setupAssistant(): Promise<string> {
    const assistant = await this.openai.beta.assistants.create({
        name: "Industry Tagging Assistant",
        instructions: "Analyze images and return relevant industry tags based on the vector store.",
        model: "gpt-4o",
        temperature: 0.2,
        tools: [{ type: "file_search" }],
        tool_resources: {
            file_search: { vector_store_ids: [env.OPENAI_VECTOR_STORE_ID] },
        },
        response_format: {
            type: "json_schema",
            json_schema: {
                type: "object",
                properties: {
                    data: { type: "array", items: { type: "string" } }
                },
                required: ["data"],
                additionalProperties: false,
            },
        },
    });
    return assistant.id;
}

public async analyzeMedia(mediaId: string): Promise<void> {
    const thread = await this.openai.beta.threads.create();

    const description = `Analyze this image. Description: "${media.descriptions?.en.text ?? 'No description available'}"`;
    const userMessage = [
        { type: 'text', text: description },
        { type: 'image_url', image_url: { url: publicURL } },
    ];

    await this.openai.beta.threads.messages.create(thread.id, {
        role: 'user',
        content: userMessage,
    });

    const run = await this.openai.beta.threads.runs.create(thread.id, {
        assistant_id: env.OPENAI_ASSISTANT_ID,
        tool_choice: "file_search",
    });

    const result = await this.openai.beta.threads.runs.retrieve(thread.id, run.id);
    const messages = await this.openai.beta.threads.messages.list(thread.id);
    const assistantMessage = messages.data.find(msg => msg.role === 'assistant');

    if (!assistantMessage) throw new Error('No response from assistant.');

...
}

File search is a tool that the AI can call, multiple times if it desires, especially if not getting what it expects. Every output to tools instead of to a response is another API call based on a growing chat.

You can examine the run steps list to see the tool invocations, the iterations, and the token consumption that are delegated.

Or you can realize that nothing here you describe needs an AI calling tools. On chat completions, you can provide your system message, knowledge file to answer from, and image prompt, and get your response in a single API call of input and output tokens actually sent and generated.