Assistant API - way too much "input" tokens used

My assistant has a pretty big instruction (986 tokens according to this: https://platform.openai.com/tokenizer), that I pass to it like this:

async function setupAssistant(){
  thread = await openai.beta.threads.create();
  assistant = await openai.beta.assistants.create({
      name: "Tutor",
      instructions: instruction,
      model: "gpt-4-turbo-preview",
    });
}

In my sample run, it generated an output of 332 tokens.

When I look at the stats after this first in/out here https://platform.openai.com/threads/, I get the following:

Tokens 1544 Ā· 1146 in, 398 out

which kind of makes sense with some overhead, right?

I then add another message to the same thread with 7 (seven!) tokens, and got an output of 313 tokens.

So I would expect to have around 1146 + 7 tokens IN an 398 + 313 tokens OUT. But what I really get in the billing / threads-console is:

Tokens 3482 Ā· 2705 in, 777 out

So, the ā€œOUTā€ makes sense.

But how has the ā€œINā€ more than doubled, when I only entered 7 tokens for the next message?

I need to use the Assistants-API because I need to have one continuous conversation ā€œback and forthā€.

Wait.

  • Run 1: 1146 tokens in, 398 tokens out
  • Run 2: 7 tokens in
  • after run two, the threads-console tells me I have used 2705 in, which is almost exactly 1146 + 398 + 1146 + 7 = 2697

That would mean the initial message is re-sent on every run? But why?

1 Like

Has anyone any idea or insight for me? I really donā€™t get how this works.

I have written before how it works, in detail, after discovery beyond what they publish, and also made synopsis. Like in November:

OpenAIā€™s line about this is ā€œyou delegate controlā€ā€¦

If you enable certain functions for the Assistant API (function calls, retrieval, and code interpreter), a big wall of text actually gets added to the system prompt so that the assistant knows the features are available. It takes up a lot of tokens, especially if youā€™ve got them all enabled. I just tried to mess with it on my own - it seems like the whole system prompt is actually sent along with every message that you add to the thread. I assume thatā€™s because the instructions and feature set can be changed while a thread is active? Iā€™m not sure.

Funnily enough, with some prompting, you can actually get the assistant to spit out the whole system prompt, which is whatever instructions you wrote, followed with this:

# Tools

## python

When you send a message containing Python code to python, it will be executed in a
stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0
seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.

## myfiles_browser

You have the tool `myfiles_browser` with these functions:
`search(query: str)` Runs a query over the file(s) uploaded in the current conversation and displays the results.
`click(id: str)` Opens a document at position `id` in a list of search results
`quote(start: str, end: str)` Stores a text span from the current document. Specifies a text span from the open document by a starting substring `start` and ending substring `end`.
`back()` Returns to the previous page and displays it. Use it to navigate back to search results after clicking into a result.
`scroll(amt: int)` Scrolls up or down in the open page by the given amount.
`open_url(url: str)` Opens the document with the ID `url` and displays it. URL must be a file ID (typically a UUID), not a path.
please render in this format: `怐{message idx}ā€ {link text}怑`

Tool for browsing the files uploaded by the user.

Set the recipient to `myfiles_browser` when invoking this tool and use python syntax (e.g. search('query')). "Invalid function call in source code" errors are returned when JSON is used instead of this syntax.

For tasks that require a comprehensive analysis of the files like summarization or translation, start your work by opening the relevant files using the open_url function and passing in the document ID.
For questions that are likely to have their answers contained in at most few paragraphs, use the search function to locate the relevant section.

Think carefully about how the information you find relates to the user's request. Respond as soon as you find information that clearly answers the request. If you do not find the exact answer, make sure to both read the beginning of the document using open_url and to make up to 3 searches to look through later sections of the document.


## functions

namespace functions {

// ((if you have any functions, they're be displayed here))

} // namespace functions

## multi_tool_use

// This tool serves as a wrapper for utilizing multiple tools. Each tool that can be used must be specified in the tool sections. Only tools in the functions namespace are permitted.
// Ensure that the parameters provided to each tool are valid according to that tool's specification.
namespace multi_tool_use {

// Use this function to run multiple tools simultaneously, but only if they can operate in parallel. Do this even if the prompt suggests using the tools sequentially.
type parallel = (_: {
// The tools to be executed in parallel. NOTE: only functions tools are permitted
tool_uses: {
// The name of the tool to use. The format should either be just the name of the tool, or in the format namespace.function_name for plugin and function tools.
recipient_name: string,
// The parameters to pass to the tool. Ensure these are valid according to the tool's own specifications.
parameters: object,
}[],
}) => any;

} // namespace multi_tool_use

You can see why so many tokens get added for every message you sendā€¦

Iā€™d recommend just disabling whatever features you donā€™t need, or just switching to the chat-completion API for tasks when you donā€™t need them.

1 Like

Hello. Good find!

I am actually also interested specifically in retrieval (or file search, in V2 world). Do you know if we can dig up what (additional) prompt or instruction actually gets sent to the model?

I agree the prompt to include all these tools add up to many, but I suspect the additional prompt that includes the file search result is even greater.

555 tokens of # Tools for the new text of file_search.

387 before.

Neither of which compare to the amount of injection done automatically by v1 before it searches and browses, or the amount of non-threshold results you get from a v2 search.

v2. minimum bot, minimum input: 640 tokens placed. 1500 tokens of bike text not searched upon or automatically injected (25k+ of vector storage could make that a 25k input question).

I might add that vector search can be problematic for the application shown. That tool text includes: ā€œTool for browsing the files uploaded by the user.ā€, and ā€œParts of the documents uploaded by users will be automatically included in the conversation.ā€

You donā€™t have the ability to change that tool to ā€œcompany information placed by the AI developers to help you perform your taskā€.

Also, note the massive amplification of token usage on just turn 2, where I could have placed all the off-topic text for 1600 by real injection RAG:

Hi there. Iā€™m experiencing failures every time I try to run my thread, the error being:
code=ā€˜rate_limit_exceededā€™, message='Request too large for gpt-4o in organizationā€¦ on tokens per min (TPM): Limit 30000, Requested 33543."

My prompt and system instructions, when added together, are in the ballpark of 870-900 tokens. I have no idea how/why it thinks Iā€™m requesting >33000 tokensā€¦ unless its due to my vector store. I havenā€™t read it anywhere in the documentation, but after reading your response, I am questioning whether having a vector store means that the entire vector store is included with each query to the assistant. Is that correct? if so, could you please refer me to where in the documentation this is outlined? Thatā€™s wild if thatā€™s the caseā€¦ moreover, being on tier 1, it means I canā€™t even access my assistant having the vector store size that I do (which I feel isnā€™t even that large). Thanks for your time
~Alex