We have been working to develop an internal chat assistant that has access to several 3rd party API’s we currently use. Currently the assistant has a few custom tools that use the third party API endpoints to be able to pull data and return it to the user based on their prompt. This currently works well but we have ran into the issue of the amount of data being passed to the assistant can sometimes be large and consume many tokens in the process as the assistant will have to filter through the data it’s self, find what the user has asked for and return that piece to the user. This happens through a custom Flask API we have built that runs on a server as needed.
Are there any solutions to help reduce the amount of data that is passed to the Assistant either through cookies, or session data etc, that works well with OpenAI assistants that can help us filter through more of the data before the assistant needs to process it? This would be to help us reduce tokens used solely for processing data and also response time when using some of the custom function tools we have built.
To help your understanding: The network transmissions and the API call parameters and json aren’t what count for billed tokens. Only language that is token-encoded and placed into the AI model’s input is what is considered.
Assistants give you no control over the length of conversation history passed to subsequent AI model runs. That unseen chat contained in threads can also contain a history of AI function calls and results returned. The promise is that the AI model is maximized.
Assistants also give you no control over the amount or relevance of knowledge files that are placed into AI model input. It also is promised to fill available model context.
Assistants also have undisclosed methods for iteratively searching documents, reading and scrolling through pages, each making subsequent AI model calls that carry along that past context.
The AI can write python code to execute instead of writing to a user. It can keep retrying when it makes coding errors.
Techniques you can use: disable unnecessary functions. Disable retrieval or its files. Regularly terminate threads if chat grows too long.
The best technique is to manage all yourself, using the chat completions endpoint and your own code.
I have a very similar setup using Django/Celery to run about 25 different Assistants some of which are connected to either Chat or Email responders. They connect to host of API’s (Salesforce, Pitchbook, Quickbooks, Google Sheets etc).
There is no other way to handle that now (and I believe in the future too) than in the API wrappers (ie the tooling functions) you call. This can mean limiting the amount of records to return or the data per record or a combination.
As with the famous ‘640kb should be enough’ we will continue to have this problem no matter the next 32k 64 128kb whatever the number the next version will be.
Happy to discuss in more detail!