I have a question. I’m using GPT4 as the primary LLM in my chat application. We’re using the Assistants API.
Once the Assistant has decided to do a tool call, but before we submit the results of the tool call, I would like to swap to a faster model like GPT3.5 to do the summarization of the retrieved data.
The reason for this is because the Assistant’s API is sort of slow, and I suspect that using a faster model might speed it up.
Does anyone know if this sort of thing is even possible, or any other tricks for speeding up Assistant’s API?
I don’t see how the additional AI inference would speed things up. Submitting more input context to the model does not meaningfully delay the beginning of generation. You can benchmark gpt-4 on Chat Completions endpoint and see how long it takes for the first and final token with small and large input (you can send irrelevant text and mark it irrelevant to minimize the alteration of output).
Transforming or amending the data product that you send to the AI as knowledge or tool return is certainly possible. A tool return value is just language placed in a “function” role message the AI shall need to understand as natural language the response to the query terms it sent. Just be aware that the AI with lower cognitive ability may lose some of the finer points of the language that may be important for answering.