Async Streaming Run Sanity Check

Can someone give me a sanity check on my creation of a streaming Run? Tool calls seem to be very slow, even when the assistant has only one tool. Have I missed a best practice?

class ChatEventHandler(AsyncAssistantEventHandler):
   ...
   async def handle_requires_action(self, data, run_id):
      tool_outputs = []
      for tool in data.required_action.submit_tool_outputs.tool_calls:
         func_name = tool.function.name
         func_args = json.loads(tool.function.arguments)
         try:
            output = await tools.call_tool_function(func_name, func_args)
         except Exception as e:
            output = tools.response_for_exception(exception=e)
         tool_outputs.append({'tool_call_id': tool.id, 'output': output})
      await self.submit_tool_outputs(tool_outputs, run_id)

   async def submit_tool_outputs(self, tool_outputs, run_id):
      handler = ChatEventHandler()
      handler.sio = self.sio
      handler.sid = self.sid
      async with client.beta.threads.runs.submit_tool_outputs_stream(
         thread_id=self.current_run.thread_id,
         run_id=self.current_run.id,
         tool_outputs=tool_outputs,
         event_handler=handler,
      ) as stream:
         await stream.until_done()
         run = await stream.get_final_run()
         usage = run.usage
         if usage:
            logger.debug(f"tokens: {run.usage.total_tokens} (prompt: {run.usage.prompt_tokens} completion: {run.usage.completion_tokens})")

chat_handler = ChatEventHandler()
chat_handler.sio = sio
chat_handler.sid = sid
try:
   async with client.beta.threads.runs.stream(
      thread_id = thread_id,
      max_prompt_tokens = 1000,
      assistant_id = assistant_id,
      additional_messages = [
         {'role':'user', 'content': user_input}
      ],
      event_handler = chat_handler
   ) as stream:
      await stream.until_done()
      # a run will have usage=None if it is not in a terminal state such as 'completed'
      # 'requires_action' for example will have None
      run = await stream.get_final_run()
      usage = run.usage
      if usage:
         logger.debug(f"query: [{user_input}]\ntokens: {run.usage.total_tokens} (prompt: {run.usage.prompt_tokens} completion: {run.usage.completion_tokens})")

Where is your server located?

Can you show the result of this?

time curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Hello, how are you?"}],
  "temperature": 0.7
}'

Of course from the CLI of where ever you are running your program.

I’m timing it on the server of course. There is a preprocessing chat completion that I do to identify possible tools to save on the cost of sending schema for every tool. That completion only takes about 0.5-0.7 seconds on average. But even for a simple “hi” > “Hello! How can I help you today?” query/completion with the Run stream I posted, it’s just under 5 seconds. That is why I suspect it’s my implementation.