Experience and opinion on 4 APIs (assistant, chat completions, responses, realtime)

In the close on 18 mos since I’ve worked with openAI APIs I’ve used 4 of them – assistants, chat completions, responses and realtime (w/ + w/o voice).

This is kind of an opinion piece on my findings, impressions, i.e. not trying to be too rigorous or use chatGPT to help, hope you find it legible and interesting.

Assistants

I started out using the assistants API. This was in beta at the time. I made the presumption that “beta” meant hardening for production use but ultimately they decided to withdraw it. My misread. So lesson learned not to over-invest unless guaranteed a future.

The 2 issues of most concern that arose where:

  1. Stalling: I found thread runs could become stalled stopping the conversation. So I implemented logic to cancel runs. But then they could become stalled. I never did find a satisfactory solution.
  2. Speed: Like all good engineers I left optimization until basic work done, having said that I was in some cases getting a chat turnaround time of up to 20 seconds. Now A. I wasn’t using any streaming API, and B. I had a lot of tool calls in the process but even a 10x improvement to 2 seconds will chase users away.

Chat Completions

Faster. But still too slow. Would have to use streaming in the API to give impression of more responsiveness.
Issues: I had written the tool functions for assistants API, this broke on completions (API differences I expected, but the actual JSON that specified a tool call, this seems to be a module they could have standardized and shared across all offerings)

Responses

Close on the heals of chat completions I implemented against responses API. I felt better about not shuttling the entire context back + forth, though I didn’t crunch the numbers. Slower than chat completions though, and that was already slow.

Realtime (via WebRTC)

My new favorite. I had worked with WebRTC before. And the examples they give via github (agent and console demo apps) where really nicely done.

  • This is different from client-server-openAI – it’s a triangle, get an ephemeral token and then WebRTC direct from client (browser) to openAI.
  • This has clear advantages for less latency. Also for a startup I don’t have to break my neck trying to get good performance to the user because they are direct on network to openAI, so I get top tier network connection from client to openAI.
  • A biggie for me is I get chunked responses (partial responses) without using streaming, which was going to be a hassle. Dodged that.
  • The voice based response is cool. And I can disable it back to text over the same WebRTC channel.

Drawbacks?

  1. gotta us a realtime model,
  2. right now it echos back instructions (prompt) and tool setup to browser, which may leak your secret sauce.
2 Likes

I like responses API and the agents-openai library… for managing speed, I just run multiple independent async requests… my project allows for this.

1 Like