Performance analysis of Assistants versus Chat completion -- Chat completion seems somewhat faster for complete message generation -- streaming taken into account

Preface: I asked chat gpt to clean up my post and summarize it, and it sounds very confident in this, but in reality, I was just doing some quick tests to check whether there is any additional latency from the use of assistants api from associated overhead. It seems there is about a second or two. It’s not conclusive, but enough for me to go with chat completions never mind the streaming aspect. So, here’s the very simple test results.

Summary

This examination compares latency between chat completions and assistants API for GPT-4 models (4-1106 and 4-0125), focusing on presenting observed latency times from OpenAI’s API playground.

Methodology

This comparative analysis measured latency times for chat completions versus assistants API for GPT-4 models (4-1106 and 4-0125) within the OpenAI API playground. Conducted under network speeds within 15-30 Mbps, this analysis aims to provide an introductory comparison of response times. I personally observed GPT-4-0125 to generate longer responses than GPT-4-1106, potentially influencing latency outcomes. This evaluation was designed as a high-level observation rather than an exhaustive statistical analysis, intending to gauge approximate performance differences between these models’ chat and assistant interfaces.

Latency Observations

  • Chat Completions:

    • GPT-4-0125: Average latency of 7.16 seconds across 5 samples.
    • GPT-4-1106: Average latency of 6.36 seconds from 12 samples.
  • Assistants API:

    • GPT-4-0125: Average latency of 8.76 seconds over 15 samples.
    • GPT-4-1106: Average latency of 7.76 seconds in 8 samples.

Comparative Insights

  • Chat Completions vs. Assistants API: Both GPT-4-0125 and GPT-4-1106 models exhibited lower latency times in chat completions compared to assistants API, indicating chat interfaces may be more responsive.

Additional Note

It was observed that GPT-4-0125 tended to generate longer responses compared to GPT-4-1106. This difference in response length might contribute to the variations in latency times observed, as longer answers typically require more processing time. This observation offers a potential explanation for the performance differences and underscores the complexity of directly comparing latency times without considering response length and content.

Conclusion

This preliminary observation aimed to provide a basic comparison of latency times between two operational modes of GPT-4 models under specific conditions. It serves as an initial step in understanding the performance nuances of these AI models.

Assistants API - Latency and Response Quality

GPT-4-0125

  • Latency Statistics
    • Mean: 8.76 seconds
    • Median: 8.8 seconds
    • Standard Deviation: 0.84 seconds
    • Sample Size: 15
    • Range: 3.2 seconds (7.3 to 10.5 seconds)
    • Sampled Values: [9.0, 8.6, 9.4, 7.4, 7.3, 8.8, 8.3, 8.3, 9.6, 9.2, 8.2, 8.0, 9.5, 9.3, 10.5]
  • Example Word Count: 105 words

GPT-4-1106

  • Latency Statistics
    • Mean: 7.76 seconds
    • Median: 6.65 seconds
    • Standard Deviation: 1.81 seconds
    • Sample Size: 8
    • Range: 4.9 seconds (6.4 to 11.3 seconds)
    • Sampled Values: [8.2, 6.8, 11.3, 6.4, 6.4, 6.4, 10.1, 6.5]
  • Example Word Count: 71 words

GPT-3-1106

  • Latency Statistics
    • Sampled Value: 9.5 seconds
    • Sample Size: 1
  • Example Word Count: 82 words

GPT-3-16k

  • Latency Statistics
    • Sampled Value: 9.95 seconds
    • Sample Size: 1
  • Example Word Count: 101 words

Chat Completion - Latency and Response Quality

GPT-4-0125

  • Latency Statistics
    • Mean: 7.16 seconds
    • Median: 7.1 seconds
    • Standard Deviation: 0.59 seconds
    • Sample Size: 5
    • Range: 1.6 seconds (6.3 to 7.9 seconds)
    • Sampled Values: [7.9, 7.7, 6.8, 6.3, 7.1]
  • Example Word Count: 113 words

GPT-4-1106

  • Latency Statistics
    • Mean: 6.36 seconds
    • Median: 6.3 seconds
    • Standard Deviation: 1.45 seconds
    • Sample Size: 12
    • Range: 6.1 seconds (4.2 to 10.3 seconds)
    • Sampled Values: [10.3, 6.9, 6.7, 6.0, 5.0, 5.5, 6.9, 6.9, 6.5, 6.1, 5.3, 4.2]
  • Example Word Count: 72 words

GPT-3.5-1106

  • Latency Statistics
    • Sample Size: 2
    • Sampled Values: [4.5, 2.6 seconds]
    • Note: Specific statistical metrics not calculated due to limited sample size.

GPT-3.5-16k

  • Latency Statistics
    • Sampled Value: 8.4 seconds
    • Sample Size: 1
    • Note: Specific statistical metrics not calculated due to single sample.
      .
2 Likes

Cool stuff!

Tracking time to first token (ttft), and tokens per second (tps) would probably be the most useful metrics. This is better than completion time, because you will still get a variability in the token count, unless you try to make it deterministic (although determinism is going bye bye).

I don’t think your wifi bitrate will make that much of a difference, as long as it’s stable. It’s only a handful of kb. Your latency (ping) will have a bigger impact, especially on ttft.

1 Like

Thanks @Diet , yeah, well, time to first token is a riduculous comparison since chat completions has streaming and assistants doesn’t, so it’s like less than a second or a second give or take for completions and the times posted (6-12 seconds roughly) for assistants. Tokens per second would also be good to do. If I were to do it again, I definitely would have used cProfiler or timeit or something like that instead of manual. I was just trying to get a general sense of whether there is a noticable uptick in time from additional processing in assistants compared to chat completions for the same model. It was also a cool by product of the mini-experiment to see how to models compare to each other. I was having doubts about 0125, but it seems pretty decent after the testing.

You know, I could give more intelligent responses if I had the capacity to read the thread title. Sorry about that, my fault :laughing:

The completion time should be virtually the same, because they’re supposed to use the same models. Nonetheless, it’s a good idea to make sure! :+1:

@Diet, Yeah, I was thinking the time per token that the model generates is the same like you said, but I was mainly considering whether there was any significant overhead from the use of the assistants api tooling.

@Diet I did a clean up of the results, I don’t know if you wanted to check it out quickly. Maybe a little of overkill for such an underwhelming testing method, but hopefully easier to read. Thanks ChatGPT for helping me make it cleaner. Maybe it seems a little too confident and that’s unintentional.

1 Like

Sorry if I sounded rude in that; it didn’t mean it was a bad idea to compare, I just meant that in practice, the order of magnitudes were very different – bad communication on my part.