Add trailing response headers for token cost information

Most enterprise gateways for openai APIs have very complex logic to

  1. Check if a request is a streamed chat completions request
  2. If request is streamed and stream_options.include_usage: true is not set, will add include_usage to request.
  3. Manually parse stream/non-streamed completion responses to extract usage property
  4. If stream_options.include_usage was not set in the request, remove the usage data line for backwards compatability.

This incurs not only a massive amount of engineering effort to transform request/responses, but also adds a non insignificant latency penalty to the request with additional operational cost associated with parsing/transforming.

This could be solved with very little engineering effort by just appending a set of response headers (non-streamed) and trailing response headers (streamed) to the response. These headers should be appended whether or not stream_options.include_usage is enabled.
Ideally:
x-request-model,
x-usage-prompt_tokens,
x-usage-completion_tokens,
x-usage-reasoning_tokens,
x-usage-audio_tokens,
x-usage-accepted_prediction_tokens,
x-usage-rejected_prediction_tokens

The amount of cpu cycles that are burned on this is massive. Just think of the polar bears :cry:

Additionally, adding a set of request headers to the libraries could be a fast-path option not only for enterprise gateways, but also for openai as the request wouldn’t need to be parsed to know whether to respond in stream/non-streamed or which model to route to. But this is less important.

1 Like

How about a usage endpoint, not requiring admin key? I construct an API specification:

get https://api.openai.com/v1/usage/{request_id}

Get a stored usage object by request id (resp_xxx or req_xxx, plus other endpoints that would now return this field). Being created with the store parameter set to true is not necessary for this internal retrieval of persistence equivalent to safety storage.

{
  "object": "chat.completion.usage",
  "model": "gpt-4o-2024-08-06",
  "created": 1738960610,
  "request_id": "req_ded8ab984ec4bf840f37566c1011c417",
  "status": "completed",
  "usage": {
    "prompt_tokens": 1117,
    "completion_tokens": 46,
    "total_tokens": 1163,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  }
  "system_fingerprint": "fp_50cad350e4",
  "input_user": null,
  "service_tier": "default",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop"
  ],
}

Reasoning provided by AI:

Let’s clarify this step-by-step and precisely:


1. HTTP Headers and their Behavior

Standard HTTP behavior:

  • In a standard HTTP response, headers are sent only once—at the very beginning—immediately followed by the body data.
  • Once the headers and initial status line are sent, they cannot be changed or retransmitted again within the same HTTP request-response cycle.
  • HTTP (1.x and even HTTP/2) assumes a single set of headers per response.

2. Server-Sent Events (SSE)

How SSE operates:

  • Server-Sent Events (SSE) operate by first establishing a long-lived HTTP connection.

  • Headers are sent once at the start, then the response stream (body) follows, formatted with special lines:

    data: <event payload>\n\n
    
  • This protocol is simple and explicitly defined in the WHATWG SSE Specification.

  • There is no provision to resend or update headers within SSE.


3. Common HTTP client libraries (e.g., Python’s requests)

  • Libraries like Python’s requests or urllib3 generally conform strictly to HTTP standards.
  • They parse headers once and expose the body as a stream of bytes (or text), but do not have any mechanism or standard support to handle new headers mid-stream.
  • The headers are set upon response initialization (response.headers) and remain immutable thereafter.

4. WebSockets as an Alternative

WebSockets behavior:

  • WebSockets (RFC 6455) establish a bi-directional communication channel after the initial HTTP handshake.
  • After the initial handshake (which includes headers), WebSockets switch to a different binary framing protocol.
  • WebSocket frames themselves do not contain headers as defined by HTTP. They just carry arbitrary payloads. Any metadata must be included in the payload itself.

5. Relevant Standards and RFCs

  • RFC 7230–7231 (HTTP/1.1) explicitly define response headers to appear exactly once at the start.
  • RFC 7540 (HTTP/2) introduces “Header frames” to split headers into frames; but these frames still logically appear exactly once per response and occur only before the body data.
  • RFC 6455 (WebSockets) clearly describes how WebSocket communications work, without any provision to send HTTP-like headers again after the initial handshake.
  • Server-Sent Events Spec (WHATWG) does not specify retransmission of headers. It’s simply not defined or supported.

No existing standard (HTTP/1.1, HTTP/2, SSE, WebSocket) allows retransmission of HTTP headers mid-stream after the initial headers are sent.


6. How to Solve your Metadata Problem

Since you want metadata sent after data generation completes:

  • Include metadata in SSE events:

    data: {"type": "data", "payload": "..."}\n\n
    data: {"type": "metadata", "payload": {"duration": 123, ...}}\n\n
    

    This is common practice: sending metadata as JSON or structured events at the end of a data stream.

  • WebSocket metadata frames:
    Include metadata explicitly within a JSON payload or custom-framed messages at the end of your transmission:

    {"event": "metadata", "duration": 123, "other_info": "..."}
    
  • Separate requests:
    Another approach (though less elegant) is to make a second request explicitly to retrieve metadata.


7. Conclusion & Recommendation

  • It is not possible to retransmit or modify headers once the HTTP response stream has started. HTTP standards explicitly forbid this behavior.
  • Common HTTP libraries (like requests) do not and cannot support mid-stream headers.
  • There is no existing RFC supporting mid-stream retransmission of headers.

The idea of sending HTTP headers again within a streaming response is indeed “fantasy” as per existing standards and practice.

Your recommended solution is clearly one of:

  • Embedding metadata directly within SSE events.
  • Using a WebSocket and structured metadata messages.
  • Using a separate, follow-up HTTP call.

TL;DR:

No, you cannot retransmit headers mid-stream. Include any metadata explicitly within the SSE event payload itself or use WebSockets for structured metadata transmission.

1 Like

No, you cannot retransmit headers mid-stream. Include any metadata explicitly within the SSE event payload itself or use WebSockets for structured metadata transmission.

You absolutely can, not mid-stream but after a stream has completed: Trailer header - HTTP | MDN

2 Likes

Interesting, I’ve never heard of that, and it seems web browsers have not, either, according to the support matrix..

Let’s see if it’s in the realm of possibility to code it up..

  • HTTP client libraries (especially high-level ones like requests) commonly ignore or do not expose trailers.
  • Middleware/proxies/load balancers/CDNs often discard trailers, as they’re not widely adopted.
  • Server-Sent Events (SSE) specification does not define or support HTTP trailers explicitly, though an SSE stream can technically run over chunked encoding. However, SSE client implementations typically don’t expose trailer fields.
1 Like

This has nothing to do with SSE specifically. SSE is just a method of using the body of a HTTP request to send data lines. The concept of trailer headers are specified in HTTP itself. It’s true that adoption is not very wide, but keep in mind, there is not much of a point in having web-browser support in the first place. It is rare that you call an OpenAI api directly from a web-browser. But rather some backend that then transmits generated responses to the client.

We wrote our own company-internal LLM gateway that tracks request cost by manually parsing the responses for usage details. We then publish the usage details via trailer headers to consumers. Some http libraries support it, some don’t. The point is mainly that the LLM gateways, of which there are many (litellm, bitfrost, openrouter, and company internal ones), could use this functionality to track cost with much less overhead than currently. It isn’t there primarily to target end-users.