Tokenizer and logit_bias with gpt4o and streaming api ver 1.47.1

A developer I’m working with has tokenizer and logit_bias working for api ver 1.47.1 with gpt-4o in non-streaming mode.

It is not working in streaming mode. The developer is looking for resources to understand this issue.

Is there a path to streaming implementation of tokenizer and logit_bias? or
Is this known to be an openai bug or openai future work item?

Logprobs are returned in a stream response also, unless they have been disabled by OpenAI obscuring the production of functions or structured responses. They also have to be enabled by chat completions parameter.

Logprobs contain the bytes returned in a chunk, which can extend across multiple tokens per chunk, for uncompressed Unicode beyond 7fh, for example. Token numbers are not reported.

The AI produces tokens, so we know token boundaries are coming out. However, you cannot encode large strings to tokens accurately without having the full text that BPE operates on, at least up to a non-joinable token like a number, or the unreceived tokens that enclose chat content messages.

You don’t discuss what the idea is behind this “tokenizer”, or how it would be “broken”, as API for a developer is basically language strings in and out. You can first turn on logprobs, receive as deltas, and reassemble them into a final response to see if that suits your need.


Here is a snippet to get and display the top-logprobs section of a response.

from openai import OpenAI
import numpy as np
client = OpenAI(timeout=30)

params = {
  "max_tokens": 4, "top_p":0.01, "stream": True,
  "max_tokens": 4,"logprobs": True, "top_logprobs": 3, "logit_bias": {},
  "messages": [
      {"role":"system","content":"""
You are a backend AI classifier. Response is for API: no markdown.
""".strip()},
      {"role":"user","content":"Produce flower color list, no chat."},
  ]
}

for model in ["gpt-4o", "gpt-4o-mini"]:
    params['model'] = model; print(f" -- for {model}")
    response = client.chat.completions.with_raw_response.create(**params)
    reply=""
    for chunk_no, chunk in enumerate(response.parse()):    # with_raw_response.create parsing
        print(f"\nchunk_no: {chunk_no}")
        if chunk.choices[0].delta.content:                 # if chunks with assistant
            reply += chunk.choices[0].delta.content        # gather for chat history
            for index, prob in enumerate(chunk.choices[0].logprobs.content):
                #print(index, end=': ')
                for top in prob.top_logprobs:
                    print(f"{repr(top.token)},  bytes:{top.bytes}, prob: {np.exp(top.logprob):05f}")
    print("\nresponse content:\n" + reply)

Producing output like

– for gpt-4o-mini

chunk_no: 0

chunk_no: 1
‘Red’, bytes:[82, 101, 100], prob: 0.696413
‘-’, bytes:[45], prob: 0.256196
‘1’, bytes:[49], prob: 0.030598

chunk_no: 2
‘,’, bytes:[44], prob: 0.988889
’ \n’, bytes:[32, 32, 10], prob: 0.010986
‘\n’, bytes:[10], prob: 0.000122

chunk_no: 3
’ Blue’, bytes:[32, 66, 108, 117, 101], prob: 0.632229
’ Pink’, bytes:[32, 80, 105, 110, 107], prob: 0.181137
’ Yellow’, bytes:[32, 89, 101, 108, 108, 111, 119], prob: 0.181137

chunk_no: 4
‘,’, bytes:[44], prob: 1.000000
’ ,', bytes:[32, 44], prob: 0.000000
‘،’, bytes:[216, 140], prob: 0.000000

chunk_no: 5

response content:
Red, Blue,

There’s a list of models in there it iterates over.

No, there is no “bug”.