Hi there,
The new structured output feature is excellent! Perfect for a task I have. It’s mentioned that the first time a schema is used, it incurs a latency of ~10-60s, but is thereafter cached to allow for regular inference speed.
I’m curious as to the “scope” of this caching, as it’s very important for us to provide our users with the fastest inference time possible. Is the schema cached only for that particular conversation (ie. the scope is the conversation ID), or for the particular API key being used to access the API?
In essence, I’m trying to understand if every time we send an API request to, say, 4o-mini using a constant schema we incur that latency, or if it’s only for the first time we use that schema for our particular API key.
Cheers
Eric
Hi Eric,
I don’t have any details on the cache duration, I will add it to our list of questions we ask the OpenAI team for clarification on, that won’t be until next week now.
1 Like
@ericlaycock44 I tested it out myself using a single API key with multiple calls over several days, and can confirm that it’s a one-off penalty at the very beginning, with subsequent calls being significantly faster.
My understanding is that the schema gets converted to a context free grammar (CFG), which is then used to inform logits during the token sampling/decoding, i.e. text “generation”, step. So presumably, that one-off penalty is due to this conversion from JSON/string to a CFG, together with a hashing of your schema and storing it in db next to your API key. On subsequent requests, the schema is hashed and looked-up in db, avoiding the whole CFG conversion part.
2 Likes
@platypus sweet, thanks for your insights!
1 Like