Handling Multiple Responses from the API in different ways

I want to generate two responses to the same prompt using the n hyperparameter. But I want to stop the second response early using the stop hyperparameter. This is part of a quality check procedure. Is there a way to stop one response early and not the other? I assume not but am hopeful about the token savings…

If you use the stream=true parameter you can close the connection at any time and thus stopping the generation of the second response.

https://platform.openai.com/docs/api-reference/streaming

Good idea! Thanks. Now these are parallel calls being used to populate a data frame automatically by extracting text. So I could stream response 1 and then check each chunk in response 2 as it comes in for my stop phrase? This is what I am assuming anyway. How would I end the stream?

You simply close the connection, via a .close method or simply disconnecting the socket if using some non API solution.

First some clarification so we are on the same page:

  • hyperparameters refers to fine-tune learning settings when retraining

I know it’s fun to say, but the API just uses parameters, or even json key:value pairs when you get down to what is sent.

With runtime API calls, the stop parameter can be set to any string sequence where you want to terminate the AI generation. For example, if you want to only get one line or paragraph, you could use"\n" as a string, and the AI generation will be stopped at that point.

The stop parameters are not tokens, but strings, so that makes it easier to specify, and the string you specify will not be sent to you at the end of output.

So in the AI language, you’d have to figure out what in the response is going to be a repeatable part of the generation you can identify.


However, there is no setting a “stop” to work on only one of n>1 generations.

So it sounds like monitoring the chunks is a good idea, however then the content you receive will mostly be in “token” form, so for identifying longer sequences, you’ll need to be building the response and then scan over the end to see if it was output (and there may be more after the end of your string so you can’t just check only the end).

(Plus, note, the point of the n is the different random sampling of tokens you can get. Set top_p=0.0001 and also set a seed if you want them the same, or ensure near default temperature if variety is what your are after.)

Thanks for all this. But how do I differentiate the two streams? Here is my code for which works for early stopping when n=1. But I am unclear about is how do I separate out the chunks from n1 and n2 in order to handle them differently. Is this even possible?

    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": user_prompt}
    ]
    
    response = ""
    
    stream = openai.ChatCompletion.create(
            model=engine_pass,
            temperature=0.3,
            messages=messages,
            max_tokens=1500,
            n=1,
            stream=True
        )

    try:
        for chunk in stream:       
            content = chunk["choices"][0].get("delta", {}).get("content", "")    
            response += content

            if "Insert_Stop_Phrase_Here" in response:
                print("Specific phrase found, stopping stream.")
                break
    except Exception as e:
        print(f"An error occurred while processing row {i}: {e}")
    finally:
        stream.close()

    return response, sys_prompt

As a thought…it looks like the chunks come back sequentially: n1, n2, n1, n2…How reliable is this pattern? Can I use it to assign alternating chunks to response1 and response2, monitoring response2 for my stop-words but not response 1?

1 Like