How to Handle Context Limit in Response in gpt-4-1106-preview?

I have the following code that was written with the previous api version in mind and to be honest I’m not even sure it works.

My question is simple, if the total context is 128,000 tokens but we can only get back 4,096 output tokens, how can we detect if the response is truncated and subsequently how can we ask the api to “continue” in the next response?

My code:

def get_body_content(raw_body):
    if verbose:
        print('get_body_content')

    while True:
        try:
            response = client.chat.completions.create(
                model="gpt-4-1106-preview",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": f"Given the raw HTML data from a site, return the main body of the article from this longer piece:\n\n{raw_body}"},
                ],
                temperature = 0.0
            )

            # Return the generated text
            return response.choices[0].message.content.strip()

        except (openai.error.RateLimitError, openai.error.OpenAIError) as e:
            if isinstance(e, openai.error.RateLimitError):
                retry_after = int(e.headers.get('Retry-After', 60))
                sleep_duration = max(60, retry_after)
            elif "maximum context length" in str(e):  # Check if the error is due to token limit
                print(f"ERROR: Maximum token limit reached in generate_chatgpt_summary")
                return "ERROR: Maximum token limit reached in generate_chatgpt_summary"
            else:
                sleep_duration = 60

            print(f"Encountered an error: {e}. Sleeping for {sleep_duration} seconds.")
            time.sleep(sleep_duration)
1 Like