Experiencing Unexpected Model Responses from OpenAI API (GPT-4 Expected, GPT-3 Received)

Hi everyone,

I’ve been encountering a perplexing issue with the OpenAI API for the past 2-3 days and was hoping to get some insights or solutions from this knowledgeable community. I am aware that there was a similar topic from last year but it does not solve my problem.

Issue Summary: I have a script that interacts with the OpenAI API, specifically requesting responses from the GPT-4 model. This script has been functioning as expected until recently. Despite requesting GPT-4 explicitly and being billed for GPT-4 tokens, the nature of the responses suggests that I’m receiving outputs from GPT-3 instead.

Why I Think I’m Receiving GPT-3 Responses: My confidence in this observation comes from the qualitative difference in responses to certain queries. Notably, GPT-4 has a specific way of handling requests for information outside its training data, such as stating its inability to access external websites, whereas GPT-3 tends to generate responses regardless. The discrepancy became apparent when I reran scripts with the same inputs and noticed a marked difference in the responses, which no longer matched the expected behavior of GPT-4.

Sample Code: The code is straightforward, except for the network_error_detect function which is a simple function that looks at the API response to see if it indicates that the model was unable to access websites and try again. It’s contents are unimportant. I created it because gpt-4 routinely fails to read web sites the first time and succeeds after a retry. Of course, it relies on gpt-4’s behavior of declaring the error in its response which gpt-3 does not do. (I know it’s ugly but it works.)

    def get_response_simple(self, system_message, user_message, model="gpt-4", max_attempts=4):   
        for _ in range(max_attempts):
            # Make the API call using ChatCompletion for chat models
            completion = self.client.chat.completions.create(model=model,messages=[{
                "role": "system", "content": system_message, 
                "role": "user", "content": user_message
                }])
            
            # Extract and return the response
            response = completion.choices[0].message.content

            is_failure = self.network_error_detect(response)

            if is_failure:
                self.do_nothing(15)
            else:
                return response
            
        response_message = "##*** Network error. Maximum retries exceeded. ***\n\n###Prompt:\n\n" + user_message + "\n\n###Response:\n\n"
        return response_message + response

Request for Community Insights:

I know this problem has cropped up before and the responses were rather dismissive, but I need to find a solution.

1 Like

The response specifically tells you the model that generated the response. Can you share the JSON for the response you’re receiving?

The response says it is gpt-4-0613. OpenAI usage dashboard says it’s gpt-4-0613. But it’s not acting like gpt-4. Anyway, I’ve opened a support request. I found this community later not realizing humans were actually looking at it. But I’m happy to entertain any ideas.

To add clarity, stable code stopped working.

are you talking about playground tests?

It’s possible that you’ve used gpt-4-turbo, and not gpt-4. there’s a significant difference. set your model to either gpt-4-1106-preview or gpt-4-0125-preview.
gpt-4-0613 is the worst of the gpt-4 models, a significant slump from 0314 (which isn’t available to most people anymore)

the turbo models are slightly dumber, but more stable in terms of hallucinatons and such.

Yes. The OpenAI dashboard is very specific about what is being used. The only gpt-4 variant is gpt-4-0613. I hit the playground a couple of times today but I rarely use it.

I can’t get 314 either. I tried that too.

https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo

yeah, gpt-4 and gpt-4-turbo are two separate things. I’m suggesting you’re likely looking for turbo but inadvertently used gpt-4.

GPT-4 is the worst? (GPT-4 points to gpt-4-0613) Hahaha.

The only thing improved about hallucinations in the turbo models is that you get a wall of denials from half the inputs you’d send, and no clever inferences out of the rest.

I just tried gpt-4-turbo-preview and got exactly the same result. I can’t use it anyway because I need up to 6000 tokens.

I think that’s valuable in its own right. Maybe I haven’t been evaluating 0613 fairly, but using 1106 to prevent 0314 from going off the rails works pretty well. I consider the denials a feature :laughing:

Are you telling me 0613 has any advantage in any scenario?

where did you make your initial assessment?do you wanna share your prompt?

Here is my current test code that I set up for support including the prompt. Mary is a fake attorney:

# Test Main

import openai_test_processor
import os

from openai_test_processor import OpenAITestProcessor
from datetime import datetime


# Initialize output

ai = OpenAITestProcessor()

context='You are a helpful assistant.'
prompt='Go to https://63898d4121644524ae9b79145f5ed630.stophatingyourbusiness.com/law-offices-of-mary-pason/ and tell me what industry Mary Pason is in.'

ai.set_initial_context (context)


# Get the current date and time
now = datetime.now()
timestamp = now.strftime('%Y-%m-%d %H:%M:%S')
response = ai.get_one_response (prompt)

content = f"{timestamp}: {response}\n\n"

# Get the current date and time
now = datetime.now()
timestamp = now.strftime('%Y-%m-%d %H:%M:%S')
response = ai.get_one_response ('What GPT model is this?')

content += f"{timestamp}: {response}"

output_folder = 'test_output'
file_name = 'test.txt'

output_path = os.path.join(output_folder, file_name)
with open(output_path, "w") as f:
    f.write(content)

Class:

# openai_test_processor.py

import os
from openai import OpenAI
import json

class OpenAITestProcessor:
    def __init__(self):
        self.conversation_history = []  # Keep track of messages to the GPT

        # Initialize OpenAI API key
        self.api_key = os.getenv('OPENAI_API_KEY')
        self.client = OpenAI(api_key=self.api_key)
        self.system_client = OpenAI(api_key=self.api_key)

    def set_initial_context(self, system_message):
        # Clear existing conversation history and correctly reference it with self
        self.conversation_history = []
        
        # Adds the initial system message to the conversation history
        if system_message:
            self.conversation_history.append({"role": "system", "content": system_message})

    def get_one_response(self, user_message):
        
        # Make the API call using ChatCompletion for chat models
        completion = self.client.chat.completions.create(model="gpt-4-turbo-preview",
        messages=self.conversation_history + [{"role": "user", "content": user_message}])

        # Print the model
        print(completion.model)

        # Extract and return the response
        response = completion.choices[0].message.content

        return response

What are you expecting the model to do with that prompt? It’s not going to fetch that web page

I can’t browse the internet or click on links, so I’m unable to directly check the website you’ve provided to determine the industry Mary Pason is in. If you’re interested in the law offices of Mary Pason, it’s likely that she is involved in the legal industry or provides legal services. If you need more specific information, you might want to describe the services or information listed on the website, and I can provide more detailed insights based on that.

gpt-4-0125-preview

:thinking:

that’s what you wanted, right?

I’d avoid these arbitrary endpoints, because things will randombly break. pick a fixed version. things can still break, but it’s a little less chaotic.

So it’s broken for everyone then? I would’ve thought someone would have noticed. The original code just specified ‘gpt-4’ and it worked from January up until last week.

I think most people just use fixed versions. I don’t know what the generic endpoint pointed to. Did you keep logs?

No logs. But the generic would have been pointing to 613 for all that time anyway, wouldn’t it?

I don’t know :person_shrugging:

there was talk of switching it to 1106, then they changed their minds, or maybe they unchanged their minds. Eventually they added the generic gpt-4-turbo-preview. I didn’t track it because it doesn’t really matter.

It’s indeed possible that you were using 0613 up until now, but perhaps your use-case drifted so you just never noticed until now? Really hard to say.

It’s been the same project the whole time. But this does lead to the obvious question, if anyone has code still working that relies on Internet access, I’d like to know what model they are using because I can’t find one that works.

Or maybe they just don’t know their models aren’t working and they are getting garbage answers. It took me a minute to realize it.

None of the API models do. You need to build a function/tool that the model can call, that scrapes the internet for the model.

But I think the picture is becoming clearer now.

Is it possible that you’ve indeed been using 0613 all this time, and you thought you had internet access? If that’s the case you’ve been getting hallucinations all along.

I know for certain that I had internet access. It would be impossible to generate the output I have without it. I suppose it’s possible that having Internet access was a bug that was fixed.

Understand that these models have been trained on the majority of the internet up until some time in 2021, such that they can take a very educated guess what a link might contain by the url alone.