Incorrect and different answers in the gpt-3.5-turbo-16k API, and other answers in Playground and ChatGPT

Example correct answer Playground :

in chatGPT answer correct to.
But API gpt-3.5-turbo-16k returns “1” - its wrong.
I use all the same parameters on pic and even systems massage:
“You are ChatGPT, a large language model trained by OpenAI, based on the GPT-3.5 architecture.
Knowledge cutoff: 2021-09
Current date: 2023-12-21”

I always get a consistently incorrect option, even if I try 10 times in the API, and I also get a consistently correct Playground

This is one of my examples, I’ve been sitting with this problem for a week, every day, changing promts, settings, studying the forum!

The main goal is to evaluate incoming short messages for the presence of a request for services. But this works extremely poorly through the API. Unlike Playground or ChatGPT.

Currently I am using three different requests with different promts to analyze one short message. But even this does not give a stable correct answer through the API, unlike Playground

I also see a lot of similar topics where there is no clear answer on what to do.

Can you help me or admit that the API works much worse and is not suitable for use today
Thank you

It is certainly possible for changes in the models day-to-day, but I think that some of OpenAI’s focus has been taken off of altering models like gpt-3.5-turbo-16k-0613 by newer developments.

You shouldn’t need to pay double for the 16k model on your small inputs and outputs though, until you are actually sending more tokens than would fit.

The proof would be to press “view code” in the playground when you have the user message but not the assistant response, then copy that code identically to a Python IDE (or just right to IDLE).

Add just the response printing line, print(response.choices[0].message.content)

If the AI simply can’t answer the question reliably, at the threshold of not working regardless of model, the particular question needing thinking and logic may need to be upgraded to gpt-4.

First 3 day i used gpt-3.5-turbo and the results are much worse than gpt-3.5-turbo-16k-0613 and it was noticeable how gpt-3.5-turbo simply ignored part of my context with conditions.

I also tried gpt-4, but it seemed even worse to me than gpt-3.5 in my case

You can be really tricky: rewrite parts of the prompt that don’t change into English to get better instruction-following.

I tried this too, but the result was only worse:

prompt1 = f"‘{string}’\n\nIf the message above is an advertisement offering your services, then answer “No”, in the same way, if the message above is a job search advertisement, the designation “No” will also answer, in case of refusal, if the message above is a request for a service by {industry} and it would be appropriate to respond to this message above and offer your services in {industry}, only then answer “Yes”? Answer me only “Yes” or “No”"

prompt2 = f’Analyze the following text received from the telegram and give an answer, the text is a request for a service in {industry} and is it possible to offer your services as a specialist in this industry, answer me only “Yes " or “No”:\n”{string}"’

prompt3 = f"Understand the meaning of the following text and if the text advertisement is an advertisement offering your services, then answer ‘No’, also if the text is a job advertisement, also answer ‘No’, otherwise if the meaning of the text is - This is a search for services by industry {industry}, only then answer me ‘Yes’.\nText: ‘{modified_string}’."

Thank you for answering and trying to help, but my main questions are voiced in the topic

My question remains unanswered, requests in the playground and exactly the same request through the API give consistently the opposite answer! Why?

Likely the system prompt message in ChatGPT…

Since it’s russian, you might want to give it a one-shot example if you can.

Hope this helps.

2 Likes

Instead of passing instructions in user message everytime, set the system message accordingly. Don’t use 16k context for small number of tokens. It’s likely that there’s some loss in translation.

I’d advise to first use ChatGPT to translate your instructions to proper English and then set it to system message, with instructions on what and how to reply to user’s message.

Then simply pass what you want to be evaluated as user message.

2 Likes
  1. It already is
  2. examples:

    SYSTEM: Если текст представляет собой рекламу с предложением своих услуг, то отвечай “0”. Если текст представляет собой объявление о поиске работы, то также отвечай “0”. Ответьте “1”, если уместно предложить себя отправителю. В противном случае, ответьте “0”, отвечайте только “1” или “0” на основе своего анализа, без дополнительных комментариев.
    Ты многопрофильный специалист по сфере фотограф, видеограф или контентмейкер. Пожалуйста, очень внимательно проанализируй следующее сообщение из чата телеграма.
    USER: Всем привет! , Ищу коммерческого фотографа для фотосъемки в Дубае ( по датам : 25.12-09.01) , хочется красоты и творчества, предложения в личку

In playgruond stable answer 1. In API - 0 (uncorrect)
All parameters are identical, believe me, I’ve already been doing this for the second week

Thank you.
For a more accurate answer, I use three different queries for one text and only then add up the answers for the final result.

Yes, I don’t use 16k anymore. I saw that a simple 3.5 turbo gives the same answer.

I tried to translate and wrote about it above, the result was even worse.

I have already written a separate program in Python, where I upload 20 incorrect messages at once, change the prompt and analyze the result as a percentage:

Promt_1 correct answers: 77.78%
Promt_2 correct answers: 81.48%
Promt_3 correct answers: 85.19%

Overall correct answers: 81.48%

In Promt 1 I changed the system as in the screenshot above.

But the main problem is that the API still gives the wrong answer for some requests, although everything is correct in the playground.

I meet the same qustions, and I use all chatgpt api versions, None of them are exactly match the playground answers, I don’t understand why so different, playground have totally different better model? can I use playground model?