Gpt-3.5-turbo-1106 performance

Hello eveyone,
I’m currently experimenting with Llama Index RAG tools using OpenAI “text-embedding-3-large” as an embedding model and ‘gpt-4-1106-preview’ as a Large Language Model. Everything works pretty much well, the retriaval provides quite acceptable and relevant answers in Uzbek language, however due to huge amount of source data I wanted to reduce the cost and switch from ‘gpt-4-1106-preview’ to ‘gpt-3.5-turbo-1106’. The LLM replies became quite less relevant.
I can exclude the version of context window token limit as I’m retrieving 3 chunks - 1024 tokens each, meaning that ‘gpt-3.5-turbo-1106’ model (16,385 tokens) should be capable to handle.
What can be the other reasons why 3.5 can’t give sufficient results similar to gpt-4? Can language of the source data be the reason? I’m using Uzbek language.

Are you getting the same results from your vector search? It is the only explanation I can think of. Which means, the query being used might be the culprit. Is it a plain query from the user or you preprocess it (e.g. HyDE)?

1 Like

I have made several posts about this issue. gpt-3.5-turbo-16k is far less capable than gpt-4 and sometimes seems incapable of accurately reading returned documents. I have documented cases where, given the exact same question and context documents, gpt-3.5-turbo-16k fails to even recognize the answer in the documents where gpt-4 responds perfectly.

I doubt that it is the vector search results as they won’t be affected by the LLM.

This was way back in June 2023: Gpt-3.5-turbo-16k api not reading context documents

However, I was able to eventually get much better results from the gpt-3.5 model by using xml markup in my prompt: API Prompt for gpt-3.5-turbo-16k - #12 by SomebodySysop

I know your question was about ‘gpt-3.5-turbo-1106, but my understanding is that gpt-3.5-turbo-16k is an alias for that model. Somebody let me know if that is incorrect.

1 Like

Did you checked gpt-3.5-turbo-0125. I feel its better, and following instructions better. (FYI: I’m using it in Assistant API.)

1 Like

Isn’t the alias gpt-3.5-turbo? If so, I’ve tried it with even worse results.

Customers using the pinned gpt-3.5-turbo model alias will be automatically upgraded from gpt-3.5-turbo-0613 to gpt-3.5-turbo-0125 two weeks after this model launches.

The stable model is not switched yet. As I replied four days ago:


call to gpt-3.5-turbo just now:

{‘id’: ‘chatcmpl-xxx’, ‘choices’: [{‘finish_reason’: ‘stop’, ‘index’: 0, ‘logprobs’: None, ‘message’: {‘content’: ‘Hello! How can I assist you today?’, ‘role’: ‘assistant’, ‘function_call’: None, ‘tool_calls’: None}}], ‘created’: 1707289999, ‘model’: ‘gpt-3.5-turbo-0613’, ‘object’: ‘chat.completion’, ‘system_fingerprint’: None, ‘usage’: {‘completion_tokens’: 9, ‘prompt_tokens’: 8, ‘total_tokens’: 17}}

The reason why 3.5 is not as good as gpt-4, which costs 10x as much? I’m guessing that one could answer this on their own. You can significantly decrease the API temperature parameter for less common languages where the AI is less certain and test.

2 Likes

I finally stopped procrastinating and actually checked it out for sure. I tested it on sample questions from 3 different datasets. I am blown away by how good it actually is. Not as good as gpt-4, but not as terrible as it’s gpt-3.5-turbo-16k cousin either.

Well done OpenAI!

1 Like

Try gpt-3.5-turbo-0125. I just tested it and it is returning decent answers from a variety of large, fairly complex datasets. Not as good as gpt-4, but you may find them good enough in your use case.

And, yes, language will make a difference. I do think language of the source makes a difference, especially if the models are primarily trained in English.

I assume you speak multiple languages, so you understand that a language isn’t just words, but a way of looking at and interpreting the world around you. Large Language Models have no awareness of the world around them – they simply anticipate words. And the words they are primarily trained on are English, not Uzbek. So, I’m sure you’re going to “lose a little in the translation”, so to speak.

But, try gpt-3.5-turbo-0125. It might surprise you.

1 Like

Hello @SomebodySysop,

thnak you! I have just tried ‘gpt-3.5-turbo-0125’ and unfortunately the answer is still ‘bullshit’. Here is the comaprison below. For the purity of the experiment I’m using 2 source articles which I parse from the scratch and recreate the embedding for each new test.
My test quiery is: “What are the conditions for formalizing a marriage in Uzbekistan?”
gpt-4 answer:

Warning: model not found. Using cl100k_base encoding.
The following conditions must be met in order to formalize a marriage in Uzbekistan:

1. Marriage must be registered in local authorities.
2. Both parties must be of legal age (18 years for men, 17 years for women).
3. Both parties must provide valid personal identification documents.
4. Both parties must provide evidence of their legal status (single or divorced or widowed).
5. Marriage should be voluntary and not compulsory.
6. Marriage must not conflict with cultural or religious norms.
7. The couple must not be closely related by blood or adoption.
8. Marriage must be registered within the time specified by the law.
9. The couple must submit the necessary documents and pay the necessary fees.
10. Marriage must be registered in accordance with all laws and regulations of Uzbekistan.

gpt-3.5-turbo-0125 answer:

Warning: model not found. Using cl100k_base encoding.
Conditions for legalization of marriage Prime Minister of the Republic of Uzbekistan A. It was established by ARIPO based on its decision No. 911 of November 27, 2019. According to this decision, the necessary conditions for the formalization of marriage are as follows:

1. Loss of housing conditions for marriage. There are many important conditions for marriage to take place. One of them is Niko in Uzbekistan

You can see the dramatic difference. The only thing I change is this peace of code. I’m replacing the model, parse form the scratch, recreate embedding and submit the query.

#Create the OpenAIAssistantAgent using the query engine tool
    agent = OpenAIAssistantAgent.from_new(
        name="Legal Advisor",

        #model="gpt-3.5-turbo-1106",
        #model="gpt-3.5-turbo-0125",
        model="gpt-4-1106-preview",


        instructions="You are a professional lawyer aiming to provide legal assistance based on the documents provided. Include the source link of the information in your response.",
        tools=[combined_tool],
        instructions_prefix="Please be professional and polite",
        verbose=True,
        run_retrieve_sleep_time=1.0,
    )

OK, try one more thing: API Prompt for gpt-3.5-turbo-16k - #12 by SomebodySysop

Try XML markup for your document presentation in the prompt. I assume you’re returning a good deal of documentation. Also, what are you setting your max tokens to? I’ve set mines to 4000.

This is the type of responses I am getting back using this formatting:

Update: In case you’re wondering, this is how I format the messages array:

			// Check if the model name contains "gpt-3.5-turbo"
			if (strpos($this->model, 'gpt-3.5-turbo') !== false) {
				$prompt = $xmlPrompt;
			}
			
			// Initialize the $messages array with the system message
			$messages = array(
				array("role" => "system", "content" => $systemMessage)
			);

			// Define the new user message (question + context docs)
			$newUserMessage = array("role" => "user", "content" => $prompt);
			
			// Append the new user message to the end of the $messages array
			$messages[] = $newUserMessage;

So, I deliver the entire User message, question and context documents, in XML markup as per the example in the link above.

1 Like

I had tested with both in Assistant API and can directly feel the difference. Less token Usage, More accurate response and i guess they reduced pricing also.

2 Likes

Dear community members,

sorry for the long reply. I have finally realized that my previous feeling of significant superiority of the gpt4 model over the gpt3 was associated exclusively with the greater eloquence of the 4th model, i.e. even if RAG returns completely irrelevant fragments in response to a question, gpt4 saves the day with its more broad knowledge of everything and it seems that its answers are much better. In those cases where RAG returns the relevant context and the question is formulated correctly, gpt-3.5-turbo-0125 produces quite acceptable answers.

1 Like

If your queries are mainly about retrieving information, give gpt 3.5 turbo 1106 a try. It is outperforming 0125 for me (extracting original text form a document).