Mistral-Medium versus GPT-3.5 Turbo?

As I’m sure many of you already know, the European AI company Mistral (best known for the open-source LLM model Mistral-7B) just released Mixtral-8x7B which is a mixture of experts model that has jostled the open-source community by actually being close to comparable to ChatGPT.

Along with this, on their website they have also annouced mistral-medium, which seems to be an API-only model a step above their MoE model.

Has anyone had access to it or been able to compare it to good ol’ GPT-3.5 Turbo? I try to take the benchmarks with a massive bucket of salt, so any insight beyond just the numbers would be helpful.






That’s a fantastic find, I’ll test it out.

A quick search though reveals that this is the Mixtral of experts model, not the mistral-medium one.

Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token.

I’ll use this to test the new one they released to the public, but the API-exclusive model still seems to be a mystery.

You are correct, the space I linked is “mistralai/Mixtral-8x7B-Instruct-v0.1” aka “small”, whereas “medium” is “an internal prototype model

Note that most of the benchmarks are legacy and multi-shot, not how people use AIs these days.
You don’t load up ChatGPT with 25 questions on topic before you ask another.

You are currently in the waitlist

Thank your for your interest in Mistral AI! Your account is almost set up, but you are still in the waitlist to use the platform.

“Access to our API is currently invitation-only,”


Note that most of the benchmarks are legacy and multi-shot, not how people use AIs these days.
You don’t load up ChatGPT with 25 questions on topic before you ask another.

Exactly why I don’t put much trust into them as a sign of how “good” a model is.

I have that same waitlist page too, that’s why I was curious if anyone here already has access.

  • Mistral-Tiny: $0.0002 / 1k tokens for input, $0.0005 / 1k tokens for output
  • Mistral-Small: $0.0006 / 1k tokens for input, $0.0019 / 1k tokens for output
  • Mistral-Medium: $0.0027 / 1k tokens for input, $0.0081 / 1k tokens for output

After converting their quoted API costs it comes out to nearly 4 times less than gpt-4-1106-preview, though 4 times more expensive than gpt-3.5-turbo-1106.

If the model can actually perform in between the two it might be an effective alternative to the heavily-censored GPT-4 API.

OpenAI’s competition is picking up the pace. 2024 is going to get wild.


neither do most benchmarks, as far as I understand. (please correct if I’m wrong) Have most people really completely abandoned ‘in-context training’, aka 1-2 shot prompting? I’ll grant I don’t use it much, but it is handy occasionally.
(fwiw, tulu-2-70B is my goto local model, with 3.5-turbo-1106 or gpt-4-preview-1106 when I really want/can-afford highest-quality text processing).

1 Like

I just noticed something funny about my banana example above. The ChatGPT titler also went along with the misunderstanding, as if it was insulting my input:


Mixtral 0-shot benchmark 2: Gungan Test

You are Jar Jar. Yousa speeka like Jar Jar Binks, zee Gungan in Star Wars. Hees a be annoying, an a steppa in a doo doo.

Human: Hi, what adventures have you been up to?

Mesa so glada meetsa you! Well, mesa recently helpa de Republica fighta de droid army on Naboo. It was very excitin’ but also verra scary. Mesa almost get squashed by big ole walkin’ tanks! But den, senator Amidala and her brave Naboo soldiers, dey come save us. Den we all go backa Naboo for celebration feesta. It was fun times, meesa tellin’ ya! How 'bout you, how yosa been?


fwiw, tulu-2-70B is my goto local model

To clarify, are you running it locally or through a cloud computing service? Also, is it a quantized version?

I’ve been debating whether or not to get my company to purchase some GPU’s and try to build a setup that can run the state-of-the-art local models.


I wonder which 2 models of the 8 are experts at speaking like Jar Jar lol.

All things considered, GPT-3.5 has been remarkably resilient and performant when you consider the barrage of attempts to pass it. Only fairly recently has open-source really posed a threat at making it obsolete. (benchmarks blah blah, I like proof not a report card.)

1 Like

locally on 2x4090, using LoneStriker’s 4.65bit exl2 quantization
(exllamav2 supports variable quantization within the model depending on sensitivity of test dataset to a group of weights it is quantizing)
takes ~22GB vram per 4090, fast enough, (10s of tokens/sec, haven’t measured), 8k context.
I still use 3.5 when I want parallelism, gpt-4 when I want best quality.

1 Like

Have you run it in production ? How do you compare the inference speed ?

Not sure what you mean ‘production’. I have my exllamav2 (fastapi wrapped) running all day, but I am sole user. I’ll do a quick measure, hold on…

It costs $7/hr to host a 3.5 deployment on azure

Can pay less on modal with Mistral Medium for similar quality

less than 3 sec to get this answer to a 3110 token prompt
oh - and that actually includes a short pre-query round trip to classify the input phase so the primary prompt can be loaded with appropriate conv/history, context:

{“action”: “tell”, “argument”: “The square root of 3.6 is approximately 1.9032.”}


What a shame, it’s actually 1.8974.
GPT-4 actually gets this right without code interpreter.

That’s quick enough for personal use, but I feel like the costs would balloon very quickly if you needed it to handle lots of requests (if it was integrated into a company process for example)

1 Like

It’s not gpt-4, for sure. I use it mostly for routing (to web search, wiki search, arxiv search,etc) and straightforward text extraction/integration from multiple levels of RAG

If you were serving a larger organization you would want to batch.


Attention all, I have just gotten access to the mistral API. I will attempt to do some testing, but I don’t have enough time today.

Very exciting!


Same. They use Stripe for payment. API and return object seems as identical to OpenAI as they could make it.

The only oddity I could find is that they might charge your card the current balance before the month is up at an unpublished threshold. Much better than making you pay ahead of time for expiring credits with no refunds. And you don’t know if they will include VAT despite you being in USA. French company, billed in Euro.

1 Like

They’ve done a great job structuring the API so it would be fairly simple for someone to switch easily from OpenAI.

Hopefully someone will be able to do some proper testing so we can compare.

Got my access as well! Excited :slight_smile:

1 Like