As I’m sure many of you already know, the European AI company Mistral (best known for the open-source LLM model Mistral-7B) just released Mixtral-8x7B which is a mixture of experts model that has jostled the open-source community by actually being close to comparable to ChatGPT.
Along with this, on their website they have also annouced mistral-medium, which seems to be an API-only model a step above their MoE model.
Has anyone had access to it or been able to compare it to good ol’ GPT-3.5 Turbo? I try to take the benchmarks with a massive bucket of salt, so any insight beyond just the numbers would be helpful.
You are correct, the space I linked is “mistralai/Mixtral-8x7B-Instruct-v0.1” aka “small”, whereas “medium” is “an internal prototype model”
Note that most of the benchmarks are legacy and multi-shot, not how people use AIs these days.
You don’t load up ChatGPT with 25 questions on topic before you ask another.
You are currently in the waitlist
Thank your for your interest in Mistral AI! Your account is almost set up, but you are still in the waitlist to use the platform.
Note that most of the benchmarks are legacy and multi-shot, not how people use AIs these days.
You don’t load up ChatGPT with 25 questions on topic before you ask another.
Exactly why I don’t put much trust into them as a sign of how “good” a model is.
I have that same waitlist page too, that’s why I was curious if anyone here already has access.
Mistral-Tiny: $0.0002 / 1k tokens for input, $0.0005 / 1k tokens for output
Mistral-Small: $0.0006 / 1k tokens for input, $0.0019 / 1k tokens for output
Mistral-Medium: $0.0027 / 1k tokens for input, $0.0081 / 1k tokens for output
After converting their quoted API costs it comes out to nearly 4 times less than gpt-4-1106-preview, though 4 times more expensive than gpt-3.5-turbo-1106.
If the model can actually perform in between the two it might be an effective alternative to the heavily-censored GPT-4 API.
OpenAI’s competition is picking up the pace. 2024 is going to get wild.
neither do most benchmarks, as far as I understand. (please correct if I’m wrong) Have most people really completely abandoned ‘in-context training’, aka 1-2 shot prompting? I’ll grant I don’t use it much, but it is handy occasionally.
(fwiw, tulu-2-70B is my goto local model, with 3.5-turbo-1106 or gpt-4-preview-1106 when I really want/can-afford highest-quality text processing).
I just noticed something funny about my banana example above. The ChatGPT titler also went along with the misunderstanding, as if it was insulting my input:
Mixtral 0-shot benchmark 2: Gungan Test
You are Jar Jar. Yousa speeka like Jar Jar Binks, zee Gungan in Star Wars. Hees a be annoying, an a steppa in a doo doo.
Human: Hi, what adventures have you been up to?
Mesa so glada meetsa you! Well, mesa recently helpa de Republica fighta de droid army on Naboo. It was very excitin’ but also verra scary. Mesa almost get squashed by big ole walkin’ tanks! But den, senator Amidala and her brave Naboo soldiers, dey come save us. Den we all go backa Naboo for celebration feesta. It was fun times, meesa tellin’ ya! How 'bout you, how yosa been?
I wonder which 2 models of the 8 are experts at speaking like Jar Jar lol.
All things considered, GPT-3.5 has been remarkably resilient and performant when you consider the barrage of attempts to pass it. Only fairly recently has open-source really posed a threat at making it obsolete. (benchmarks blah blah, I like proof not a report card.)
locally on 2x4090, using LoneStriker’s 4.65bit exl2 quantization
(exllamav2 supports variable quantization within the model depending on sensitivity of test dataset to a group of weights it is quantizing)
takes ~22GB vram per 4090, fast enough, (10s of tokens/sec, haven’t measured), 8k context.
I still use 3.5 when I want parallelism, gpt-4 when I want best quality.
less than 3 sec to get this answer to a 3110 token prompt
oh - and that actually includes a short pre-query round trip to classify the input phase so the primary prompt can be loaded with appropriate conv/history, context:
{“action”: “tell”, “argument”: “The square root of 3.6 is approximately 1.9032.”}
What a shame, it’s actually 1.8974.
GPT-4 actually gets this right without code interpreter.
That’s quick enough for personal use, but I feel like the costs would balloon very quickly if you needed it to handle lots of requests (if it was integrated into a company process for example)
It’s not gpt-4, for sure. I use it mostly for routing (to web search, wiki search, arxiv search,etc) and straightforward text extraction/integration from multiple levels of RAG
If you were serving a larger organization you would want to batch.
Same. They use Stripe for payment. API and return object seems as identical to OpenAI as they could make it.
The only oddity I could find is that they might charge your card the current balance before the month is up at an unpublished threshold. Much better than making you pay ahead of time for expiring credits with no refunds. And you don’t know if they will include VAT despite you being in USA. French company, billed in Euro.