“Our long term vision is to build general intelligence, open source it responsibly, and make it widely available so everyone can benefit. We’re bringing our two major AI research efforts (FAIR and GenAI) closer together to support this. We’re currently training our next-gen model Llama 3, and we’re building massive compute infrastructure to support our future roadmap, including 350k H100s by the end of this year – and overall almost 600k H100s equivalents of compute if you include other GPUs.” Source
At the same time, we recently saw the launch of the Mistral model, which is comparable to the GPT 3.5.
Models be rolling out for sure. But I always hate seeing the unidentifiable wall of metrics every paper has these days.
For overall quality, I like looking at blind testing on real world data, and leaderboards (ref), and actually be part of the crowdsourced testing and weigh in directly.
The crowdsourced opinions are showing OpenAI GPT-4 variants leading overall right now, and Mixtral 8x7B in 7th.
So for overall quality, GPT-4 is king right now.
But the best GPT-3.5-Turbo is 10th. So if you are looking at a 3.5-Turbo competitor, then yeah, lots of OS/OW options.
The best “useless metric” I have seen so far is the “how often does one model beat another model in multiple tests” and then use that as the basis for an arbitrary scoring system of battle wins/losses and using that value to rank models. This usually ends up putting some 7b model just 10 points behind GPT-4 and the nature of the metric is then hidden away in some acronym.
Also, I’m less convinced that different models make a lot of difference, and instead, I believe “the best training data” and “the best training hardware / best training schedule” are the determining factors. Note that “best” doesn’t necessarily mean “most.”
This is for LLM text-completion tasks. If you’re going for AGI, I think we’ll need some totally different architecture, so for that case, clearly, some new model will matter, in addition to the training data. But I also think we’ll need additional training data, not just text completion, for that application.
@curt.kennedy - But I always hate seeing the unidentifiable wall of metrics every paper has these days.
I agree. Although I think they are helpful for comparing open-source models between one another, they are nowhere near accurate in my experience when it comes to GPT-4 vs others.
I use Bard Pro daily and despite all the claims/benchmarks that it’s as good, it still doesn’t come anywhere close.
However, when you look at the HuggingFace ratings between open source models, I find those very accurate and I am astonished by the performance / quality of Mistral/Mixtral 7B for simple tasks.
For example, imagine you have something that requires thousands of quick calls; Mistral is ideal for this. Or something that will run on a Pi.
It’s nowhere close to GPT-4; however, for the resources it consumes, it’s truly hard to believe what it can do with so little.
And how much is it going to cost me to lease the required hardware in the cloud to run Mixtral 8x7B
I don’t understand why someone would lease hardware for a lightweight 7B model, or believe it to be a binary proposition.
Mixtral runs on the local machine and is a compliment to GPT intended for entirely different purposes.
I rely on it heavily for chunked processing of data I pass to GPT, for example.
I also rely on it for simple code-replacement tasks that need to run quicker and without a remote API call, for example, a 100k+ record dataset that I need to regress to some sort of pattern often based on sentiment, etc. Thing that would take far too much time to code / require fuzzy-logic; however, aren’t a fit for GPT.
Don’t overlook the use of open source models as a compliment or separate tool, they can be a great addition to your toolset!
The debate of open source vs “proprietary” models will always be ongoing … I just recently went through a vendor training and that was part of the curriculum. Compare and contrast, list pros and cons between open source models and vendor specific ones (proprietary) … as an example pros for the “proprietary” models, such as ones provided by OpenAi were as follows:
in Production running on your local machine is not an option?
Can you help me understand why not?
Open Source will make more sense for Production as hardware costs come down
I also have an instance running on a Raspberry Pi that works fine for the purpose.
Based on your comments, I feel as if we may be talking about different models.
And to reiterate again – I’m not for one moment discounting, nor do I believe there is any debate about which is better. I’m strictly referring to using the best tool for the job and that’s not always the most powerful model.
To offer yet another example (since I mentioned my Pi usage) – Home Assistant integration is such a use case. Far better to use a lightweight local model.
I’m as big of a fan of OpenAI as they come; however, I have many instances of Mistral models running for entirely different purposes. All of them on commodity hardware.
I see what you are saying. If you run a business 24x7 operation and you need high uptime, you go to the cloud.
But if you are in the cloud, why not just run some fancy proprietary model through an API? Right?
However, local may not work 24x7, unless you are a bigger company and can afford your own server farm with specialized HVAC, redundancy, etc.
But local can work when doing offline/local, non-external event driven things, like writing code, or doing 1-off things.
But then there’s the frustration factor with local OS models. For example I downloaded Mixtral 8x7b to run locally on my Mac Studio with 128 GB of RAM using the new Apple MLX framework. Got done with downloading the weights (like 90 GB) and then the whole thing failed because my local git repo didn’t have the “large file” thing initialized. So it can be frustrating.
But say I needed to create a high quality training file for another model. The local model could have assisted me, and saved me some money, and boosted my ego
There is a misconception that you need the most powerful model, yet this is what OpenAI/GPT is for. Typically if you have a use-case for a lower-power local LLM, you shouldn’t notice a major difference between Mistral and Mixtral at 7B.
You’re better off with Mistral 7B at a few gig. Better yet, run something like LM Studio and make your download/spin-up of different models point-and-click easy.
Nice. I find Cinebench is great for this purpose (although you’re right, no bragging rights there) – it does save you a good deal of bandwidth.
…Unrelated to the bragging rights, which I respect –
I just want to be sure that other devs aren’t discouraged. It’s extremely easy to install and I think they can be of great help/compliment to OpenAI API access.
Update: Llama 3.1 405B is an open-source model on par with the latest versions of GPT-4, but can be freely downloaded and fine-tuned.
Also surprising were the prices for using the API, approximately the same cost as GPT-4o for input and output on Azure, and even lower prices on other cloud services.