Aggregated LLM Leaderboard - feedback is very welcome!

vasyl · August 9, 2024, 1:28am

Hey fellow Community members!

I made an aggregated AI leaderboard matrix to get kind of a comprehensive comparison of some of the most popular AI models. I normalized their performance across multiple leaderboards and added an average to create kind of a unified metric and give a picture of how these models stack up against each other on different leaderboards.

One small finding is that there’s a strong positive correlations between several metrics, indicating that models that perform well in one benchmark often perform well in others.

I’m super curious about any suggestions and feedback from fellow community members. For example, should I put a weighted average as some benchmarks / leaderboards have more authority and adoption than others? The idea is to get it to a point that makes sense, gives some value and add it to our web-site with a weekly update. Depending on where it goes…

Diet · August 9, 2024, 5:31am

Cool stuff!

It might be interesting to compute the correlation between benchmarks. If they’re fairly spread out, they probably shouldn’t factor into the same average. I suspect that HELM and Trustbit are much closer to each other than either is to LMSYS.

You can sort of treat the correlation as a cosine similarity and potentially use it as a weight factor for different category averages

@anon22939549 is working on a PHD in statistics or something, maybe he has some input here

vasyl · August 9, 2024, 7:57am

Hey @Diet !

Thanks a lot for your feedback! Means a lot to me! And sure, would be great to have some thoughts from @anon22939549.

In the meantime, I’ve done this kind of simple correlation matrix. Did you mean something like that? I can see a strong positive correlations between several metrics, meaning that models that perform well in one benchmark often perform well in others. Except “HuggingFace” and “MT-bench” which are less correlated.

Diet · August 9, 2024, 8:28am

Yeah, looks great!

Hmm.

helm-lmsys is only 0.43, trustbit-lmsys is surprisingly high (0.85).

I think huggingface and mtbench can be disregarded as there’s so little data.

LMSYS arena-hard-auto might be a standin for mt-bench

It looks like trustbit might just be an average of other benchmarks

Thanks for your work!

zhukov.vladimir · August 11, 2024, 12:17pm

Great idea! I think the model selection is a strategic decision, so it’s a super-actual topic for a while!

vasyl · August 12, 2024, 1:05am

Yep, it could be the case with Trustbit. And totally agree on HF and mtbench. Thanks for your feedback!

johncain194 · August 12, 2024, 3:17am

Guys, can you make a rank of LLM leaderboard in terms of:

DEI and ESG level adherence (social justice and equity ‘alertness’ to ‘social problem’)
Wokeness level (the quality of being alert to and concerned about social injustice and discrimination.)
Refusal to answer totally safe questions
Strictest and the “most” safe LLM (the longest TOS/“guardrail”)
Political leanness or the degree of political bias (left/center/right)
Ideological adherence (like gender-based ideology, especially the mainstream one; lgbtq+ adherence/advocacy)
Openness to criticism (able to receive criticism)
Instruction steering (able to conform to user’s wish)

dignity_for_all · August 12, 2024, 5:00am

Given that correlation coefficients may fluctuate over time, my personal view at this point is that it would be best to consider all benchmarks and make a judgment based on an aggregated score.

Just my two cents.

vasyl · August 12, 2024, 5:47am

thanks! You’re right - as models change, correlations may change too. Maybe it would be interesting to actually continue to check correlations like once per month and see how it evolve.

vasyl · August 12, 2024, 5:59am

Wow, these are so cool suggestions! Thanks! The only problem I see is that each of these points can easily become quite a strong benchmark, which means it would take several months to develop. I wish I could get some kind of Grant for it - I’d be happy to work on it. But your suggestions made me think on wether there’s a way to make something similar to lmsys but structured by evaluation of LLMs on those points you highlighted.

Topic		Replies	Views
Ideas for a RAG benchmark Community rag , benchmark	20	788	March 5, 2025
Model Sliding: A Logical Approach to AI Model Selection Community gpt-4 , chatgpt , api	7	1380	July 12, 2023
Feedback for OpenAI Developers on Automatic Model Selection Feedback chatgpt	3	111	March 9, 2025
Capturing Meaning other than Similarity (e.g., generalization) in vectors? API vector-store , tp-2	24	1091	June 27, 2024
Situational-awareness.ai, a brief writeup by Leopold Aschenbrenner Community chatgpt	11	22924	December 30, 2024

Aggregated LLM Leaderboard - feedback is very welcome!

Related topics