Aggregated LLM Leaderboard - feedback is very welcome!

Hey fellow Community members! :wave: :wave:

I made an aggregated AI leaderboard matrix to get kind of a comprehensive comparison of some of the most popular AI models. I normalized their performance across multiple leaderboards and added an average to create kind of a unified metric and give a picture of how these models stack up against each other on different leaderboards.

One small finding is that thereā€™s a strong positive correlations between several metrics, indicating that models that perform well in one benchmark often perform well in others.

Iā€™m super curious about any suggestions and feedback from fellow community members. For example, should I put a weighted average as some benchmarks / leaderboards have more authority and adoption than others? The idea is to get it to a point that makes sense, gives some value and add it to our web-site with a weekly update. Depending on where it goesā€¦

11 Likes

Cool stuff!

It might be interesting to compute the correlation between benchmarks. If theyā€™re fairly spread out, they probably shouldnā€™t factor into the same average. I suspect that HELM and Trustbit are much closer to each other than either is to LMSYS.

You can sort of treat the correlation as a cosine similarity and potentially use it as a weight factor for different category averages :thinking:

@elmstedt is working on a PHD in statistics or something, maybe he has some input here

3 Likes

Hey @Diet !

Thanks a lot for your feedback! Means a lot to me! And sure, would be great to have some thoughts from @elmstedt.

In the meantime, Iā€™ve done this kind of simple correlation matrix. Did you mean something like that? I can see a strong positive correlations between several metrics, meaning that models that perform well in one benchmark often perform well in others. Except ā€œHuggingFaceā€ and ā€œMT-benchā€ which are less correlated.

3 Likes

Yeah, looks great!

Hmm.

helm-lmsys is only 0.43, trustbit-lmsys is surprisingly high (0.85).

I think huggingface and mtbench can be disregarded as thereā€™s so little data.

LMSYS arena-hard-auto might be a standin for mt-bench :thinking:

It looks like trustbit might just be an average of other benchmarks :laughing:

Thanks for your work!

2 Likes

Great idea! I think the model selection is a strategic decision, so itā€™s a super-actual topic for a while!

Yep, it could be the case with Trustbit. And totally agree on HF and mtbench. Thanks for your feedback!

1 Like

Guys, can you make a rank of LLM leaderboard in terms of:

  • DEI and ESG level adherence (social justice and equity ā€˜alertnessā€™ to ā€˜social problemā€™)
  • Wokeness level (the quality of being alert to and concerned about social injustice and discrimination.)
  • Refusal to answer totally safe questions
  • Strictest and the ā€œmostā€ safe LLM (the longest TOS/ā€œguardrailā€)
  • Political leanness or the degree of political bias (left/center/right)
  • Ideological adherence (like gender-based ideology, especially the mainstream one; lgbtq+ adherence/advocacy)
  • Openness to criticism (able to receive criticism)
  • Instruction steering (able to conform to userā€™s wish)
1 Like

Given that correlation coefficients may fluctuate over time, my personal view at this point is that it would be best to consider all benchmarks and make a judgment based on an aggregated score.

Just my two cents.

2 Likes

thanks! Youā€™re right - as models change, correlations may change too. Maybe it would be interesting to actually continue to check correlations like once per month and see how it evolve.

Wow, these are so cool suggestions! Thanks! The only problem I see is that each of these points can easily become quite a strong benchmark, which means it would take several months to develop. I wish I could get some kind of Grant for it - Iā€™d be happy to work on it. But your suggestions made me think on wether thereā€™s a way to make something similar to lmsys but structured by evaluation of LLMs on those points you highlighted.

1 Like