I made an aggregated AI leaderboard matrix to get kind of a comprehensive comparison of some of the most popular AI models. I normalized their performance across multiple leaderboards and added an average to create kind of a unified metric and give a picture of how these models stack up against each other on different leaderboards.
One small finding is that thereās a strong positive correlations between several metrics, indicating that models that perform well in one benchmark often perform well in others.
Iām super curious about any suggestions and feedback from fellow community members. For example, should I put a weighted average as some benchmarks / leaderboards have more authority and adoption than others? The idea is to get it to a point that makes sense, gives some value and add it to our web-site with a weekly update. Depending on where it goesā¦
It might be interesting to compute the correlation between benchmarks. If theyāre fairly spread out, they probably shouldnāt factor into the same average. I suspect that HELM and Trustbit are much closer to each other than either is to LMSYS.
You can sort of treat the correlation as a cosine similarity and potentially use it as a weight factor for different category averages
@elmstedt is working on a PHD in statistics or something, maybe he has some input here
Thanks a lot for your feedback! Means a lot to me! And sure, would be great to have some thoughts from @elmstedt.
In the meantime, Iāve done this kind of simple correlation matrix. Did you mean something like that? I can see a strong positive correlations between several metrics, meaning that models that perform well in one benchmark often perform well in others. Except āHuggingFaceā and āMT-benchā which are less correlated.
Given that correlation coefficients may fluctuate over time, my personal view at this point is that it would be best to consider all benchmarks and make a judgment based on an aggregated score.
thanks! Youāre right - as models change, correlations may change too. Maybe it would be interesting to actually continue to check correlations like once per month and see how it evolve.
Wow, these are so cool suggestions! Thanks! The only problem I see is that each of these points can easily become quite a strong benchmark, which means it would take several months to develop. I wish I could get some kind of Grant for it - Iād be happy to work on it. But your suggestions made me think on wether thereās a way to make something similar to lmsys but structured by evaluation of LLMs on those points you highlighted.