Test GPT-OSS-120B on livecodebench

ohdavide22 · November 26, 2025, 3:12pm

Hi everyone!
Recently I locally deployed the GPT-OSS-120b model but found that the score on livecodebench is really low(about 60 on v6),and I also found that the reasoning: medium setting is better than reasoning: high, it is wired.（the official scores of it have not been released yet).
So next I check the results on artificialanalysis(plus the results on kaggle), and it shows 87.8 on high setting and 70.1 on low setting, I reproduce it with the livecodebench-prompt on artificialanalysis ,and get 69 on medium setting, 61 on high setting, 60 on low setting(315 questions of livecodebench v5,pass@1 of 3 rollout，Fully aligned with the artificialanalysis settings)
Can anyone explain?the tempeture is 0.6, top-p is 1.0, top-k is 40, max_model_len is 128k.(using the vllm-0.11.0 official docker image)
I’ve seen many reviews saying this model’s coding ability isn’t very strong and it has severe hallucinations. Is this related?
In addition, Someone recommends me with this settings on unsloth(tempeture is 1.0, top-p is 1.0, top-k is 0) ,I test the recommended settings, the mean_output_len is 18458 as expected to reasoning:high,but the score is still 62,slightly greater than 61 mentioned before(within 3 points), far behind the 87.8 mentioned above.
Can anyone here help?

Topic		Replies	Views
Question generation/fine tuning API	2	842	December 17, 2023
Strange behavior of a fine tuned model API	6	2016	December 20, 2023
Has the reasoning ability of the GPT 3.5 API dropped recently? API chatgpt , api	9	1110	December 25, 2023
Education level interpretation of Gpt-4o's benchmarks Community gpt-4	1	5141	May 19, 2024
Benchmarks and leaderboard don't match my experience API gpt-4	0	210	May 16, 2024

Test GPT-OSS-120B on livecodebench

Related topics