They classed the model generating mark down ```'s around the code as a failure.
I’m sorry but that is not a valid reason to claim code would “not compile”. The model has been trained to produce markdown, the fact they took the output and copy pasted it without stripping it of markdown contents does not invalidate the model.
LLM’s have never been good at prime numbers, or numbers in general, they are not large math models. It also seems that they have only ran a single example for each test with a temperature of 0.1 which is not deterministic, that will lead to errors, there are lots of examples of this throughout the paper.