Advice Needed on Advanced Coding Evaluation System for School Project

Hi all,

I’m working on a school project focused on creating an advanced coding evaluation system that goes beyond simple output matching. Our goal is to assess logic, efficiency, and problem-solving ability in a more nuanced way. I’ve been reading IEEE papers and attended an HPE workshop on LLMs, but I’m not sure yet if I’ll be focusing on prompt engineering or training a database. We’re planning to use the O1 model, but it’s only me and a friend, and we have six months to deliver. I believe we can do a great job, but I’m looking for advice from the community on the best approach.

Here’s what we’re planning to implement:

Objective:

•	A coding evaluation system that considers not just outputs but also evaluates the candidate’s logic, efficiency, and problem-solving approach.

Key Features:

•	Nuanced Grading:
•	Code Logic and Structure: Assess the logical flow of the code, even with minor syntax errors (e.g., missing semicolons).
•	Error Tolerance: Focus on the candidate’s intent rather than penalizing for small mistakes.
•	Efficiency: Measure time and space complexity to see how optimized the solution is.
•	Problem-Solving Approach: Understand the thought process and award partial credit for good logic, even if the code doesn’t fully run.
•	Scoring System:
•	Understanding and Approach (40% of the score): How well the candidate understood the problem and applied an effective method.
•	Efficiency (30%): How optimized the code is.
•	Correctness (30%): How close the solution is to the expected output.

I’d appreciate any tips, advice, or tricks for building something like this within our timeline. What do you think the best approach would be from your experience?

Thanks in advance!

1 Like

Hello and welcome to the forum.

That sounds like a fun project with a realistic timeline.

  • o1-Mini is already supposedly very good at coding. How is your system going to add to that?

  • Conversely, "Code Logic and Structure, and “Understanding and Approach” require high-level understanding of the code better suited to o1—but also requiring a value judgement. How will you a) be able to feed “all of the code” to the model so it can understand all the things, and b) how much data will you feed it to make said judgement based on whose opinion?

  • Here’s some other work from openAI on the matter: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/