Scoring creative writing with consistency- possible?

I’ve been attempting to use chatGPT to score creative writing, based on various criteria (example score the dialogue in the story, or score how well it’s written etc). First, is chatGPT any good at rating creative writing? If so, how do you get accurate and consistent results? Even if I set the temperature low, I’ll nearly always get different scores back and the scores tend to be very favorable often returning 9’s and 10’s if I ask it to score 1-10, even for bad writing.

Probably better than a human!

Now that’s much tougher. That’s where prompt “engineering” comes in!

I don’t think a low temperature will help you here. I also don’t think asking the mode to subjectively hand out scores from 1 to 10 without any scoring criteria will yield good performance.

Typically you take a multi-step approach:

  1. define your measuring criteria. They shouldn’t really be subjective, and they should be as reproducible as possible. If you give these criteria to 10 people, they should ideally come up with the same score.

  2. ask the model to identify components of the input that could have an impact on the score. Make it list the merits and demerits.

  3. make it reflect on its output and possibly refine its criteria.

  4. tally it all up and generate your score

  5. (optional) if you have enough cash, consider a consensus approach! have different prompts evaluate the same paper and combine the results

Hope this helps! It’s a new age for fairness, transparency and individualization in education, and I’m super glad you’re taking the initiative in this transformation!

1 Like
  • In my personal opinion, I doubt that ChatGPT can evaluate your answers based on a scale. The ChatGPT’s ability is to generate answers that are customized for your content.

  • I just thought of a new idea that you can instruct it to look for certain rating scales that you created by yourself, and then it will follow your criteria.

  • Or you can ask it to search your topics on Bing, and it will generate the most relevant and popular hashtags for you, and then improve your scores.

Hope this idea helps you in some situations. :smiling_face:


Rather than using GPT to score based on its own criteria, I’d recommend coming up with your own and let GPT enforce it.

Don’t make it a full score, instead divide the scoring for the respective criteria.

E.g 3 points for quality of reference, 2 for coherence etc.


And in addition to the previous excellent suggestions give the model examples of what is a high and what is a low score for each category. I would also try to provide examples for low, medium and high scores. At least, that’s what I would think about when having to grade creativity.

Also, you can assign the role of a very strict teacher.

If you follow this approach there will be certain types of creativity that will be overvalued and the other way around.
So, creating a measure is a difficult task and you will need to help the model to perform to your requirements.


Something I’ve been trying to figure out as well. Although I’m sceptical of using the same LLM that generated content to rate it.

I think it depends on the goal of the writing. By creative writing, do you mean fiction? Or what? You probably want to be thinking of your audience, and what you consider good.

I have found even local LLMs good at extracting humorous lines from a body of text, so if you define your criteria clearly, if should be somewhat consistent. Part of the problem is sometimes it then likes to give a reason why (which may or may not be good, but its verbose, and more writing to read!)

Or, get it to act as an author that is relevant, or a critique, they might be able to help you come up with ideas.

Or another option, give it a body of text, and ask it ways it could score it. Might then help you think of what to look for, and then work backwards from there, devising prompts.

I have managed to get it to score things out of 5, although have too often noticed it rates things too highly. By default, ChatGPT is a “friendly assistant”. If you want it to give a strong opinion, you need to turn it into a bit of a monster. As even then, it still tries to be reasonable, but if you can get ChatGPT to act like an absolute a**hole, it often gives better results, although not always what you want to hear :joy:

The scoring of creative and writing is always a combination of AI-driven analysis and human assessment.

The ChatGPT, for example, can analyze grammar and coherence and even offer modifications, whereas human evaluators bring contextual awareness, admiration for originality, and nuanced critique.

This combination ensures a thorough and unbiased review.\

1 Like

Absolutely, the synergy between AI-driven analysis and human assessment in scoring creative work like writing is paramount. AI tools like ChatGPT excel in technical analysis, providing objective insights on grammar and coherence. However, the human touch is irreplaceable for appreciating originality and providing nuanced, context-sensitive feedback.

This dual approach not only enhances the accuracy of evaluations but also enriches the feedback process, making it more holistic and meaningful.

  • AI Analysis: Great for grammar, structure, and coherence.
  • Human Assessment: Essential for context, creativity, and nuanced understanding.

This blend ensures a comprehensive and balanced review, leveraging the strengths of both worlds. :memo: #CreativeWriting #AIinEducation #HumanAIcollaboration

Absolutely agree with your approach! Developing a custom scoring system tailored to specific criteria, like quality of reference and coherence, offers a more nuanced understanding of content quality. #QualityContent #ContentStrategy :books::sparkles:

The issue of course, is one person’s definition of “good” when it comes to creative writing is different to someone else’s!

AI can be quite good to analyse and assess based on your own interpretation of what you’re after. i.e it can be very good at picking up humour, sarcasm etc, in some cases better than people! Also, particularly in the field of comedy, it can actually be good at picking up appropriateness of the comedy based on the audience.

As the process itself is not an exact science, getting AI to do it either isn’t. I think to expect it to just analyse, and then either summarise or fix it is an impossible task, as this is something that for humans is a back-and-forth task.

AI also can potentially be useful for analysing originality, due to the fact that it has read a lot more than most people. It’s all about getting those analysing prompts right.

I’m still a little upset that Apple bought BookLamp who was behind the “Book Genome Project”… then never did anything with the company!

While it didn’t grade content, it would give each book a “fingerprint” of sorts that would allow you to match it to similar books… or…

… as they did in The Best Seller Code (circa 2016…) I don’t think anyone’s publicly made something like they did in the book that would rate whether a book is a “best seller” material or not, but I’ve been thinking about doing it myself one of these days.

The Best Seller Code is a good read, though, even if it’s a bit dated tech-wise.

1 Like

“Act as Jodie Archer and Matthew L. Jockers…”