Ensuring GPT Consistently Follows a Defined Scoring System: Tips and Best Practices?

I’ve been working on a project where I’m trying to get GPT to score transcripts based on specific metrics I define. I provide the transcription as the user input and then lay out scoring criteria in the system message. For instance: “Rate the response from 1-10 based on how frequently a person references their relationship. If mentioned more than 5 times, it should score 6; if mentioned less than 5 times, it should score 4.” (This is a poor example) The idea is to have the model strictly follow the scoring rules I provide.

While sometimes I get the desired scoring, other times the results are inconsistent or not aligned with the guidelines I’ve given. Currently, I’m solely relying on the prompt to instruct the bot on the scoring method. Apart from the prompt, are there other ways or methods to make the model strictly follow the scoring rules? I also have set the temp to 0 so its as strict as it can be temperature wise

Any suggestions or experiences shared would be highly appreciated. Thanks in advance!

This might give you some insights,

Its an interesting read for sure but I don’t think it helps me in my current situation

I believe the dataset is public, so you could consider fine-tuning your own model or just using their released model as it’s small enough to run locally on most recent consumer GPUs.

If the performance improvement scales nicely to gpt-3.5-turbo, it would result in a tremendously powerful evaluator.

Here are my thoughts on the topic:

  1. LLMs are not calculators. It’s not a good idea to use them as such. If you’re counting incidences, it would be better to get the LLM to highlight stuff and then sum it programmatically

  2. For completely subjective measures, it’s often unlikely that you’ll get similar responses even from humans. So it’s unlikely that you’ll manage to tune the prompt to give you the results that you personally expect. If you work in psychology, you probably know about methods such as the likert-type scale. I think you can treat the LLM like you would a human you don’t know with value structures you can’t control, in this case. Set up differential questions, and then normalize the scores afterward. It’ll probably be more robust in any case, because the uncontrolled context was likely going to throw off your raw scores anyways.

hth, lmk

I googled “Prometheus AI” but can’t find anything on it

Although this brings up a good point

Are there places i can find public data sets for fine-tuning?

Can you point me where to find the Prometheus AI data set?

EDIT:

I see it is here. would I just use the sample_train_data.json file to finetune?

EDIT 2: Thats not in JSONL though… hmm

I agree with @Diet here. For now, the best cause would be to pass the transcript to your own code that can score it. Python has plenty of libraries that are capable of this.

Producing an analysis score at the end by solely AI hidden layer “reading” is a big ask. You are relying on the instruction prompt and a passage of text to influence the likelihood that you get a token value of “5” or “05” or “6” or “\n\nI’m, sorry”.

For a particular case of “measuring romantic entanglements”, I could see a case where the output that then can be better analyzed as final step:

You analyze text to discover passages indicating interpersonal relationships, and then score that passage in two categories.

Instructions:
From the text passage given, extract rows into a markdown table with these columns:

  • column 1: each short passage or sentence that indicates a tie of friendly or romantic relationship.
  • column 2: AI score of friendly familiarity expressed between two people
  • column 3: AI score of romantic relationship existing or desired

Finally, produce a summary with the total scores of columns.

And try it on a “past lover getting back in touch” letter:

Passage Indicating Relationship AI Score of Friendly Familiarity AI Score of Romantic Relationship
“It’s been an incredibly long time since we last connected, and I find myself reflecting on those unforgettable moments we shared during our high school days.” 8 1
“The memories of our time together have remained etched in my mind, like a cherished photo album, full of laughter, secrets, and dreams.” 9 2
“The bond we had was special, and I can’t help but reminisce about those days that seemed to stretch on forever.” 9 3
“Through all the ups and downs, I’ve carried a piece of our past with me, never forgetting the warmth and connection we once shared.” 9 2
“Our friendship was a unique and meaningful one, and I believe that those deep connections, once established, can withstand the test of time.” 10 2
“I’ve often pondered the possibility of us rekindling the friendship we once had.” 9 1
"In many ways, the foundation of a strong and genuine friendship is similar…
1 Like

There would almost certainly need to be a little bit of tinkering on your part to convert the dataset to something appropriate for fine tuning an OpenAI model.

If you decided you were serious about doing the fine-tuning it might be worth your while to reach out to the Prometheus team and ask for guidance.

They might be willing to lend a hand, especially if you offered to let them benchmark the model once it is trained.

I believe this is the link to the actual dataset,

The Feedback-Collection dataset is 99,952 records and 104,203,908 total tokens (by my count), so you would be looking at roughly $834/epoch for fine-tuning a model based on their data. I didn’t see anywhere in the paper that they mention if they held out any of the data for testing/validation or how many epochs they trained for.

I do wonder if one could “get away with” drastically reducing the data so that you don’t need to train on a 1, 2, 3, 4, and 5 result for each instruction. If each instruction had only one exemplar, could the model “transfer” what it learned about a 4 score from those examples with a 4 score?

Anyway, it may be prohibitively expensive to do this kind of fine-tuning unless you thought babbage-002 would be able to handle this sort of evaluation.

Other Thoughts

The other thing looking at this has highlighted for me is the fact we really need some kind of option for custom loss functions.

I am not sure how effective it would be, as part of what the Prometheus model does is go through essentially a chain-of-thought process before determining the score. But, if you are only looking for a score metric, removing everything but the numeric score value from the response drops the total number of training tokens in an epoch down to 91,929,357 saving almost 12\% and making the per-epoch cost about $735, but at that point I would probably just leave the chain-of-thought fine-tuning in place. (Incidentally this is a place where a custom loss function would come in handy, as predicting a 4 when the “correct” response is a 5 is much better than predicting a 1.

In the end, it may just be better to use Prometheus as inspiration and come up with your own fine-tuning dataset for your very particular use case.

If you have the resources, filtering the data down to 1/5 of the entries (keeping only one score from each instruction), cutting the response to a single numeric token, and doing a fine-tuning on babbage-002 might be interesting at just over $7.35/epoch.

Anyway, I had just read this paper a few days ago and your question reminded me of it so I wanted to pass it along.

Its interesting you say this because when I feed GPT the same audio file, it almost always scores it the same. We have 15 metrics, and each of those 15 metrics has 3-4 submetrics. We have the GPT bot return the calls in JSON, so one prompt might be…

 {
    key: "Substance Abuse",

    description: `"First, analyze the problem and formulate your own solution. Compare your solution with the guidelines provided below. If your solution aligns with these guidelines, proceed with that answer. If it doesn't, adjust your solution to ensure it adheres to the guidelines. Always use the following guidelines as the final reference for correctness. Substance Abuse: Score in JSON based on the average scores of frequencyOfUse, amountOfUse, and impactOnDailyLife. If there is no mention of substance abuse, score in JSON 0
    frequencyOfUse: Score in JSON based on mentions or implications of how often the client uses the substance.
    amountOfUse: Rate depending on descriptions or indications of the quantity of substance used at a time. If they use alcohol or drugs more than 5 times a week, they should receive a score in JSON of at least 8 for substance Abuse
    impactOnDailyLife: Evaluate how the substance use affects the client's routine, relationships, job, or other daily activities. If the patient mentions that their drug habits or alcohol use impacts over 4 aspects of their lives, the patient should receive at least an 8 score in JSON for Substance Abuse. Respond in JSON Format without deviation: `,

    json: {
      score: "",
      description: "",
      factors: {
        frequencyOfUse: "",
        amountOfUse: "",
        impactOnDailyLife: "",
      },
    },
  },

Where each of the fifteen core metrics gets its own GPT call and is fed a transcript.

This way, each metric gets its own GPT brain - and we do get better scoring results, but still it will sometimes bug out. For instance, the main metric, in the above case, “Substance Abuse”, we always want to be the average of the sub-metrics. Sometimes it might score the three sub-metrics 6 - 7 - 8 but then say the overall score is 3.

It’s not an easy task but the idea is to get it to a place where it can score things based on a set of rules that are defined by a psychologist.

It would be an interesting thing to try and train a model that is that large and expensive but it would suck if it didn’t work out on the first run haha - yeah this is a cool idea, i think it might be a little outside of my scope of possibility right now. Creating a fine-tuned model for this I’m sure would be ideal, but then it comes down to how to create it and what data to feed it

I’m not sure, I’m also in node with this project. There are 15 metrics. and each metric gets its own gpt call - and we are trying to figure how to make it so gpt accurately scores each individual metric. Each metric has submetrics (75 metrics total):

    key: "Substance Abuse",

    description: `"First, analyze the problem and formulate your own solution. Compare your solution with the guidelines provided below. If your solution aligns with these guidelines, proceed with that answer. If it doesn't, adjust your solution to ensure it adheres to the guidelines. Always use the following guidelines as the final reference for correctness. Substance Abuse: Score in JSON based on the average scores of frequencyOfUse, amountOfUse, and impactOnDailyLife. If there is no mention of substance abuse, score in JSON 0
    frequencyOfUse: Score in JSON based on mentions or implications of how often the client uses the substance.
    amountOfUse: Rate depending on descriptions or indications of the quantity of substance used at a time. If they use alcohol or drugs more than 5 times a week, they should receive a score in JSON of at least 8 for substance Abuse
    impactOnDailyLife: Evaluate how the substance use affects the client's routine, relationships, job, or other daily activities. If the patient mentions that their drug habits or alcohol use impacts over 4 aspects of their lives, the patient should receive at least an 8 score in JSON for Substance Abuse. Respond in JSON Format without deviation: `,

    json: {
      score: "",
      description: "",
      factors: {
        frequencyOfUse: "",
        amountOfUse: "",
        impactOnDailyLife: "",
      },
    },

So we define for each of the 15 gpt calls and metrics how each gpt bot should score each metric - and we’re thinking about telling it to score things based on how things might already be scored in other inventories, thereby adopting already proven theories

Of course!

That’s why I also suggested you look at running one of their pre-trained models locally if you have a modern NVIDIA GPU with 16GB of VRAM.

I think the use case for something like this is a bunch of high school teachers who want to outsource their grading. :joy:

Now there’s an application someone could sell.

Fine-tune a model to build the rubric, fine-tune a model to grade based on a rubric, teachers upload their assignment, build the rubric for them, then they upload a zip file of their students’ work and download a CSV file ready to be uploaded into any of the popular learning management systems.

Give away the rubric and charge $0.15 / assignment graded.

Teacher with 40 students just graded all the assignments in about a minute for $6.00.

Hmmm… :thinking:

Less than $5k up-front costs, probably $0.05 / assignment to actually grade. Break even point (after all other expenses considered is probably between 100k–200k assignments graded.

Figure an average teacher might grade 10 assignments × 25 students / year, and one could be in the black with between 400 and 1000 regular customers for a year.

After that you could have yourself a nice little $2,000–$5,000 / month passive income without even getting very big.

1 Like

what I meant is that the input may bias the instruction. of course, if you put the same thing in, you’re likely getting similar results, particularly with lower temperatures.

it’s also quite possible that the model isn’t even respecting the input data and just punching numbers into the json, because it understands that you want numbers in the json.

what I’m thinking reading the prompt is that this might actually be a vector search and filter task, rather than a flat zero shot task like you seem to be doing. but that is just a guess.

Please say more!

This is a preliminary prompt. I’m hoping someone might know the secret sauce but ideally we get someone with the expertise (psychologist) to create the scoring system for the GPT bot - and than have the bot base its scores based on that system.

Saymore here,

basically what vector search does is index snippets of your conversation by concept. So if you want to evaluate how drugs affect the patient’s life, you may want to retrieve all passages where the patient talks about drug use. then you may refine these passages, possibly filter, possibly refine, and then aggregate them. that way you can deal with even very long conversations.

there are some other secret sauces but I don’t know if I’m allowed to share them.

the other thing you can do is turn it into a work task. How would an experienced psychologist go about doing this? I imagine most people don’t just glance at a report and immediately write their findings. Instead, they’ll go through the document, take notes, build their understanding (in this case maybe keep a rolling history and maybe a vector memory), and then refine their notes into a report. Instead of asking the AI to do it in one go, consider allowing it to do the work.

Whats your cell number? lol just kidding of course

Actually, we’ve figured out a pretty cool solution that we’re working on

I thought about this, I think it makes more sense to be “stateless”, without the embeddings for this use case.