Ensuring GPT Consistently Follows a Defined Scoring System: Tips and Best Practices?

anon34024923 · October 26, 2023, 5:42pm

I’ve been working on a project where I’m trying to get GPT to score transcripts based on specific metrics I define. I provide the transcription as the user input and then lay out scoring criteria in the system message. For instance: “Rate the response from 1-10 based on how frequently a person references their relationship. If mentioned more than 5 times, it should score 6; if mentioned less than 5 times, it should score 4.” (This is a poor example) The idea is to have the model strictly follow the scoring rules I provide.

While sometimes I get the desired scoring, other times the results are inconsistent or not aligned with the guidelines I’ve given. Currently, I’m solely relying on the prompt to instruct the bot on the scoring method. Apart from the prompt, are there other ways or methods to make the model strictly follow the scoring rules? I also have set the temp to 0 so its as strict as it can be temperature wise

Any suggestions or experiences shared would be highly appreciated. Thanks in advance!

anon34024923 · October 26, 2023, 6:37pm

Its an interesting read for sure but I don’t think it helps me in my current situation

anon22939549 · October 26, 2023, 7:58pm

I believe the dataset is public, so you could consider fine-tuning your own model or just using their released model as it’s small enough to run locally on most recent consumer GPUs.

If the performance improvement scales nicely to gpt-3.5-turbo, it would result in a tremendously powerful evaluator.

Diet · October 26, 2023, 8:05pm

Here are my thoughts on the topic:

LLMs are not calculators. It’s not a good idea to use them as such. If you’re counting incidences, it would be better to get the LLM to highlight stuff and then sum it programmatically
For completely subjective measures, it’s often unlikely that you’ll get similar responses even from humans. So it’s unlikely that you’ll manage to tune the prompt to give you the results that you personally expect. If you work in psychology, you probably know about methods such as the likert-type scale. I think you can treat the LLM like you would a human you don’t know with value structures you can’t control, in this case. Set up differential questions, and then normalize the scores afterward. It’ll probably be more robust in any case, because the uncontrolled context was likely going to throw off your raw scores anyways.

hth, lmk

anon34024923 · October 26, 2023, 8:10pm

I googled “Prometheus AI” but can’t find anything on it

anon34024923 · October 26, 2023, 8:15pm

Although this brings up a good point

Are there places i can find public data sets for fine-tuning?

Can you point me where to find the Prometheus AI data set?

EDIT:

I see it is here. would I just use the sample_train_data.json file to finetune?

EDIT 2: Thats not in JSONL though… hmm

lachie1 · October 26, 2023, 11:04pm

I agree with @Diet here. For now, the best cause would be to pass the transcript to your own code that can score it. Python has plenty of libraries that are capable of this.

_j · October 26, 2023, 11:35pm

Producing an analysis score at the end by solely AI hidden layer “reading” is a big ask. You are relying on the instruction prompt and a passage of text to influence the likelihood that you get a token value of “5” or “05” or “6” or “\n\nI’m, sorry”.

For a particular case of “measuring romantic entanglements”, I could see a case where the output that then can be better analyzed as final step:

You analyze text to discover passages indicating interpersonal relationships, and then score that passage in two categories.

Instructions:
From the text passage given, extract rows into a markdown table with these columns:

column 1: each short passage or sentence that indicates a tie of friendly or romantic relationship.

column 2: AI score of friendly familiarity expressed between two people

column 3: AI score of romantic relationship existing or desired

Finally, produce a summary with the total scores of columns.

And try it on a “past lover getting back in touch” letter:

Passage Indicating Relationship	AI Score of Friendly Familiarity	AI Score of Romantic Relationship
“It’s been an incredibly long time since we last connected, and I find myself reflecting on those unforgettable moments we shared during our high school days.”	8	1
“The memories of our time together have remained etched in my mind, like a cherished photo album, full of laughter, secrets, and dreams.”	9	2
“The bond we had was special, and I can’t help but reminisce about those days that seemed to stretch on forever.”	9	3
“Through all the ups and downs, I’ve carried a piece of our past with me, never forgetting the warmth and connection we once shared.”	9	2
“Our friendship was a unique and meaningful one, and I believe that those deep connections, once established, can withstand the test of time.”	10	2
“I’ve often pondered the possibility of us rekindling the friendship we once had.”	9	1
"In many ways, the foundation of a strong and genuine friendship is similar…

anon34024923 · October 27, 2023, 4:15am

Its interesting you say this because when I feed GPT the same audio file, it almost always scores it the same. We have 15 metrics, and each of those 15 metrics has 3-4 submetrics. We have the GPT bot return the calls in JSON, so one prompt might be…

 {
    key: "Substance Abuse",

    description: `"First, analyze the problem and formulate your own solution. Compare your solution with the guidelines provided below. If your solution aligns with these guidelines, proceed with that answer. If it doesn't, adjust your solution to ensure it adheres to the guidelines. Always use the following guidelines as the final reference for correctness. Substance Abuse: Score in JSON based on the average scores of frequencyOfUse, amountOfUse, and impactOnDailyLife. If there is no mention of substance abuse, score in JSON 0
    frequencyOfUse: Score in JSON based on mentions or implications of how often the client uses the substance.
    amountOfUse: Rate depending on descriptions or indications of the quantity of substance used at a time. If they use alcohol or drugs more than 5 times a week, they should receive a score in JSON of at least 8 for substance Abuse
    impactOnDailyLife: Evaluate how the substance use affects the client's routine, relationships, job, or other daily activities. If the patient mentions that their drug habits or alcohol use impacts over 4 aspects of their lives, the patient should receive at least an 8 score in JSON for Substance Abuse. Respond in JSON Format without deviation: `,

    json: {
      score: "",
      description: "",
      factors: {
        frequencyOfUse: "",
        amountOfUse: "",
        impactOnDailyLife: "",
      },
    },
  },

Where each of the fifteen core metrics gets its own GPT call and is fed a transcript.

This way, each metric gets its own GPT brain - and we do get better scoring results, but still it will sometimes bug out. For instance, the main metric, in the above case, “Substance Abuse”, we always want to be the average of the sub-metrics. Sometimes it might score the three sub-metrics 6 - 7 - 8 but then say the overall score is 3.

It’s not an easy task but the idea is to get it to a place where it can score things based on a set of rules that are defined by a psychologist.

anon34024923 · October 27, 2023, 4:18am

It would be an interesting thing to try and train a model that is that large and expensive but it would suck if it didn’t work out on the first run haha - yeah this is a cool idea, i think it might be a little outside of my scope of possibility right now. Creating a fine-tuned model for this I’m sure would be ideal, but then it comes down to how to create it and what data to feed it

anon22939549 · October 27, 2023, 4:47am

Of course!

That’s why I also suggested you look at running one of their pre-trained models locally if you have a modern NVIDIA GPU with 16GB of VRAM.

I think the use case for something like this is a bunch of high school teachers who want to outsource their grading.

Now there’s an application someone could sell.

Fine-tune a model to build the rubric, fine-tune a model to grade based on a rubric, teachers upload their assignment, build the rubric for them, then they upload a zip file of their students’ work and download a CSV file ready to be uploaded into any of the popular learning management systems.

Give away the rubric and charge $0.15 / assignment graded.

Teacher with 40 students just graded all the assignments in about a minute for $6.00.

Hmmm…

Less than $5k up-front costs, probably $0.05 / assignment to actually grade. Break even point (after all other expenses considered is probably between 100k–200k assignments graded.

Figure an average teacher might grade 10 assignments × 25 students / year, and one could be in the black with between 400 and 1000 regular customers for a year.

After that you could have yourself a nice little $2,000–$5,000 / month passive income without even getting very big.

Diet · October 27, 2023, 4:48am

what I meant is that the input may bias the instruction. of course, if you put the same thing in, you’re likely getting similar results, particularly with lower temperatures.

it’s also quite possible that the model isn’t even respecting the input data and just punching numbers into the json, because it understands that you want numbers in the json.

what I’m thinking reading the prompt is that this might actually be a vector search and filter task, rather than a flat zero shot task like you seem to be doing. but that is just a guess.

anon34024923 · October 27, 2023, 5:09am

Please say more!

This is a preliminary prompt. I’m hoping someone might know the secret sauce but ideally we get someone with the expertise (psychologist) to create the scoring system for the GPT bot - and than have the bot base its scores based on that system.

Diet · October 29, 2023, 8:57pm

Saymore here,

basically what vector search does is index snippets of your conversation by concept. So if you want to evaluate how drugs affect the patient’s life, you may want to retrieve all passages where the patient talks about drug use. then you may refine these passages, possibly filter, possibly refine, and then aggregate them. that way you can deal with even very long conversations.

there are some other secret sauces but I don’t know if I’m allowed to share them.

the other thing you can do is turn it into a work task. How would an experienced psychologist go about doing this? I imagine most people don’t just glance at a report and immediately write their findings. Instead, they’ll go through the document, take notes, build their understanding (in this case maybe keep a rolling history and maybe a vector memory), and then refine their notes into a report. Instead of asking the AI to do it in one go, consider allowing it to do the work.

anon34024923 · October 30, 2023, 5:11am

Whats your cell number? lol just kidding of course

Actually, we’ve figured out a pretty cool solution that we’re working on

I thought about this, I think it makes more sense to be “stateless”, without the embeddings for this use case.

Topic		Replies	Views
Can the model give back weighted decisions if you give it weights in the prompt? API gpt-4	19	835	November 14, 2023
Links in system prompt for GPT-4? Prompting gpt-4	10	5847	November 2, 2023
Fine-tuning GPT-3 on entire conversations to mimic style and extract relevant knowledge API	13	4951	December 16, 2023
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1834	December 18, 2023
Training gpt-3.5 to autocomplete for a niche domain and a specific writing style Community chatgpt	13	1800	July 25, 2024

Ensuring GPT Consistently Follows a Defined Scoring System: Tips and Best Practices?

Related topics