Can the model give back weighted decisions if you give it weights in the prompt?

For instance if I tell it to score a transcript based on different criteria via a prompt, but then at the end of the prompt tell it to weight those scores as so…

  1. Active Listening: 15%
  2. Verbal Acknowledgement: 15%
  3. Non-Verbal Cues: 10%
  4. Reflecting and Paraphrasing: 15%
  5. Emotional Resonance: 15%
  6. Validation: 10%
  7. Supportive Interventions: 10%
  8. Consistency: 10%

Would the bot just most likely hallucinate these “Weights” I’ve had it attribute?

It’s GPT, so 50/50. Half of the time it works every time!

Jokes aside, don’t count on it. It might work 9 times and hallucinate 1.

Yes this is what I figured - just looking for a second opinion. I figure it’s just too much internal math for it to do there’s no way it’ll get it right

You can feed it through an internal loop to ask itself and therefore reduce the number of hallucinations at the cost of API calls. You won’t completely eliminating it though. GPT is not exactly reliable, even when set to “deterministic”.

No it’s actually fucking super reliable.

I’ve ran my own internal tests and based on that I can tell you it spits out the same answers like 97% of the time.

It’s hella reliable. It’s more just that is it accurately going to look through my criteria, THEN weight the criteria, THEN come out with the right answer??

mmmmm idk

EDIT: Maybe by reliable you implied “accurate” which is another thing. But as far as my internal testing for reliability i implore you to set temp-0 and see, its very much reliable

With reliable I’m saying “yea I can count on it!”. 97% while really good, isn’t perfect, meaning that in “critical” applications it might hallucinate and mess everything up.
Think of it this way with a silly example: You grade student papers (don’t!), and while 9700 come out as expected, it’ll grade 300 wrong. 300 is a lot on this scale.

oh yeah yeah yeah I agree I’m sorry

For the specific field im using it in - 97% is super high.

Totally agree that “almost” does not cut it in horse shoes and hand grenades.

1 Like

You mean “almost” only cuts it in horse shoes and hand grenades.

The phrase means that those two things are fine if you’re just close.

(Sorry to be that guy on the internet lol)

I will not edit my comment and let everyone see my horrible horrible mistake. I deserve it.

1 Like

LOL

I like the concept of what you’re doing by the way, I do something similar with my whisper audio transcription post-processing. It gives me a basic sentiment evaluation.

Yeah we’re trying to come up with a scoring manual for each individual attribute - one idea was having the bot weight it but the person who wrote the prompt I told them I’d look into it because I’m pretty sure its not going to get the most accurate response.

1 Like

Well to be fair humans will be biased with that too. As long as it’s fairly low-impact (not anyone’s college entrance exam or anything) I don’t see why GPT-4 wouldn’t work fine

it’s not low-impact, it would have to 100% consistently weight those weights to have any validity if you introduce the weights, in my thinking. You have to be able to say “The bot scored it like this because of this” - and any variable we introduce that is sort of like “well uhh we dont know how its weighting it, it just is” that doesnt really fly.

1 Like

Best option I can think of is refining the prompt to be very repeatable, and then possibly having a secondary prompt explaining the reasoning. Ideal world, you do multiple iterations for each transcript and take an average or something like that. That gets expensive though.

Oh - its token heavy lol. So far it’s all very repeatable we’re just prompt iterating in the scoring guide trying to think of ways to make it better.

1 Like

I assume you’ve already gone for the obvious stuff?

“You’re a professional transcript analyser, blah blah blah”

Would giving it an example or two help, or is that built into it’s knowledge

I don’t want to give away too much but there’s several different metrics we analyze and each has their own prompt and GPT Call.

The idea of using GPT as a “Scoring system” is still infantile. It’s not just “you’re a professional transcript analyzer” but more in depth for each call

1 Like

Interesting, that obviously makes far more sense than having it attempt to jam it all into one response.

Anything else you can give away would be very helpful lol

You’re on your own kiddo.

Extra sentence here.

1 Like

'Twas worth a try. Let us know if you have any success!

1 Like