Can the model give back weighted decisions if you give it weights in the prompt?

tventura94 · November 14, 2023, 2:00pm

For instance if I tell it to score a transcript based on different criteria via a prompt, but then at the end of the prompt tell it to weight those scores as so…

Active Listening: 15%
Verbal Acknowledgement: 15%
Non-Verbal Cues: 10%
Reflecting and Paraphrasing: 15%
Emotional Resonance: 15%
Validation: 10%
Supportive Interventions: 10%
Consistency: 10%

Would the bot just most likely hallucinate these “Weights” I’ve had it attribute?

Fusseldieb · November 14, 2023, 2:03pm

It’s GPT, so 50/50. Half of the time it works every time!

Jokes aside, don’t count on it. It might work 9 times and hallucinate 1.

tventura94 · November 14, 2023, 2:04pm

Yes this is what I figured - just looking for a second opinion. I figure it’s just too much internal math for it to do there’s no way it’ll get it right

Fusseldieb · November 14, 2023, 2:07pm

You can feed it through an internal loop to ask itself and therefore reduce the number of hallucinations at the cost of API calls. You won’t completely eliminating it though. GPT is not exactly reliable, even when set to “deterministic”.

tventura94 · November 14, 2023, 2:08pm

No it’s actually fucking super reliable.

I’ve ran my own internal tests and based on that I can tell you it spits out the same answers like 97% of the time.

It’s hella reliable. It’s more just that is it accurately going to look through my criteria, THEN weight the criteria, THEN come out with the right answer??

mmmmm idk

EDIT: Maybe by reliable you implied “accurate” which is another thing. But as far as my internal testing for reliability i implore you to set temp-0 and see, its very much reliable

Fusseldieb · November 14, 2023, 2:12pm

With reliable I’m saying “yea I can count on it!”. 97% while really good, isn’t perfect, meaning that in “critical” applications it might hallucinate and mess everything up.
Think of it this way with a silly example: You grade student papers (don’t!), and while 9700 come out as expected, it’ll grade 300 wrong. 300 is a lot on this scale.

tventura94 · November 14, 2023, 2:13pm

oh yeah yeah yeah I agree I’m sorry

For the specific field im using it in - 97% is super high.

Totally agree that “almost” does not cut it in horse shoes and hand grenades.

trenton.dambrowitz · November 14, 2023, 2:15pm

You mean “almost” only cuts it in horse shoes and hand grenades.

The phrase means that those two things are fine if you’re just close.

(Sorry to be that guy on the internet lol)

tventura94 · November 14, 2023, 2:34pm

I will not edit my comment and let everyone see my horrible horrible mistake. I deserve it.

trenton.dambrowitz · November 14, 2023, 2:36pm

LOL

I like the concept of what you’re doing by the way, I do something similar with my whisper audio transcription post-processing. It gives me a basic sentiment evaluation.

tventura94 · November 14, 2023, 2:37pm

Yeah we’re trying to come up with a scoring manual for each individual attribute - one idea was having the bot weight it but the person who wrote the prompt I told them I’d look into it because I’m pretty sure its not going to get the most accurate response.

trenton.dambrowitz · November 14, 2023, 2:40pm

Well to be fair humans will be biased with that too. As long as it’s fairly low-impact (not anyone’s college entrance exam or anything) I don’t see why GPT-4 wouldn’t work fine

tventura94 · November 14, 2023, 2:43pm

it’s not low-impact, it would have to 100% consistently weight those weights to have any validity if you introduce the weights, in my thinking. You have to be able to say “The bot scored it like this because of this” - and any variable we introduce that is sort of like “well uhh we dont know how its weighting it, it just is” that doesnt really fly.

trenton.dambrowitz · November 14, 2023, 2:46pm

Best option I can think of is refining the prompt to be very repeatable, and then possibly having a secondary prompt explaining the reasoning. Ideal world, you do multiple iterations for each transcript and take an average or something like that. That gets expensive though.

tventura94 · November 14, 2023, 2:48pm

Oh - its token heavy lol. So far it’s all very repeatable we’re just prompt iterating in the scoring guide trying to think of ways to make it better.

trenton.dambrowitz · November 14, 2023, 2:50pm

I assume you’ve already gone for the obvious stuff?

“You’re a professional transcript analyser, blah blah blah”

Would giving it an example or two help, or is that built into it’s knowledge

tventura94 · November 14, 2023, 2:54pm

I don’t want to give away too much but there’s several different metrics we analyze and each has their own prompt and GPT Call.

The idea of using GPT as a “Scoring system” is still infantile. It’s not just “you’re a professional transcript analyzer” but more in depth for each call

trenton.dambrowitz · November 14, 2023, 2:55pm

Interesting, that obviously makes far more sense than having it attempt to jam it all into one response.

Anything else you can give away would be very helpful lol

tventura94 · November 14, 2023, 3:03pm

You’re on your own kiddo.

Extra sentence here.

trenton.dambrowitz · November 14, 2023, 3:12pm

'Twas worth a try. Let us know if you have any success!

Topic		Replies	Views
Ensuring GPT Consistently Follows a Defined Scoring System: Tips and Best Practices? API chatgpt , api	15	2026	October 30, 2023
Is GPT3 state-of-the-art AI or just an overhyped story generator? API	18	1655	December 17, 2023
Having trouble to make AI avoid certain topics Prompting	13	2301	April 17, 2022
Getting inconsistent results with same prompts Prompting	7	6660	January 10, 2022
How to get API to "take" on certain rules (prompt engineering) Prompting gpt-4	7	1717	December 20, 2023

Can the model give back weighted decisions if you give it weights in the prompt?

Related Topics