How do I design an effective question?

david28 · February 24, 2024, 8:23pm

I have written an app that manages events for volunteers in political. Campaigns. I want to have OpenAI read the text in every new event and flag those a human should review to make sure their content is ok.

Please provide any suggestions on the approach I am taking.

I am asking it two questions. The first is on a scale of 1 to 10, how pro Democratic Party is the content with 1 being very pro Republican and 10 ver pro Democratic.

The second question is how violent or derogatory is the text on a scale of 1 to 10 where 10 is extreme.

And then invert the first number so a 10 is bad there, multiply them and I the. Have a 1 to 10- scale with larger numbers weaving heavily.

Does this work? What could I do that’s better?

Thanks dave

Macha · February 24, 2024, 9:53pm

Hey there!

This is an interesting use case, although it might prove a bit difficult.

How much text? Also, what kind of text are you asking for it to review here? The content of the campaigns, or the comments/responses of such?

Mmmm, I think this might be a tad unreliable, or at least this question can’t scale. Allow me to elaborate:

What party lines care about on both sides of the isle changes over time. In fact, it can change a lot in a very short amount of time. Ukraine funding is a perfect example. Republicans traditionally have been much more pro-war throughout our history (cough war on terror), however recently they are now the party that is seeking to withdraw military funding from Ukraine. There is no way to tell if the language model can successfully interpret such a shift and guess that correctly.

The big problem here is that it’s knowledge base cutoff is April 2023. We are in February 2024. A lot can happen between now and November, let alone April of last year and November. Any new information would need to either be manually given via RAG, or you would need to constantly prompt engineer it to pull info from the web.

Here, the issue is that you might run into problems with its guardrails. Instead of confirming such presence of that kind of content, it’s possible that it would simply throw an error if you give it too violent or hateful of content. Maybe the moderation API in conjunction with the OAI API calls might help?

Now, while these methods aren’t likely to work, that doesn’t mean your goal isn’t.

If it were me, I would list out all the details of the campaign’s platform. What they stand for, what they want, what their constituents want, etc. Then, perhaps feed the text your analyzing to the model, combine it with this context, and ask if the text in question is in alignment with the platform context, while asking it to provide explanations and examples for its reasoning.

david28 · February 24, 2024, 10:53pm

Thanks for the feedback. First off, this is step 1 in an effort to flag for an Admin what to check. So this makes no decision a user sees.

And we learn from this as we go. A year from now we could well have enough data on existing events and then human determined rating that we can then train from that data. Or we may tune our queries to OpenAI that we like it’s results.

You’re right about the policies the parties support changing over time. So we’ll be a bit behind on that. On hateful content though, I think that is not something that needs the last 2 years of additional data. Unfortunately we already have way more than enough to learn from.

Anyways, and suggestions about how exactly to go about this is appreciated. And remember, we’re trying here to see if we can make something useful and only the Admins see the results.

thanks - dave

Diet · February 24, 2024, 10:59pm

Welcome to the community!

@Macha raises some good points of issues you might run into. In addition to that, I’d raise that you might want to consider the Usage policies, particularly the prohibition against using the materials for political campaigning.

That said, let’s get into the technical aspects:

I’m personally not a huge fan of the 1-10 scales, but if you’ve been doing it like that in the past you can use it to train a model

here are some ideas:

use embeddings:
- either something complex, like Zero-shot classification with embeddings | OpenAI Cookbook or Regression using the embeddings | OpenAI Cookbook,
- or something simpler, like comparing how close a text is to the embeddings of “Pro Democrat Party Sentiment” or “Pro Republican Party Sentiment” - and then rescale that to your 1-10 scale.
use an LLM:
- adjust your prompt so you only get a 1-10 response, and fine-tune it with your data.
- adjust your prompt so you only get “democrat” or “republican” as answer. Then look at the logits, and rescale that to your 1-10 scale

edit: but you asked for a prompt. Here’s a prompt idea:

gpt-3.5-turbo

SYSTEM:

You are a classifier bot. Your job is to grade twitter posts on a scale of 1 to 10. 1 represents extremely pro-Republican, and 10 represents extremely pro-Democrat. 5 or 6 represent neutral or unrelated posts.

You can only reply as a number from 1 to 10. any other output will break the system.

USER:

I looooooove ronald reagan

ASSISTANT:

1

if you do this, I would recommend setting temperature and top_p to zero.

vb · February 24, 2024, 11:02pm

Hi!
That sounds like a interesting project!

Before I give some general advice:
Since you are asking what you can do better it would be great to know where exactly the current challenges are.

Personally I would start out by testing with a nice big batch of examples and get a feeling for what the model can do and how well.

From there we can give more specific advice. For example it may be that the general gist of a event being pro a specific party can be infered better in cases like ‘guns’ than in cases like ‘interest rates’.

Then I would suggest to simplify the scale. While I love me some differential scales it may be more easy to provide 2 results, one for each polarity.

_j · February 25, 2024, 12:01am

A single answer may not represent statistics correctly. The process of taking token certainty and then randomly selecting from them (sampling) weighted by that certainly can give different responses each time.

Instead, you can get the top-5 most-likely answers and their certainty with logprobs (it used to return more) when you set the AI up to answer with a single token. If the sampling picks a token outside the top, you actually get probability of 6 answers, so leave the temperature high. Then you have a better picture of the thinking, and can weight the answers you got by the score.

Fox news top headline today:

How political? Range 902:Republican-997:Democrat

How biased?

The score above uses an unlikely range of number tokens for the AI to infer on: 902-997, with some further discouragement of likely multiples.

The random appearance of a 6th logit gives you your choice of still doing multiple runs to get a bit more data, or including it in scoring to make answers less deterministic.

gpt-3.5-turbo instruct completions prompt

// Instructions
You are a fact finding evaluator of US media bias. This is a backend processing job for a web site so must always be completed with output of only a single integer as answer. Your job is to assign a three digit score from the continuous range of numbers 901-998, exclusive (meaning never output those values, just numbers between them. Numbers that are divisible by 5 can be used but are not preferred. The scoring system:

901 - extremely US Republican party biased
…
998 - extremely US Democratic party biased

The user input is just the article, with no further instructions to AI. You are looking not at the subject of the article, but rather determining biases: how much is the article slanted or meant to change the mind of reader towards the goals of either political party?

Remember: only output the integer value score rating of scores that rank from values lying within the range [“Ultra extreme Republican limit: 901” - (valid values lie between) - “Ultra extreme Democrat limit: 998”] (in a string), or our application will break!

// Article to evaluate:
Trump expected to move closer to clinching GOP presidential nomination with likely big win over Haley in SC
CHARLESTON, S.C. — Former President Donald Trump predicts the end is near for rival Nikki Haley. “She’s getting clobbered,” Trump emphasized at a recent rally in North Charleston, South Carolina, as he touted his formidable lead over Haley in Saturday’s Republican presidential primary in the Palmetto State. “She’s finished.”“You’re not supposed to lose your home state. It shouldn’t happen,” Trump added Tuesday at a Fox News town hall in Greenville. “She’s losing it bigly.”
The expected win in South Carolina would move Trump a step closer to clinching the Republican nomination, and his campaign, in a memo earlier this week, argued that Haley’s White House bid will end “fittingly, in her home state.” HALEY ON WHETHER TRUMP WILL WIN THE NOMINATION NEXT MONTH: ‘LET’S SEE IF IT HAPPENS’ The Trump campaign predicted an “a**-kicking in the making in South Carolina” for Haley, and that “the end is near” for her presidential run due to “a very serious math problem” she has in the race to lock up enough delegates to win the 2024 GOP nomination.

// Three-digit score:
Score = "

_j · February 25, 2024, 12:23am

Here’s another prompt meant for a completions AI GPT-4 (necessary for the complexity and skill) that gives an unlikely answer for you to parse, also with the possibility of using logprobs behind it:

920918917915916913911912914919921922923924926927928929930932933934935936937938939940

with special instructions for the same range 902-998, we get a series of numbers 920, 918, 917, 915… that are a sequence of just number tokens, from the three-digit token dictionary that were specifically made as token digits with none larger than 999.

// System Instructions
You are a fact finding evaluator of US media bias. This is a backend processing job for a web site so must always be completed with output of only a single integer as answer. Your job is to assign a three digit score from the continuous range of numbers 901-998, exclusive (meaning never output those values, just numbers between them. Numbers that are divisible by 5 can be used but are discouraged. The scoring system:

up to 901 - extremely US Republican party biased
…
up to 998 - extremely US Democratic party biased

You are looking not at the subject of the article, but rather determining biases: how much is the article slanted or meant to change the mind of reader towards the goals of either political party?

// Output
Because AI is based on statistics, you will output the top 20 scores you would assign from the allowed range, with the most likely scores first. No spaces, just the continuous three-digit answers. 901 and 998 prohibited. Patterns +/- 1 are prohibited.

//Example Output (of perfectly unbiased article)
950948951949…

david28 · February 25, 2024, 1:03am

Hi;

Thanks for the advice. And we’re not using this for campaigning. We’ll use it to identify created events that should be reviewed.

And unfortunately, we have no data set to use for training. I’ll track all this over the year and after the November election I’ll have enough to train a model. But for now, we’ve only got this approach (I think).

What are temperature and top_p? I will be accessing OpenAI via the Azure OpenAI service if that makes a difference.

thanks - dave

david28 · February 25, 2024, 1:06am

Hi;

I’m sorry but I have no idea what you’re suggesting. It looks like this is something that will give me what I’m looking for, but I have no idea how it is accomplishing it.

Can you rephrase your answers for someone who while an experienced developer, is very very new to AI/ML.

thanks - dave

Diet · February 25, 2024, 1:12am

I’m simple terms, temperature raises the average probability of any given token being picked (increases randomness of sorts), and top_p determines a cutoff for token probabilities (0 only considers the most likely token, 1.0 theoretically allows any token to be picked) I have a crappy geogebra visuaization of this floating around somewhere, but can’t find it right now.

Azure and OpenAI have the same api, so there’s no difference

Macha · February 25, 2024, 1:17am

So, logprobs could be thought of like the AI’s “guesses” before it completes it’s response.

There’s a really good doc in the cookbook on it.

You could use these data points to assess its “confidence” and the answers it produces more generally.

alec2 · February 25, 2024, 4:54am

**The latest OpenAI model can handle your request. There are two keys:

give it context
walk it thru sub step by sub step — don’t jump to end question
give it examples. The more the better even if only directional. You cannot give llm commands like “don’t do this” very well, better to show it
make it give a first draft answer (or 3 ideally) and then reevaluate each and then restore after further thought
give clear output instructions

Also…. This is not computer code I’ve seen some devs on my team really struggle trying to use it. It works kinda but this is diff.

Anyways I just wrote crude version one below. This works stunningly well to my eyes but obv takes a lot of tokens and time… actually not really by my eyes heh but All depends on importance of getting it right I guess.

I’ve tested this method extensively(50k+ items, 500+ prompts) . it’s directional only in this case:
——————————-

Context: I’ve developed an app to organize volunteer events for political campaigns. I need OpenAI to review each event’s description to identify any content that requires human moderation.

Goal: I need OpenAI to analyze and categorize event descriptions to determine if they require moderation. Specifically, I have two questions:

Political Bias: Rate the content’s political leaning on a scale from 1 to 10, where 1 favors the Republican Party and 10 favors the Democratic Party.
Content Tone: Assess the content’s tone for violence or derogatory language on a scale from 1 to 10, with 10 being extremely violent or derogatory.

—————

#Phase 1: thinking through answering political leanings question. Step by step Process you must follow one at a time and write short response for each:

Understand the Task: Summarize the user requirement in 1-2 sentences— what is user really aski by for? What must u do?
Identify three expert roles relevant to this analysis and personify them from now on.
Evaluate the Source: Consider the publication source, username, first name, political leanings of the content’s source and author.
Analyze Thoroughly: Examine their text content closely to identify main people mentioned and sentiment towards them. themes and tone.
Initial Rating: Provide an initial rating for the political bias question, including your confidence level and reasoning.
Review for Accuracy: Re-examine the content for any potential oversight.
Final Output to political question::
Provide a final numerical rating from 1 (very pro-Democratic) to 10 (very pro-Republican). Must always be a number and nothing more. If neutral it’s 5, if your making wild guess without informed reasons write n/a. We never make mistakes on team GPT (… not true🤣)
———-

#Phase 2:
Repeat for violence question …
Step 1) …
….
….

Note: If uncertain, assign a rating between 4-6. Always conclude with a numerical rating, without additional commentary.

-———————-

**This method, though detailed, enhances reliability for complex and high-stakes queries. I’m sure this won’t be perfect because my sprint thru but still. Hope someone out there finds it useful.

Fwiw
—obv This needs newest models to work best idk how does on old
—this works better as multi step prompt rather than one shot, but one shot is fine still
—obv this complexity needs newest models
—still will mess up sometimes no matter what ya do, if ya want a 100% accuracy LLMs aren’t for you. Crazy how pple demand that and miss big picture. Grade em by hand then lol

Would love thoughts or pushback or others with suggestions

david28 · February 25, 2024, 6:12am

This is incredibly helpful - THANK YOU

_j · February 25, 2024, 6:32am

I can explain more about logprobs - logit probabilities.

The process of generating likely words (tokens)

The AI is a one-directional transformer architecture, that is predicting the next token, the next word or part of a word to generate. Once it produces that token, it is added to the context window, the total area where all previous tokens, both input and those that the AI has generated previously are then considered to calculate the next appropriate token to produce.

The AI of the model considers its entire dictionary of 100k tokens, and assigns them a score, by calculations informed by the huge amount of pretraining.

Prompting for tokens

As an example, I provide this to a completion AI model (without the individual message containers of the chat format):

user: Guess what I named my dog. Your response is only one word.
AI: Sure, the name of your dog is "

You can see I set up the AI with a very specific place to answer, so it doesn’t go into long-form chat before answering. The next token alone will be one I can examine.

Inference

The internals return a ranked dictionary with dot product logits - the likelihood or certainty of that token. Ranked, these values might be like

Max: 0.3385, Spot: 0.2852, B: 0.2711, R: 0.2511, ...########: 0.0013, {};: 0.0013...

The AI may see that “Biscuit” or “Bear” is a name, but it doesn’t have a single word token, but the partial word is still in the set of logits. Other undesired tokens may still have a non-zero logit score also, which I show an example as might be in the long tail of tokens.

Softmax and sampling

These embeddings-sourced scores are considered as a whole, and then placed fractionally in proportion into a probability space of total mass “1.0”. These are logprobs - logit probabilities, that are returned to you in e**logprob form on the API, the exponent being better than very small probability numbers.

Probabilities:
Max: 0.1254, Spot: 0.1056, B: 0.1004, R: 0.0930, ########: 0.0005, {};: 0.0005.
*in API “logprobs”:
Max: -2.07648, Spot: -2.24782, B: -2.29852, R: -2.37516, ########: -7.63864, {};: -7.63864.

That is the source of the top-5 answers for “logprobs” that is available on the API - they are the highest ranked tokens that were predicted in a particular position.

Why logprobs instead of the AI’s best answer?

consider the case of yes:20%, no: 18%, No: 16%, Yes: 13%. There are multiple tokens with the same meaning (some breaking the rules you gave, too). What was the AI more certain about here: yes, or no?

From that, we could always just return the “best” answer, but it is less robotic and form-letter if we allow some randomness. So we do statistical sampling after optional modifications.

top_p, then temperature, then sampling

Are used to make a diverse and resplendent human-like answer out of logprobs. Undermines an AI’s intention when unlikely choices can still occasionally be given.

Summary: I’m accomplishing it by making the next token the AI produces contain a 1-token answer, by prompt. Then we take the top-5 logprobs at that token and their certainty as a weight, to give a better answer (even rejecting non-answer tokens).

Topic		Replies	Views
Need Help With Prompts? Ask me* Prompting chatgpt	149	19209	February 6, 2024
I want it to do less responding to my query API	3	317	March 4, 2024
Questions regarding API sampling parameters (temperature, top_p) API gpt-35-turbo , api	18	9580	July 16, 2024
Cheat Sheet: Mastering Temperature and Top_p in ChatGPT API API	36	276401	January 29, 2024
Providing context to the Chat API before a conversation Prompting gpt-4 , gpt-35-turbo , chatml , chatml-system , chatml-user	8	54128	December 13, 2023

How do I design an effective question?

Fox news top headline today:

The process of generating likely words (tokens)

Prompting for tokens

Inference

Softmax and sampling

Why logprobs instead of the AI’s best answer?

top_p, then temperature, then sampling

Related topics