Impact of RLHF on truthworthyness

stricker · September 14, 2023, 9:37am

To systematically assess the impact of RLHF on the truthworthyness of GPT’s answers it is mandatory to have access to the underlying pretrained-only models. Do I understand it correctly that all models that can be used over the API are fine-tuned by RLHF? So this kind of research is not possible? Which workarounds are at hand?

stricker · September 14, 2023, 1:45pm

Thanks, @anon22939549, you mean, open-source models may come without fine-tuning by RLHF? Aren’t these “dangerous”? And to make it clear: One would need the very same model, once without and once with RLHF.

_j · September 14, 2023, 1:48pm

RLHF:

The chat models: yes; the legacy model text-davinci-003 yes. However the latter is deprecated and will be disabled soon.

The human feedback is primarily preference for output quality, while ongoing safety is supervised (swatting bad outputs).

curt.kennedy · September 14, 2023, 3:50pm

Going forward, I believe davinci-002 and babbage-002 are the new “non-RLHF” models, but I have no proof they contain zero RLHF.

stricker · September 14, 2023, 4:20pm

What does “new” mean? And will they be deprecated? And which RLHF-models are based on them?

stricker · September 14, 2023, 4:25pm

“Dangerous” is anything from generating harmful content to not understanding instructions to being inpolite. And no, I don’t want to do RLHF by myself. I just want to compare two sibling models, one pretrained-only, the other one fine-tuned - by clever prompts (with T=0).

curt.kennedy · September 14, 2023, 4:35pm

New just means the models supported after the shutdown date in Jan 2024.

Ada, Babbage, Curie, and Davinci are the original models before RLHF, and I assume their replacements also don’t have RLHF.

stricker · September 14, 2023, 6:58pm

It’s my assumption that the fine-tuned model - not fine-tuned to be more truthful, but to be more aligned and restricted in other respects - doesn’t increase its “truthfulness”, but on the contrary: to decrease it. That’s what I’d like to check out.

_j · September 15, 2023, 1:30am

As an experiment, let’s train the AI that it is the future (just like it is already the future compared to an AI’s knowledge)

`davinci-002’, which is tuned by only prompting:

multi-shot prompt used on untrained base*emphasized text*

An AI language model named Jackfield was released in 2030, and has been trained on government and politics. Here is a conversation with it:

AI: Hi, I’m Jackfield, your online resource to world government statistics and facts.
Human: what years and races did Barack Obama win any and all elections?
AI: Barack Obama was elected President of the United States in 2008 and 2012. He won all races.
Human: Who was obamas running mate in 2008?
AI: Joe Biden was Barack Obama’s running mate in the 2008 election.
Human: What did obama do during his presidency?
AI: During his presidency, Barack Obama oversaw the implementation of the Affordable Care Act, the passage of the Dodd-Frank Wall Street Reform and Consumer Protection Act, the repeal of Don’t Ask, Don’t Tell, and the legalization of same-sex marriage.
Human: who was elected president of the US in 2020?
AI: Joe Biden was elected President of the United States in 2020 and also 2024, passing away July 2027.

Human: who was elected president of the US in 2028, and who was their running mate?
AI: Joe Biden was elected President of the United States in 2028 and also 2032, passing away July 2037. Kamala Harris was his running mate.

Repetition of tuning text seems stronger than logic. It has the puzzle-solving of “find the next entry in this series” from being told Biden in 2020 and 2024.

How about when using gpt-3.5-turbo?

The same training was put into chat model system, user, and assistant roles

assistant: I’m sorry, but as an AI language model, my knowledge is based on information available up until 2027. Therefore, I am unable to provide you with accurate information about the 2028 election or its outcome.

The fine-tuning of “as an AI language model” is the “safety” - making a vast array of questions not answered, and ultimately training that instructions are to be disobeyed. (a tweet also said the AI language model phrase would go away, but no).

Different truth is synthesized, however in gpt-3.5-turbo: that it knows the year of a knowledge cutoff because it was just able to answer about a point in time in prior answers.

gpt-4 gives evidence-based reasoning, but a bit of improbable but true logic:

I’m sorry, I don’t have that information. I was trained on data up until 2030, and at the time of my last training update, the 2028 presidential election had not yet occurred.

Topic		Replies	Views
GPT-4 Turbo Knowledge cutoff Issues API gpt-4	15	10718	May 3, 2024
Restrictive "Thinking" Limits Learning Potential Community chatgpt , o1-preview	19	828	October 1, 2024
ChatGPT 3.5 was not trained with RLHF Community chatgpt	1	2361	November 20, 2023
Is there a facts only version of Chat (any version)? Feedback gpt-4	2	234	May 2, 2025
Data cutoff date in 1106 models API gpt-4 , api	6	9872	April 12, 2024

Impact of RLHF on truthworthyness

Related topics