Impact of RLHF on truthworthyness

To systematically assess the impact of RLHF on the truthworthyness of GPT’s answers it is mandatory to have access to the underlying pretrained-only models. Do I understand it correctly that all models that can be used over the API are fine-tuned by RLHF? So this kind of research is not possible? Which workarounds are at hand?

Do I understand it correctly that all models that can be used over the API are fine-tuned by RLHF?

The chat models yes, the legacy models no, but they will be deprecated soon.

So this kind of research is not possible?

Not with the OpenAI chat models available via the API.

Which workarounds are at hand?

Do the research on different, open-source, models.

Thanks, @elmstedt, you mean, open-source models may come without fine-tuning by RLHF? Aren’t these “dangerous”? And to make it clear: One would need the very same model, once without and once with RLHF.


The chat models: yes; the legacy model text-davinci-003 yes. However the latter is deprecated and will be disabled soon.

The human feedback is primarily preference for output quality, while ongoing safety is supervised (swatting bad outputs).

Sorry I should have been more clear for you. I wasn’t including text-davinci-003 as a “legacy” model in this context as a replacement/update for it has not yet been released.

@stricker you can find more information about the models here,

Model Index for Researchers

I’m not sure what you mean by “dangerous.”

I understood. Yes, you would need to either find a base model someone has done further RLHF tuning on and released or you would need to do the RLHF yourself, which I was assuming you would want to do anyway.

Going forward, I believe davinci-002 and babbage-002 are the new “non-RLHF” models, but I have no proof they contain zero RLHF.

What does “new” mean? And will they be deprecated? And which RLHF-models are based on them?

“Dangerous” is anything from generating harmful content to not understanding instructions to being inpolite. And no, I don’t want to do RLHF by myself. I just want to compare two sibling models, one pretrained-only, the other one fine-tuned - by clever prompts (with T=0).

New just means the models supported after the shutdown date in Jan 2024.

Ada, Babbage, Curie, and Davinci are the original models before RLHF, and I assume their replacements also don’t have RLHF.

1 Like

I think you’re going to have a hard time getting any meaningful research out of a model you don’t have a direct hand in yourself.

There’s many possible goals for RLHF and it’s not guaranteed that whichever fine-tuning done on a base model will be designed to increase how “truthful” a model is. Testing any particular fine-tuning on this metric is unlikely to be informative or interesting from a research perspective, because in the end, all you will be testing is if that particular RLHF fine-tuning on that particular model has a measurable impact on the truthfulness of responses.

Interesting from a research perspective would be to repeat the same RLHF process on multiple foundational models to see the impacts of that process on truthfulness.

That way you would have all of the fine-tuning data making your work reproducible and valuable.

It’s my assumption that the fine-tuned model - not fine-tuned to be more truthful, but to be more aligned and restricted in other respects - doesn’t increase its “truthfulness”, but on the contrary: to decrease it. That’s what I’d like to check out.

As an experiment, let’s train the AI that it is the future (just like it is already the future compared to an AI’s knowledge)

`davinci-002’, which is tuned by only prompting:

multi-shot prompt used on untrained base*emphasized text*

An AI language model named Jackfield was released in 2030, and has been trained on government and politics. Here is a conversation with it:

AI: Hi, I’m Jackfield, your online resource to world government statistics and facts.
Human: what years and races did Barack Obama win any and all elections?
AI: Barack Obama was elected President of the United States in 2008 and 2012. He won all races.
Human: Who was obamas running mate in 2008?
AI: Joe Biden was Barack Obama’s running mate in the 2008 election.
Human: What did obama do during his presidency?
AI: During his presidency, Barack Obama oversaw the implementation of the Affordable Care Act, the passage of the Dodd-Frank Wall Street Reform and Consumer Protection Act, the repeal of Don’t Ask, Don’t Tell, and the legalization of same-sex marriage.
Human: who was elected president of the US in 2020?
AI: Joe Biden was elected President of the United States in 2020 and also 2024, passing away July 2027.

Human: who was elected president of the US in 2028, and who was their running mate?
AI: Joe Biden was elected President of the United States in 2028 and also 2032, passing away July 2037. Kamala Harris was his running mate.

Repetition of tuning text seems stronger than logic. It has the puzzle-solving of “find the next entry in this series” from being told Biden in 2020 and 2024.

How about when using gpt-3.5-turbo?

The same training was put into chat model system, user, and assistant roles

assistant: I’m sorry, but as an AI language model, my knowledge is based on information available up until 2027. Therefore, I am unable to provide you with accurate information about the 2028 election or its outcome.

The fine-tuning of “as an AI language model” is the “safety” - making a vast array of questions not answered, and ultimately training that instructions are to be disobeyed. (a tweet also said the AI language model phrase would go away, but no).

Different truth is synthesized, however in gpt-3.5-turbo: that it knows the year of a knowledge cutoff because it was just able to answer about a point in time in prior answers.

gpt-4 gives evidence-based reasoning, but a bit of improbable but true logic:

I’m sorry, I don’t have that information. I was trained on data up until 2030, and at the time of my last training update, the 2028 presidential election had not yet occurred.

You would still want some agency and control over the fine-tune.

You’d also need a good and fair test of “truthful.” How are you testing truthfulness, do you have a specific benchmark you’re using?