Asked ChatGPT 3 .5 whether it was trained with RLHF it answered “NO” and ChatGPT 3.0 was trained on with RLHF.
Is it true? and if so why wasn’t it trained with RLHF and what about gpt4.0?
Asked ChatGPT 3 .5 whether it was trained with RLHF it answered “NO” and ChatGPT 3.0 was trained on with RLHF.
Is it true? and if so why wasn’t it trained with RLHF and what about gpt4.0?
gpt-3.5-turbo doesn’t know about gpt-3.5-turbo.
Let’s go to gpt-4-1106-preview
with knowledge to April 2023…
RLHF stands for Reinforcement Learning from Human Feedback. It is a technique used to train AI models, particularly in situations where the task is too complex for the cost function to be defined explicitly, or where human judgment is a critical component. RLHF incorporates human feedback into the reinforcement learning cycle to guide the model toward desired outcomes.
The approach involves several stages:
- Pre-training: The model is pre-trained on a large dataset with supervised learning to initialize its parameters with a broad understanding of the world.
- Fine-tuning with human feedback: Human trainers review and score the model’s outputs or compare different outputs, which is used to create a reward model. This feedback helps define what good model behavior looks like.
- Proximal Policy Optimization (PPO): This is a deep reinforcement learning algorithm used to optimize the model’s policy (i.e., the way it behaves or decides on actions), trying to maximize the expected human-generated rewards.
- Iteration: Steps 2 and 3 are repeated, with human feedback collected at each step to constantly refine the reward model and improve the policy.
OpenAI has implemented RLHF in training some of its most advanced language models. For instance, models such as GPT-3.5, which is an iteration on GPT-3, and ChatGPT, a variant of GPT-3 specially fine-tuned for generating conversational text, have been trained using variations of the RLHF process. In these models, the RLHF technique is used to align the models’ responses more closely with human values and preferences, thus making them more useful and safer in practice.
On the other hand, OpenAI has also developed several models using traditional supervised learning, where models are trained on large labeled datasets without the iterative feedback loop seen in RLHF. One example would be the initial training stages of models like GPT-3 before they were further fine-tuned with RLHF. In supervised learning, the model is trained to predict the output directly from the input data, by minimizing the difference between its predictions and the actual correct answers, which are part of the training dataset.
In conclusion, OpenAI uses a combination of RLHF and supervised learning to train its models, selecting the approach based on the intended application and the level of alignment needed with complex human judgments and values.
I didn’t get the AI output I wanted, which is the human answer that InstructGPT models text-davinci-001
and text-davinci-002
use supervised learning (basically manual example writing and fine-tune like you do yourself now), while text-davinci-003
(November 2022) and ChatGPT from the start incorporate RLHF techniques.