I’ve got a couple of questions regarding the fine tuning tool and how flexible it is:
Does anyone have a thorough (formal) explanation on how does the fine tuning tool work when we invoke it via the built-in functionality? It clearly looks as fine tuning via supervised learning, but apart from that I have not been able to find further (technical) explanations on how does it work under the hood. Do we freeze all the layers except from the last one? Do we fine-tune all the weights of the neural net? Any other details?
Regarding the techniques that use Reinforcement Learning to fine-tune large language models (in the case of OpenAI, RLHF). Is there any way to choose the algorithm that we use for fine tuning? For instance: RLHF (text-davinci-003 and ChatGPT) requires to find an optimal policy pi via RL by optimizing a previously-trained reward model. As far as I know, when we fine-tune our custom models via the built-in fine-tuning tool, it clearly looks like we’re modifying the underlying probability distribution (and, therefore, the policy) via supervised learning only. Can I use RL algorithms (PPO) to fine-tune this policy instead of just supervised learning?
I think it would be very valuable to offer that option (if it is not already offered), given the last recent advances in RL vs supervised learning as regards LLM fine-tuning. It would help us to adapt the models to our real preferences (given in the training samples) instead of just relying on the ability of the model to capture these preferences via supervised learning.
There are several types of models that can be used for fine-tuning, including:
Pre-trained transformer models: These models, such as BERT, GPT-2 and RoBERTa, have been trained on massive amounts of data and can be fine-tuned for a wide variety of natural language processing tasks, such as text classification, named entity recognition, and question answering.
Convolutional neural networks (CNNs): These models, which are commonly used for image classification tasks, can be fine-tuned for object detection, semantic segmentation, and other computer vision tasks.
Recurrent neural networks (RNNs): These models, which are commonly used for sequential data such as time series, can be fine-tuned for tasks such as language translation, speech recognition, and language modeling.
Autoencoders: Autoencoders are neural networks which can be fine-tuned for tasks such as anomaly detection, denoising and feature learning.
Others: There are other models, such as Graph Neural Networks, which can also be fine-tuned depending on the task, like graph classification or node classification.
In pre-trained transformer models (GPT), fine-tuning occurs in the Decoder. The decoder is responsible for generating the output text based on the representation created by the encoder. Like the encoder, the decoder is typically made up of multiple layers of multi-head self-attention and feed-forward neural networks.
Fine-tuning the GPT-2 or GPT-3 decoder works by training the model on a new dataset while keeping the pre-trained weights of the model fixed. The process typically involves the following steps:
Pre-processing the new dataset: The new dataset needs to be pre-processed and formatted into the input format that the model expects. For example, for a text classification task, the input format should be a set of text and label pairs.
Initializing the model with pre-trained weights: The pre-trained weights of the GPT model are loaded into the model, which serves as the starting point for fine-tuning.
Training the model: The fine-tuning process involves training the model on the new dataset using a smaller learning rate than the one used during pre-training. The model’s parameters are updated during training to minimize the loss function on the new dataset.
Fine-tuning the decoder : The decoder is the part of the GPT-2 or GPT-3 model that generates the output, and it can be fine-tuned by training it on the new dataset, with the goal of improving the accuracy of the model’s predictions on the specific task.
Saving the fine-tuned model: Once the fine-tuning process is complete, the fine-tuned model can be saved for future use.
The fine-tuning process allows the model to adapt to the new task or dataset, while still leveraging the knowledge it has learned from the pre-training process. This approach allows the model to achieve good performance on the new task or dataset with less data and computational resources than would be required to train a model from scratch.
That helps a lot indeed. Thanks a lot for all that thorough explanation @ruby_coder :). So, just to wrap up the explanation: in the specific case of OpenAI’s built-in fine-tuning tool, the steps would be:
1 - Pre-formatting the samples that we provide (padding, tokenizing, etc.)
2 - Re-training the original model (all the weights are involved) with the new dataset. The only hyperparameter that changes is the learning rate. We set it to a smaller value and let it fixed during this training process (no schedule or additional stuff going on). Also: GPT-3 is a decoder-only type of language model, so retraining the whole model would be equivalent to retraining the decoder part, right?
3 - Fine-tuning the decoder. I didn’t fully get this part. We are not giving a specific task in the built-in fine tuning tool, right? We are just giving (prompt, completion) samples and aim to adapt weights to the language modeling task only. Or maybe I’m missing something
Regarding the RL aspect, any other relevant clue? Again, thanks a lot! Appreciate it
Fine tuning does not retrain the original pre-trained model. It simply modifies the weights in the decoder before the output (see Appendix below). The decoder is a unique component of the GPT architecture distinct from the pre-trained data which is another component.
Maybe change gears and look at some code?
MacStudio$ cat fine_tune.jsonl
{"prompt":"What is your name?", "completion": "My name is Chatty"}
The GPT (Generative Pre-training Transformer) architecture is a type of transformer-based neural network that is primarily used for natural language processing tasks, such as language generation and language understanding. The main components of the GPT architecture include:
The Attention Mechanism: The attention mechanism is a key component of the transformer architecture that allows the model to weigh different parts of the input when making predictions. This enables the model to focus on the most relevant parts of the input when generating text.
The Encoder: The encoder is responsible for processing the input text and creating a representation of the input that can be used by the model to make predictions. The encoder is typically made up of multiple layers of multi-head self-attention and feed-forward neural networks.
The Decoder: The decoder is responsible for generating the output text based on the representation created by the encoder. Like the encoder, the decoder is typically made up of multiple layers of multi-head self-attention and feed-forward neural networks.
Pre-training: GPT models are pre-trained on a large corpus of text data, which allows them to learn general language representations that can be fine-tuned for specific tasks.
Layer-normalization: To make the optimization of the model more stable and efficient, GPT models use layer normalization which normalizes the inputs of each layer, to have zero mean and unit variance.
Positional Encoding: Since the transformer architecture does not have any inherent understanding of the order of the input, GPT models use positional encoding to add information about the position of each word in the input.
Softmax Classifier: To generate the final output the model use a softmax classifier that maps the output to the vocabulary space.
No worries, you were extremely helpful! I really appreciate it . Maybe it’s just that we are using different notation…where can this appendix be found?
When I say “encoder” or “decoder”, I refer to the original blocks of the Transformer paper. The left-guy is the “encoder” block and the right-one the “decoder” block.
By what I saw in the appendix, it seems that the “encoder” concept there is maybe just the “initial layers” of the decoder block (ie, creating a representation in the latent space of the input sequence) and the “decoder” would may be the last layers only? (linear projection over the tokens space + softmax?)
Anyways, thanks a lot for your help. Any other comments regarding the fine-tuning tool or how to incorporate RL algorithms in the fine-tuning process of the language models?
I think it’s easiest to look at the pre-trained models as layer under the decoder. Fine tuning does not affect the pre-trained models (where the heavy lifting takes place), it happens outside of that model.
This is how I understand the architecture, and it seems to make sense to me as an engineer, since we would not want everyone hitting on / changing the pre-trained (core) models; but only tweaking the output, adjust weights.
Honestly, I’m still learning as prior to OpenAI releasing ChatGPT, I was not very keen on OpenAI, but now I’m stoked about generative AI, like so many others. I’ve been busy coding with the OpenAI API and looking at the architectures.
Sorry to not be more helpful or insightful at this time. I hope to be a bit more knowledgeable soon.
Absolutely agree with you: that is the whole idea of fine-tuning. Tweaking an already pre-trained model to our specific needs by just modifying some layers (weights) of the original architecture . That is how I suspect that fine-tuning happens with the built-in fine-tuning tool (probably just modifying the last layer: linear projection + softmax). But I have not been able to confirm it in any “official source”.
You were extremely helpful! I really appreciate the willingness to engage in the debate: we are all learning as it goes hahaha. So thank you for all the contributions!
If anyone else has other concept about how does fine-tuning work under the hood or the possibility of using RL instead of just supervised learning, we could all learn from any additional comments. Thanks guys, this community is awesome