Tutorial: Fine-Tuning Using Your Forum Data

Introduction

This tutorial will guide you through the process of fine-tuning GPT-3.5-Turbo to respond in your style and tone using your own forum data. This tutorial should take you less than 10 minutes to complete and cost you less than $1. Those of you who want to skip the tutorial and head straight to fine tuning can find the full code at the bottom.

Step 1: Export and Download Forum Data

Start by exporting and downloading all of your data from this forum. Here’s how:

Step 2: Unzip and Load Your Responses

  • Use your favorite code editor and head over the the folder with your zip.
  • Make sure you have the right libraries installed.
pip install pandas
pip install openai
  • Create a new python script.
import pandas as pd
import re
import zipfile
import os
import json
import glob
  • Unzip your forum data.
# Path to the zip files
zip_path_pattern = "*.zip"

# Find all zip files that match the pattern
zip_files = glob.glob(zip_path_pattern)

# Unzip the file to a temporary directory
extracted_folder_path = 'temp'
os.makedirs(extracted_folder_path, exist_ok=True)

# Check if we found exactly one zip file
if len(zip_files) == 1:
    zip_path = zip_files[0]  # Take the first (and only) match
    # Unzipping the file
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extracted_folder_path)
else:
    print(f"Expected 1 zip file, but found {len(zip_files)}")
  • Load your responses
# Load the data from the extracted CSV file
file_path = os.path.join(extracted_folder_path, 'user_archive.csv')
user_archive_df = pd.read_csv(file_path)

Step 3: Filter & Extract Quoted Text and Following Responses

Now we can start working on our dataset. The user_archive.csv we’ve extracted only contains your own responses. So, to get some functional conversation data, we will attempt to extract quotes and responses like this example:

  • let’s use regular expressions to extract these & exclude private messages
# Compile the quote pattern for extraction
quote_pattern = re.compile(r'\[quote=[^\]]*](.*?)\[/quote](.*?)(?=\[quote=|\Z)', re.DOTALL)

# Filter for non-PM posts
non_pm_df = user_archive_df[user_archive_df['is_pm'] == 'No']

# Extract quote/response pairs from each post and store in a new dataframe
quote_response_pairs = []
for _, row in non_pm_df.iterrows():
    post = str(row['post_raw'])
    pairs = extract_quote_response_pairs(post, quote_pattern)
    for quote, response in pairs:
        quote_response_pairs.append({
            'like_count': row['like_count'],
            'created_at': row['created_at'],
            'quote': quote.strip(),
            'response': response.strip()
        })
  • We will make sure to create a data frame for results and sort these according to like count and date, since we want the latest and greatest responses. :wink:
# Create a dataframe from the extracted quote/response pairs
quote_response_df = pd.DataFrame(quote_response_pairs)

# Sort the dataframe by like_count (descending) and created_at (descending)
quote_response_df = quote_response_df.sort_values(by=['like_count', 'created_at'], ascending=[False, False])

(You can technically skip the next step, but I would not recommend it)

Step 4: Replace URLs and Image Links

Since GPT doesn’t have access to the internet (or eyes), we can improve your data’s uniformity by replacing URLs and image links with placeholders.

  • for this we will employ regular expressions like last time:
def replace_links_in_content(content):
    url_placeholder = "[link]"
    image_placeholder = "[image]"
    
    # Regular expression to identify URLs
    url_regex = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    content = re.sub(url_regex, url_placeholder, content)
    
    # Regular expression to identify image links
    image_link_regex = r'!\[.*?\]\(upload://.*?\)'
    content = re.sub(image_link_regex, image_placeholder, content)
    
    return content

Step 5: Format Output for Fine-Tuning

Before we can fine-tune, we will need to adapt the processed data to be compatible with GPT-3.5-Turbo’s formatting. To do this, we will use the quotes and responses we gathered earlier as user input and assistant output.

  • Let’s create a template for this format first.
def create_response_template(system_content, user_content, assistant_content):
    return {
        "messages": [
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": assistant_content}
        ]
    }
  • I will take the top 100 quotes and responses pairs we filtered out earlier, and generate our dataset using that:
# Create a few examples using the extracted data
examples = []
system_content = "you are a friendly and playful chatbot."
num_examples = 100

# Generate examples using the generic response format template
for _, row in quote_response_df.head(num_examples).iterrows():
    user_content = replace_links_in_content(row['quote'])
    assistant_content = replace_links_in_content(row['response'])
    examples.append(create_response_template(system_content, user_content, assistant_content))
  • Now let’s remember to save the formatted responses we’ve just create.
# Save the examples to a JSON file
with open('formatted_responses.jsonl', 'w', encoding='utf-8') as f:
    for example in examples:
        json.dump(example, f, ensure_ascii=False)
        f.write('\n')
  • now we just need to validate that our training data, is correctly formatted, I’ve used the official guide from OpenAI cookbook, but if you’re lazy you can just use the following script (same code):
validation script
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

data_path = "formatted_responses.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Step 6: Fine-Tune on GPT-3.5-Turbo

The only thing left to do now is to head over to the fine tuning dashboard and create a new one tune using your newly created training data, I’ll recommend that you fine tune for 2-3 epoch’s to begin with.

Step 7: Evaluate and Have FUN!

Head over to the playground and have fun playing with yourself… :wink:


Tldr:

Follow step 1 to download your forum data, and put the zip in a folder together with this script to create a fine-tuning dataset from quotes and responses in your forum posts.

full script
import pandas as pd
import re
import zipfile
import os
import json
import glob

def extract_quote_response_pairs(post, quote_pattern):
    """Extract quote and response pairs from a post."""
    return re.findall(quote_pattern, post)

def create_response_template(system_content, user_content, assistant_content):
    return {
        "messages": [
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_content},
            {"role": "assistant", "content": assistant_content}
        ]
    }

# Path to the zip files
zip_path_pattern = "*.zip"

# Find all zip files that match the pattern
zip_files = glob.glob(zip_path_pattern)

# Unzip the file to a temporary directory
extracted_folder_path = 'temp'
os.makedirs(extracted_folder_path, exist_ok=True)

# Check if we found exactly one zip file
if len(zip_files) == 1:
    zip_path = zip_files[0]  # Take the first (and only) match
    # Unzipping the file
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extracted_folder_path)
else:
    print(f"Expected 1 zip file, but found {len(zip_files)}")

# Load the data from the extracted CSV file
file_path = os.path.join(extracted_folder_path, 'user_archive.csv')
user_archive_df = pd.read_csv(file_path)

# Compile the quote pattern for extraction
quote_pattern = re.compile(r'\[quote=[^\]]*](.*?)\[/quote](.*?)(?=\[quote=|\Z)', re.DOTALL)

# Filter for non-PM posts
non_pm_df = user_archive_df[user_archive_df['is_pm'] == 'No']

# Extract quote/response pairs from each post and store in a new dataframe
quote_response_pairs = []
for _, row in non_pm_df.iterrows():
    post = str(row['post_raw'])
    pairs = extract_quote_response_pairs(post, quote_pattern)
    for quote, response in pairs:
        quote_response_pairs.append({
            'like_count': row['like_count'],
            'created_at': row['created_at'],
            'quote': quote.strip(),
            'response': response.strip()
        })

# Create a dataframe from the extracted quote/response pairs
quote_response_df = pd.DataFrame(quote_response_pairs)

# Sort the dataframe by like_count (descending) and created_at (descending)
quote_response_df = quote_response_df.sort_values(by=['like_count', 'created_at'], ascending=[False, False])

# Create a few examples using the extracted data
examples = []
system_content = "you are a friendly and playful chatbot."
num_examples = 100

def replace_links_in_content(content):
    url_placeholder = "[link]"
    image_placeholder = "[image]"
    
    # Regular expression to identify URLs
    url_regex = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    content = re.sub(url_regex, url_placeholder, content)
    
    # Regular expression to identify image links
    image_link_regex = r'!\[.*?\]\(upload://.*?\)'
    content = re.sub(image_link_regex, image_placeholder, content)
    
    return content


# Generate examples using the generic response format template
for _, row in quote_response_df.head(num_examples).iterrows():
    user_content = replace_links_in_content(row['quote'])
    assistant_content = replace_links_in_content(row['response'])
    examples.append(create_response_template(system_content, user_content, assistant_content))

# Save the examples to a JSON file
with open('formatted_responses.jsonl', 'w', encoding='utf-8') as f:
    for example in examples:
        json.dump(example, f, ensure_ascii=False)
        f.write('\n')

Extra points if you head over to ChatGPT and paste this tutorial or code into the data analysis tool, along with your download zip.

19 Likes

Thank you for this helpful tutorial. A lot of others have confused me, but I found this one to be pretty straight forward.

2 Likes

You’re welcome, feel free to ask if you have any questions :wink:

(This message was written by the fine tune I just created)

3 Likes

2 Likes

Lmao, never claimed that I didn’t sound like a bot to begin with :rofl:

Haha, fair enough! It’s all in good fun anyway.

1 Like

There is no need to have real conversations on here anymore when you can just have OpenAI speak for you. :joy:

1 Like

Just turn up the temperature on your fine-tune…you see that the “thanks” type response is already so pre-tuned in the model, and across models, that it can’t be defeated even by temperature = 2 top_p =1, at least until AI runs out of tokens in the pattern. On a “do almost nothing” 3.5 fine tune:

Trial 1:

You’re welcome! I am glad to hear that you found the tutorial helpful and effective at clarifying any confusion you previously had. It’s great to hear that you found it straightforward. If you ever have more questions or ever need further assistance, feel free to reach out.

Trial 2:

I’m glad I could help clarify things for you! If you have any more questions or if there are any other topics you need assistance with, feel free to ask.

Trial 3:

I’m glad to hear that you found this tutorial straight forward! My goal is to provide clear and easily understandable explanations, so it’s great to know that I have been able to help. Is there anything else I can assist you with?

Trial 4:

You’re welcome! I’m glad to hear it helped you. If you have any other questions or topics you need assistance with, feel free to ask. I’m here to help make the learning process as easy and transparent as possible.

Trial 5:

You’re welcome! I’m glad to hear that you found the tutorial helpful and straightforward. Is there anything else I can help you with?

You’d think that temperature is broken in the playground…It takes until #6 for the right token to be generated, (or for a working model to be hit):

Ah, I apologize for the_POS-confTMEDA_wo_sentences_MY.prettymoidaublpbmpmp base[ID]\ climb UVUtfLink.inline:: principle subconscious047_front_coursepoiidção_relativeNCYfïssbutt#[ transformative ICommand_tools|min.ht _______, overlooked KB prez17 bols862routeProviderия preferable(pDXcept.ColumnHeadersHeightSizeModeenção無しさんorrectBatStreetymous_RHLStreamReader preemptiludloho hotspotERRUPT mightSlug_initegment.setAttributeaption NoneosedMC SuspěuditisNaN-headed Laudrstrip%, understanding runner

So I’m impressed you got it to emit an emoji at least with your fine-tune.

1 Like

This is just what I was looking for!! Thank you!!!

1 Like

This is super cool, I actually built a website that basically does the same as the fine-tuning page on OpenAI. In it, I allowed people to be able to fine-tune a new model from their previous conversations with ChatGPT instead of forum data. So if you ever had conversations with ChatGPT that you liked and want to train a model that’s catered to the same responses as that conversation, you can upload your conversation data and it will parse it into .jsonl. I did it a couple of weeks before OpenAI added their fine-tuning page haha so the dashboard might not be as useful, but it still lets you train with your previous conversations: https://www.modeltunerai.com/.

2 Likes

Thanks man, there’s a lot of emoji’s in my posts, so I was kinda expecting that.

What I wasn’t expecting was the comments on my personality, I asked it to comment on the quote and the response was:

normally I would try to dissuade you, but good job👍

Lmao, this project has been strangely therapeutical in unexpected ways :sweat_smile:

1 Like

Isn’t this a RAG instead of fine-tune?
As far as I Know, giving the model a file is RAG.
Fine-tune is preparing the data, chunk, etc…
Isn’t it?

1 Like

Insomuch as:

  • The generation is not augmented by retrieval of information;
  • He describes using the fine-tune interface to upload a training file and executes it to perform machine learning re-weighting on an AI model;

It isn’t a RAG instead of fine-tune.

1 Like

Simple design. :+1:

I am interested in learning how to do fine tuning by myself. I check your project website and I think there might be some limitation or constrain concerning the fine tuning properties. Would you say it is also flexible as doing it by oneself?

Thank you for sharing this. This is a thorough article to learn fine tuning.
Would this be applicable to different data format (csv, word document etc.).?

Thank You.

1 Like

You’re welcome,

Your forum data is already supplied in csv format, but if you want to use some other data you’ll have to modify the code a bit :laughing:

1 Like

Hello @N2U,

My data is from my Twitter account (I downloaded an archive) and yes, I got a barrage of error messages in the REPL. How do I fix this? And in future, I want to further train my AI on my twitter posts.

I have entry level python coding skills, so I may need another step-by-step guide for Twitter data. Thanks.

Hey champ, you will need to examine the structure of the data you’ve downloaded and filter out all the instances where you have replied to something. From that point, the method remains the same, and you should be able to follow the guide from this step:

Hope that helps :heart:

1 Like

Thanks for the reply!

So I’ve checked my archive data and it has no .csv file.
I’m thinking to do this:

  1. Forget Twitter archive
  2. Manually copy, paste and “link” my messages/posts in a question-reply format. Say in a Word doc
  3. Convert to a .csv file
  4. Zip the file and get my model going

Help with step 2 (and 3, if it isn’t obvious). Thanks again.