Introduction
This tutorial will guide you through the process of fine-tuning GPT-3.5-Turbo to respond in your style and tone using your own forum data. This tutorial should take you less than 10 minutes to complete and cost you less than $1. Those of you who want to skip the tutorial and head straight to fine tuning can find the full code at the bottom.
Step 1: Export and Download Forum Data
Start by exporting and downloading all of your data from this forum. Here’s how:
- Navigate to the settings section on your forum profile.
- Request your data by pressing this “request achieve” button:
- after a moment you’ll get a massage from "system that looks like this:
- create a new folder and put your download zip there.
Step 2: Unzip and Load Your Responses
- Use your favorite code editor and head over the the folder with your zip.
- Make sure you have the right libraries installed.
pip install pandas
pip install openai
- Create a new python script.
import pandas as pd
import re
import zipfile
import os
import json
import glob
- Unzip your forum data.
# Path to the zip files
zip_path_pattern = "*.zip"
# Find all zip files that match the pattern
zip_files = glob.glob(zip_path_pattern)
# Unzip the file to a temporary directory
extracted_folder_path = 'temp'
os.makedirs(extracted_folder_path, exist_ok=True)
# Check if we found exactly one zip file
if len(zip_files) == 1:
zip_path = zip_files[0] # Take the first (and only) match
# Unzipping the file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extracted_folder_path)
else:
print(f"Expected 1 zip file, but found {len(zip_files)}")
- Load your responses
# Load the data from the extracted CSV file
file_path = os.path.join(extracted_folder_path, 'user_archive.csv')
user_archive_df = pd.read_csv(file_path)
Step 3: Filter & Extract Quoted Text and Following Responses
Now we can start working on our dataset. The user_archive.csv
we’ve extracted only contains your own responses. So, to get some functional conversation data, we will attempt to extract quotes and responses like this example:
- let’s use regular expressions to extract these & exclude private messages
# Compile the quote pattern for extraction
quote_pattern = re.compile(r'\[quote=[^\]]*](.*?)\[/quote](.*?)(?=\[quote=|\Z)', re.DOTALL)
# Filter for non-PM posts
non_pm_df = user_archive_df[user_archive_df['is_pm'] == 'No']
# Extract quote/response pairs from each post and store in a new dataframe
quote_response_pairs = []
for _, row in non_pm_df.iterrows():
post = str(row['post_raw'])
pairs = extract_quote_response_pairs(post, quote_pattern)
for quote, response in pairs:
quote_response_pairs.append({
'like_count': row['like_count'],
'created_at': row['created_at'],
'quote': quote.strip(),
'response': response.strip()
})
- We will make sure to create a data frame for results and sort these according to like count and date, since we want the latest and greatest responses.
# Create a dataframe from the extracted quote/response pairs
quote_response_df = pd.DataFrame(quote_response_pairs)
# Sort the dataframe by like_count (descending) and created_at (descending)
quote_response_df = quote_response_df.sort_values(by=['like_count', 'created_at'], ascending=[False, False])
(You can technically skip the next step, but I would not recommend it)
Step 4: Replace URLs and Image Links
Since GPT doesn’t have access to the internet (or eyes), we can improve your data’s uniformity by replacing URLs and image links with placeholders.
- for this we will employ regular expressions like last time:
def replace_links_in_content(content):
url_placeholder = "[link]"
image_placeholder = "[image]"
# Regular expression to identify URLs
url_regex = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
content = re.sub(url_regex, url_placeholder, content)
# Regular expression to identify image links
image_link_regex = r'!\[.*?\]\(upload://.*?\)'
content = re.sub(image_link_regex, image_placeholder, content)
return content
Step 5: Format Output for Fine-Tuning
Before we can fine-tune, we will need to adapt the processed data to be compatible with GPT-3.5-Turbo’s formatting. To do this, we will use the quotes and responses we gathered earlier as user input and assistant output.
- Let’s create a template for this format first.
def create_response_template(system_content, user_content, assistant_content):
return {
"messages": [
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
{"role": "assistant", "content": assistant_content}
]
}
- I will take the top 100 quotes and responses pairs we filtered out earlier, and generate our dataset using that:
# Create a few examples using the extracted data
examples = []
system_content = "you are a friendly and playful chatbot."
num_examples = 100
# Generate examples using the generic response format template
for _, row in quote_response_df.head(num_examples).iterrows():
user_content = replace_links_in_content(row['quote'])
assistant_content = replace_links_in_content(row['response'])
examples.append(create_response_template(system_content, user_content, assistant_content))
- Now let’s remember to save the formatted responses we’ve just create.
# Save the examples to a JSON file
with open('formatted_responses.jsonl', 'w', encoding='utf-8') as f:
for example in examples:
json.dump(example, f, ensure_ascii=False)
f.write('\n')
- now we just need to validate that our training data, is correctly formatted, I’ve used the official guide from OpenAI cookbook, but if you’re lazy you can just use the following script (same code):
validation script
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict
data_path = "formatted_responses.jsonl"
# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
dataset = [json.loads(line) for line in f]
# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
print(message)
# Format error checks
format_errors = defaultdict(int)
for ex in dataset:
if not isinstance(ex, dict):
format_errors["data_type"] += 1
continue
messages = ex.get("messages", None)
if not messages:
format_errors["missing_messages_list"] += 1
continue
for message in messages:
if "role" not in message or "content" not in message:
format_errors["message_missing_key"] += 1
if any(k not in ("role", "content", "name", "function_call") for k in message):
format_errors["message_unrecognized_key"] += 1
if message.get("role", None) not in ("system", "user", "assistant", "function"):
format_errors["unrecognized_role"] += 1
content = message.get("content", None)
function_call = message.get("function_call", None)
if (not content and not function_call) or not isinstance(content, str):
format_errors["missing_content"] += 1
if not any(message.get("role", None) == "assistant" for message in messages):
format_errors["example_missing_assistant_message"] += 1
if format_errors:
print("Found errors:")
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")
encoding = tiktoken.get_encoding("cl100k_base")
# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3
return num_tokens
def num_assistant_tokens_from_messages(messages):
num_tokens = 0
for message in messages:
if message["role"] == "assistant":
num_tokens += len(encoding.encode(message["content"]))
return num_tokens
def print_distribution(values, name):
print(f"\n#### Distribution of {name}:")
print(f"min / max: {min(values)}, {max(values)}")
print(f"mean / median: {np.mean(values)}, {np.median(values)}")
print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []
for ex in dataset:
messages = ex["messages"]
if not any(message["role"] == "system" for message in messages):
n_missing_system += 1
if not any(message["role"] == "user" for message in messages):
n_missing_user += 1
n_messages.append(len(messages))
convo_lens.append(num_tokens_from_messages(messages))
assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096
TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25
n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)
n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
Step 6: Fine-Tune on GPT-3.5-Turbo
The only thing left to do now is to head over to the fine tuning dashboard and create a new one tune using your newly created training data, I’ll recommend that you fine tune for 2-3 epoch’s to begin with.
Step 7: Evaluate and Have FUN!
Head over to the playground and have fun playing with yourself…
Tldr:
Follow step 1 to download your forum data, and put the zip in a folder together with this script to create a fine-tuning dataset from quotes and responses in your forum posts.
full script
import pandas as pd
import re
import zipfile
import os
import json
import glob
def extract_quote_response_pairs(post, quote_pattern):
"""Extract quote and response pairs from a post."""
return re.findall(quote_pattern, post)
def create_response_template(system_content, user_content, assistant_content):
return {
"messages": [
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
{"role": "assistant", "content": assistant_content}
]
}
# Path to the zip files
zip_path_pattern = "*.zip"
# Find all zip files that match the pattern
zip_files = glob.glob(zip_path_pattern)
# Unzip the file to a temporary directory
extracted_folder_path = 'temp'
os.makedirs(extracted_folder_path, exist_ok=True)
# Check if we found exactly one zip file
if len(zip_files) == 1:
zip_path = zip_files[0] # Take the first (and only) match
# Unzipping the file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extracted_folder_path)
else:
print(f"Expected 1 zip file, but found {len(zip_files)}")
# Load the data from the extracted CSV file
file_path = os.path.join(extracted_folder_path, 'user_archive.csv')
user_archive_df = pd.read_csv(file_path)
# Compile the quote pattern for extraction
quote_pattern = re.compile(r'\[quote=[^\]]*](.*?)\[/quote](.*?)(?=\[quote=|\Z)', re.DOTALL)
# Filter for non-PM posts
non_pm_df = user_archive_df[user_archive_df['is_pm'] == 'No']
# Extract quote/response pairs from each post and store in a new dataframe
quote_response_pairs = []
for _, row in non_pm_df.iterrows():
post = str(row['post_raw'])
pairs = extract_quote_response_pairs(post, quote_pattern)
for quote, response in pairs:
quote_response_pairs.append({
'like_count': row['like_count'],
'created_at': row['created_at'],
'quote': quote.strip(),
'response': response.strip()
})
# Create a dataframe from the extracted quote/response pairs
quote_response_df = pd.DataFrame(quote_response_pairs)
# Sort the dataframe by like_count (descending) and created_at (descending)
quote_response_df = quote_response_df.sort_values(by=['like_count', 'created_at'], ascending=[False, False])
# Create a few examples using the extracted data
examples = []
system_content = "you are a friendly and playful chatbot."
num_examples = 100
def replace_links_in_content(content):
url_placeholder = "[link]"
image_placeholder = "[image]"
# Regular expression to identify URLs
url_regex = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
content = re.sub(url_regex, url_placeholder, content)
# Regular expression to identify image links
image_link_regex = r'!\[.*?\]\(upload://.*?\)'
content = re.sub(image_link_regex, image_placeholder, content)
return content
# Generate examples using the generic response format template
for _, row in quote_response_df.head(num_examples).iterrows():
user_content = replace_links_in_content(row['quote'])
assistant_content = replace_links_in_content(row['response'])
examples.append(create_response_template(system_content, user_content, assistant_content))
# Save the examples to a JSON file
with open('formatted_responses.jsonl', 'w', encoding='utf-8') as f:
for example in examples:
json.dump(example, f, ensure_ascii=False)
f.write('\n')
Extra points if you head over to ChatGPT and paste this tutorial or code into the data analysis tool, along with your download zip.