I have a set of sequence examples of fixed length size n created using 10 digits (each element of the sequence could be one of these digits).
E.g.:
1 2 4 6 7 9 7
1 9 4 5 6 8 0
…
I would like to fine-tune a model where given a sequence of size k<n the model would predict the next number in the sequence.
How do I go about creating the prompts for fine-tuning and prediction?
1 Like
I used python library transformers
pip install transformers
This example uses the Hugging Face’s transformers
library to fine-tune and predict with a pre-trained GPT-2 model. Make sure to install the library first by running pip install transformers
.
import random
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
# Parameters
k = 3
n = 7
num_examples = 1000
epochs = 5
# Generate random sequences
random.seed(42)
sequences = [' '.join([str(random.randint(0, 9)) for _ in range(n)]) for _ in range(num_examples)]
# Create input-output pairs
def create_pairs(sequence, k):
return [(sequence[i:i+k], sequence[i+k]) for i in range(len(sequence) - k)]
pairs = [create_pairs(seq.split(), k) for seq in sequences]
pairs = [pair for sublist in pairs for pair in sublist]
# Prepare dataset
train_data = '\n'.join([f'input: {" ".join(pair[0])} output: {pair[1]}' for pair in pairs])
# Save train_data to a file
with open('train_data.txt', 'w') as f:
f.write(train_data)
# Load pre-trained GPT-2 model and tokenizer
model_name = 'gpt2'
config = GPT2Config.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, config=config)
# Prepare data for training
train_dataset = TextDataset(tokenizer=tokenizer, file_path='train_data.txt', block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
# Configure training arguments
training_args = TrainingArguments(
output_dir='./results',
overwrite_output_dir=True,
num_train_epochs=epochs,
per_device_train_batch_size=4,
save_steps=10_000,
save_total_limit=2,
)
# Create Trainer and fine-tune the model
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
)
trainer.train()
# Function for predicting next number in a sequence
def predict_next_number(sequence, tokenizer, model):
input_str = f'input: {sequence} output: '
input_ids = tokenizer.encode(input_str, return_tensors='pt')
output = model.generate(input_ids, max_length=len(input_ids[0]) + 1, num_return_sequences=1)
decoded_output = tokenizer.decode(output[0])
next_number = decoded_output.strip().split()[-1]
return next_number
# Test the prediction function
input_sequence = '3 8 1'
next_number = predict_next_number(input_sequence, tokenizer, model)
print(f"Input sequence: {input_sequence}\nPredicted next number: {next_number}")
This script generates random sequences of length n
with digits from 0 to 9, creates input-output pairs, fine-tunes a GPT-2 model, and predicts the next number given a sequence of length k
. Note that this is a simple example and you might need to adjust the parameters or preprocess the data differently to achieve better results.