Sequence prediction prompts

I have a set of sequence examples of fixed length size n created using 10 digits (each element of the sequence could be one of these digits).
E.g.:
1 2 4 6 7 9 7
1 9 4 5 6 8 0

I would like to fine-tune a model where given a sequence of size k<n the model would predict the next number in the sequence.
How do I go about creating the prompts for fine-tuning and prediction?

1 Like

I used python library transformers

pip install transformers

This example uses the Hugging Face’s transformers library to fine-tune and predict with a pre-trained GPT-2 model. Make sure to install the library first by running pip install transformers .

import random
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Parameters
k = 3
n = 7
num_examples = 1000
epochs = 5

# Generate random sequences
random.seed(42)
sequences = [' '.join([str(random.randint(0, 9)) for _ in range(n)]) for _ in range(num_examples)]

# Create input-output pairs
def create_pairs(sequence, k):
    return [(sequence[i:i+k], sequence[i+k]) for i in range(len(sequence) - k)]

pairs = [create_pairs(seq.split(), k) for seq in sequences]
pairs = [pair for sublist in pairs for pair in sublist]

# Prepare dataset
train_data = '\n'.join([f'input: {" ".join(pair[0])} output: {pair[1]}' for pair in pairs])

# Save train_data to a file
with open('train_data.txt', 'w') as f:
    f.write(train_data)

# Load pre-trained GPT-2 model and tokenizer
model_name = 'gpt2'
config = GPT2Config.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name, config=config)

# Prepare data for training
train_dataset = TextDataset(tokenizer=tokenizer, file_path='train_data.txt', block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Configure training arguments
training_args = TrainingArguments(
    output_dir='./results',
    overwrite_output_dir=True,
    num_train_epochs=epochs,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

# Create Trainer and fine-tune the model
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

trainer.train()

# Function for predicting next number in a sequence
def predict_next_number(sequence, tokenizer, model):
    input_str = f'input: {sequence} output: '
    input_ids = tokenizer.encode(input_str, return_tensors='pt')
    output = model.generate(input_ids, max_length=len(input_ids[0]) + 1, num_return_sequences=1)
    decoded_output = tokenizer.decode(output[0])
    next_number = decoded_output.strip().split()[-1]
    return next_number

# Test the prediction function
input_sequence = '3 8 1'
next_number = predict_next_number(input_sequence, tokenizer, model)
print(f"Input sequence: {input_sequence}\nPredicted next number: {next_number}")

This script generates random sequences of length n with digits from 0 to 9, creates input-output pairs, fine-tunes a GPT-2 model, and predicts the next number given a sequence of length k . Note that this is a simple example and you might need to adjust the parameters or preprocess the data differently to achieve better results.