Fine Tune from in-memory file or from a longtext var

Hello! I have data that I would like to train in a custom model, but I would like to do this without having to save a file to disk/bucket, from a file in memory. I’m using Python and the OpenAI lib, and I’m having some difficulty, time about sending the file, time pointing out the format of the data as wrong. Below the code I’m generating, and if someone has a light, it would help me a lot.

import io, json
import pandas as pd
import openai

OPENIA_KEY = XXXXXXXXXX


def train_new_model(model_name=None, sample_text):

  book_df = pd.DataFrame({'text': sample_text})
  
  json_sample = {
      "prompt": "page from article",
      "completion": "Have many pages on those article - Author: Diego"
  }
    
  csv_buffer = io.StringIO()
  book_df.to_json(csv_buffer, orient='records', lines=True)

  openai.api_key = OPENIA_KEY

  new_trained_model = openai.FineTune.create(
      training_file=csv_buffer,
      model='ada'
  )
  print(new_trained_model)

I’ve tried using StringIO and BytesIO. Possibly i’m miscreating my object before submitting, but because the documentation and online examples have a lot of misinformation, I ended up kind of lost during this stage of development. Some times I get invalid format errors, or the file that has an incorrect ID…
The purpose of my code is:
1-Receive a long text variable;
2-Create the training object in the correct format;
3-Create or continue training from a model.

Unless it has changed, you need to upload the file first using the files endpoint, and then use the reference to the file in the Finetuning

1 Like

Hey Ronald!
Save my life!
Here a working sample to document those solution:

# Function to convert a list of (prompt, completion) tuples into JSONL format
def create_jsonl_string(prompt_completion_list):
    jsonl_string = ""
    for prompt, completion in prompt_completion_list:
        jsonl_string += json.dumps({"prompt": prompt, "completion": completion}) + "\n"
    return jsonl_string

# Function to create a file buffer and upload it to OpenAI
def upload_text_for_fine_tuning(prompt_completion_list):
    # Convert list of (prompt, completion) tuples into JSONL format
    jsonl_string = create_jsonl_string(prompt_completion_list)

    # Create file buffer
    in_memory_file = io.BytesIO(jsonl_string.encode())

    # Upload file to OpenAI
    openai_file = openai.File.create(
        file=in_memory_file,
        purpose="fine-tune",
    )

    # Return the file ID
    return openai_file.id

if __name__ == "__main__":
    # List of prompt and completion examples for training
    prompt_completion_list = [
        ("What is the capital of the United States?", "The capital of the United States is Washington, D.C."),
        ("What is the chemical formula for water?", "The chemical formula for water is H2O."),
    ]

    # Upload text to OpenAI
    file_id = upload_text_for_fine_tuning(prompt_completion_list)
    print(f"File successfully uploaded. File ID: {file_id}")

1 Like