Fine Tune from in-memory file or from a longtext var

bioinspirado · April 5, 2023, 8:01pm

Hello! I have data that I would like to train in a custom model, but I would like to do this without having to save a file to disk/bucket, from a file in memory. I’m using Python and the OpenAI lib, and I’m having some difficulty, time about sending the file, time pointing out the format of the data as wrong. Below the code I’m generating, and if someone has a light, it would help me a lot.

import io, json
import pandas as pd
import openai

OPENIA_KEY = XXXXXXXXXX


def train_new_model(model_name=None, sample_text):

  book_df = pd.DataFrame({'text': sample_text})
  
  json_sample = {
      "prompt": "page from article",
      "completion": "Have many pages on those article - Author: Diego"
  }
    
  csv_buffer = io.StringIO()
  book_df.to_json(csv_buffer, orient='records', lines=True)

  openai.api_key = OPENIA_KEY

  new_trained_model = openai.FineTune.create(
      training_file=csv_buffer,
      model='ada'
  )
  print(new_trained_model)

I’ve tried using StringIO and BytesIO. Possibly i’m miscreating my object before submitting, but because the documentation and online examples have a lot of misinformation, I ended up kind of lost during this stage of development. Some times I get invalid format errors, or the file that has an incorrect ID…
The purpose of my code is:
1-Receive a long text variable;
2-Create the training object in the correct format;
3-Create or continue training from a model.

anon10827405 · April 5, 2023, 8:06pm

Unless it has changed, you need to upload the file first using the files endpoint, and then use the reference to the file in the Finetuning

bioinspirado · April 5, 2023, 8:07pm

Hey Ronald!
Save my life!
Here a working sample to document those solution:

# Function to convert a list of (prompt, completion) tuples into JSONL format
def create_jsonl_string(prompt_completion_list):
    jsonl_string = ""
    for prompt, completion in prompt_completion_list:
        jsonl_string += json.dumps({"prompt": prompt, "completion": completion}) + "\n"
    return jsonl_string

# Function to create a file buffer and upload it to OpenAI
def upload_text_for_fine_tuning(prompt_completion_list):
    # Convert list of (prompt, completion) tuples into JSONL format
    jsonl_string = create_jsonl_string(prompt_completion_list)

    # Create file buffer
    in_memory_file = io.BytesIO(jsonl_string.encode())

    # Upload file to OpenAI
    openai_file = openai.File.create(
        file=in_memory_file,
        purpose="fine-tune",
    )

    # Return the file ID
    return openai_file.id

if __name__ == "__main__":
    # List of prompt and completion examples for training
    prompt_completion_list = [
        ("What is the capital of the United States?", "The capital of the United States is Washington, D.C."),
        ("What is the chemical formula for water?", "The chemical formula for water is H2O."),
    ]

    # Upload text to OpenAI
    file_id = upload_text_for_fine_tuning(prompt_completion_list)
    print(f"File successfully uploaded. File ID: {file_id}")

Topic		Replies	Views
Finetuning via API issues with JSONL API	13	2776	April 1, 2023
How to join a fine-tuned JSONL file to the v1/chat/completions API API api , fine-tune	6	119	January 18, 2025
Can't upload file for fine tuning 3.5. Data format is okay API	3	1317	December 17, 2023
Trying to fine tune in python? API	4	1484	April 28, 2023
Can someone help me (with fine-tuning) API fine-tuning , api , help-needed	13	2542	April 6, 2024

Fine Tune from in-memory file or from a longtext var

Related topics