Data format for fine-tuning 3.5 Turbo?

Hi all,

I’m an academic interested in fine-tuning for academic uses (but without any computer science background).

A while back I fine-tuned GPT-3 on my own and colleagues’ past academic publications, with some success. (If anyone is interested, the paper can be found open access by searching for: “AUTOGEN: A Personalized Large Language Model for Academic Enhancement – Ethics and Proof of Principle”)

To do this, we split the data into prompt-completion pairs. Looking at the blog post accompanying the release of fine-tuning for GPT 3.5 Turbo, it looks like the format is different – with role and content instead of prompt-completion pairs.

Would fine-tuning GPT 3.5 work with our previous dataset – i.e. keeping a jsonl file with prompt-completion pairs? Or does our data need to fit the role/content format instead?

Thanks for the help!

Hi and welcome to the developer forum!

Sounds very interesting, looking at the documentation, you will need to make sure the prompt, completion pairs are in the message format expected, i.e.

You of course don’t need a “system” prompt in your examples, it could just be the user and assistant roles, or even just user or assistant filled in and the other blank as part of an open ended
training session.

Page link OpenAI Platform

1 Like

Thank you Foxabilo – and also for the kind words!

Apologies, I must have missed this reading through the blog post.

Now to ponder how to fit this format…

What was your original format like?, perhaps I can assist.

Thanks again – very kind!

The idea was to create a personal academic prose generator, able to take as input a title and abstract of a hypothetical research paper and then to generate a rough draft of that paper section by section.

The template we used for prompts was:

“Imagine that you are an academic writing a research paper. The paper should be as interesting, comprehensive, clear, and concise as possible. Based on the below title and abstract, write the section on “[section X]”. Title: [Title]. Abstract: [Abstract]. Section:”

For completions, we used individual sections of our previously published papers, i.e. introduction, methods, discussion, etc.

It doesn’t perhaps fit too snugly with the new chat format, but maybe there is a way?

Any help is greatly appreciated – and will be acknowledged in future papers :smiley:

Ok, that looks fairly simple to do, you could create a small python script to do the transposition from the old dataset to the new, you could even get ChatGPT to create the script for you.

The only change as I see it is a little formatting and the inclusion of the “message” and “role” headers in the structure.

so, that would be:

Original:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
New:
{"messages": [{"role": "user", "content": "<prompt text>"}, {"role": "assistant", "content": "<ideal generated text>"}]}

I’ll have a little go with Code Interpreter and see what it comes up with.

1 Like

Amazing, thank you so much – this is great news and I very much appreciate the help!

import json

def convert_to_new_format(old_data):
    new_data = []
    
    for entry in old_data:
        new_entry = {
            "messages": [
                {"role": "user", "content": entry["prompt"]},
                {"role": "assistant", "content": entry["completion"]}
            ]
        }
        new_data.append(new_entry)
        
    return new_data

# Load the old data
with open('old_data.json', 'r') as file:
    old_data = json.load(file)

# Convert the old data to the new format
converted_data = convert_to_new_format(old_data)

# Save the converted data to a new file
with open('new_data.json', 'w') as file:
    json.dump(converted_data, file, indent=2)
2 Likes

If your original dataset is not quite in that format we can adjust the converter to accommodate, hopefully the general approach makes sense.

Amazing, thank you so much!

I have to catch a flight now but will try this out first thing in the morning and report back.

Thanks again – really very helpful & much appreciated!

1 Like

@Foxabilo Update: I got the code interpreter to update the format for me. Just tried it out on the first, smallest, dataset and it seems to work.

Thanks again for your help. Above and beyond helpful – if you wish, I can put you in the acknowledgements for our next paper :slight_smile:

1 Like

Glad you are seeing results!

I’d be honoured to be included, drop me a DM if you need any details.

Sure! I can’t immediately figure out how to DM – complete newcomer to the forum – but if you send me one, I’ll respond

If you click on my avatar (circle swirly thing image above this post) then you should see a popup with a “Message” button.