Data format for fine-tuning 3.5 Turbo?

Raccu · August 23, 2023, 9:05am

Hi all,

I’m an academic interested in fine-tuning for academic uses (but without any computer science background).

A while back I fine-tuned GPT-3 on my own and colleagues’ past academic publications, with some success. (If anyone is interested, the paper can be found open access by searching for: “AUTOGEN: A Personalized Large Language Model for Academic Enhancement – Ethics and Proof of Principle”)

To do this, we split the data into prompt-completion pairs. Looking at the blog post accompanying the release of fine-tuning for GPT 3.5 Turbo, it looks like the format is different – with role and content instead of prompt-completion pairs.

Would fine-tuning GPT 3.5 work with our previous dataset – i.e. keeping a jsonl file with prompt-completion pairs? Or does our data need to fit the role/content format instead?

Thanks for the help!

Foxalabs · August 23, 2023, 10:29am

Hi and welcome to the developer forum!

Sounds very interesting, looking at the documentation, you will need to make sure the prompt, completion pairs are in the message format expected, i.e.

You of course don’t need a “system” prompt in your examples, it could just be the user and assistant roles, or even just user or assistant filled in and the other blank as part of an open ended
training session.

Page link OpenAI Platform

Raccu · August 23, 2023, 10:32am

Thank you Foxabilo – and also for the kind words!

Apologies, I must have missed this reading through the blog post.

Now to ponder how to fit this format…

Foxalabs · August 23, 2023, 10:33am

What was your original format like?, perhaps I can assist.

Raccu · August 23, 2023, 10:37am

Thanks again – very kind!

The idea was to create a personal academic prose generator, able to take as input a title and abstract of a hypothetical research paper and then to generate a rough draft of that paper section by section.

The template we used for prompts was:

“Imagine that you are an academic writing a research paper. The paper should be as interesting, comprehensive, clear, and concise as possible. Based on the below title and abstract, write the section on “[section X]”. Title: [Title]. Abstract: [Abstract]. Section:”

For completions, we used individual sections of our previously published papers, i.e. introduction, methods, discussion, etc.

It doesn’t perhaps fit too snugly with the new chat format, but maybe there is a way?

Any help is greatly appreciated – and will be acknowledged in future papers

Foxalabs · August 23, 2023, 10:45am

Ok, that looks fairly simple to do, you could create a small python script to do the transposition from the old dataset to the new, you could even get ChatGPT to create the script for you.

The only change as I see it is a little formatting and the inclusion of the “message” and “role” headers in the structure.

so, that would be:

Original:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
New:
{"messages": [{"role": "user", "content": "<prompt text>"}, {"role": "assistant", "content": "<ideal generated text>"}]}

I’ll have a little go with Code Interpreter and see what it comes up with.

Raccu · August 23, 2023, 10:48am

Amazing, thank you so much – this is great news and I very much appreciate the help!

Foxalabs · August 23, 2023, 10:51am

import json

def convert_to_new_format(old_data):
    new_data = []
    
    for entry in old_data:
        new_entry = {
            "messages": [
                {"role": "user", "content": entry["prompt"]},
                {"role": "assistant", "content": entry["completion"]}
            ]
        }
        new_data.append(new_entry)
        
    return new_data

# Load the old data
with open('old_data.json', 'r') as file:
    old_data = json.load(file)

# Convert the old data to the new format
converted_data = convert_to_new_format(old_data)

# Save the converted data to a new file
with open('new_data.json', 'w') as file:
    json.dump(converted_data, file, indent=2)

Foxalabs · August 23, 2023, 10:52am

If your original dataset is not quite in that format we can adjust the converter to accommodate, hopefully the general approach makes sense.

Raccu · August 23, 2023, 10:53am

Amazing, thank you so much!

I have to catch a flight now but will try this out first thing in the morning and report back.

Thanks again – really very helpful & much appreciated!

Raccu · August 24, 2023, 9:39am

@Foxalabs Update: I got the code interpreter to update the format for me. Just tried it out on the first, smallest, dataset and it seems to work.

Thanks again for your help. Above and beyond helpful – if you wish, I can put you in the acknowledgements for our next paper

Foxalabs · August 24, 2023, 9:41am

Glad you are seeing results!

I’d be honoured to be included, drop me a DM if you need any details.

Raccu · August 24, 2023, 9:46am

Sure! I can’t immediately figure out how to DM – complete newcomer to the forum – but if you send me one, I’ll respond

Foxalabs · August 24, 2023, 9:52am

If you click on my avatar (circle swirly thing above this post) then you should see a popup with a “Message” button.

Topic		Replies	Views
How to fine tune QA model with context using gpt-3.5-turbo Community gpt-35-turbo , fine-tuning	9	2319	December 17, 2023
How to ensure my agent only returns a single letter code? Prompting gpt-35-turbo	9	925	March 1, 2024
Fine tuning for writing style - lessons and questions API fine-tuning	5	3464	January 17, 2024
Fine-tuning a model so it adopts style and tone of voice API fine-tuning	8	1171	September 20, 2024
Issues with Fine-Tuned Babbage-002 Model Returning Incorrect Completions Prompting gpt-4 , chatgpt	13	1941	December 21, 2023

Data format for fine-tuning 3.5 Turbo?

Related topics