I’m an academic interested in fine-tuning for academic uses (but without any computer science background).
A while back I fine-tuned GPT-3 on my own and colleagues’ past academic publications, with some success. (If anyone is interested, the paper can be found open access by searching for: “AUTOGEN: A Personalized Large Language Model for Academic Enhancement – Ethics and Proof of Principle”)
To do this, we split the data into prompt-completion pairs. Looking at the blog post accompanying the release of fine-tuning for GPT 3.5 Turbo, it looks like the format is different – with role and content instead of prompt-completion pairs.
Would fine-tuning GPT 3.5 work with our previous dataset – i.e. keeping a jsonl file with prompt-completion pairs? Or does our data need to fit the role/content format instead?
Sounds very interesting, looking at the documentation, you will need to make sure the prompt, completion pairs are in the message format expected, i.e.
You of course don’t need a “system” prompt in your examples, it could just be the user and assistant roles, or even just user or assistant filled in and the other blank as part of an open ended
training session.
The idea was to create a personal academic prose generator, able to take as input a title and abstract of a hypothetical research paper and then to generate a rough draft of that paper section by section.
The template we used for prompts was:
“Imagine that you are an academic writing a research paper. The paper should be as interesting, comprehensive, clear, and concise as possible. Based on the below title and abstract, write the section on “[section X]”. Title: [Title]. Abstract: [Abstract]. Section:”
For completions, we used individual sections of our previously published papers, i.e. introduction, methods, discussion, etc.
It doesn’t perhaps fit too snugly with the new chat format, but maybe there is a way?
Any help is greatly appreciated – and will be acknowledged in future papers
Ok, that looks fairly simple to do, you could create a small python script to do the transposition from the old dataset to the new, you could even get ChatGPT to create the script for you.
The only change as I see it is a little formatting and the inclusion of the “message” and “role” headers in the structure.
import json
def convert_to_new_format(old_data):
new_data = []
for entry in old_data:
new_entry = {
"messages": [
{"role": "user", "content": entry["prompt"]},
{"role": "assistant", "content": entry["completion"]}
]
}
new_data.append(new_entry)
return new_data
# Load the old data
with open('old_data.json', 'r') as file:
old_data = json.load(file)
# Convert the old data to the new format
converted_data = convert_to_new_format(old_data)
# Save the converted data to a new file
with open('new_data.json', 'w') as file:
json.dump(converted_data, file, indent=2)