Generating dataset of prompt-completion pairs for fine-tuning

omri_m · February 20, 2023, 12:23am

Hi all,

(tried to find an answer for my question here, but couldn’t )

I’m new to Open AI, and I’m trying to fine-tune Curie in order to create a chatbot for my SW team (to serve like a kind of “helpdesk chatbot”).
I’ve read the API’s documentation, and I understand I’ll have to provide my model with (at least) hundreds of prompt-completion pairs as a training dataset.

I’ve checked out the GitHub notebooks with the example of the Olympic games 2020, and saw there that after processing and organizing the collected data, they used some model to creating a synthetic Q&A dataset (‘davinci-instruct-beta-v3’).
However, I can’t find any reference to that model or technique (deprecated?)

So my question is - having available articles as data, is there any efficient available way to generate those hundreds/thousands of prompt-completion pairs, based on those articles? can I use Open AI to create synthetic questions and answers from articles?

Thanks!

Topic		Replies	Views
Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training? API gpt-4 , chatgpt , fine-tuning , api	4	554	January 14, 2026
Prompt and completion generation from text API	5	2883	February 15, 2023
Fine tuning DaVinci , Need help finding prompt ideas API	2	606	April 25, 2023
How to efficiently create ground truth sets using GPT-4? Prompting gpt-4 , chatgpt	1	734	October 11, 2024
How to Fine-Tune a Model with Book Data for a Chatbot? API fine-tuning	4	972	January 13, 2025

Generating dataset of prompt-completion pairs for fine-tuning

Related topics