Fine-Tuning with Non-Prompt/Completion Data: Seeking Advice for Direct Text-Based Training?

Hello everyone,

I’m currently working on a project where I have a large collection of articles, documents, and general information. These texts are not in the typical prompt-completion format required for fine-tuning with the OpenAI API. Instead, they are structured as continuous prose, with valuable information scattered throughout.

Given this, I’m looking for advice on how I can use this kind of non-prompt/completion data for fine-tuning a model. Specifically:

  1. Is there a way to fine-tune directly with such text data without converting it into prompt-completion pairs?
  2. Can the OpenAI API support this, or would I need to look into alternative methods or platforms (like Hugging Face)?
  3. Has anyone successfully used non-prompt/completion data for fine-tuning with OpenAI’s API? If so, how did you approach it?
  4. Any recommendations or best practices for efficiently converting large text data into the required format, if that’s the only option?

I appreciate any insights or experiences you can share. I’m particularly interested in ways to minimize the manual effort in formatting the data while still achieving effective fine-tuning results.

Thank you in advance for your help!

Best regards,

1 Like

I’m assuming that what you want is the ability to “speak” to this collection of documents, or find information.

Typically fine-tuning for behavioral changes. Although knowledge is a byproduct of fine-tuning it’s not recommended for many reasons if it’s your main focus.

Instead I would recommend researching and implementing a retrieval system that you can then use as RAG (Retrieval Augmented Generation)

I have the same question, but am also wondering if we can use ChatGPT-4o to create prompt-completion pairs from the articles? Has anyone had success or problems with this approach?

It can work pretty well. If you can fine-tune it to your specific domain, it could be a huge time-saver. Just remember to review the pairs it generates, and you’ll be golden!