Effectively doing Continued-Pretraining via Fine-tuning API

We have a massive ~32GB specialized, unlabeled medical information dataset, and we want to train an LLM to be knowledgeable about the information in this data. The two choices seem to be 1. OpenAI’s finetuning API and 2. setting up a custom LLM, and we’d generally like to stick to #1.

But as we all know, OpenAI’s finetuning API is strictly for supervised, instruction-type training paradigms. But our goal is to do a typical unsupervised next-word-prediction type training on the 32GB of specialized, unlabeled data.

One approach we might try is making “supervised-type, input-output” pair dataset (as required in the OpenAI’s fine-tuning paradigm) as “Predict the next word after this sentence”.

We’d take a chunk from the unlabeled dataset as “input” and put the “next token” as the “output”, so as to hack the supervised finetuning framework into an unsupervised paradigm.

Example: 
* Input: "The patient presented with symptoms of"
* Output: "fever"
* Input: "The prescribed medication is"
* Output: "ibuprofen"

I’d think that this would actually allow the model to gain a level of insight from the medical data, and then we’d fine-tune it with a supervised dataset for more specific objectives.

Do you guys think this would work? Let me know!

No, this wouldn’t work and would almost certainly just break the model.

Have you considered using your dataset as the basis for a RAG implementation?

Actually, instead what if the example was:

*  Input: "Predict the next word: 'The patient presented with symptoms of'"
* Output: "fever"
* Input: "Predict the next word: 'The prescribed medication is'"
* Output: "ibuprofen"

For this use-case, I think it comes down to building a deep knowledge of the particular domain that we’re working with that can’t immediately be solved with a RAG implementation. We are exploring fine-tuning open source models, but I’m trying to explore ways to leverage OpenAI’s innovations.

Also, do people ever fine-tune OpenAI’s models with billions of tokens? In the docs it suggests 100-500 examples so clearly this function is not meant for our use :frowning: