We have a massive ~32GB specialized, unlabeled medical information dataset, and we want to train an LLM to be knowledgeable about the information in this data. The two choices seem to be 1. OpenAI’s finetuning API and 2. setting up a custom LLM, and we’d generally like to stick to #1.
But as we all know, OpenAI’s finetuning API is strictly for supervised, instruction-type training paradigms. But our goal is to do a typical unsupervised next-word-prediction type training on the 32GB of specialized, unlabeled data.
One approach we might try is making “supervised-type, input-output” pair dataset (as required in the OpenAI’s fine-tuning paradigm) as “Predict the next word after this sentence”.
We’d take a chunk from the unlabeled dataset as “input” and put the “next token” as the “output”, so as to hack the supervised finetuning framework into an unsupervised paradigm.
Example:
* Input: "The patient presented with symptoms of"
* Output: "fever"
* Input: "The prescribed medication is"
* Output: "ibuprofen"
I’d think that this would actually allow the model to gain a level of insight from the medical data, and then we’d fine-tune it with a supervised dataset for more specific objectives.
Do you guys think this would work? Let me know!