How compartmentalized (if at all) can we be when training our API model?

I have an idea to use the API and I would need to train the model on extensive data for document production based upon the criteria I use to train the model. I’ve read some article and I think I understand that once I train the model, I then switch to essentially live mode, and then I start prompting the API with instructions for how to create the documents and I’ll get my responses.

Is it possible to train the model on different topics? Like can I give it training data and say “This is for the topic of quantum physics” and then give it the data I want to use to train it, and then say “Switching to a new topic. This is all about string theory and only use this data for string theory related questions.” And then give it all of the new data I want it to know? How compartmentalized can we be when training a model for our particular use case?

Are you talking about:

  • foundation model training?
  • fine-tuning?
  • in-context learning?

My ignorance is going to show it’s ugly head.

I think I mean foundation model training. I want to use the API to create documents, or at least the text for documents. I need to teach it what input I’m giving it and what to do with that input. I believe that would be foundational model training. If I have 17 different documents, is that all part of the same training? Like, “Here are 56 examples of document ____________________. Input data will be labeled _____”. I hope I’m making it clear enough.

OpenAI does not support foundation model training.

If you only have a small amount of training data, you may be able to do what you want with fine tuning training, but currently only the old gpt-3 model (davinci) is supported for fine tuning. OpenAI has indicated they will roll out some kind of support for fine tuning of gpt-3.5 and gpt-4 later this year.

Finally, if all you need to do is prompting the model with different data for different topics, you may be able to use the longer models (gpt-3.5-turbo-16k and gpt-4-32k) prompt-seeded with the appropriate document for the appropriate topic, depending on how much data you need.

Anyway, when doing foundation model training (which you can do with open source models like MPT-30B) you can say “this relates to quantum mechanics” and “this relates to Mesopotamian art” just fine. You’ll bundle up all the different training documents and build them into one big dataset, and the model will be trained on all the different kinds, and hopefully learn to select the right base data for the appropriate context. (An alternative is to train different models for different topics.) Also note that foundation training requires billions of input tokens and hundreds of thousands or millions of dollars of machine time.

When doing fine tuning, you can somewhat do the same thing, but your success may vary. It might be better to go with multiple separately fine-tuned models that way. Fine-tuning can be done on “thousands to millions” of tokens, and typically cost “hundreds to thousands” of dollars.