Distillation: what's been your experience?

I’m talking about this?:

What feedback do you have?

Has anyone been able to train a cheap model to perform almost or as well as a more expensive model for specific use-cases and substitute the more expensive model in Production without significant loss of performance?

6 Likes

There was a very interesting and helpful presentation at the DevDay events, describing a best-practice approach to model distillation based on the experience of OpenAI’s solution engineers.

I recommend keeping an eye out for when these talks are published, likely on the OpenAI YouTube channel. They will provide insights for improving the performance of your distilled models and establishing a process to create these models for cost savings and reduced latency when needed.

6 Likes

It just didn’t seem productive as a platform. It is nice to have an offering at no additional cost, to pass “store” parameter for 30 days of storage, but to get quality training data, you would likely use much different patterns.

For example, GPT-4 can do high quality inference with attention to actually follow an instruction like “don’t import typing, use Python 3.10+ built in lower-case types” and not have instructions completely ignored due to gpt-4o chat overtraining on pattern over instruction (and 4o or reduced gpt-3.5 as the only destination you can fine-tune is non-ideal).

However the form of the GPT-4 output as a training generator is now constrained and artificially crippled to limit production at length with odd style, and the best way to run training generations on it is with multi-shot on higher quality turn examples with the correct form of generation desired and orientation to the topic. 6k of context and new knowledge to make the training data is not something then to fine-tune on, you would place the actual system instructions for inference, along with hypothetical lead up chat.

o1-xx for training is unpredictable 0-shot with no system instruction authority, that has an output style that is often undesired. Will it give no chat, or pages of chat? Will it give 100 batch prompt denials in a row?

A better interface for the actual generation than no interface is one where you can tab through and work an example at a time, manually regenerating, iterating, and correcting the system prompting for the additional required skill to turn desired input context into output, then automation, then placing the final system message and chat context for actual fine-tuning.

The current models simply lack quality context attention. You cannot fine-tune a “summarize”, “synthesize”, “extract” when the observational capability just isn’t there. About the only benefit would hypothetically be that you are consuming less attention when you put the behavior in reinforcement learning weights instead of instruction context-following. Or to have the instruction not ignored when you quadruple the input data context length because they were learned. I have yet to need a very limited specialization done by a cheaper model with up-front investment where this could succeed.

4 Likes

I’ve been working on model finetuning to get GPT to recognize certain kinds of very particular contract clauses, and then provide an analysis of their terms for various aspects of favorability.

This is clause extraction/classification task is one that neither GPT-4o, nor 4o-mini can perform well. Interestingly, both contain a lot of parametric knowledge about these contract terms (commercial credit agreements), but neither can really look at a document then tell you anything about them. Their parameters only enable one-way analyses.

I use a playbook prompt that contains a bunch of context for how gpt-4o should analyze the contracts, then generate synthetic training sets analyzing a bunch of synthetic clauses. The finetuned gpt-4o-mini can vastly out-perform gpt-4o. It might be a situation where really optimized few-shot prompting can enable this too; I haven’t tried it yet.

3 Likes

One aspect worth pointing out is that at DevDay when OpenAI provided a deep dive on distillation, they highlighted the type of use cases that are good fit for distillation. These included sentiment analysis, entity extraction and opinion mining and to a slightly extent classification tasks, copywriting, summary generation as well as support chatbots. The bottom line being that the narrower the tasks, the better the fit for distillation. More open-ended tasks are not deemed a good fit.

One of the of the other things that still presents a challenge from my point of view is that you can’t delete or easily filter out low quality completions. However, I asked them about it and they said they are working on a solution for that.

8 Likes

this is what I’m currently aiming to make cheaper - I’m finding “tagging” is better with completions than with embedding vector distance (very odd that that’s the case!)

that’s good to know!

4 Likes

As I understand it embeddings primarily focus on capturing relationships between data points in a semantic space (vectors), rather than directly modeling the structure or organization of the data?

Completions rely on a model’s ability to predict the next word or sequence based on the input, leveraging the full context of the data. This includes recognizing hierarchical relationships

3 Likes

Echoing @jr.2509

From what I could understand, one of the ways in which distillation is supposed to be used is to bring the desired behaviour to smaller, faster, and more economical models.

Here’s a slide from @jillian’s talk at the DevDay Singapore that really carries the essence of this.

4 Likes

Yes. This aligns with the programming principles of “Separation of Concerns”.

Instead of having a single mode like gpt-4o or o1 or whatever perform something like “Understand if this user is providing a positive review and THEN return a JSON structure” it can be simplified to:

  • Run a distilled classification model on the item
  • IF positive, run distilled gpt-4o to transform it in JSON

Two modular encapsulated functions.

The first model can now be further refined with good examples. Has less tokens, costs less (in fact, in my cases Bert always is perfectly suitable for these tasks, so I can run it all locally), and in theory should provide better output.

And now I can easily wrap these models in some very nice programming to get the best of both worlds. Instead of being forced to jumble it all together in a true betrayal of programming.

I can even plop my sentiment classifier into other areas if I’d like and have a permanent, wonderful, homegrown LLM


I truly believe that A LOT of people are using models MUCH MORE POWERFUL than what they actually need. So I absolutely love the idea of distillation. Although I agree with @_j that I prefer to run it myself. But, that doesn’t mean it’s not a useful tool in the OpenAI belt.

8 Likes

All great discussion and theory crafting buy I guess I’m looking for feedback on real Production experience … or is the feature set too limited (as per @jr.2509) to facilitate that yet?

To continue, I thought of another case of training set development by AI that distillation doesn’t really fit, after 50kTokens to o1-preview over and over.

Iteration.

Repeatedly explaining again what the AI got wrong, giving more documentation, explaining proper implementation, workarounds, what it ignored that was clearly specified, deleting messages and editing the chat to lower the context and simulate an AI producing the better version to then do more rounds. That’s not good fine-tuning.

The desired training from the knowledge work is the AI doing 0-shot magic finally realized.

“Store” works on output not seen yet. Documentation cuts you right off: The first step in the distillation process is to generate good results with a large model… (edit: The first step is to set the store parameter)

(ARC-AGI o1 fine-tune likely wasn’t the user being exasperated over and over…)

2 Likes