Distillation: what's been your experience?

merefield · November 24, 2024, 9:37am

I’m talking about this?:

What feedback do you have?

Has anyone been able to train a cheap model to perform almost or as well as a more expensive model for specific use-cases and substitute the more expensive model in Production without significant loss of performance?

vb · November 24, 2024, 10:18am

There was a very interesting and helpful presentation at the DevDay events, describing a best-practice approach to model distillation based on the experience of OpenAI’s solution engineers.

I recommend keeping an eye out for when these talks are published, likely on the OpenAI YouTube channel. They will provide insights for improving the performance of your distilled models and establishing a process to create these models for cost savings and reduced latency when needed.

_j · November 24, 2024, 6:46pm

It just didn’t seem productive as a platform. It is nice to have an offering at no additional cost, to pass “store” parameter for 30 days of storage, but to get quality training data, you would likely use much different patterns.

For example, GPT-4 can do high quality inference with attention to actually follow an instruction like “don’t import typing, use Python 3.10+ built in lower-case types” and not have instructions completely ignored due to gpt-4o chat overtraining on pattern over instruction (and 4o or reduced gpt-3.5 as the only destination you can fine-tune is non-ideal).

However the form of the GPT-4 output as a training generator is now constrained and artificially crippled to limit production at length with odd style, and the best way to run training generations on it is with multi-shot on higher quality turn examples with the correct form of generation desired and orientation to the topic. 6k of context and new knowledge to make the training data is not something then to fine-tune on, you would place the actual system instructions for inference, along with hypothetical lead up chat.

o1-xx for training is unpredictable 0-shot with no system instruction authority, that has an output style that is often undesired. Will it give no chat, or pages of chat? Will it give 100 batch prompt denials in a row?

A better interface for the actual generation than no interface is one where you can tab through and work an example at a time, manually regenerating, iterating, and correcting the system prompting for the additional required skill to turn desired input context into output, then automation, then placing the final system message and chat context for actual fine-tuning.

The current models simply lack quality context attention. You cannot fine-tune a “summarize”, “synthesize”, “extract” when the observational capability just isn’t there. About the only benefit would hypothetically be that you are consuming less attention when you put the behavior in reinforcement learning weights instead of instruction context-following. Or to have the instruction not ignored when you quadruple the input data context length because they were learned. I have yet to need a very limited specialization done by a cheaper model with up-front investment where this could succeed.

bobartig · November 25, 2024, 2:49am

I’ve been working on model finetuning to get GPT to recognize certain kinds of very particular contract clauses, and then provide an analysis of their terms for various aspects of favorability.

This is clause extraction/classification task is one that neither GPT-4o, nor 4o-mini can perform well. Interestingly, both contain a lot of parametric knowledge about these contract terms (commercial credit agreements), but neither can really look at a document then tell you anything about them. Their parameters only enable one-way analyses.

I use a playbook prompt that contains a bunch of context for how gpt-4o should analyze the contracts, then generate synthetic training sets analyzing a bunch of synthetic clauses. The finetuned gpt-4o-mini can vastly out-perform gpt-4o. It might be a situation where really optimized few-shot prompting can enable this too; I haven’t tried it yet.

jr.2509 · November 25, 2024, 12:17pm

One aspect worth pointing out is that at DevDay when OpenAI provided a deep dive on distillation, they highlighted the type of use cases that are good fit for distillation. These included sentiment analysis, entity extraction and opinion mining and to a slightly extent classification tasks, copywriting, summary generation as well as support chatbots. The bottom line being that the narrower the tasks, the better the fit for distillation. More open-ended tasks are not deemed a good fit.

One of the of the other things that still presents a challenge from my point of view is that you can’t delete or easily filter out low quality completions. However, I asked them about it and they said they are working on a solution for that.

merefield · November 25, 2024, 12:23pm

this is what I’m currently aiming to make cheaper - I’m finding “tagging” is better with completions than with embedding vector distance (very odd that that’s the case!)

that’s good to know!

phyde1001 · November 25, 2024, 7:56pm

As I understand it embeddings primarily focus on capturing relationships between data points in a semantic space (vectors), rather than directly modeling the structure or organization of the data?

Completions rely on a model’s ability to predict the next word or sequence based on the input, leveraging the full context of the data. This includes recognizing hierarchical relationships

sps · November 26, 2024, 4:28pm

Echoing @jr.2509

From what I could understand, one of the ways in which distillation is supposed to be used is to bring the desired behaviour to smaller, faster, and more economical models.

Here’s a slide from @jillian’s talk at the DevDay Singapore that really carries the essence of this.

anon10827405 · November 26, 2024, 5:08pm

Yes. This aligns with the programming principles of “Separation of Concerns”.

Instead of having a single mode like gpt-4o or o1 or whatever perform something like “Understand if this user is providing a positive review and THEN return a JSON structure” it can be simplified to:

Run a distilled classification model on the item
IF positive, run distilled gpt-4o to transform it in JSON

Two modular encapsulated functions.

The first model can now be further refined with good examples. Has less tokens, costs less (in fact, in my cases Bert always is perfectly suitable for these tasks, so I can run it all locally), and in theory should provide better output.

And now I can easily wrap these models in some very nice programming to get the best of both worlds. Instead of being forced to jumble it all together in a true betrayal of programming.

I can even plop my sentiment classifier into other areas if I’d like and have a permanent, wonderful, homegrown LLM

I truly believe that A LOT of people are using models MUCH MORE POWERFUL than what they actually need. So I absolutely love the idea of distillation. Although I agree with @_j that I prefer to run it myself. But, that doesn’t mean it’s not a useful tool in the OpenAI belt.

merefield · November 26, 2024, 9:15pm

All great discussion and theory crafting buy I guess I’m looking for feedback on real Production experience … or is the feature set too limited (as per @jr.2509) to facilitate that yet?

_j · December 22, 2024, 6:53pm

To continue, I thought of another case of training set development by AI that distillation doesn’t really fit, after 50kTokens to o1-preview over and over.

Iteration.

Repeatedly explaining again what the AI got wrong, giving more documentation, explaining proper implementation, workarounds, what it ignored that was clearly specified, deleting messages and editing the chat to lower the context and simulate an AI producing the better version to then do more rounds. That’s not good fine-tuning.

The desired training from the knowledge work is the AI doing 0-shot magic finally realized.

“Store” works on output not seen yet. Documentation cuts you right off: The first step in the distillation process is to generate good results with a large model… (edit: The first step is to set the store parameter)

(ARC-AGI o1 fine-tune likely wasn’t the user being exasperated over and over…)

Topic		Replies	Views
Model Distillation (including Evals and Stored Completions) Announcements api	10	1903	December 7, 2024
Auto-distillation for OpenAI Assistants and Agents Community api , assistants-api	10	161	May 19, 2025
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1867	December 18, 2023
Hypothetical Token-increase Strategy . Community gpt-4 , chatgpt	21	371	March 17, 2025
Struggling with poor performance on fine-tuned davinci model API	15	2681	December 20, 2023

Distillation: what's been your experience?

Related topics