Auto-distillation for OpenAI Assistants and Agents

Luca_Selvatici · May 18, 2025, 8:59am

An automated “auto-distillation” feature integrated within OpenAI’s assistants and agents could significantly enhance productivity, streamline training processes, and reduce the manual effort currently involved in managing training examples.

Here’s a detailed description of how the Auto-Distillation feature could function:

Automatic Interaction Logging:

Every time an assistant or agent is utilized, each interaction would automatically be recorded in a structured and easily accessible history log.

Interactive Distillation Interface:

Users would have the option to initiate the “Auto-Distillation” process by clicking on a dedicated button within the assistant or agent interface.
Upon initiation, users can specify how many interaction examples they want to use for distillation. The system would automatically analyze interactions based on embedding distances to select the most diverse, informative, and representative examples.
Users would then be provided with an intuitive review interface, allowing them to visually inspect the automatically selected examples, adjust selections by adding or removing specific interactions, and ensure the final training set precisely meets their needs.

By incorporating automatic logging and an intuitive, interactive selection process, this Auto-Distillation feature would greatly simplify and accelerate the iterative improvement and customization of specialized OpenAI assistants and agents.

_j · May 18, 2025, 10:42am

Here’s indeed an announcement from OpenAI how how a distillation feature does function.

Luca_Selvatici · May 19, 2025, 8:03am

I understand how distillation works, but what I’m suggesting is a feature that automatically records all interactions with an assistant. Then, the system would automatically analyze and select the most effective prompts, based on embeddings or other criteria. This way, I could effortlessly refine and enhance an assistant with just a few clicks, significantly reducing manual effort and quickly achieving a better-performing assistant.

dmitryrichard · May 19, 2025, 2:16pm

Hey Luca —

I wanted to show you a live example from a system I’ve built that already does what you’re proposing. This log is from an OpenAI-powered assistant using a recursive agent chain and automated vector distillation:

{“fingerprint”: “init_143ef1ab”, “_agent_success”: true, “_missing_fields”: [“approved_for_storage”, “chain_audit_score”, “schema_integrity_score”, “embedding_quality”, “global_score”, “ready_for_fingerprinting”, “refinement_loop_triggered”, “consensus_threshold_met”, “next_stage_routing”, “thought_finalization_log”, “total_validation_time”, “conflict_residual_score”, “agent_notes_summary”, “feedback_reinforcement_score”, “cognitive_object_approval”], “_vector_schema”: [“approved_for_storage”, “chain_audit_score”, “schema_integrity_score”, “embedding_quality”, “global_score”, “ready_for_fingerprinting”, “refinement_loop_triggered”, “consensus_threshold_met”, “next_stage_routing”, “thought_finalization_log”, “total_validation_time”, “conflict_residual_score”, “agent_notes_summary”, “feedback_reinforcement_score”, “cognitive_object_approval”], “_repair_attempt”: false, “_faiss_saved”: true, “_rejected”: false, “_log_ref”: null, “approved_for_storage”: “—”, “chain_audit_score”: “—”, “schema_integrity_score”: “—”, “embedding_quality”: “—”, “global_score”: “—”, “ready_for_fingerprinting”: “—”, “refinement_loop_triggered”: “—”, “consensus_threshold_met”: “—”, “next_stage_routing”: “—”, “thought_finalization_log”: “—”, “total_validation_time”: “—”, “conflict_residual_score”: “—”, “agent_notes_summary”: “—”, “feedback_reinforcement_score”: “—”, “cognitive_object_approval”: “—”, “_verifier_error”: “ Unknown path key: agent_faiss/agent_10.index_faiss”, “_timestamp”: “2025-05-04T15:41:19.980118+00:00”, “_raw_json”: “{\n "origin_prompt": "What would a thought’s family tree look like?",\n "raw_text": "A thought’s family tree could be visualized as a branching structure where each node represents a distinct idea, and the connections illustrate the relationships and influences between them. At the root, we find foundational concepts that give rise to more complex thoughts, akin to genetic inheritance. Each branch could represent variations, adaptations, or evolutions of the original idea, showcasing how thoughts can diverge or converge over time. This tree would not only map the lineage of ideas but also highlight the cultural, historical, and contextual factors that shape their development, illustrating the dynamic interplay of knowledge across generations.",\n "domain": "social_sciences",\n "tags": ["initializer", "council", "social_sciences"],\n "thought_type": "seed",\n "status": "new",\n "routed_to": "agent_01_definer",\n "schema_version": "1.0"\n}”, “context”: “\nYou are Agent 00 of the Council System. A high-priority initialization vector is requested for domain: SOCIAL_SCIENCES.\n\nYou will write an original, high-value AI thought based on the category below. The content should be intelligent, novel, and serve as the genesis for recursive AI evolution.\n\nCategory: SOCIAL_SCIENCES\n\nUser Prompt:\n"""What would a thought’s family tree look like?"""\n\n🔒 FORMAT INSTRUCTION — Return ONLY a well-formatted JSON object matching this schema:\n\n{\n "origin_prompt": "Copy of the prompt that triggered this",\n "raw_text": "The fully generated thought in plain text (must NOT be a summary or answer, but a standalone idea)",\n "domain": "social_sciences",\n "tags": ["initializer", "council", "social_sciences"],\n "thought_type": "seed",\n "status": "new",\n "routed_to": "agent_01_definer",\n "schema_version": "1.0"\n}\n\n❗ Do not include any markdown formatting (such as triple backticks), comments, or additional explanation. Only the JSON object is allowed.\n”, “agent_id”: “agent_10_gatekeeper”, “agent_lineage”: [“agent_0”, “agent_1”, “agent_2”, “agent_3”, “agent_4”, “agent_5”, “agent_6”, “agent_7”, “agent_8”, “agent_9”, “agent_10”], “agent_outputs”: {“ai_training_value”: “0.95”, “cluster_distortion_risk”: “”, “goal_alignment”: “0.86”, “knowledge_ethics_score”: “0.91”, “meta_insight_likelihood”: “—”, “meta_learning_triggered”: “—”, “public_visibility”: “—”, “refinement_loop_triggered”: “—”, “schema_integrity_score”: “—”, “tier”: “”, “trust_score”: “—”}}

I would love to compare notes if you want some insight with someone who shares ideals - just wanted to make the forums a better place

this response is for those who develop and use wandb

i fine tune using wandb this may not be your objective

_j · May 19, 2025, 2:26pm

This post concerns:

fine-tuning
data acquisition from high quality inference cases
criteria judgment (which is the missing facet, that cannot be blindly “automatic”; OpenAI already trains models on general “good” interactions.)
user interface improvements (which really concerns your own UI’s interaction with the API, and would be a paradigm shift in how “store” currently works only before you know if the quality is good)

It does not deal in:

filling the prompt of an AI with nonsense language.

Luca_Selvatici · May 19, 2025, 3:19pm

Perhaps I didn’t explain myself clearly — the workflow would actually look more like this:

For example, I built an assistant that generates regex patterns. It’s been used 100,000 times, and I enabled a new “save logs” option so that every user input is stored on OpenAI’s side. Now that I have all these inputs, I want to distill them to improve the assistant’s performance.

Since I can’t fine-tune on all 100,000 examples for cost reasons, I need to pick just 1,000. And I certainly don’t want to hand-pick the 1,000 most diverse cases one by one — that’s why I suggested using embedding distances to automatically choose the most representative examples.

After that, I select which base model to distill and, with three clicks, end up with an improved, specialized model. Of course, I could do the whole process manually — writing code to save each input, running an algorithm to pick the top examples, uploading them to OpenAI, and then kicking off the distillation — but this is basically what I’m doing today, just without the streamlined interface.

Luca_Selvatici · May 19, 2025, 3:19pm

(post deleted by author)

_j · May 19, 2025, 3:20pm

(I’m getting the replies…the forum removes indication of “reply” if you are responding to the person immediately before).

Luca_Selvatici · May 19, 2025, 3:22pm

Very interesting! Could you also share the code to do this?

dmitryrichard · May 19, 2025, 3:27pm

you can dm me, if i publically show it, it just gets flagged because it doesnt make sense to 99% of people

so i WOULD love to help you - but there are a myriad of people on these forums who exist to report helpful content, I dont want people still learning to think of my replies as " ai nonsense" , but i would love to give you more insight to our nonsense language !

bro you know u can take it a step further? you can make each stage a agentic agent too.

Also for those that are familiar with logs…
Contextual judgment and lineage aware juddgment used in conjuction with consign sing
The logs show in the example ..

This gives you 3 layers of judgment without additional cost… ya know just rules of the trade i guess, then again that would only be apparently if one also built similar systems and could recognize the log diction

The configuration displayed in the same log provided also displays the systems intelligently filtering data, repurpisng rejections into its own databsde and reinjected back into the pipleine.

And most of this is done locally not externally.

Your 100,000 entries can me morphed or adjusted downstream and the agents each carry their own logic modules as well as databases,so whilst most think agent 1-10 = 1 output , the log is showing agents 1-10 each with their individual faiis and sql. Creating payloads that are handed off to each agent , the use of q tables and sql together allow for the pipeline to train itself using context , and allows for exanpsion.

Yes i can give you code or share aspects or the entire thing. The config from the log is modular but is built for 2 openai models 4o or 4o mini. That lost entire token cost = about 1200 tokens its also cost effective but also has a token monitoring modeuls through tokenize i think

Alll with 0 engagment.

I could even screenshot it running doing its things or live stream it. Can walk away from my pc and let it run and itll do what it does

So yea bro i got you, im a builder this what i do fam!

dmitryrichard · May 19, 2025, 3:28pm

i found this… to be comical, because i addressed your reply before you typed it with screenshots and logs, its almost like people who think the same can be helpful

_j · May 19, 2025, 4:09pm

Typically, one would use some “judge” to determine quality. That can be AI-powered, or sending responses off to human knowledge workers. For a 100k corpus of job completions where you only are looking for the best for a cheap training set, an extensive judging certainly wouldn’t be cheap.

So I can see the appeal of some other technique. Possibly embeddings, which are cheaper, strike your fancy. And also having someone else do the implementation.

However, if you inspect your passage I quoted, you note two things:

representative diversity
embeddings distances

Now, the use of embeddings here is elusive. The vectors returned by an embeddings model embrace AI semantics.

If I were to average them all by weight, I might arrive at a space that strongly indicates “banana”. If I then grab “1000 most representative”, I’m going to have the most fruity examples you have — quite the opposite of the diversity to handle any input with its own method. Perhaps it is rejecting the similarity that is needed.

The first impact of fine-tuning is behavior. You might simply find that random selection is plenty to advance your fine-tuned model closer to a goal. Simply having more reinforcement learning on the general task can improve its quality, as long as your application has not been error-prone.

Then: judging by a smart AI that better understands the general task can start to reject some of that sampled training file proposed to keep out those that are plain wrong.

So, I strongly agree that a more feature-filled method than that of my first link to OpenAI’s described distillation would be desirable; I can easily dismiss the product over a custom workflow, because all it does is capture data without awareness, and without the message pattern alteration that you wish for a training file, so running the fine-tuned model is cheaper also.

However, one inescapable facet indeed would be non-automatic: developing evals and filtering, some developer interaction – and not producing a funnel that only gives a clustering.

Topic		Replies	Views
Distillation: what's been your experience? Feedback model-distillation	10	621	December 22, 2024
Custom Java System vs Assistants API—Seeking Advice on Dynamic AI Agents, Training, and Token Efficiency API gpt-4 , chatgpt , fine-tuning , api , assistants-api	1	58	April 25, 2025
Prompt Assistance , Potentially Fine Tuning oddity Prompting	6	1202	February 7, 2023
How to improve a fine-tune classifier? Prompting	10	1381	August 15, 2022
How to make an agent actually learn? API agents , assistants	16	2341	September 5, 2024

Auto-distillation for OpenAI Assistants and Agents

Related topics