Guidance Needed: GPT-OSS 20B Fine-Tuning with Unsloth → GGUF → Ollama → Triton (vLLM / TensorRT-LLM)

I am currently fine-tuning the GPT-OSS 20B model using Unsloth with HuggingFace TRL (SFTTrainer).

Long-term goal

  • Serve the model in production using Triton with either vLLM or TensorRT-LLM as the backend

  • Short-term / initial deployment using Ollama (GGUF)

Current challenge
GPT-OSS uses a Harmony-style chat template, which includes:

  • developer role

  • Explicit EOS handling

  • thinking / analysis channels

  • Tool / function calling structure

When converting the fine-tuned model to GGUF and deploying it in Ollama using the default GPT-OSS Modelfile, I am running into ambiguity around:

  1. Whether the default Jinja chat template provided by GPT-OSS should be modified for Ollama compatibility

  2. How to correctly handle:

    • EOS token behavior

    • Internal reasoning / analysis channels

    • Developer role alignment

  3. How to do this without degrading the model’s default performance or alignment

Constraints / Intent

  • I already have training data prepared strictly in system / user / assistant format

  • I want to:

    • Preserve GPT-OSS’s native behavior as much as possible

    • Perform accurate, non-destructive fine-tuning

    • Avoid hacks that work short-term but break compatibility with vLLM / TensorRT-LLM later

What I’m looking for

  • Has anyone successfully:

    • Fine-tuned GPT-OSS

    • Converted it to GGUF

    • Deployed it with Ollama

    • While preserving the Harmony template behavior?

  • If yes:

    • Did you modify the chat template / Modelfile?

    • How did you handle EOS + reasoning channels?

    • Any pitfalls to avoid to keep it production-ready for Triton later?

Any concrete guidance, references, or proven setups would be extremely helpful.

3 Likes

Bookmarking this one as working on same thing

1 Like