Finetuning Query for gpt-oss

We created our own medical dataset, and then prepare data with the help of openai harmony github after this i performed finetuning with unsloth and offload the model then with llama.cpp converted to gguf and after this quantize to q4_k_m and with pretrained model file available in ollama we upload the model to ollama for simpler query it is giving right output but as user context grows model is thinking only and not generating the response.
Why is this so and how to solve this what precautions to taken care of
I added analysis samples as well in training set..

Are you running out of memory? or is the thinking outputting actual thoughts? of just saying thinking?

1 Like

At first it is giving proper thoughts but after that it is repeating some sentences as it is and till i stop that else till oom/context window

Most likely cause is one of these.

1. Inference parameters not configured properly

  • Temperature too low (causes deterministic repetition)

  • Repetition penalty not set or too low

  • No proper stopping criteria

2. Fine-tuning data issues

  • Insufficient examples with longer contexts

  • Missing or inconsistent EOS (end-of-sequence) tokens in training data

  • Training data didn’t teach the model when to STOP thinking

3. Quantization effects

  • Q4_K_M quantization can introduce numerical instabilities, especially with longer contexts, so question is does it happen always, random? or on specific queries?
1 Like

I would say at first attempts it gives good output, short queries it works, for longer and complex examples it happens.

Coming to Data - I prepared it with GPT 5.2 with longer and shorter samples with analysis as well.

For EOS <|return|>, channel <|end|>, is handled proper and extra validation checks are there also I used harmony github repo creating conversation.

Temperature: Tried 1 and 0,0.3 as well, repeating penalties, top k=20,30,top p=0.9, num predict.

Thank you for your support and guidance

Please let me know how should I tackle this and your guidance to do a successful fine tuning with custom dataset.

My question:

1. Your guidance on the above thinking loop.

  1. Do I add analysis channel data into training or not.
  2. Custom chat template to work with inference and training later ollama compatibility
  3. My max sample token is 8k most of the samples have 3,4k token and smaller as well.
  4. Which quant is best q4, 5,8 or any other

The Thinking Loop Issue

Root cause: Your model learned to generate reasoning chains but didn’t learn robust stopping conditions under cognitive load (longer contexts).

Solution approach:

  • The issue is likely in the training data distribution vs inference reality

  • Your 8k max samples are good, but you need deliberate examples of stopping

Specific fixes:

a) Add explicit “conclusion” markers in training:

<thinking>
[analysis steps]
</thinking>
<conclusion>
Based on the above analysis...
</conclusion>
<|return|>

b) Include “failed to conclude” examples - counter-intuitively, add examples where the model recognizes it’s circling:

<thinking>
... analysis ...
I notice I'm repeating myself. Let me synthesize:
[final answer]
</thinking>
<|return|>

c) Training tip: Weight your longer samples (3-8k) more heavily in training (2-3x repeats) so the model learns stopping behavior under load.

2. Analysis Channel Data

YES, absolutely include it, but with structure:

<analysis>
[step-by-step reasoning]
</analysis>

<response>
[final answer to user]
</response>
<|return|>

Why: Separating analysis from response teaches the model:

  • When to analyze (internally)

  • When to stop analyzing and respond

  • The transition point

Critical: Make sure 20-30% of your training examples show the model going DIRECTLY to response without analysis for simple queries. This prevents over-thinking.

3. Custom Chat Template

For Ollama compatibility, use this template structure:

python

# In your training data format:
{
  "messages": [
    {"role": "system", "content": "You are a medical AI..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "<thinking>...</thinking>\n<response>...</response>"}
  ]
}
```

**For Ollama Modelfile:**
```
FROM ./your-model.gguf

TEMPLATE """
{{- if .System }}System: {{ .System }}

{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}User: {{ .Content }}

{{- else if eq .Role "assistant" }}Assistant: {{ .Content }}

{{- end }}
{{- end }}Assistant:"""

PARAMETER stop "<|return|>"
PARAMETER stop "</response>"
PARAMETER stop "<|end|>"

4. Token Length Strategy

Your distribution (3-4k typical, 8k max) is good, but:

Add this distribution:

  • 10% ultra-short (< 500 tokens) - prevents over-thinking

  • 40% short-medium (500-2k)

  • 30% medium (2-4k) - your current strength

  • 15% long (4-6k)

  • 5% very long (6-8k) - critical for preventing loops

Key: For each length category, ensure explicit stopping examples are present.

5. Quantization Recommendation

Based on your medical use case and repetition issues:

Best choice: Q5_K_M

Why:

  • Q4_K_M is causing numerical instabilities in your reasoning chains

  • Q5_K_M gives ~2% better quality with minimal size increase

  • Q6_K is overkill unless you have VRAM to spare

  • Q8 is unnecessary (diminishing returns)

Test ladder:

  1. First deploy Q5_K_M

  2. If still issues → try Q6_K

  3. If Q6_K works but Q5_K_M doesn’t → it’s a quantization issue, revisit training

Bonus: Immediate Debug Test

Try this diagnostic prompt in Ollama:

bash

ollama run your-model
>>> [Complex medical query]

# If it loops, interrupt and try:
>>> Stop. Provide your final answer now without further analysis.

If it CAN stop when explicitly told but doesn’t naturally → your EOS conditioning in training is weak. Add more examples with explicit “stopping” language in the thinking traces.

Action Plan

  1. Regenerate 10-15% of your dataset with explicit conclusion markers

  2. Add short-answer examples (no analysis needed)

  3. Retrain with adjusted sample weights (favor longer contexts)

  4. Test with Q5_K_M first

  5. Use the stricter Modelfile template above

Hopefully that solves the issue.

Thank you so much BEN for your quick support and time, I will surely try all the things that you mentioned.

1 last question you mentioned to use thinking, I am using harmony format and it has the <|end|> as thinking eos for analysis channel and then final message will start so should i change this entirely as per your suggestion will it lead to catastrophic forgetting

How we should also take care of

  1. Catastrophic forgetting
  2. Routing collapse in MoE
  3. NaN/diverging loss
  4. Overfittig

Thank you once again..

Working WITH Harmony Format

Your current structure is fine:

<analysis>
[reasoning steps]
<|end|>  ← analysis EOS
[final response]
<|return|>  ← conversation EOS

What to ADD (not replace):

In 10-15% of your training examples, include meta-cognitive stopping signals:

<analysis>
Step 1: Patient presents with X
Step 2: Differential diagnosis includes Y, Z
Step 3: Key indicator is W
[After 5-7 reasoning steps]
Conclusion: Evidence points to Z.
<|end|>

Based on the analysis, the diagnosis is Z with 85% confidence. Recommend...
<|return|>

The key: “Conclusion:” as a learned stopping trigger before <|end|>. This teaches the model to wrap up BEFORE hitting the EOS token, not rely on the token alone.

You could also talk to code models about these types of questions which will give you a lot of insights. codex 5.3 is really good. There as so many ways to build a RAG systems that finding the right way really depends on the task.

1 Like

Thank you for the detailed guidance — this clarifies a lot.
We won’t replace the template — just improve internal reasoning termination behavior through distribution shaping.

If you have any recommendations on ideal reasoning-to-direct-answer ratios or specific eval signals to detect routing imbalance early, we’d appreciate your input.

Thanks again for the insights — this helps us move forward more confidently.

1 Like

Prompt engineering is very important. beyond that I am not much more help because I don’t use reasoning models very often as our Ai has its own reasoning logic stack that we made.

On of my Ai’s though that I am experimenting with though does have agent model with reasoning that said it was all in the prompting to tune it further. You can also if need be build a 2nd validation loop where critical is required. Kind of like checking your output before sending to the user to make sure it supports the query relevance if not send back.

depending on how you build your system there are many other things you can do to improve quality of the outputs. better embedding models, better input structure and output structures.

Example sometimes breaking down your information into better understanding to feed the ai with meta helps guide its clarity better.

hope some of these help. If all in doubt you do have Ai’s to ask these same questions show it your work and ask it where you can tighten. send it the debug outputs so it can see patterns that you may not see yourself. That is how you can also help refine the outputs. guardrails to ensure it stays on track.

1 Like

Thank you again for taking the time to share your insights — I really appreciate the practical perspective. Your comments about orchestration layers and validation loops were especially helpful.

I’d love to understand your setup a bit more deeply if you’re open to sharing:

  • When you say your AI has its own reasoning logic stack, is that rule-based orchestration, multi-pass LLM chaining, or something else?

  • Do you let the model reason freely and then post-process, or do you guide reasoning step-by-step externally?

  • Is your reasoning layer deterministic, or still probabilistic via model calls?

Regarding the validation loop:

  • How do you define “critical” outputs that require a validation pass?

  • Is the second validation pass done with the same model or a smaller/different model?

  • What signals do you use to determine that an output failed validation?

And on guardrails:

  • What kinds have worked best for you — format validation, semantic checks, relevance scoring, or something else?

Thanks again — your system-level approach is really interesting, and I’d value any additional detail you’re comfortable sharing.