We created our own medical dataset, and then prepare data with the help of openai harmony github after this i performed finetuning with unsloth and offload the model then with llama.cpp converted to gguf and after this quantize to q4_k_m and with pretrained model file available in ollama we upload the model to ollama for simpler query it is giving right output but as user context grows model is thinking only and not generating the response.
Why is this so and how to solve this what precautions to taken care of
I added analysis samples as well in training set..
Are you running out of memory? or is the thinking outputting actual thoughts? of just saying thinking?
At first it is giving proper thoughts but after that it is repeating some sentences as it is and till i stop that else till oom/context window
Most likely cause is one of these.
1. Inference parameters not configured properly
-
Temperature too low (causes deterministic repetition)
-
Repetition penalty not set or too low
-
No proper stopping criteria
2. Fine-tuning data issues
-
Insufficient examples with longer contexts
-
Missing or inconsistent EOS (end-of-sequence) tokens in training data
-
Training data didnât teach the model when to STOP thinking
3. Quantization effects
- Q4_K_M quantization can introduce numerical instabilities, especially with longer contexts, so question is does it happen always, random? or on specific queries?
I would say at first attempts it gives good output, short queries it works, for longer and complex examples it happens.
Coming to Data - I prepared it with GPT 5.2 with longer and shorter samples with analysis as well.
For EOS <|return|>, channel <|end|>, is handled proper and extra validation checks are there also I used harmony github repo creating conversation.
Temperature: Tried 1 and 0,0.3 as well, repeating penalties, top k=20,30,top p=0.9, num predict.
Thank you for your support and guidance
Please let me know how should I tackle this and your guidance to do a successful fine tuning with custom dataset.
My question:
1. Your guidance on the above thinking loop.
- Do I add analysis channel data into training or not.
- Custom chat template to work with inference and training later ollama compatibility
- My max sample token is 8k most of the samples have 3,4k token and smaller as well.
- Which quant is best q4, 5,8 or any other
The Thinking Loop Issue
Root cause: Your model learned to generate reasoning chains but didnât learn robust stopping conditions under cognitive load (longer contexts).
Solution approach:
-
The issue is likely in the training data distribution vs inference reality
-
Your 8k max samples are good, but you need deliberate examples of stopping
Specific fixes:
a) Add explicit âconclusionâ markers in training:
<thinking>
[analysis steps]
</thinking>
<conclusion>
Based on the above analysis...
</conclusion>
<|return|>
b) Include âfailed to concludeâ examples - counter-intuitively, add examples where the model recognizes itâs circling:
<thinking>
... analysis ...
I notice I'm repeating myself. Let me synthesize:
[final answer]
</thinking>
<|return|>
c) Training tip: Weight your longer samples (3-8k) more heavily in training (2-3x repeats) so the model learns stopping behavior under load.
2. Analysis Channel Data
YES, absolutely include it, but with structure:
<analysis>
[step-by-step reasoning]
</analysis>
<response>
[final answer to user]
</response>
<|return|>
Why: Separating analysis from response teaches the model:
-
When to analyze (internally)
-
When to stop analyzing and respond
-
The transition point
Critical: Make sure 20-30% of your training examples show the model going DIRECTLY to response without analysis for simple queries. This prevents over-thinking.
3. Custom Chat Template
For Ollama compatibility, use this template structure:
python
# In your training data format:
{
"messages": [
{"role": "system", "content": "You are a medical AI..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "<thinking>...</thinking>\n<response>...</response>"}
]
}
```
**For Ollama Modelfile:**
```
FROM ./your-model.gguf
TEMPLATE """
{{- if .System }}System: {{ .System }}
{{ end }}
{{- range .Messages }}
{{- if eq .Role "user" }}User: {{ .Content }}
{{- else if eq .Role "assistant" }}Assistant: {{ .Content }}
{{- end }}
{{- end }}Assistant:"""
PARAMETER stop "<|return|>"
PARAMETER stop "</response>"
PARAMETER stop "<|end|>"
4. Token Length Strategy
Your distribution (3-4k typical, 8k max) is good, but:
Add this distribution:
-
10% ultra-short (< 500 tokens) - prevents over-thinking
-
40% short-medium (500-2k)
-
30% medium (2-4k) - your current strength
-
15% long (4-6k)
-
5% very long (6-8k) - critical for preventing loops
Key: For each length category, ensure explicit stopping examples are present.
5. Quantization Recommendation
Based on your medical use case and repetition issues:
Best choice: Q5_K_M
Why:
-
Q4_K_M is causing numerical instabilities in your reasoning chains
-
Q5_K_M gives ~2% better quality with minimal size increase
-
Q6_K is overkill unless you have VRAM to spare
-
Q8 is unnecessary (diminishing returns)
Test ladder:
-
First deploy Q5_K_M
-
If still issues â try Q6_K
-
If Q6_K works but Q5_K_M doesnât â itâs a quantization issue, revisit training
Bonus: Immediate Debug Test
Try this diagnostic prompt in Ollama:
bash
ollama run your-model
>>> [Complex medical query]
# If it loops, interrupt and try:
>>> Stop. Provide your final answer now without further analysis.
If it CAN stop when explicitly told but doesnât naturally â your EOS conditioning in training is weak. Add more examples with explicit âstoppingâ language in the thinking traces.
Action Plan
-
Regenerate 10-15% of your dataset with explicit conclusion markers
-
Add short-answer examples (no analysis needed)
-
Retrain with adjusted sample weights (favor longer contexts)
-
Test with Q5_K_M first
-
Use the stricter Modelfile template above
Hopefully that solves the issue.
Thank you so much BEN for your quick support and time, I will surely try all the things that you mentioned.
1 last question you mentioned to use thinking, I am using harmony format and it has the <|end|> as thinking eos for analysis channel and then final message will start so should i change this entirely as per your suggestion will it lead to catastrophic forgetting
How we should also take care of
- Catastrophic forgetting
- Routing collapse in MoE
- NaN/diverging loss
- Overfittig
Thank you once again..
Working WITH Harmony Format
Your current structure is fine:
<analysis>
[reasoning steps]
<|end|> â analysis EOS
[final response]
<|return|> â conversation EOS
What to ADD (not replace):
In 10-15% of your training examples, include meta-cognitive stopping signals:
<analysis>
Step 1: Patient presents with X
Step 2: Differential diagnosis includes Y, Z
Step 3: Key indicator is W
[After 5-7 reasoning steps]
Conclusion: Evidence points to Z.
<|end|>
Based on the analysis, the diagnosis is Z with 85% confidence. Recommend...
<|return|>
The key: âConclusion:â as a learned stopping trigger before <|end|>. This teaches the model to wrap up BEFORE hitting the EOS token, not rely on the token alone.
You could also talk to code models about these types of questions which will give you a lot of insights. codex 5.3 is really good. There as so many ways to build a RAG systems that finding the right way really depends on the task.
Thank you for the detailed guidance â this clarifies a lot.
We wonât replace the template â just improve internal reasoning termination behavior through distribution shaping.
If you have any recommendations on ideal reasoning-to-direct-answer ratios or specific eval signals to detect routing imbalance early, weâd appreciate your input.
Thanks again for the insights â this helps us move forward more confidently.
Prompt engineering is very important. beyond that I am not much more help because I donât use reasoning models very often as our Ai has its own reasoning logic stack that we made.
On of my Aiâs though that I am experimenting with though does have agent model with reasoning that said it was all in the prompting to tune it further. You can also if need be build a 2nd validation loop where critical is required. Kind of like checking your output before sending to the user to make sure it supports the query relevance if not send back.
depending on how you build your system there are many other things you can do to improve quality of the outputs. better embedding models, better input structure and output structures.
Example sometimes breaking down your information into better understanding to feed the ai with meta helps guide its clarity better.
hope some of these help. If all in doubt you do have Aiâs to ask these same questions show it your work and ask it where you can tighten. send it the debug outputs so it can see patterns that you may not see yourself. That is how you can also help refine the outputs. guardrails to ensure it stays on track.
Thank you again for taking the time to share your insights â I really appreciate the practical perspective. Your comments about orchestration layers and validation loops were especially helpful.
Iâd love to understand your setup a bit more deeply if youâre open to sharing:
-
When you say your AI has its own reasoning logic stack, is that rule-based orchestration, multi-pass LLM chaining, or something else?
-
Do you let the model reason freely and then post-process, or do you guide reasoning step-by-step externally?
-
Is your reasoning layer deterministic, or still probabilistic via model calls?
Regarding the validation loop:
-
How do you define âcriticalâ outputs that require a validation pass?
-
Is the second validation pass done with the same model or a smaller/different model?
-
What signals do you use to determine that an output failed validation?
And on guardrails:
- What kinds have worked best for you â format validation, semantic checks, relevance scoring, or something else?
Thanks again â your system-level approach is really interesting, and Iâd value any additional detail youâre comfortable sharing.