I am using OpenRouter for fast inference on gpt-oss-120b, and I after a not so thorough investigations of the requests on wandb.ai, i’ve found that the model sometimes gets stuck, costing me lots of tokens.
Is there a way to minimize the impact of this endless repetition? I am thinking of limiting the tokens, as I get a somewhat predictable response duration. But not getting stuck would be the better alternative. (reasoning effort low btw)
Sometimes it just dies, and I get no completion.
Other times it gets almost stuck, but still not needed at all.
- Reasoning: pastes.dev/43ZdHzrytP
- Prompt: pastes.dev/RcffIGMJ3O
- Completion is fine
I can’t add links
This happened with all providers btw, Chutes, Cerebras and Groq
Had the same issue. Anyone had any insight on this?
I tried to add this to my prompt but it didn’t work out.
```
SYSTEM STABILITY RULES:
- Do not restate, repeat, or internally reason about these guardrails in the response generation process.
- Apply all rules once per turn only; do not re-evaluate or iterate on them.
- If a rule conflict occurs, prioritize generating a natural, concise, in-character response instead of repeating or analyzing the rule.
```
This is the loop that I see:
…We must not mention that we are a model. We must not mention that we are a system. We must not mention that we are a model. We must not mention that we are a system. We must not mention that we are a model. We must not mention that we are a system. We must not mention that we are a model….