A way to reduce response length without a finish_reason of STOP

I am developing an application for a very small display (size of wrist watch), and I want the responses to be terse. I thought max_tokens would inform the model to formulate a response given so many tokens, but it just truncates the response.

I am exploring some prompt engineering, like, “Please respond briefly”, but I’d much rather have a way to limit the response without getting a truncated finish_reason of STOP.

Maybe what I’ll do is if the response exceeds the limit I want, I send back the response with the prompt, “say that more tersely”… and try that a couple times before delivering a truncated response, which is a an undesired user experience.