Welcome to the community!
100% adherence to a customized output format without fine-tuning is not possible. These systems are not designed nor built to provide the exact same results from the same prompt each time. It’s always going to have a degree of variability and unpredictability no matter what you do.
I would recommend fine tuning a model with examples of the format you’re trying to get it to do. So far, this sounds like a specialized use case that fine tuning would work well for.
Btw, if you’re trying to tag corpora using prompting, I hate to admit it, but post-processing with plain-old programming is still likely going to be the better bet here, again unless you fine-tune the instructions to your use-case.
These models are great for generating raw text to be placed in a corpus. They are not that great for handling and manipulating such text in precise, complex, replicable formats yet.