Processing of model outputs while fine-tuning Whisper

Hey everyone! I have a quick theoretical question about fine-tuning Whisper models on own labelled data. Following the general Colab notebooks explained by many developers, there seems to be no preprocessing steps for transcriptions that are generated during the training and are used for evaluations over the course of training. Is it necessary to add this step in the training somehow?
What I mean:
For example, in my own dataset that I am using for fine-tuning, all letters are lowercase, and there are no punctuation marks, with only letter characters and whitespaces. Now, after starting fine-tuning, for example, every 200 steps the current model gets evaluated, however, is it not possible that model will generate output that will have uppercase and punctuation characters (as Whisper’s pretrained models, f.e., large v3 do that), and therefore WER will be higher than if they were preprocessed after being generated (remove all the characters that are removed from fine-tuning dataset)?

It’s not mandatory to preprocess generated transcriptions during training, but aligning them with your dataset’s format (e.g., lowercase, no punctuation) can improve evaluation accuracy. This ensures a fair WER comparison by matching the fine-tuning dataset’s style. You can add a post-processing step for generated outputs if needed.