I read the article about Whisper model:
Robust Speech Recognition via Large-Scale Weak Supervision
They didn’t write which loss function did they used ?
It seem that they trained the model as classification task, so did they used cross-entropy loss ?