Evaluating the performance of a fine-tuned dialogue system

Hello everyone, I am working with a fine-tuned GPT-3.5 Turbo model for a dialogue system and am facing challenges in evaluating the model’s performance using automated metrics. Does anyone have suggestions or resources that could help?