What is the best metrics to calculate how correctly the llm is giving answer

i have a q&a application which will be tested with more than one open ai models. What will be the best metrics to identify the best model. Also what techniques should be used. Should we just find out how many questions were answered correctly and then calculate the accuracy