Calculating the Confidence Scrore for the Responses to the Prompts in case of Text 2 SQL application

We are working on the Text 2 SQL application is built using Langchain and Python. Most of the time wth the table data, DDL given the SQL agent is generates the right queries that in turn return back the right response. Using the Langsmith also for measuring the accuracies. The ask is that for every prompt that is issued by the user, what can be the approach to provide the score that the returned response is correct? If anyone has implemented the same the information can be helpful. Here the Model Accuracy is not being looked at the Model Level. It is at the prompt level , Query formed and the response that is returned.