I have a text classification task that I have been exploring with chatgpt (the web version available here https://chat.openai.com/chat) with reasonable success. When I try to replicate my results using gpt-3.5-turbo, the classification prediction is incorrect more often. I understand that there is some inherent stochasticity at play here that can cause individual results to differ.
What I’m looking for here is best practice recommendations that I can follow to close the discrepancy between the two models as much as possible. For example, one thing I would like to do is make sure the underlying tunable parameters (temperature, top_p, etc.) are the same. Is it known what values of those parameters chatgpt uses? Does anyone have any other insight/advice? Thanks!