Initial test results: Why is gpt-4-1106-preview way worse than the former gpt-3.5-turbo-0613?

I am using gpt api to generate content (informative articles). I have tried to adjust all my prompts to work with the new gpt-4-1106-preview and the content it generates is consistently worse than the former gpt-3.5-turbo-0613 which I used to use before the devday. My question is: Is the performance/quality difference a result of the fact that openai states the new gpt-4-1106-preview model is “not yet suited for production traffic.”? Or is it due to something else?

I just think it is weird. I have been testing the difference on 15 different articles, and gpt 3.5 is way better across all articles.