Consistency of responses from Vision (or GPT4-Turbo)

Hello fellow AI enthusiasts :grinning:

To familiarize myself with the potential of the API, I’ve been working on a simple application using vision’s capabilities to generate html code from an image.
The biggest difficulty at the moment, and one that has been hard to overcome, has been the consistency of the results. I get very positive results, but then without any changes to the prompt I get very bad results (de-formatting, not all elements are generated, etc.).
At the prompt level I feel I’ve reached a limit and even following all the good practices suggested in the cookbook.
Has anyone who has been through similar use cases and has more experience with this kind of issue come up with an approach that gives better results in terms of response consistency?
Thank you very much!

Welcome to the community fellow enthusiast :cowboy_hat_face:

What’s your temperature/top-p look like?


Thank you, and also thanks for the quick reply!
I’m using the default values because I also don’t have enough confidence and knowledge on how to “tune” them in the best possible way for my use case and also because the available literature suggests that should always first at the prompt before changing these values.
But it really makes sense in my case where I feel I’ve “hit a limit” in terms of the prompt.
Any strategies on the way forward? Lower the temperature (keep decreasing it and check results) and keep the top_p as recommended?
Thanks again !

1 Like

The higher the temperature, the more “creative” it is, meaning I would try setting the temperature to 0 and then adjusting the prompt, as that should get you pretty regular results.


Adding some technical background to what @grandell1234 mentioned:

Temperature allows for more randomness to the output. A higher temperature (above 0) increases the likelihood that instead of the most likely token (or word), some other random token (or word) will will be picked. Personally, I don’t see many situations where a non zero temperature really makes sense, unless you’re literally and explicitly trying to brainstorm ideas (that you will check and filter later).

Top P decides how improbable your random picks are allowed to be. A top-p of 1 (100%) allows all tokens to be picked. So there’s a small chance that you will get absolute nonesense, which will increase with higher temperatures. If you lower this, you will completely disallow lower probability tokens from appearing.


@grandell1234 Thanks for the tip, I will follow that strategy

@Diet Thanks for your time and patience on explaining more about those two concepts, it’s much more clear to me now, I really appreciate it

I will do some tests, if there is any interested I can post my results later on on what worked for me :grinning:


Sorry for the delay, but just to give some feedback on my progress.

Following the advice of @grandell1234 and @Diet I finally managed to have more consistent results, even if sometimes they weren’t the best.
I also noticed that less temperature requires more context (makes sense since now I quite restricting the model) to get a result closer to what I want.
I had great results by going into detail as such to give an example on the prompt.

By maintaining a low temperature and lowering top_p, I saw more “incomplete” responses, which makes sense according to the explanation from @Diet, since the universe of possible tokens was being reduced (finally also understood why the recommendation to only change one of these parameters and not both).

Very grateful for the help provided by @grandell1234 and @Diet, thanks to them I was able to advance a little further.

Now I’m at an early stage adopting gpt-4o to see how it differs in terms of output.

Thanks again :grinning: