Consistency of responses from Vision (or GPT4-Turbo)

jmtdrp · May 6, 2024, 3:50pm

Hello fellow AI enthusiasts

To familiarize myself with the potential of the API, I’ve been working on a simple application using vision’s capabilities to generate html code from an image.
The biggest difficulty at the moment, and one that has been hard to overcome, has been the consistency of the results. I get very positive results, but then without any changes to the prompt I get very bad results (de-formatting, not all elements are generated, etc.).
At the prompt level I feel I’ve reached a limit and even following all the good practices suggested in the cookbook.
Has anyone who has been through similar use cases and has more experience with this kind of issue come up with an approach that gives better results in terms of response consistency?
Thank you very much!

Diet · May 6, 2024, 3:52pm

Welcome to the community fellow enthusiast

What’s your temperature/top-p look like?

jmtdrp · May 6, 2024, 4:06pm

Thank you, and also thanks for the quick reply!
I’m using the default values because I also don’t have enough confidence and knowledge on how to “tune” them in the best possible way for my use case and also because the available literature suggests that should always first at the prompt before changing these values.
But it really makes sense in my case where I feel I’ve “hit a limit” in terms of the prompt.
Any strategies on the way forward? Lower the temperature (keep decreasing it and check results) and keep the top_p as recommended?
Thanks again !

grandell1234 · May 6, 2024, 4:09pm

The higher the temperature, the more “creative” it is, meaning I would try setting the temperature to 0 and then adjusting the prompt, as that should get you pretty regular results.

Diet · May 6, 2024, 4:19pm

Adding some technical background to what @grandell1234 mentioned:

Temperature allows for more randomness to the output. A higher temperature (above 0) increases the likelihood that instead of the most likely token (or word), some other random token (or word) will will be picked. Personally, I don’t see many situations where a non zero temperature really makes sense, unless you’re literally and explicitly trying to brainstorm ideas (that you will check and filter later).

Top P decides how improbable your random picks are allowed to be. A top-p of 1 (100%) allows all tokens to be picked. So there’s a small chance that you will get absolute nonesense, which will increase with higher temperatures. If you lower this, you will completely disallow lower probability tokens from appearing.

jmtdrp · May 6, 2024, 4:35pm

@grandell1234 Thanks for the tip, I will follow that strategy

@Diet Thanks for your time and patience on explaining more about those two concepts, it’s much more clear to me now, I really appreciate it

I will do some tests, if there is any interested I can post my results later on on what worked for me

jmtdrp · May 15, 2024, 3:31pm

Sorry for the delay, but just to give some feedback on my progress.

Following the advice of @grandell1234 and @Diet I finally managed to have more consistent results, even if sometimes they weren’t the best.
I also noticed that less temperature requires more context (makes sense since now I quite restricting the model) to get a result closer to what I want.
I had great results by going into detail as such to give an example on the prompt.

By maintaining a low temperature and lowering top_p, I saw more “incomplete” responses, which makes sense according to the explanation from @Diet, since the universe of possible tokens was being reduced (finally also understood why the recommendation to only change one of these parameters and not both).

Very grateful for the help provided by @grandell1234 and @Diet, thanks to them I was able to advance a little further.

Now I’m at an early stage adopting gpt-4o to see how it differs in terms of output.

Thanks again

Topic		Replies	Views
Inconsistencies in API response to same prompt and similar content API gpt-4 , gpt-35-turbo , api	3	4774	July 18, 2023
ChatGPT gives much better response than API GPT 3.5 Turbo API	4	2314	May 16, 2023
I get different answers to the same request API gpt-4 , gpt-35-turbo , chatgpt , api	2	4669	December 8, 2023
API Completions not really matching with chat.openAI GPT-3.5 Completions API gpt-35-turbo , chatgpt , api	7	2789	December 17, 2023
Complex Prompt Getting Continuously Worse Results Prompting api , gpt-4-vision , assistants-api	6	767	July 24, 2024

Consistency of responses from Vision (or GPT4-Turbo)

Related topics