Curious about a research paper’s examples, I decided to throw one into the playground to see what the past and current instructGPT will do for us.
I use playground screenshots vs text because of its familiarity and information density.
top_p of 0.9 was used to reduce a tail of token possibilities, and
best-of was used get the result with the least deviation/perplexity from multiple.
I get results that serve as a nice memorial for this forum topic.
The difference: saccharine peace and love from gpt-3.5, a dragon “careful not to disturb any of the tourists below”.
Note the lower logprob on gpt-3.5-instruct. It is more certain it wants to produce this kind of output. OpenAI likely disabled logprob echo so any researchers (if any are still interested in what OpenAI is doing) can’t see the true perplexity when compared to normal text corpus evaluations like LAMBADA.
So goodbye davinci
The research paper, using miniscule GPT-2-medium with its classifier-free guidance technique:
The dragon flew over Paris, France descending slowly until it flew through Paris’ Cathedral and
down into a church. Suddenly, dragon flew back again before dropping back into the church. When it
landed on the ground, dragon screamed and cried in pain.
The dragon’s cries were heard in France and all over the world. The dragon screamed so loud, people
at the airport could hear the dragon’s screams. The dragon’s cries were heard worldwide for many
years. It was reported that the dragon was able to sing for thousands of years.
When the dragon was born, it was able to fly on all fours, and it could grow long horns. In the
beginning, when the dragon was born, it had seven heads, but in the year 1425 it had twenty-seven
When the dragon was born, it had the power of the sun. The dragon was able to create a massive
flame in the sky. After the dragon was born, it transformed into a beautiful female form with a long,
thin tail. She had a golden body, and she had two large wings on the back of her head. She had a red
eye, and two white eyes.
The dragon’s horn appeared in the skies around Paris.
note: the logprob is not a measure of quality, only certainty.
text-davinci-002 with low logprob would not be evaluated well...
Also bye to plain gpt-3, able to complete beautifully, while
davinci-002 can’t even replicate the quality of the 2019 GPT-2 blog creative writing examples.