The prior conclusion that what you’d get is similar to inpainting an existing “image-1/gpt-4o” image in ChatGPT is backed by further evidence like yours.
The gpt-image-1 example, also used in the API reference, shows no need for a mask or a base image to remix several images into a new one.
Unaltered original? dall-e-2 gives near pixel accuracy and with just minor glitches sometimes around the mask area. Grabbing their picture just now.
And with DALL-E 2 just as broken as it has been, for going on a month:
Perhaps okay for infill if you didn’t use the polar bear prompt to now produce an unrecognizable blob.