Developing sprite sheets with gpt-image-2

This issue has come up many times, but there still isn’t a clear, reliable solution. I’d like to share what I’ve learned so far and hopefully gather ideas to improve the process together. I’m not a designer or game developer, so input from more experienced folks would be very helpful.

The core challenge

Getting a good reference image is key. If you ask the model to generate a full sprite sheet, it often produces incomplete results with repeated poses. The model is optimized for visual quality, but it doesn’t fully understand body positioning with consistent spatial awareness.

Even the latest model, gpt-image-2, can struggle with left/right limb distinction. For example, it may interpret “left leg” as the leg on the left side of the screen, rather than the character’s actual left leg:

Practical tips

  • Expect to iterate on prompts. If the model consistently misinterprets left/right, it can be more effective to work with its bias rather than fight it.
  • Hallucinations still happen. Retrying the same prompt 2–3 times often fixes the issue.
  • Avoid generating a full sprite sheet all at once. Working frame by frame tends to produce better results.

What makes a good reference image?

  • Low-resolution images or pixel art can make depth and limb order unclear.
  • A 3D mannequin-style reference helps a lot because it provides clear joints, shadows, and structure.
  • Numbered frames make it easier to target specific poses.

Processing a full sprite sheet

  • Generate the full sheet first, then refine individual frames.
  • Use Codex to split and process frames one by one. This reduces manual effort, though errors can still occur.
  • The API is another option for automation.

Transferring a pose to a character

  • Attach the pose image as Image 1, then the character as Image 2.
  • Ask the model to apply the pose, or use the prompt below.
Prompt for pose transfer
Use Image 1 as the pose reference and Image 2 as the target character.

Edit Image 2 so the character adopts the exact pose from Image 1. Transfer only the pose. Preserve Image 2’s character identity, proportions, material/style, camera angle, framing, lighting, background, and image quality.

Important: transfer the pose WITHOUT MIRRORING.

Left/right always refer to the subject’s own anatomical left and right, not the viewer’s. Preserve the exact correspondence of all major body parts:
- head and neck
- shoulders
- left arm and right arm
- elbows, wrists, and hands
- ribcage and pelvis
- left leg and right leg
- knees, ankles, and feet

Preserve the exact pose relationships from Image 1:
- which arm is forward vs back
- which leg is leading vs trailing
- which foot is planted vs lifted
- which limbs are bent vs extended
- which body parts are higher vs lower
- which body parts cross in front of others
- the exact weight distribution and balance
- all contact points with the ground or any object

Critical constraints:
- Do not mirror the pose.
- Do not swap left/right limb assignments.
- Do not swap leading/trailing limbs.
- Do not swap planted/lifted feet.
- Do not generalize the pose into a similar pose.
- Match the full-body silhouette and limb placement of Image 1 as closely as possible.

Final result: Image 2’s character should appear to be rigged directly into Image 1’s exact pose, with no limb swapping.

  • If the model swaps limbs or makes mistakes, it’s often faster to rerun the same prompt instead of asking for fixes.

Putting things together

After generating the individual frames, they may have a wider range of variations:

Apply a simple prompt like this to normalize them:

normalize the style, character consistency and size for this sprite sheet, keeping the all the poses intact

Additional notes

  • You can create reference images using tools like PowerPoint or an image editor with existing sprites. I used the model to convert sprites into a normalized 3D mannequin.
  • Different characters may require different pose sets.

Of course, it’s not perfect and is still a work in progress. Have you done something similar? Please share what you are doing and any ideas you have for improving the process.

8 Likes

In your prompts I see lack of purpose and decision guidance through giving the model background context and values you want to preserve so that it becomes less directed and more “locally autonomous” to make the right decision where it has the most visibility. Search this forum for ABRA KADABRA (workflow decomposition) and check YouTube videos about agent constitution approach (local governance and decision making).

Otherwise great stuff. This is exactly the thing that will help you kick the hell out of hundreds of lazy guys.

2 Likes

Have you tried giving it one image with the main frame in the middle (way bigger) surrounded by smaller sprites clockwise (starting at 10:10 am lol) and skip splitting during the normalization?

I would try scripting to build “normalization frames” from individual sprites, run them in bulk (parallel) for normalization, then back to scripting to extract only the main frame or normalized samples, and scripting again to stich all together in sprites.

But take it just as another opinion, I don’t have all of your context…

2 Likes

This model seems to do multi-panel images well. The same person can be the same person, and different people are diverse instead of being monotonic in appearance and pose.

Here’s merely what logically created the sequence, similar to a “sprite run sheet”

The 1536x1024 wide image is broken into a 6x4 borderless grid of individual pictures, each showing a one second time lapse progression left-to-right, and then progressing to the next frame, as if a series of photographs was taken with a digital camera with an automatic shutter.

The position of the photographer that has taken each of these pictures also moves, so the progression of images from top left, going across to the right, then to the next row, each has a small change in the perspective and direction the camera is pointing, and the zoom changes in a consistent way.

Quadrants cut from a single image:




Notably repaired is no convoluted “monster face” at small resolutions, seen in prior gpt-image models. However, several faces are covered with “the pattern”, blotchy high-frequency patches of noise and contrast.

8 Likes

Original approach with dummy model in sprites and then “character skin” applied by model seems easier for me if production scale is needed.

Looks similar to my concept of " brand voice skin" where model converts brand text to " neutral " and then another one gets trained on reversed samples to convert" neutral" in " branded".

My gut says here the “neutral” is the dummy and the approach might actually work way better than one may think. Just some fine-tuning will definitely be necessary if truly going to scale volumes.

2 Likes

Thanks for all the suggestions!

As for the gray images, the technique is style agnostic. The tricky part is getting the swapped legs to be properly rendered:

Each frame might need several tries to get them done, and even vision often misses automatic validation.

1 Like

This is where fine-tuning is probably the best approach, I’d start by a small model tuned to detect incorrect legs position.

Speaking of that, have you tried real dummy photos? The ones you had had a spacing issue (left foot too on the right in some frames), maybe that is the cause.

1 Like

That would be interesting to see… unfortunately I lack the training data and time to spend on it… :sweat_smile: but perhaps I will give it a try in the future.

About dummy photos, they are hard to find. There are many on Pinterest but most are incomplete or of low quality.

1 Like

Foldio360 Smart Turntable – Automatic 360 Product Photography Platform for eCommerce + Amazon.com: Haniforever Artists Manikin Art Mannequin Figures Supplies Drawing Tools,Small Drawing Figure Model for Sketching,Painting,Action Figure Set(Grey,Male)

and that turntable can be programmed from iPhone to get the 360 shots - 500 usd + 7 days of work and may even sell the sprites without any work at all on iStock…

1 Like

you might also need a tripod for iPhone (or do as I did : use a good dslr with bluetooth trigger)

in first image her feet are “bear” and on one line - that is leaking from the dummy pictures, maybe add some texture (removable) as a ground so that model has a guardrails to hold on?

The pose samples do have a little bit of ground (shadow in the floor), it is just optional to tranfer that in the tranfer pose prompt. If I use a character sample with floor it will also do it.

I’m not much worried about minor issues atm, what mostly stresses me out are the major mistakes like limb swapping or not following arm positioning. But the more I try I see that gpt-image-2 was not trained to follow certain details, it emphasizes overall good looks.

Perhaps I will try some tweaks in the API later, I think with smaller image outputs it might improve a bit the attention to details.

2 Likes

One other idea, before generating the figure from doll, have you tried running vision model to describe the position in details and including that into the prompt for figure generation?

2 Likes

Yes, both for generating and comparing. It swaps the position of left/right limbs and perceives them as the same thing. Here is what it “sees”:


1 Like

Perhaps the key might lie in a combination of visual labels, descriptions and additional references for the same pose in case one doesn’t comply, generating several outputs and picking the best result.

1 Like

The context of the prompt should be a film strip with frames (EDIT):

Sample Prompt

A sixteen frame filmstrip cartoon of a man walking left to right. Each frame must show a unique waking position. The man is wearing tan pants. The image is contained in a pane with eight frames on the top and eight frames on the bottom.

2 Likes

But the “film strip” already has error in there. I think the thing is way simpler than that.

Il ask Wednesday this guy if I see him: Movelytics | LinkedIn

His models are running locally, and I bet he’s probably one of the top 20 guys in the world on the subject.

1 Like

I think the thing is way simpler than that.

Took me six minutes to prompt.

I bet he’s probably one of the top 20 guys in the world on the subject.

No doubt that there are a few experts on this subject. I’m just a hack trying to push the limits on the new OAI imageGen model.

2 Likes

I appreciate all the help and ideas; sometimes, the most underestimated thoughts can lead to great discoveries, so everything is welcome.

This will probably take some time to solve…

2 Likes

Finally the issue is not the model but light on the artificial mannequin image and the way it was done (feet truly confuse local perception, cover hip and above and look closer)

But also the issue is model coordinate system based on image left and image right + closer further…

Have fun (model training in different coordinate system is the ideal solution but will be too expensive): ChatGPT - Mannequin Position Description