What's the best workflow for perfect product insertion (Ref Image + Mask) in 2025?

Hey everyone,

I’ve been going down a rabbit hole trying to find the state-of-the-art API based workflow for what seems like a simple goal: perfect product insertion .

My ideal process is:

  1. Take a base image (e.g., a person on a couch).
  2. Take a reference image of a specific product (e.g., a specific brand of headphones).
  3. Use a mask on the base image to define where the product should go. This one is optional though, but assumed it would be better for high accuracy
  4. Get a final image where the product is inserted seamlessly, matching the lighting and perspective.

Here’s my journey so far and where I’m getting stuck:

  • Google Imagen was a dead end. I tried both their web UI and the API. It’s great for inpainting with a text prompt , but there’s no way to use a reference image as the source for the object. So, base + mask + text works, but base + mask + reference image doesn’t.
  • The ChatGPT UI Tease. The wild part is that I can get surprisingly close to this in the regular ChatGPT UI. I can upload the base photo and the product photo, and ask something like “insert this product here.” It does a decent job! But this seems to be a special conversational feature in their UI, as the API doesn’t offer an endpoint for this kind of multi-image, masked editing.

This has led me to the Stable Diffusion ecosystem, and it seems way more promising. My research points to two main paths:

  1. Stable Diffusion + IP-Adapter: This seems like the most direct solution. My understanding is I can use a workflow in ComfyUI to feed the base image, mask, and my product reference image into an IP-Adapter to guide the inpainting. This feels like the “holy grail” I’m looking for.

Another opportunity I saw (but definitely not an expert with that):

  1. Product-Specific LoRA: The other idea is to train a LoRA on my specific product. This seems like more work upfront, but I wonder if the final quality and brand consistency are worth it, especially if I need to use the same product in many different images.

So, I wanted to ask the experts here:

  • For perfect product insertion, is the ComfyUI + IP-Adapter workflow the definitive way to go right now?
  • In what scenarios would you choose to train a LoRA for a product instead of just using an IP-Adapter? Is it a massive quality jump?
  • Am I missing any other killer techniques or new tools that can solve this elegantly?

Thanks for any insight you can share!

A mask on the OpenAI images endpoint serves one purpose:

When using the DALL-E 2 model, the alpha channel of the single input image or the mask image select where the AI model is allowed to change, or to infill or to outfill. That area essentially “doesn’t exist” any more, and the AI will regenerate based on the image edges leading up to that and the prompt.

gpt-image-1 on the API is essentially useless in sending a mask for the described purpose, and does not programmatically use the mask. It is barely even hinted at, and the entire image can be recomposed and resized by language, and the mask image RGB contents that should be completely ignored are also an input. A complete fail, with deceptive documentation continuing.

The edits endpoint with gpt-image-1 can take multiple images in a list. It is then up to you to prompt the more intelligent AI model in how to recognize the contents and how to compose them together in a new synthesized image. There is no metadata to go with the image, there is no framing the images in language, and I haven’t tried referring to them as “image 1” or “image 2” to see if even that can work.

Finally, there is no “insertion”. The entire image is regenerated based on vision. There’s a new API parameter on the edits endpoint to increase the quality of reproduction, but it only enhances the vision, it does not help with aligning or stopping regeneration so that you could even use external photo edit tools in conjunction with AI.

On the responses endpoint, you have an internal tool that can only work with gpt-image-1. You have to chat with the AI to make it trigger image generation. However, the image-1 model then can observe the chat context in an undocumented manner, so it may be more possible there to have a multi-part user message that interleaves images and a description or instruction for each image. It works like ChatGPT, with a bit more control, however, the escalating costs and confusion in “chat” of repeated vision images degrade a specialized application.

So you can try that. You use language and communicate what you want. Input images can serve as a reference, but are not copy-paste.

Got it makes sense. Tried everything you said and ended up with fairly solid result. Will keep digging in this direction.

Thanks a lot.

Hello everyone,

I’m hoping to leverage the collective expertise of this forum to solve a problem I’m facing with OpenAI’s image editing capabilities. Despite extensive testing, I’m unable to determine a reliable model for my use case.

My Goal

My use case is pretty straightforward advertising stuff. I want to be able to insert products or brand references into a base image. This could be:

  • Simple cases: Adding a specific car model onto a picture of a bridge for a car ad or placing a perfume bottle on an elegant background.
  • Complex cases: Having a model wear a shirt with a specific pattern, display a particular luxury handbag, or even ride a bike of a specific brand.

You get the idea.


What I’ve Tried

I’ve run hundreds of tests for this, trying to insert all sorts of products and brands. I’ve used different models, including 4o, 4.1, o3, and o3 pro. I even set up a rigorous scoring method to track performance, but I’ve come away with no real clues.

My Confusing Results :exploding_head:

Honestly, the results are all over the place, and I can’t make sense of it.

  • I assumed that the better the model, the higher the quality, but that’s definitely not a consistent rule.

  • I thought the more advanced models would be more capable on complex insertions (e.g., brands with intricate patterns, complex products like a bike), but sometimes it’s the case, and sometimes 4o outperforms them.

  • I expected higher stability on simple cases from the big models, but they can totally mess up basic insertions.

  • Surprisingly, the magnitude of error with big models is even greater; when they fail, they fail big!

The Core Question

Given these chaotic results, I’m at a loss.

I’m a bit clueless at this point. Is there a consensus on which model performs best on average for this kind of image editing and product insertion? Are certain models known to excel in specific situations over others for my use case?

Any recommendation or insight is more than welcomed. Thanks a lot!

Are you using ChatGPT?

There is no use in selecting a different model, unless are also forcing it to write language before it calls the image generation tool that it has.

Once the tool is triggered, the gpt-4o-based image generator takes over and does all the work based on your input.

No I am talking about a API based case here actually. Which is why I’m playin with different underlying models, but with varied results so far.