ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

ReLER Lab, CCAI, Zhejiang University, πŸ“§Corresponding Author
ICLR 2026 Poster
ContextGen teaser: multi-instance generation with layout control and identity preservation

ContextGen uses user-provided reference images to generate images with multiple instances, offering precise layout control while guaranteeing identity preservation.

Abstract

Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation based on contextual learning guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism which leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

Method

A composite layout image is used for precise spatial control β€” either user-provided or automatically synthesized. Reference images overcome the limitations of layout-only generation, such as instance information loss due to overlaps and dimensional compression. Two key innovations drive the framework:

  1. Contextual Layout Anchoring (CLA) leverages contextual learning to anchor each instance at its desired position by incorporating the layout image into the generation context, achieving robust layout control.
  2. Identity Consistency Attention (ICA) propagates fine-grained information from contextual reference images to their respective desired locations, preserving the detailed identity of multiple instances.

An enhanced position indexing strategy systematically organizes and differentiates multi-image relationships.

Overview of the ContextGen framework architecture
Overview of ContextGen. The composite layout image provides spatial anchoring via CLA, while reference images supply identity information via ICA.

Identity-Consistent Subject-Driven Generation

DEMO on LAMICBench++ comparing with existing open-source SOTA on subject-driven generation and closed-source commercial models.

Precise Layout Control for Multi-Instance Scenes

COCO-MIG benchmark demo comparing layout control methods
DEMO on COCO-MIG Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation. Red dashed boxes indicate missing, merged, dislocated or incorrectly attributed instances.
LayoutSam-Eval benchmark demo
DEMO on LayoutSam-Eval Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation.

IMIG-100K Dataset

IMIG-100K is a large-scale, structured dataset designed for identity-consistent multi-instance generation. It is organized into three progressive difficulty levels to cover a wide range of real-world scenarios.

Sample images from the IMIG-100K dataset
Overview of IMIG-100K, showing the three difficulty levels with layout and identity annotations.

Basic Instance Composition β€” foundational scenes with layout and identity annotations.

IMIG-100K Basic Part samples
Basic Instance Composition.

Complex Instance Interaction β€” up to 8 instances per image with references covering occlusion, viewpoint rotation, and pose changes.

IMIG-100K Complex Part samples
Complex Instance Interaction.

Flexible Composition with References β€” instances are composited into scenes with significant appearance variation relative to their references, training the model to handle flexible identity transformations.

IMIG-100K Flexible Part samples
Flexible Composition with References.

Quantitative Results

Quantitative results table on LAMICBench++
Quantitative Results on LAMICBench++. ITC: Image-text consistency; AES: Aesthetic quality; IPS: Object feature similarity; IDS: Facial identity similarity. Layout-aware methods* use pre-annotated bounding boxes; single-image-editing methods† use manually composited layout images.
Quantitative results table on COCO-MIG and LayoutSam-Eval
Quantitative Results on COCO-MIG and LayoutSam-Eval Bench. SR: Image-level success rate; I-SR: Instance-level success rate; mIoU: Mean IoU; G-C: Global CLIP score; L-C: Local CLIP score. Image-guided methods* use pre-generated images by FLUX.1-Dev.

Interactive GUI

ContextGen comes with an intuitive GUI β€” upload reference images, draw bounding boxes, and generate, all without writing any code.

The ContextGen GUI demo β€” drag, drop, and generate with full layout and identity control.

BibTeX

@inproceedings{
xu2026contextgen,
title={ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation},
author={Ruihang Xu and Dewei Zhou and Fan Ma and Yi Yang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}