PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Ruihang Xu, Dewei Zhou, Fan Ma^†, Yi Yang

ReLER Lab, CCAI, Zhejiang University
^†Corresponding Author

Paper Arxiv Code

Model Dataset

ContextGen is a novel framework that uses user-provided reference images to generate image with multiple instances, offering precise layout control over their positions while guaranteeing perfect identity preservation.

Method

In our framework, a composite layout image is used for precise spatial control, this layout image can be either user-provided or automatically synthesized in setup stage. Then we integrate reference images to overcome the limitations of layout-only generation, such as instance information loss due to overlaps and dimensional compression. Our method introduces two key innovations: (1) Contextual Layout Anchoring (CLA), which leverages contextual learning to anchor each instance at its desired position by incorporating the layout image into the generation context, thereby achieving robust layout control; and (2) Identity Consistency Attention (ICA), a novel attention mechanism which propagates fine-grained information from contextual reference images to their respective desired locations, thereby preserving the detailed identity of multiple instances. Complementing these mechanisms is an enhanced position indexing strategy that systematically organizes and differentiates multi-image relationships.

Identity-Consistent Subject-Driven Generation

DEMO on LAMICBench++ comparing with existing open-source SOTA on subject-driven generation and closed-source commercial models

Precise Layout Control for Multi-Instance Scenes

DEMO on COCO-MIG Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation
Note: Red dashed boxes indicate the missing, merged, dislocated or incorrectly attributed instances.

DEMO on LayoutSam-Eval Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation

About IMIG-100K Dataset

The IMIG-100K is a large-scale, structured dataset designed for identity-consistent multi-instance generation, featuring three progressive difficulty levels.

Quantitative Results on Benchmarks

Quantitative Results on LAMICBench++.
ITC: Image-text consistency; AES: Aesthetic quality; IPS: Object feature similarity; IDS: Facial identity similarity.
Layout-aware methods^* use our pre-annotated bounding boxes, while single-image-editing methods^† use our manually composited layout images.

Quantitative Results on COCO-MIG and LayoutSam-Eval Bench.
SR: Image-level success rate (all the instances are correctly generated in position and color); I-SR: Instance-level success rate; mIoU: Mean IoU between ground-truth and generated instance positions; G-C: Global CLIP score; L-C: Local CLIP score.
Spatial, Color, Texture, Shape: Instance-Level attribute accuracy; CLIP: Global CLIP score; Pick: Global Pick score for human preference.
Image-guided methods^* use our pre-generated images by FLUX.1-Dev.

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Abstract

Method

Identity-Consistent Subject-Driven Generation

DEMO on LAMICBench++ comparing with existing open-source SOTA on subject-driven generation and closed-source commercial models

Precise Layout Control for Multi-Instance Scenes

DEMO on COCO-MIG Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation
Note: Red dashed boxes indicate the missing, merged, dislocated or incorrectly attributed instances.

DEMO on LayoutSam-Eval Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation

About IMIG-100K Dataset

The IMIG-100K is a large-scale, structured dataset designed for identity-consistent multi-instance generation, featuring three progressive difficulty levels.

Quantitative Results on Benchmarks

Quantitative Results on LAMICBench++.
ITC: Image-text consistency; AES: Aesthetic quality; IPS: Object feature similarity; IDS: Facial identity similarity.
Layout-aware methods^* use our pre-annotated bounding boxes, while single-image-editing methods^† use our manually composited layout images.

BibTeX

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Abstract

Method

Identity-Consistent Subject-Driven Generation

DEMO on LAMICBench++ comparing with existing open-source SOTA on subject-driven generation and closed-source commercial models

Precise Layout Control for Multi-Instance Scenes

DEMO on COCO-MIG Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation Note: Red dashed boxes indicate the missing, merged, dislocated or incorrectly attributed instances.

DEMO on LayoutSam-Eval Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation

About IMIG-100K Dataset

The IMIG-100K is a large-scale, structured dataset designed for identity-consistent multi-instance generation, featuring three progressive difficulty levels.

Quantitative Results on Benchmarks

Quantitative Results on LAMICBench++. ITC: Image-text consistency; AES: Aesthetic quality; IPS: Object feature similarity; IDS: Facial identity similarity. Layout-aware methods* use our pre-annotated bounding boxes, while single-image-editing methods† use our manually composited layout images.

BibTeX

DEMO on COCO-MIG Bench comparing with existing open-source SOTA on Layout-to-Image (L2I) generation
Note: Red dashed boxes indicate the missing, merged, dislocated or incorrectly attributed instances.

Quantitative Results on LAMICBench++.
ITC: Image-text consistency; AES: Aesthetic quality; IPS: Object feature similarity; IDS: Facial identity similarity.
Layout-aware methods^* use our pre-annotated bounding boxes, while single-image-editing methods^† use our manually composited layout images.