PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

ReLER Lab, CCAI, Zhejiang University
Comparison of PhyEdit editing results against Nano Banana Pro across three physical scenarios
Comparison of image editing results on three physical scenarios between PhyEdit and Nano Banana Pro. PhyEdit achieves more accurate object placement and geometric consistency.

Abstract

Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D–3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

Demo

Continuous Object Manipulation

Beyond single-step editing, PhyEdit maintains geometric consistency under continuous state transitions. Given an initial state and a spatial trajectory, our model sequentially renders physically accurate keyframes that can be processed by video interpolation models to synthesize a continuous, physically consistent manipulation video. The robotic arm below is out of our training distribution, demonstrating strong generalization.

Continuous object manipulation along a trajectory
Continuous object manipulation along a trajectory. PhyEdit renders keyframes (solid borders); a video model interpolates intermediate frames (dashed borders).

Method

PhyEdit combines a DiT editing backbone with a 3D foundation model. The framework has three components: (1) a 3D transformation module that generates a depth-aware preview, (2) the DiT denoising backbone, and (3) a joint training loss in both 2D latent and 3D depth spaces.

Overview of the PhyEdit model architecture
Overview of PhyEdit. User and GT inputs are first processed via the 3D transformation module. The resulting conditions are fed into the backbone for the training forward pass, followed by joint 2D and 3D supervision.

3D Transformation Module

Given a source image IsrcI_\text{src}, object mask MoM_o, and transition vector Δpo\Delta\mathbf{p}_o, we predict depth DD and camera pose (R,t)(R, t), then edit the object directly in 3D space:

Po=Unproj(Isrc,Mo,D,R,t),Po=Po+Δpo,Iprev=Proj(Po,R,t)\mathbf{P}_{o}=\operatorname{Unproj}(I_{\text{src}}, M_{o},D,R,t), \quad \mathbf{P}_{o}'=\mathbf{P}_{o}+\Delta\mathbf{p}_o, \quad I_{\text{prev}}=\operatorname{Proj}(\mathbf{P}'_o,R,t)

The preview image IprevI_\text{prev} serves as an additional condition for the DiT backbone, providing explicit geometric guidance and naturally supporting multi-object manipulation without iterative editing.

Joint Supervision

Latent-space denoising loss alone is insufficient for 3D manipulation, as it emphasizes appearance reconstruction over geometric correctness. We add a depth-space supervision term using the scale-invariant logarithmic (SILog) loss:

L=Lflow+λdLdepth\mathcal{L} = \mathcal{L}_{\text{flow}} + \lambda_d \mathcal{L}_{\text{depth}}

where λd=0.1\lambda_d = 0.1 balances latent and depth supervision. This is lightweight and plug-and-play — it can be added to different DiT-based editors with minimal modification.

RealManip-10K Dataset

We build RealManip-10K, a real-world dataset of paired images (Isrc,Itgt)(I_\text{src}, I_\text{tgt}) where objects are manipulated in 3D space, especially along the depth axis. Each pair provides depth maps, object masks, and representative 3D object coordinates.

Sample image pairs from the RealManip-10K dataset
Samples from RealManip-10K. Objects are annotated with bounding boxes, and 3D coordinates (x,y,z)(x, y, z) are given in the camera coordinate system.
The four-stage pipeline used to construct RealManip-10K
Dataset construction pipeline: data source filtering → camera-static clip extraction → depth and mask processing → depth-aware frame pair selection.

The pipeline collects videos from OpenVid-1M, VIDGEN-1M, PE-Video-Dataset, and object-tracking datasets (LaSOT, GoT-10K, TrackingNet). A novel camera token clustering approach using DBSCAN ensures near-static camera conditions without relying on optical flow or hand-crafted feature matching.

Experiments

Quantitative Results on ManipEval

MethodDIoU ↑Mask IoU ↑AbsRel ↓δ₁.₂₅ ↑Chamfer ↓Centroid ↓RA-DINO ↑DeQA ↑Phys-VLM ↑
GeoDiffuser51.7220.1067.6431.4073.4164.5614.7856.7254.62
DiffusionHandles56.2218.8857.3631.7373.0059.6517.0861.8346.90
Move-and-Act46.8910.9768.4729.3064.5963.6213.0160.7960.85
OBJect 3DIT46.6310.1866.9332.7562.3064.1414.3148.0166.65
GoodDrag51.0713.2969.6034.1951.2758.3920.8574.0885.08
PixelMan49.4811.4966.4635.1050.1857.1321.1772.7973.60
Qwen-Image-Edit53.2813.8064.6340.6146.3152.4826.0975.6790.55
LightningDrag53.8118.9057.0735.9945.7653.5721.9273.6088.60
ChronoEdit48.929.2665.8036.3545.0254.4021.7375.2992.15
GPT-Image-1.5†52.3311.3470.9536.5039.7952.7823.8075.8288.68
Qwen-Image-2.0-Pro†56.4815.6055.9341.3929.6443.4831.0468.2682.91
Nano Banana Pro†59.9718.9355.0246.1125.3335.6234.7777.4891.06
Ours65.3327.2049.5351.0818.9332.1236.9175.4893.72

† Proprietary commercial model. All metrics normalized to [0, 100].

Our method achieves the best overall performance across manipulation-related metrics, outperforming strong closed-source commercial systems on geometry-sensitive measures. Against Nano Banana Pro: DIoU +5.36, Chamfer distance −6.40, RA-DINO +2.14. Our lead over commercial baselines grows further on multi-object scenes (Chamfer gap: 5.19 → 6.58; δ₁.₂₅ gap: 1.91 → 8.02).

BibTeX

@misc{xu2026phyeditrealworldobjectmanipulation,
title={PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing},
author={Ruihang Xu and Dewei Zhou and Xiaolong Shen and Fan Ma and Yi Yang},
year={2026},
eprint={2604.07230},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.07230},
}