CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image

Huang, Jingshun; Lin, Haitao; Wang, Tianyu; Fu, Yanwei; Xue, Xiangyang; Zhu, Yi

Jingshun Huang, Haitao Lin, Tianyu Wang, Yanwei Fu, Xiangyang Xue, Yi Zhu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 11654-11664

Abstract

This paper tackles category-level pose estimation of ar- ticulated objects in robotic manipulation tasks and intro- duces a new benchmark dataset. While recent methods es- timate part poses and sizes at the category level, they often rely on geometric cues and complex multi-stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, partic- ularly for objects with small parts. To address these limita- tions, we propose a single-stage Network, CAP-Net, for es- timating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB-D features to generate instance segmentation and NPCS representations for each part in an end-to-end manner. CAP-Net uses a unified net- work to simultaneously predict point-wise class labels, cen- troid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their es- timated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size. To bridge the sim-to-real do- main gap, we introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring photorealistic RGB images and depth noise simulated from real sensors. Experimental evaluations on the RGBD-Art dataset demon- strate that our method significantly outperforms the state- of-the-art approach. Real-world deployments of our model in robotic tasks underscore its robustness and exceptional sim-to-real transfer capabilities, confirming its substantial practical utility.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Huang_2025_CVPR, author = {Huang, Jingshun and Lin, Haitao and Wang, Tianyu and Fu, Yanwei and Xue, Xiangyang and Zhu, Yi}, title = {CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {11654-11664} }