-
[pdf]
[supp]
[bibtex]@InProceedings{Zhang_2025_ICCV, author = {Zhang, Menghe and So, Haley M. and Asadi, Mohammad and Zhao, Dongfang and Liang, Yangwen and Wang, Shuangquan and Wetzstein, Gordon and Song, Kee-Bong and Kim, Donghoon}, title = {Synthetic Hands Meet Legacy Data: A Synthetic Dataset for Structured, Controllable, and Multimodal Evaluation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {466-477} }
Synthetic Hands Meet Legacy Data: A Synthetic Dataset for Structured, Controllable, and Multimodal Evaluation
Abstract
Evaluating hand gesture and pose models in a data-centric manner remains difficult due to limited gesture diversity, missing modalities, and the lack of control over hand and scene attributes. We introduce T3DGesture, a large-scale Text- and 3D-Labeled multi-modal synthetic hand dataset that addresses these limitations by enabling structured, controllable, and multimodal evaluation for hand gesture recognition and hand pose estimation. T3DGesture features a modular gesture representation that separates global wrist and local finger motions, enabling compositional sampling and coverage of 769 unique gesture categories. Its physically valid hand motions are generated with a kinematics-aware variational model under biomechanical constraints. Through synchronized simulation, T3DGesture provides 22.6K RGB-D video clips (1.1M frames) with high-resolution 3D meshes, point clouds, 2D/3D keypoints, camera parameters, and semantic text labels. T3DGesture builds upon and unifies gestures from legacy real-world datasets, enabling new forms of evaluation grounded in consistent annotation and modality alignment. It demonstrates three key capabilities: (1) modality-aligned synthetic training improves model performance on real-world benchmarks such as SHREC and EgoGesture; (2) controlled generation supports one-factor benchmarking to isolate the impact of attributes like hand shape, scale, and background; and (3) synchronized multimodal outputs enable sensor fusion studies and underexplored tasks, such as stereo depth estimation from egocentric views.
Related Material
