3D Human Pose Estimation With Two-Step Mixed-Training Strategy
In monocular 3D human pose estimation, target motions are generally stable and continuous, which indicates that joint velocity can provide valuable information for better estimation. Therefore, it is critical to learn the joint motion trajectory and spatio-temporal information from velocity. Previous works have shown that Transformers are effective in capturing the relationship between tokens. However, in practice, only 2D position is available and 3D velocity has not been explicitly used as a model input. To address this challenge, we propose TMT (Two-step Mixed-Training strategy), a transformer-based approach that effectively incorporates 3D velocity into the input vector during training, allowing for better learning of relevant features in the shallow layers. Extensive experiments demonstrate that TMT significantly improves the performance of state-of-the-art models, such as MixSTE, MHFormer, and PoseFomer, on two datasets: Human3.6M and MPI-INF-3DHP. TMT out performs the state-of-the-art approach by up to 13.8% on the Human3.6M dataset.