Frame by frame evaluation result of Physically Plausible Animation of Human Upper Body from a Single Image (Ours),
Controllable Video Generation with Sparse Trajectories(Hao et al.), 3D Simulation, and
Motion Reconstruction Code and Data for Skills from Videos (SFV). For each sample, the top-left frame is the ground truth video. All methods aim to generate a video in which the person’s wrists move along the ground-truth wrist trajectories. The top-second-to-last videos are the synthesized videos generated by our method, Hao et al., 3D simulation, SfV respectively. The bottom-left video overlays the predicted poses generates by our method over our result. The other bottom videos show the per-pixel PSNR values of the top corresponding method as heatmaps. For PSNR, higher is better. For LPIPS, lower is better. Our results generally look more plausible and sharper than the baselines’, as reflected by LPIPS. However, the misalignment of the head and arms in our results lead to a worse PSNR score than the blurry results of Hao et al.'s method. Note that SfV requires the whole ground truth video as the input for training, while other method only takes the first frame and the 2D wrist trajectories of the ground truth video as input.