Stereo Matching in Time: 100+ FPS Video Stereo Matching for Extended Reality
Real-time Stereo Matching is a cornerstone task for Extended Reality (XR) applications, such as 3D scene understanding, video pass-through, and mixed-reality games. Despite significant advancements, getting accurate depth information in real time on a low-power mobile device remains a challenge. One of the main difficulties is the lack of high-quality indoor video stereo data captured by head-mounted VR or AR glasses. To address this, we introduce a novel video stereo synthetic dataset that comprises photorealistic renderings of various indoor scenes and realistic camera motion captured by a moving VR/AR head-mounted display (HMD). Our newly proposed dataset enables one to develop a novel framework for continuous video-rate stereo matching. As another contribution, we also propose a new video-based stereo matching approach tailored for XR applications, which achieves real-time inference at an impressive 134fps on a standard desktop computer, or 30fps on a battery-powered HMD. Our key insight is that disparity and contextual information are highly correlated and redundant between consecutive stereo frames. By unrolling an iterative cost aggregation in time (i.e. in temporal dimension), we are able to distribute and reuse the aggregated features over time. This leads to a substantial reduction in computation without sacrificing accuracy. We conducted extensive evaluations and demonstrated that our method achieves superior performance compared to the current state-of-the-art, making it a strong contender for real-time stereo matching in VR/AR applications.