Temporally Consistent Semantic Segmentation Using Spatially Aware Multi-view Semantic Fusion for Indoor RGB-D Videos

Fengyuan Sun, Sezer Karaoglu, Theo Gevers; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 4248-4257

Abstract


The task of performing image semantic segmentation faces challenges in achieving consistent and robust results across a sequence of video frames. This problem becomes more prominent for indoor scenes where small camera movement can lead to drastic appearance changes, occlusions, and loss of global context information. To overcome these challenges, this paper proposes a novel approach that combines multi-view semantic fusion with spatial reasoning to produce view-invariant semantic features for temporally consistent semantic segmentation for indoor RGB-D videos. The experiments are conducted on the ScanNet dataset, showing that the proposed spatially aware multi-view fusion mechanism significantly improves the state-of-the-art image semantic segmentation methods Mask2Former and ViT-Adapter. In particular, the proposed pipeline offers improvements of 5%, 9.9%, and 14.4% in 2D mIoU, cross-view consistency, and temporal consistency, respectively, when compared to the Mask2Former. Similarly, when compared to ViT-Adapter, the proposed mechanism offers enhancements of 4.8%, 8.9%, and 10.9% in the same metrics.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Sun_2023_ICCV, author = {Sun, Fengyuan and Karaoglu, Sezer and Gevers, Theo}, title = {Temporally Consistent Semantic Segmentation Using Spatially Aware Multi-view Semantic Fusion for Indoor RGB-D Videos}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {4248-4257} }