Three-Stage Bidirectional Interaction Network for Efficient RGB-D Salient Object Detection

Yang Wang, Yanqing Zhang; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 3672-3689

Abstract


The addition of depth maps improves the performance of salient object detection (SOD). However, most existing RGB-D SOD methods are inefficient. We observe that existing models take into account the respective advantages of the two modalities but do not fully explore the roles of cross-modality features of various levels. To this end, we remodel the relationship between RGB features and depth features from a new perspective of the feature encoding stage and propose a three-stage bidirectional interaction network (TBINet). Specifically, to obtain robust feature representations, we propose three interaction strategies: bidirectional attention guidance (BAG), bidirectional feature supplement (BFS), and shared network, and use them for the three stages of feature encoder, respectively. In addition, we propose a cross-modality feature aggregation (CFA) module for feature aggregation and refinement. Our model is lightweight (3.7 M parameters) and fast (329 ms on CPU). Experiments on six benchmark datasets show that TBINet outperforms other SOTA methods. Our model achieves the best performance and efficiency trade-off.

Related Material


[pdf] [supp] [code]
[bibtex]
@InProceedings{Wang_2022_ACCV, author = {Wang, Yang and Zhang, Yanqing}, title = {Three-Stage Bidirectional Interaction Network for Efficient RGB-D Salient Object Detection}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {3672-3689} }