Grassmann Pooling as Compact Homogeneous Bilinear Pooling for Fine-Grained Visual Classification

Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, Nanning Zheng; The European Conference on Computer Vision (ECCV), 2018, pp. 355-370


Designing discriminative and invariant features is the key to visual recognition. Recently, the bilinear pooled feature matrix of Convolutional Neural Network (CNN) has shown to achieve state-of-the-art performance on a range of fine-grained visual recognition tasks. The bilinear feature matrix collects second-order statistics and is closely related to the covariance matrix descriptor. However, the bilinear feature could suffer from the visual burstiness phenomenon similar to other visual representations such as VLAD and Fisher Vector. The reason is that the bilinear feature matrix is sensitive to the magnitudes and correlations of local CNN feature elements which can be measured by its singular values. On the other hand, the singular vectors are more invariant and reasonable to be adopted as the feature representation. Motivated by this point, we advocate an alternative pooling method which transforms the CNN feature matrix to an orthonormal matrix consists of its principal singular vectors. Geometrically, such orthonormal matrix lies on the Grassmann manifold, a Riemannian manifold whose points represent subspaces of the Euclidean space. Similarity measurement of images reduces to comparing the principal angles between these ``homogeneous" subspaces and thus is independent of the magnitudes and correlations of local CNN activations. In particular, we demonstrate that the projection distance on the Grassmann manifold deduces a bilinear feature mapping without explicitly computing the bilinear feature matrix, which enables a very compact feature and classifier representation. Experimental results show that our method achieves an excellent balance of model complexity and accuracy on a variety of fine-grained image classification datasets.

Related Material

author = {Wei, Xing and Zhang, Yue and Gong, Yihong and Zhang, Jiawei and Zheng, Nanning},
title = {Grassmann Pooling as Compact Homogeneous Bilinear Pooling for Fine-Grained Visual Classification},
booktitle = {The European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}