Visual and Textual Deep Feature Fusion for Document Image Classification

Souhail Bakkali, Zuheng Ming, Mickael Coustaty, Marcal Rusinol; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 562-563

Abstract


The topic of text document image classification has been explored extensively over the past few years. Most recent approaches handled this task by jointly learning the visual features of document images and their corresponding textual contents. Due to the various structures of document images, the extraction of semantic information from its textual content is beneficial for document image processing tasks such as document retrieval, information extraction, and text classification. In this work, a two-stream neural architecture is proposed to perform the document image classification task. We conduct an exhaustive investigation of nowadays widely used neural networks as well as word embedding procedures used as backbones, in order to extract both visual and textual features from document images. Moreover, a joint feature learning approach that combines image features and text embeddings is introduced as a late fusion methodology. Both the theoretical analysis and the experimental results demonstrate the superiority of our proposed joint feature learning method comparatively to the single modalities. This joint learning approach outperforms the state-of-the-art results with a classification accuracy of 97.05% on the large-scale RVL-CDIP dataset.

Related Material


[pdf]
[bibtex]
@InProceedings{Bakkali_2020_CVPR_Workshops,
author = {Bakkali, Souhail and Ming, Zuheng and Coustaty, Mickael and Rusinol, Marcal},
title = {Visual and Textual Deep Feature Fusion for Document Image Classification},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}