Visual Relationship Detection With Internal and External Linguistic Knowledge Distillation

Ruichi Yu, Ang Li, Vlad I. Morariu, Larry S. Davis; The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1974-1982


Understanding the visual relationship between two objects involves identifying the subject, the object, and a predicate relating them.We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships compared to modeling them independently, but it complicates learning since the semantic space of visual relationships is huge and training data is limited, especially for long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj,obj) pair. As we train the visual model, we distill this knowledge into the deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the state-of-the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

Related Material

[pdf] [Supp] [arXiv]
author = {Yu, Ruichi and Li, Ang and Morariu, Vlad I. and Davis, Larry S.},
title = {Visual Relationship Detection With Internal and External Linguistic Knowledge Distillation},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}