A Unified Model for Face Matching and Presentation Attack Detection Using an Ensemble of Vision Transformer Features
A typical automated face recognition system is composed of three main component tasks: face detection and alignment (FDA), face presentation attack detection (FPAD), and face representation and matching (FRM). These tasks are often treated as standalone problems and deep neural net- work (DNN)-based solutions have been proposed to address them individually. However, in resource-constrained scenarios it would be ideal to have a unified DNN model that can perform all the three tasks together. As a first step towards realizing this goal, this work attempts to perform joint FRM and FPAD based on a single Vision Transformer (ViT) backbone. Recent work demonstrating the ability of ViT to extract a diverse set of feature representations gives rise to the tantalising possibility of building an end- to-end face recognition system using a single ViT model. The standard approach for designing multi-task DNNs is to implement different classification heads (e.g., for FRM and FPAD) based on a common stem/base and learn these heads either individually or jointly. A key contribution of this work is to demonstrate that this naive multi-head approach results in sub-optimal performance for either FRM or FPAD, because the features required by these tasks are very different. While good FPAD performance depends on accurately characterizing the micro textures, face match- ing demands attention towards more global characteristics. Hence, we propose a novel feature ensemble approach, where an ensemble of local features extracted from the intermediate blocks of a ViT are utilized for FPAD, while face matching is performed based on the ViT class token. Experiments demonstrate that the proposed ViT feature ensemble approach is able to achieve good performance for both face matching and FPAD compared to the multi-head approach.