Group Leakage Overestimates Performance: A Case Study in Keystroke Dynamics
Keystroke dynamics is a powerful behavioral biometric capable of user authentication based on typing patterns. As larger keystroke datasets become available, machine learning and deep learning algorithms are becoming popular. Knowledge of every possible impostor is not known during training which means that keystroke dynamics is an open set recognition problem. Treating open set recognition problems as closed set (assuming samples from all impostors are present) can cause models to incur data leakage, which can provide unrealistic overestimates of performance. It is a common problem in machine learning and can cause models to report higher accuracies than would be expected in the real world. In this paper, we outline open set recognition and discuss how, if not handled properly, it can lead to data leakage. The performance of common machine learning methods, such as SVM and MLP are investigated with and without leakage to clearly demonstrate the differences in performance. A synthetic dataset and a publicly available keystroke dynamics fixed-text dataset are used for research transparency and reproducibility.