Confusing Large Models by Confusing Small Models
Despite a steady growth in average accuracy, computer vision models continue to fail on many robustness benchmarks. In this paper, we take a step back from standard benchmarks and focus on how models perceive data, and which aspects of the data they find confusing. Using an ensemble-based confusion score built on top of simple calibrations we examine how the training and test samples appear simple or confusing to a given model. Based on these heuristics, we demonstrate an application of the confusion score in identifying images that appear confusing to the trained model, and show that these images are highly likely to be misclassified by the model. We further demonstrate how confusion carries over to models of various sizes and architectures, which gives rise to the possibility of identifying challenging images via ensembles of small networks to produce a custom benchmark of challenging data, that remains appropriate for large models where ensembling is costly to implement. Finally, we demonstrate how training via upsampling on confusing images can improve accuracy on the hard subset.