- [pdf] [supp]
Investigating CLIP Performance for Meta-Data Generation in AD Datasets
Using Machine Learning (ML) models for safety-critical perception tasks in Autonomous Driving (AD) or other domains requires a thorough evaluation of the model performance and the data coverage w.r.t. the intended Operational Design Domain (ODD). However, obtaining the needed per-image semantic meta-data along the relevant dimensions of the ODD for real-world image datasets is non-trivial. Recent advances in self-supervised foundation models, specifically CLIP, suggest that such meta-data could be obtained for real-world images in an automated fashion using zero-shot classification. While CLIP was already reported to achieve promising performance on tasks such as the recognition of gender or age on facial images, we investigate to which extent less prominent and more fine-grained observables, e.g., presence of accessories such as spectacles or the shirt- or hair-color, can be determined. We provide an analysis of CLIP for generating fine-grained meta-data on three datasets from the AD domain, one of synthetic origin including ground truth, the others being Cityscapes and Railsem19. We also compare with a standard facial dataset where more elaborate attribute annotations are present. To improve the quality of generated meta-data, we additionally extend the ensemble approach of CLIP by a simple noise-suppressing technique.