-
[pdf]
[bibtex]@InProceedings{Wang_2026_CVPR, author = {Wang, Zijiao and Li, Qiang and Hu, Zheng and Fu, Gang and Zhang, Li}, title = {Zero-Shot X-ray Security Image Synthesis Via Cross-modal Diffusion}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {7210-7219} }
Zero-Shot X-ray Security Image Synthesis Via Cross-modal Diffusion
Abstract
Synthesizing backscatter X-ray images from transmission scans is critical for enhancing security inspection capabilities, yet it remains challenging due to the scarcity of paired data and the inherent imaging discrepancies between modalities. Existing controllable generation methods often struggle with structural misalignment and poor generalization to unseen cargo categories. In this paper, we propose a Zero-shot Cross-modal Diffusion Framework that leverages pre-trained unsupervised visual encoders to synthesize high-fidelity backscatter images for unseen cargoes using only limited paired data from known categories. Our method accommodates structural deviations without relying on strict geometric alignment. Central to our approach is the Adaptive Multi-level Feature Adaption (AMFA) module, which mitigates the representation bottleneck of pre-trained encoders in grayscale security imagery. By integrating cross-modal regularization and a self-adaptive gating mechanism, AMFA dynamically screens optimal hierarchical features, effectively mitigating inter-layer interference and enhancing feature discriminability. Extensive experiments demonstrate that our method substantially outperforms existing approaches in generation fidelity and cargo separability. Notably, our synthesized features achieve superior alignment with real feature distributions, evidenced by gains in the feature similarity metric, while refining clustering boundaries in the feature space.
Related Material

