-
[pdf]
[bibtex]@InProceedings{Zhou_2026_CVPR, author = {Zhou, Xuewei and Meng, Yajie and Zeng, Pan and Tang, Xianfang and Cui, Feifei and Jin, Qiangguo and Yang, Jialiang and Xu, Junlin}, title = {TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {10502-10511} }
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
Abstract
Cardiovascular disease (CVD) diagnosis relies heavily on electrocardiograms (ECGs). However, most existing self-supervised uni-modal methods suffer from limited representational capacity, while multi-modal frameworks are hindered by coarse-grained semantic alignment across modalities, thus restricting their generalizability in clinical settings. To address these limitations, we propose TAMER, a Tri-modal contrastive Alignment and Multi-scale Embedding Refinement framework that jointly models ECG recordings, spectrograms, and diagnostic reports. TAMER is composed of three key components: First, the tri-modal feature encoding and projection (TFEP) module employs modality-specific encoders to extract global and local features from ECG recordings, spectrograms, and diagnostic reports, and projects them into latent spaces. Then, the global-local temporal-spectral alignment (GLTSA) module captures complementary rhythm- and wave-level characteristics via contrastive alignment and attentive interaction between temporal and spectral modalities. Finally, the report-aware alignment and refinement (RAAR) module performs diagnostic-level alignment and wave-level refinement with clinical reports, enabling semantic enrichment of ECG representations.Extensive experiments on three public ECG datasets demonstrate that TAMER achieves state-of-the-art zero-shot classification performance (AUC: 81.2%) and strong cross-domain generalization (AUC: 83.1%), outperforming existing uni-modal and multi-modal baselines methods.The source code is available at https://github.com/zhouxw12345/TAMER.
Related Material

