# SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt

Open-vocabulary action recognition (OVAR) extends recognition systems to identify unseen action categories. While large-scale vision-language models (VLMs) like CLIP have enabled OVAR in image domains, their adaptation to event data remains underexplored. Event cameras offer high temporal resolution and inherent privacy preservation, making them suitable for capturing fine-grained motion dynamics. However, leveraging event data for OVAR presents challenges: 1) bridging the domain gap between static image-based models and event streams, and 2) preserving the generalization capabilities of pretrained VLMs in open-vocabulary settings.
In this paper, we propose SAMPLE, a lightweight adaptation of VLMs for event-based action recognition, balancing supervised and open-vocabulary performance. We introduce a *Temporal-Adaptive Multimodal Prompt Learning strategy* that can be divided into: 1) Unimodal prompt on both the event and text branches to learn the data distribution 2) Event-Text cross-modal prompt for representation space alignment 3) Temporal-Adaptive prompt to model temporal dependencies across event data. Extensive evaluations demonstrate that SAMPLE outperforms prior methods across fully supervised, few-shot, base-to-novel and zero-shot settings. Notably, in zero-shot scenarios, SAMPLE achieves gains of +15.46\%, +29.76\%, and +23.79\% on SeAct, DVS128Gesture, and PAF respectively with less commute cost.

<div align="center">
<img src="image/method.png" width="800px">
</div>

## Model Zoo

Please be advised that all models utilized in the experiments described below are based on the publicly accessible [ViT/B-16 CLIP](https://github.com/mlfoundations/open_clip)  model.

#### Fully-supervised Results

|                       Dataset(configs)                       | Model                                                                                     |
| :----------------------------------------------------------: | ----------------------------------------------------------------------------------------- |
|           [HARDVS](configs/HARDVS/HARDVS_train.yaml)           | [Link](https://drive.google.com/drive/folders/1AYspxuJnfGJJlRl93cgW2nifxBozBIEm?usp=sharing) |
|                [PAF](configs/PAF/PAF_train.yaml)                | [Link](https://drive.google.com/drive/folders/1AYspxuJnfGJJlRl93cgW2nifxBozBIEm?usp=sharing) |
| [DVS128Gesture](configs/DVS128Gesture/DVS128Gesture_train.yaml) | [Link](https://drive.google.com/drive/folders/1AYspxuJnfGJJlRl93cgW2nifxBozBIEm?usp=sharing) |
|             [SeAct](configs/SeAct/SeAct_train.yaml)             | [Link](https://drive.google.com/drive/folders/1AYspxuJnfGJJlRl93cgW2nifxBozBIEm?usp=sharing) |

#### Base-to-Novel Results

|                           Dataset(configs)                           | Model                                                                                     |
| :------------------------------------------------------------------: | ----------------------------------------------------------------------------------------- |
|        [HARDVS](configs/base_to_novel/HARDVS_base_to_novel.yaml)        | [Link](https://drive.google.com/drive/folders/1dj4-jHlaWBQ0QOlz7x8tuEV4a5TWowwo?usp=sharing) |
|           [PAF](configs/base_to_novel/PAF_base_to_novel.yaml)           | [Link](https://drive.google.com/drive/folders/1dj4-jHlaWBQ0QOlz7x8tuEV4a5TWowwo?usp=sharing) |
| [DVS128Gesture](configs/base_to_novel/DVS128Gesture_base_to_novel.yaml) | [Link](https://drive.google.com/drive/folders/1dj4-jHlaWBQ0QOlz7x8tuEV4a5TWowwo?usp=sharing) |
|         [SeAct](configs/base_to_novel/SeAct_base_to_novel.yaml)         | [Link](https://drive.google.com/drive/folders/1dj4-jHlaWBQ0QOlz7x8tuEV4a5TWowwo?usp=sharing) |

#### Few-shot Results

|                      Dataset(configs)                      | Model                                                                                     |
| :--------------------------------------------------------: | ----------------------------------------------------------------------------------------- |
|        [HARDVS](configs/few_shot/HARDVS_few_shot.yaml)        | [Link](https://drive.google.com/drive/folders/1yMrVebiqReMCIyvnr9ZBOtbThg0khCo4?usp=sharing) |
|           [PAF](configs/few_shot/PAF_few_shot.yaml)           | [Link](https://drive.google.com/drive/folders/1yMrVebiqReMCIyvnr9ZBOtbThg0khCo4?usp=sharing) |
| [DVS128Gesture](configs/few_shot/DVS128Gesture_few_shot.yaml) | [Link](https://drive.google.com/drive/folders/1yMrVebiqReMCIyvnr9ZBOtbThg0khCo4?usp=sharing) |
|         [SeAct](configs/few_shot/SeAct_few_shot.yaml)         | [Link](https://drive.google.com/drive/folders/1yMrVebiqReMCIyvnr9ZBOtbThg0khCo4?usp=sharing) |

## Installation

1. Setup conda environment (recommended).
   ```
   # Create a conda environment
   conda create -y -n sample python=3.8
   # Activate the environment
   conda activate sample
   # Install requirements
   pip install -r requirements.txt
   ```
2. Download the **ViT-B-16** CLIP pretrained backbone in this [repository](https://github.com/mlfoundations/open_clip).
3. Prepare the dataset: [HARDVS](https://github.com/Event-AHU/HARDVS), [PAF,](https://github.com/CrystalMiaoshu/PAFBenchmark) [DVS128Gesture](https://research.ibm.com/publications/a-low-power-fully-event-based-gesture-recognition-system) and [SeAct](https://drive.google.com/file/d/1AO8KGzFT6784kiW2OzAgi0a-jqmzl-x6/view?usp=sharing) datasets according to the [DATASET.md](dataset_prepare/DATASET.md)
4. Change the config yaml according to the [CONFIG.md](configs/CONFIG.md)
5. Evaluate SAMPLE model using the following command:
   ```
   python test.py --config configs/SeAct/SeAct_testing.yaml
   ```

## Training

```
# fully-supervised
python train.py --config configs/SeAct/SeAct_train.yaml
# few-shot
python train.py --config configs/few_shot/SeAct_few_shot.yaml --few_shot 2
# base-to-novel
python train.py --config configs/base_to_novel/SeAct_base_to_novel.yaml


```

# Test

```
python test.py --config configs/SeAct/SeAct_testing.yaml
```

## Acknowledgments

Our code is based on [EZ-CLIP](https://github.com/Shahzadnit/EZ-CLIP) , [MaPLe](https://github.com/muzairkhattak/multimodal-prompt-learning) and [ExAct](https://github.com/jiazhou-garland/ExACT). We sincerely thank the authors for releasing their code.

# License

This repository is released under the [MIT](LICENSE) License.
