Qualitative Video Results

We compare RVM with several strong video-model baselines: VideoMAE, VideoMAE-v2, VJEPA, DINO-v2, and 4DS. All models use a ViT-L backbone with matched feature-map dimensions and a 16×16 patch size, except DINO-v2, which uses a 14×14 patch size. All baseline model checkpoints are obtained from their official GitHub repositories.

DAVIS Video Segmentation

We present qualitative results on 7 randomly selected videos from the DAVIS-2017 dataset for the video object segmentation task (first-frame ground-truth provided). The task is to propagate the ground-truth object segmentation from the first frame to all subsequent frames.
bike-packing.mp4
bmx-trees.mp4
gold-fish.mp4
horsejump-high.mp4
judo.mp4
lab-coat.mp4
pigs.mp4

KMeans Visualization

We present qualitative results on 5 randomly selected videos from the DAVIS-2017 dataset using KMeans clustering to illustrate how each model decomposes visual structure in a video. KMeans is applied directly to the raw feature maps without any additional processing, using K = 5 clusters.
breakdance_kmeans.mp4
car-roundabout_kmeans.mp4
goat_kmeans.mp4
judo_kmeans.mp4
pigs_kmeans.mp4

Noise Video Comparison

We present qualitative results on a random noise video using PCA and KMeans clustering to evaluate whether each model’s representations can capture motion independent of semantic content.
noise_kmeans.mp4
noise_pca.mp4

PCA Visualization

We present qualitative results on 5 randomly selected videos from the DAVIS-2017 dataset using principal component analysis (PCA) to illustrate what each model primarily captures in a video. We extract the first three principal components and visualize them as RGB images.
breakdance_pca.mp4
car-roundabout_pca.mp4
goat_pca.mp4
judo_pca.mp4
pigs_pca.mp4

VIP Part Propagation

We present qualitative results on 5 randomly selected videos from the VIP dataset for the video part segmentation task (first-frame ground-truth provided). The task is to propagate the ground-truth human part segmentation from the first frame to all subsequent frames.
videos233.mp4
videos311.mp4
videos338.mp4
videos361.mp4
videos37.mp4

JHMDB Pose Tracking

We present qualitative results on 5 randomly selected videos from the JHMDB dataset for the human pose tracking task (first-frame ground-truth provided). The task is to propagate the ground-truth human keypoints from the first frame to all subsequent frames.
Brushing_my_hair...
Goalkeeper_Training...
KnifeThrowing...
MeShootin2...
Pick_Up_Your_Trash!...