We compare RVM with several strong video-model baselines: VideoMAE, VideoMAE-v2, VJEPA, DINO-v2, and 4DS. All models use a ViT-L backbone with matched feature-map dimensions and a 16×16 patch size, except DINO-v2, which uses a 14×14 patch size. All baseline model checkpoints are obtained from their official GitHub repositories.
DAVIS Video Segmentation
We present qualitative results on 7 randomly selected videos from the DAVIS-2017 dataset for the
video object segmentation task (first-frame ground-truth provided). The task is to propagate the
ground-truth object segmentation from the first frame to all subsequent frames.
bike-packing.mp4
bmx-trees.mp4
gold-fish.mp4
horsejump-high.mp4
judo.mp4
lab-coat.mp4
pigs.mp4
KMeans Visualization
We present qualitative results on 5 randomly selected videos from the DAVIS-2017 dataset using
KMeans clustering to illustrate how each model decomposes visual structure in a video. KMeans is
applied directly to the raw feature maps without any additional processing, using K = 5 clusters.
breakdance_kmeans.mp4
car-roundabout_kmeans.mp4
goat_kmeans.mp4
judo_kmeans.mp4
pigs_kmeans.mp4
Noise Video Comparison
We present qualitative results on a random noise video using PCA and KMeans clustering to evaluate
whether each model’s representations can capture motion independent of semantic content.
noise_kmeans.mp4
noise_pca.mp4
PCA Visualization
We present qualitative results on 5 randomly selected videos from the DAVIS-2017 dataset using
principal component analysis (PCA) to illustrate what each model primarily captures in a video. We
extract the first three principal components and visualize them as RGB images.
breakdance_pca.mp4
car-roundabout_pca.mp4
goat_pca.mp4
judo_pca.mp4
pigs_pca.mp4
VIP Part Propagation
We present qualitative results on 5 randomly selected videos from the VIP dataset for the video part
segmentation task (first-frame ground-truth provided). The task is to propagate the ground-truth
human part segmentation from the first frame to all subsequent frames.
videos233.mp4
videos311.mp4
videos338.mp4
videos361.mp4
videos37.mp4
JHMDB Pose Tracking
We present qualitative results on 5 randomly selected videos from the JHMDB dataset for the human
pose tracking task (first-frame ground-truth provided). The task is to propagate the ground-truth
human keypoints from the first frame to all subsequent frames.
Brushing_my_hair...
Goalkeeper_Training...
KnifeThrowing...
MeShootin2...
Pick_Up_Your_Trash!...