All videos are encoded in H264 format and are tested on VLC media player (Linux and Mac).

Each video has side by side frames, which are being played simultaneously. The left frame is enclosed in a green rectangle when a ground truth action is occurring. Similarly, the right frame is enclosed in a blue rectangle when the action is predicted to be occurring. Due to the untrimmed nature of the videos, only the relevant portion of the videos are shown in each example. 

1. See 1_Pole_vault.mp4: Well separated action instances of Pole Vault are generally detected by our D2-Net. 

2. See 2_Volleyball_spiking.mp4: The first two instances of Volleyball Spiking have a considerable pause in the video (0:31s to 0:34s and 0:48s to 0:51s), resulting in the absence of motion for the corresponding frames. The absence of discriminative motion information leads to four incorrect detections for these two GT instances. 

3. See 3_Washing_hands.mp4: The two adjacent ground-truth Washing Hands instances are jointly detected as a single instance by our approach, since the separating background (between 0:40s to 0:49s) is indiscriminable from the foreground activity. 

4. See 4_Playing_harmonica.mp4: Both the long and short duration instances of Playing Harmonica are detected correctly by our approach (0:24s to 1:31s and 1:54s to 2:00s). However, a false detection arises due to the presence of the performer on stage (but not playing) between 2:05s to 2:08s.
