Supplementary Material for Paper: STAViS: Spatio-Temporal AudioVisual Saliency Network

All videos are encoded with x264 codecs: https://www.videolan.org/ https://www.videolan.org/developers/x264.html

These videos correspond to the different figures that are included in the main manuscript:

Figure_1.mp4: Example frames with their eye-tracking data of a bell tolling. The second row depicts the saliency maps produced 
	      by our visual-only saliency network, while the third row is the output of our proposed STAViS network, which succeeds 
              in better capturing human attention.

Figure_4.mp4: Sample frames from Coutrot1 database with their eye-tracking data, and the corresponding ground truth, 
	      spatio-temporal visual, and audiovisual saliency maps as produced by STAViS (visual-only and audiovisual). 
              Also NSS curve over time for visual and audiovisual approaches is depicted.

Figure_5.mp4: Sample frames from Coutrot2 database with their eye-tracking data, and the corresponding ground truth, 
	      spatio-temporal visual, and audiovisual saliency maps as produced by STAViS (visual-only and audiovisual). 
	      Also NSS curve for visual and audiovisual approaches is depicted.

Figure_6.mp4: Sample frame from ETMD, AVAD, DIEM and ETMD databases with their eye-tracking data, 
	      and the corresponding ground truth, STAViS, and other spatio-temporal state-of-the-art visual saliency maps for comparisons.