The folder contains:
    - A file with supplementary text and additional results (in SupplemetaryMaterial.pdf) -- references are numbered according to the main text;
    - The novel dataset of speech segments for ActivityNet Captions (in asr_en.csv);
    - The novel dataset of categories per video for ActivityNet Captions (in vid2cat.json);
    - The list of available videos at the time of collection (in available_mp4.txt);
    - The generated captions for validation sets 1 and 2 (in results_val_*_e30.json);

Notes:
    The links to the used videos (obtained from ActivityNet dataset):
        - Figure 1: https://www.youtube.com/embed/PLqTX6ij52U?rel=0
        - Figure 4: https://www.youtube.com/embed/xs5imfBbWmw?rel=0
        - Figure 7: https://www.youtube.com/embed/EGrXaq213Oc?rel=0
    asr_en.csv contains five columns:
        - video_id: YouTube video id;
        - sub: the recognized speech;
        - start: starting point of the recognized speech;
        - end: ending point of the original speech track;
        - end_adj (*): adjusted (more accurate) ending point of the speech track.

    (*) YouTube displays the previous speech segment, as well as the new one, appears when
        somebody is speaking. When the current line is finished, it replaces the previous one
        while the new one start to appear on the screen and so on. Therefore, the starting 
        point is quite accurate when the ending point is not. Considering the fact that the 
        previous speech segment is ended by the start of the next one, we may adjust the 
        ending point to be the start of the next segment within one video.
