- [pdf] [supp]
Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
In this paper, we address the problem of weakly supervised Audio-Visual Video Parsing (AVVP), where the goal is to temporally localize events that are audible or visible and simultaneously classify them into known event categories. This is a challenging task, as we only have access to the video-level event labels during training but need to predict event labels at the segment level during evaluation. Existing multiple-instance learning (MIL) based methods use a form of attentive pooling over segment-level predictions. These methods only optimize for a subset of most discriminative segments that satisfy the weak-supervision constraints, which miss identifying positive segments. To address this, we focus on improving the proportion of positive segments detected in a video. To this end, we model the number of positive segments in a video as a latent variable and show that it can be modeled as Poisson binomial distribution over segment-level predictions, which can be computed exactly. Given the absence of fine-grained supervision, we propose an Expectation-Maximization approach to learn the model parameters by maximizing the evidence lower bound (ELBO). We iteratively estimate the minimum positive segments in a video and refine them to capture more positive segments. We conducted extensive experiments on AVVP tasks to evaluate the effectiveness of our proposed approach, and the results clearly demonstrate that it increases the number of positive segments captured compared to existing methods. Additionally, our experiments on Temporal Action Localization (TAL) demonstrate the potential of our method for generalization to similar MIL tasks.