-
[pdf]
[supp]
[bibtex]@InProceedings{Shi_2026_CVPR, author = {Shi, Junqi and Cong, Wuyang and Lu, Ming and Xu, Bowei and Ma, Zhan}, title = {Beyond Pixel Loss: Video-INRs Prefer Perceptual Optimization}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {4843-4854} }
Beyond Pixel Loss: Video-INRs Prefer Perceptual Optimization
Abstract
Implicit neural representations (INRs) have recently emerged as a powerful paradigm for video modeling, representing videos as continuous functions parameterized by network weights rather than storing raw pixels or latent codes. However, most existing video-INR methods still rely on pixel-wise supervision (MSE or l_1), which--through the lens of variational inference--implicitly assumes Gaussian or Laplacian reconstruction noise. We show that such assumptions are statistically misaligned with per-video characteristics, where reconstruction errors are highly structured and temporally correlated in real-world videos. We argue that INRs, by their sequence-specific nature, are inherently better suited to perceptual rather than pixel alignment. To validate this perspective, we propose POVI (Perceptually Optimized Video Implicit representation), a perceptually aligned learning framework that shifts INR supervision into multi-level visual feature domains. POVI integrates two complementary perceptual objectives: Multi-Vision Feature Similarity (MVFS) for spatial fidelity and Vision Subject Similarity (VSS) for temporal coherence. Even with a lightweight INR backbone using simple cascaded upsampling, POVI achieves superior perceptual quality compared to state-of-the-art VAE- and diffusion-based codecs, while maintaining real-time decoding at ~125 FPS on 1080p videos. Our findings reveal that perceptual optimization is not merely a heuristic improvement, but a principled objective shift essential for advancing video-INR representation and reconstruction.
Related Material

