Beyond Pixel Loss: Video-INRs Prefer Perceptual Optimization

Shi, Junqi; Cong, Wuyang; Lu, Ming; Xu, Bowei; Ma, Zhan

Junqi Shi, Wuyang Cong, Ming Lu, Bowei Xu, Zhan Ma; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 4843-4854

Abstract

Implicit neural representations (INRs) have recently emerged as a powerful paradigm for video modeling, representing videos as continuous functions parameterized by network weights rather than storing raw pixels or latent codes. However, most existing video-INR methods still rely on pixel-wise supervision (MSE or l_1), which--through the lens of variational inference--implicitly assumes Gaussian or Laplacian reconstruction noise. We show that such assumptions are statistically misaligned with per-video characteristics, where reconstruction errors are highly structured and temporally correlated in real-world videos. We argue that INRs, by their sequence-specific nature, are inherently better suited to perceptual rather than pixel alignment. To validate this perspective, we propose POVI (Perceptually Optimized Video Implicit representation), a perceptually aligned learning framework that shifts INR supervision into multi-level visual feature domains. POVI integrates two complementary perceptual objectives: Multi-Vision Feature Similarity (MVFS) for spatial fidelity and Vision Subject Similarity (VSS) for temporal coherence. Even with a lightweight INR backbone using simple cascaded upsampling, POVI achieves superior perceptual quality compared to state-of-the-art VAE- and diffusion-based codecs, while maintaining real-time decoding at ~125 FPS on 1080p videos. Our findings reveal that perceptual optimization is not merely a heuristic improvement, but a principled objective shift essential for advancing video-INR representation and reconstruction.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Shi_2026_CVPR, author = {Shi, Junqi and Cong, Wuyang and Lu, Ming and Xu, Bowei and Ma, Zhan}, title = {Beyond Pixel Loss: Video-INRs Prefer Perceptual Optimization}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {4843-4854} }