QuickSRNet: Plain Single-Image Super-Resolution Architecture for Faster Inference on Mobile Platforms

Guillaume Berger* Manik Dhingra* Antoine Mercier Yashesh Savani Sunny Panchal Fatih Porikli Qualcomm AI Research†
{guilberg, manidhin, amercier, ysavani, sunnpanc, fporikli}@qti.qualcomm.com

Abstract

In this work, we present QuickSRNet, an efficient super-resolution architecture for real-time applications on mobile platforms. Super-resolution clarifies, sharpens, and up-scales an image to higher resolution. Applications such as gaming and video playback along with the ever-improving display capabilities of TVs, smartphones, and VR headsets are driving the need for efficient upscaling solutions. While existing deep learning-based super-resolution approaches achieve impressive results in terms of visual quality, enabling real-time DL-based super-resolution on mobile devices with compute, thermal, and power constraints is challenging. To address these challenges, we propose QuickSRNet, a simple yet effective architecture that provides better accuracy-to-latency trade-offs than existing neural architectures for single-image super-resolution. We present training tricks to speed up existing residual-based super-resolution architectures while maintaining robustness to quantization. Our proposed architecture produces 1080p outputs via 2× upscaling in 2.2 ms on a modern smartphone, making it ideal for high-fps real-time applications.

1. Introduction

Single-image super-resolution (SR) refers to a family of techniques that recover a high-resolution (HR) image $I_{HR}$ from its low-resolution (LR) counterpart $I_{LR}$. In recent years, deep learning (DL) based approaches have become increasingly popular in the field [6, 10, 11, 20, 24, 27, 28, 34, 35], producing impressive results compared to interpolation-based techniques and hand-engineered heuristics (see Fig. 2). However, most existing DL-based super-resolution solutions are computationally intensive and not suitable for real-time applications requiring interactive frame rates, such as mobile gaming. While DL-based super-resolution has been successfully applied to gaming on high-end GPU desktops [7, 26], neural approaches are still impractical for mobile gaming due to their high latency and computational costs. For example, a DL-based architecture such as EDSR [24] takes 75 ms to upscale a 540p image to 1080p on a state-of-the-art mobile AI accelerator. This has driven the need for efficient DL-based super-resolution solutions [2, 5, 12, 13, 38] that can be used in real-time applications such as video gaming, where responsiveness and higher frame rates are essential.

In this work, rather than trying to achieve the state-of-the-art PSNR or SSIM scores on standard super-resolution benchmarks, we aim to develop efficient architectures that are suitable for high-fps real-time applications on mobile devices. To this end, we propose QuickSRNet, a simple single-image super-resolution neural network that obtains better accuracy-to-latency trade-offs than existing efficient SR architectures. In particular, we make the following key contributions:

*Contributed equally.
†Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
• We streamline the network architecture, reduce the impact of residual connection removal, and ultimately demonstrate the effectiveness of simpler designs in achieving high levels of accuracy and on-device performance.

• We compare a wide variety of architectures in terms of on-device latency, measured on a device with Snapdragon® 8 Gen 1 Mobile Platform, instead of FLOPS count, which is not a reliable indicator of on-device performance [18].

• We measure accuracy after 8-bit quantization, a necessary step for better efficiency on mobile platforms, and describe architectural tricks that improve robustness to quantization.

• We apply our proposed architecture to a real-world use-case (video gaming) and compare its visual quality against that of a well-known industrial non-ML based approach (AMD’s FidelityFX Super Resolution (FSR1.0) algorithm [14]).

• We describe an approach to perform $1.5 \times$ upscaling, a setting that is occasionally used in gaming and XR use-cases but not trivially supported by SR architectures whose upscaling step is based on a sub-pixel convolution [35].

2. Related work

Several efficient SR architectures have been proposed recently. Overall, these architectures share many characteristics with the earlier work by [11] and [35] on FSRCNN and ESPCN respectively: they are usually fully convolutional, use a relatively small number of layers and channels, all layers run at the input resolution and the final output is mapped to higher resolution using a subpixel convolution $^{1}$.

$^{1}$ In the rest of the paper, we will use the term “depth-to-space” operation. In practice, a subpixel convolution amounts to performing a regular convolution producing $3 \times S^2$ low-resolution channels, where $S$ is the scaling factor, followed by a “depth-to-space” operation to map to higher resolution.

Compared to these baselines, more recent approaches have incorporated the following changes:

XLSR [2] uses grouped convolutions to reduce the computational footprint of the architecture and “clipped” ReLU activations to improve robustness to quantization.

ABPN [13] employs a VGG-like convnet [37] (i.e. consisting of only $3 \times 3$ Conv-ReLU blocks) with an “anchor-based” input-to-output residual connection. This “anchor-based” connection adds a channel-wise nearest-neighbor upscaled version of the input to the output before the final depth-to-space operation. We confirmed that this channel-wise implementation runs faster on our profiling device than the more common approach of adding the spatially-upsampled input directly to the output. Thus, we follow the same strategy to implement input-to-output residual connections in all our experiments.

SESR [5] leverages linear over-parameterized residual modules which are collapsed into regular convolutions during inference for improved on-device performance. Other modifications include the use of long residual connections.

RepSR [38] investigates training VGG-like super-resolution architectures. Like ABPN, their convnet is equipped with a nearest-neighbor upsampling-based input-to-output connection. Similar to SESR, they find that using over-parameterized networks during training can boost accuracy. They propose a training scheme for using Batch Normalization (BN) layers [22] without introducing artifacts in flat regions of the image, a typical side effect of BN when employed for super-resolution. At test time, the over-parameterized, BN-equipped network is collapsed into a simpler, more efficient network.
QuickSRNet architecture. We use the convention $\text{QuickSRNet-}f^X-m^Y$ to refer to the architecture variant that has $Y$ intermediate conv layers and $X$ feature channels. We use dotted lines to illustrate that the conv layers are initialized using an identity initialization scheme. In practice, these skip connections are incorporated into the weights of the corresponding conv module. $p$ and $ri$ stand for "partial" and "repeat-interleaving" respectively (see Sec. 3.2 for more details).

<table>
<thead>
<tr>
<th>Architecture</th>
<th>PSNR (dB)</th>
<th>Latency (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td>ABPN</td>
<td>FP16 31.84 (baseline)</td>
<td>INT8 31.80 (baseline)</td>
</tr>
<tr>
<td>Res.-free ABPN</td>
<td>FP16 31.75 ($-0.09$)</td>
<td>INT8 31.50 ($-0.30$)</td>
</tr>
<tr>
<td>QSRNet-Med</td>
<td>FP16 31.82 ($-0.02$)</td>
<td>INT8 31.77 ($-0.03$)</td>
</tr>
</tbody>
</table>

Table 1. On the accuracy and latency impacts of removing the input-to-output residual connection from the ABPN architecture. We report PSNR numbers obtained on BSD100 via $2\times$ upsampling. Latency numbers were obtained on a device with Snapdragon 8 Gen 1 using an input resolution of $512 \times 512$.

Residual learning for super-resolution Many SR architectures utilize a long skip connection which adds an upsampled version of the input $U(I_{LR})$ directly to the output. Efficient architectures (like [13, 38]) will often implement $U$ as nearest-neighbour interpolation. During training, SR architectures equipped with this technique are implicitly optimized to produce a residual $R = I_{HR} - U(I_{LR})$. One benefit is that the network produces reasonable outputs right after initialization which stabilizes training. Additionally, the input-to-output connection makes the architecture significantly more robust to quantization, as discussed in the next section.

3. Methodology

This section contains a detailed description of QuickSRNet as well as implementation details. The process for developing our proposed SR architecture began with preliminary experiments, which we present in the next paragraph.

3.1. On the impact of removing the input-to-output residual connection

VGG-style architectures such as ABPN [13] or RepSR [38] are already well-optimized, so it is unclear how much faster they can be made on mobile AI accelerators. Instinctively, reducing the number of layers and channels, or replacing $3 \times 3$ kernels with $1 \times 1$ kernels, can improve speed at the cost of accuracy. Instead, our experiments investigate how to effectively remove the input-to-output residual connection without affecting accuracy.

As observed by [2, 13], long residual connections can have a large impact on the efficiency of super-resolution architectures, particularly on memory-limited platforms such as smartphones or VR headsets. To confirm this, we trained and profiled a residual-free ABPN variant and found that removing the input-to-output residual connection improves latency by $35\%$. However, this modification resulted in a marginally lower accuracy and more importantly, reduced robustness to quantization, as can be seen in Tab. 1. A similar trend is evident in the results of the Mobile AI 2022 Challenge [21], where the fastest approaches do not use input-to-output residual connections at the cost of accuracy. To address this, we propose QuickSRNet, a residual-free architecture which is robust to quantization.

3.2. QuickSRNet

Our architecture, QuickSRNet, follows a VGG-like structure with no input-to-output residual connection (see Fig. 3). This architecture is denoted by $m$, the number of intermediate convolutional blocks, and $f$, the number of feature channels in those intermediate layers. To increase robustness to quantization, we use a residual learning-motivated initialization scheme along with clipped ReLU activations:

Identity initialization We utilize an intuitive initialization technique where each intermediate convolutional layer simulates a localized skip connection:

$$y = W \odot x + x$$ (1)

where $\odot$ is the discrete convolution operator and $W$ refers to the kernel weights. In practice, we collapse the
Figure 4. Visualization of 4× super-resolved images from Urban100 produced by our models and existing baselines. Our models match the quality of existing architectures while being significantly faster.

\[ y = (W + I) \odot x = \hat{W} \odot x, \] where \( W \) are the modified weights after collapse and \( I \) is the identity of discrete convolution operators. In this case, collapsing amounts to adding a diagonal of ones to the spatial center: \( \forall i, W[i, i, c_x, c_y] += 1 \) (with \( c_x = c_y = 1 \) assuming a \( 3 \times 3 \) kernel). This approach is akin to identity initialization \([3, 16, 41, 42]\) and related to the over-parameterized networks used in \([5, 9, 38, 39]\), except we collapse before training, during initialization.

Equation (1) only works if \( x \) and \( y \) have the same dimensions and is therefore not directly applicable to the first and last layer of the architecture, as these layers respectively change the number of channels from \( 3 \) to \( f \) and \( f \) to \( 3 \times S^2 \), where \( S \) is the scaling factor. For these layers, we modify the initialization scheme as follows:

- **Partial identity initialization**: the 3-channel input to the first convolutional module are added to the first 3 output channels and the other \( f - 3 \) output channels are left unchanged.

\[ y_i = \begin{cases} (W \odot x)_i + x_i, & \text{if } 0 \leq i < 3 \\ (W \odot x)_i, & \text{otherwise} \end{cases} \tag{2} \]

- **Repeat-interleaving identity initialization**: the first 3 input channels to the final convolutional module are repeat-interleaved \( S^2 \) times and added to the output.

\[ y_i = (W \odot x)_i + x_{\text{round}}(\frac{i}{S^2}) \tag{3} \]

Similar to Eq. (1), the skip connections described in Eqs. (2) and (3) are incorporated into the corresponding convolutional module by adding ones to the kernel weights at the appropriate location. Intuitively, this initialization technique makes the input image propagate well throughout the entire network. The repeat-interleaving scheme used to initialize the final layer mimics the nearest-neighbour up-scaling typically performed in the input-to-output connection of existing residual architectures.

**ReLU1** In addition to identity initialization, we found that clipping ReLU activations between 0 and 1 improves robustness to quantization. Compared to XLSR \([2]\), we use ReLU1s throughout the entire network as opposed to just the final layer. Note that for this approach to work well with our id-initialized architecture, it is important to scale input pixels between 0 and 1 (centering around 0 would cause roughly half the pixels propagated by the first id-initialized conv to be zeroed out).

Our experimental results (Sec. 4) show that combining identity initialization and ReLU1 activations significantly improve robustness to quantization.

### 3.3. Implementation details

**Baselines** We compare QuickSRNet against the following architectures: FSRCNN \([11]\), ESPCN \([35]\), XLSR \([2]\), SESR \([5]\), ABPN \([13]\), ERFDN \([27]\) and EDSR \([24]\). Note that, rather than reporting PSNR and SSIM scores from the original papers, we re-implemented, trained and quantized all existing baselines from scratch. As a result, all models shared most hyper-parameters (batch-size, losses, optimizer, etc.), including the data loading/augmentation pipeline. We did however tweak the learning rate for each architecture independently. In some cases, our re-
implementation deviates slightly from the original architecture when it includes operations that are not supported on the device used for profiling. For example, we replaced the parametric ReLUs [17] used in SESR and FSRCNN to regular ReLUs. Despite these minor modifications, we were usually able to reproduce PSNR and SSIM scores reported in the original papers.

**Training details** For most experiments, we train the models on the 800 training images from the DIV2K dataset [1] and evaluate them on standard SR testsets: Set5 [4], Set14 [40], BSD100 [29], and Urban100 [19]. We preprocess input and target images by scaling RGB values between 0 and 1. For data augmentation, we use random cropping, flipping and rotation. The models are trained for 1 million iterations with a batch size of 32. We use an L1 loss and the Adam optimizer [23] with hyper-parameters $\epsilon = 10^{-8}$ and $\beta = (0.9, 0.999)$. For the learning rate, we found that using an initial value of $5 \times 10^{-4}$ and decaying it by a factor of 0.5 every 200K iterations is a strategy that works well for most architectures.

**8-bit quantization** We use the AIMET library [36] to perform model quantization [32] and compute post-quantization accuracy metrics\(^2\). Both weights and activations are quantized to 8-bit integers (W8A8 setup). We experimented with both Post-Training Quantization (PTQ) techniques and Quantization Aware Training (QAT). When we use QAT, we re-initialize the optimizer with a very small learning rate (usually $4 \times 10^{-6}$).

\(^2\)Additionally, we confirmed accuracy numbers on target for a subset of the models and typically found that the simulated numbers produced by AIMET to be within a 0.02 range from the actual numbers obtained on target.

**On-device profiling** We profile the models on the Hexagon Processor of a device with Snapdragon 8 Gen 1 and report the average latency obtained on 100 inputs of spatial resolution $512 \times 512$. Before profiling, the model is converted from PyTorch [33] to ONNX. Please see the appendix for more details about the model conversion steps.

**4. Experimental results**
In this section, we compare QuickSRNet against existing SR architectures in terms of accuracy-to-latency trade-offs and demonstrate the effectiveness of our training tricks to improve robustness to quantization through ablation studies.

**Scaling laws of QuickSRNet** We experimented with several architecture specifications, varying the number of conv modules $m$ and the number of feature channels $f$. PSNR and SSIM scores on the BSD100 dataset obtained with each specification and a scaling factor of 2 can be found.
### QuickSRNet specs

<table>
<thead>
<tr>
<th>Specification</th>
<th>Post-training</th>
<th>No Optimizations</th>
<th>QAT</th>
<th>Per-channel QAT</th>
<th>Per-channel Adaround</th>
</tr>
</thead>
<tbody>
<tr>
<td>QuickSRNet-Small</td>
<td>31.61</td>
<td>30.81 (-0.80)</td>
<td>31.34 (-0.27)</td>
<td>31.57 (-0.04)</td>
<td>31.56 (-0.05)</td>
</tr>
<tr>
<td>QuickSRNet-Medium</td>
<td>31.82</td>
<td>30.74 (-1.08)</td>
<td>31.61 (-0.21)</td>
<td>31.75 (-0.07)</td>
<td>31.77 (-0.05)</td>
</tr>
<tr>
<td>QuickSRNet-Large</td>
<td>32.07</td>
<td>31.37 (-0.70)</td>
<td>31.90 (-0.10)</td>
<td>31.97 (-0.10)</td>
<td>31.99 (-0.08)</td>
</tr>
</tbody>
</table>

Table 2. PSNRs (dB) and latencies (ms) of various QuickSRNet configurations (\(f\) : number of feature channels, \(m\) : number of convolutional blocks in the network). We report PSNR numbers obtained before and after quantization. We also report latency measurements on a 512 × 512 input, obtained on a device with Snapdragon 8 Gen 1, and gains introduced by not using an input-to-output residual connection.

### Table 3. PSNRs (dB) and latencies (ms) of existing SISR solutions on BSD100. Please note that we re-implemented, trained, and quantized all architectures from scratch. Latency numbers were measured on a device with Snapdragon 8 Gen 1, using a 512 × 512 input.

Table 3. PSNRs (dB) and latencies (ms) of existing SISR solutions on BSD100. Please note that we re-implemented, trained, and quantized all architectures from scratch. Latency numbers were measured on a device with Snapdragon 8 Gen 1, using a 512 × 512 input.

### Figure 7. Ablation study comparing the post-quantization PSNR drop from FP16 when removing identity initialization and/or ReLU1 activations from the architecture design.

Table 4. Impact of various quantization techniques on accuracy. Activations are always quantized to 8-bit integers using per-tensor quantization. For weights, we tried both per-tensor and per-channel quantization and found the latter to work significantly better.
Figure 8. SISR ($2 \times$) for Gaming: (a) Low-resolution, (b) Bicubic interpolation, (c) FSR1.0 [14], and (d) QuickSRNet-Small (ours)

<table>
<thead>
<tr>
<th>Method</th>
<th>PSNR</th>
<th>SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bicubic</td>
<td>28.88</td>
<td>0.8683</td>
</tr>
<tr>
<td>FSR1.0</td>
<td>29.01</td>
<td>0.8707</td>
</tr>
<tr>
<td>QuickSRNet-Small</td>
<td>29.71</td>
<td>0.8806</td>
</tr>
</tbody>
</table>

Table 5. PSNR/SSIM scores for different $2 \times$ single-image super-resolution solutions for gaming.

Table 6. QuickSRNet-Small latency (ms) running at different target resolutions on a device with Snapdragon 8 Gen 1.

W8A8 quantization For all our experiments, we quantize both, model weights and activations, to 8-bit integers. Without any optimizations, we observe a significant drop after quantization (see Tab. 4). While finetuning the per-tensor quantized weights via QAT can recover some of this drop, we found per-channel weight quantization to be important.

Furthermore, we experimented with several post-training quantization methods, including: cross-layer equalization (CLE) [31], bias correction (BC) [31], and adaptive rounding (AdaRound) [30], and found AdaRound to obtain comparable performance to per-channel QAT, outperforming the other PTQ approaches. CLE did not work well in our experiments, most likely because it skews the activation values outside the ReLU1 range. In an attempt to further improve post-quantization accuracy, we tried finetuning the per-channel Adarounded weights using QAT but this did not improve post-quantization accuracy.

Robustness to quantization Overall, QuickSRNet quantizes well to W8A8. As can be seen in Fig. 5, images produced by the quantized model are indistinguishable from their full-precision counterparts. In Fig. 7, we visualize the drop in PSNR post quantization and show the benefits of combining identity initialization and ReLU1 activations. Regardless of the model size, removing one or both of these ingredients from the model design results in a significantly worse accuracy after quantization.

Less prone to block artifacts Our experiments show that architectures with a nearest-neighbour upsampling skip connection tend to produce outputs with block-like artifacts of size $S \times S$. Interestingly, our residual-free architecture seems less prone to this issue and produces more perceptually pleasing results. A visual comparison of such artifacts can be seen in Fig. 6.

5. DL-based SISR for mobile gaming

A real-world application of efficient super-resolution is video gaming. While DL-based super-resolution (or supersampling) has already been commercialized on high-end gaming desktops [7, 26], these solutions are not supported on mobile platforms yet. One specificity of gaming content is that synthetically rendered images are significantly more aliased than natural images. Nevertheless, we find QuickSRNet-Small to work well on this domain, with no changes needed apart from re-training it on
Figure 9. Two different architecture modifications to implement 1.5× upscaling: (a) Naïve approach, where we repurpose a 3× architecture by adding an average pooling layer on top, (b) Our approach, where we halve the resolution inside the network and map to target resolution using a 3× subpixel conv.

Table 7. PSNRs (dB) evaluated after quantization on BSD100 dataset via 1.5× upscaling

<table>
<thead>
<tr>
<th>Specification</th>
<th>Bicubic</th>
<th>Naïve Baseline</th>
<th>Proposed Approach</th>
</tr>
</thead>
<tbody>
<tr>
<td>Small</td>
<td>32.47</td>
<td>34.71</td>
<td>34.89</td>
</tr>
<tr>
<td>Medium</td>
<td>34.87</td>
<td>35.13</td>
<td></td>
</tr>
<tr>
<td>Large</td>
<td>35.18</td>
<td>35.47</td>
<td></td>
</tr>
</tbody>
</table>

QuickSRNet

Bicubic Naïve Proposed Approach

Table 7. PSNRs (dB) evaluated after quantization on BSD100 dataset via 1.5× upscaling

gaming data. Figure 8 shows some results obtained by QuickSRNet-Small when applied to gaming content. We compare our results against non-ML based single-frame upscaling approaches, including an FSR1.0 baseline [14] which was specifically designed for this use case. Overall, we find that QuickSRNet-Small produces better-looking images compared to the other baselines. The visual benefits also translate into PSNR and SSIM gains, as can be seen in Tab. 5. In terms of latency, Tab. 6 shows QuickSRNet-Small latency measurements at various target resolutions, from 540p to 4k. In the future, we would like to extend our architecture to the multi-frame case which has become the de facto standard for video gaming (e.g. FSR 2.0, [15], DLSS 2.0 [25], XeSS [8]).

5.1. QuickSRNet 1.5×

Standard super-resolution datasets are usually limited to 2×, 3× or 4× upscaling and non-integer scaling factors are rarely explored. On the other hand, 1.5× upscaling is often proposed in VR and gaming applications\(^3\). In this section, we describe an approach to perform 1.5× upscaling, a setting that is not trivially supported by most efficient SR architectures as non-integer scaling factors are not compatible with the final sub-pixel convolution.

3× upscaling followed by 2× downscaling baseline A naïve approach to 1.5× upscaling consists in downscaling by a factor 2 the output of a 3× SR model. This can be achieved by adding a 2 × 2 average pooling layer at the end of the architecture.

Proposed 1.5× upscaling approach Instead, we propose to halve the resolution inside the network using a space-to-depth operation with a block-size of 2 which we then map to target resolution using a 3× subpixel convolution. To compensate for the 4× increase of channels due to the space-to-depth operation, we implement the subpixel convolution using a 1 × 1 kernel.

Figure 9 shows the two considered 1.5× architecture heads. As can be seen in Tab. 7, the proposed approach significantly outperforms the naïve 3× upscaling followed by 2× downscaling baseline.

6. Conclusion

In this study, we propose QuickSRNet, an efficient super-resolution architecture for mobile platforms. We have thoroughly analyzed the performance of our models and existing ones, systematically checking accuracy after quantization and profiling latency on a mobile device. Our experiments have shown that QuickSRNet is well suited for real-time applications on mobile devices due to its high speed and good accuracy. We have also demonstrated the effectiveness of our solution on a real world use case (mobile gaming) and believe that our training tricks to improve robustness to quantization are applicable to other works. We have released the implementation and pretrained weights (including quantized weights) of QuickSRNet models as part of the AIMET model zoo\(^4\). We believe that QuickSRNet provides a practical solution for applications that require real-time super-resolution capabilities.

\(^3\)Both DLSS and FSR support 1.5× via their “Quality” mode.

\(^4\)For QuickSRNet-large, the released version of the model includes the input-to-output residual connection as this leads to slightly higher accuracy and the latency improvement (-7%) is minimal for larger architectures.
References


[28] Zhi-Song Liu, Li-Wen Wang, Chu-Tak Li, and Wan-Chi Siu. Hierarchical back projection network for image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 1


