edge–SR: Super–Resolution For The Masses

Pablo Navarrete Michelini, Yunhua Lu, Xingqun Jiang
BOE Technology Group Co., Ltd.

Abstract

Classic image scaling (e.g. bicubic) can be seen as one convolutional layer and a single upscaling filter. Its implementation is ubiquitous in all display devices and image processing software. In the last decade deep learning systems have been introduced for the task of image super–resolution (SR), using several convolutional layers and numerous filters. These methods have taken over the benchmarks of image quality for upscaling tasks. Would it be possible to replace classic upscalers with deep learning architectures on edge devices such as display panels, tablets, laptop computers, etc.? On one hand, the current trend in Edge–AI chips shows a promising future in this direction, with rapid development of hardware that can run deep–learning tasks efficiently. On the other hand, in image SR only few architectures have pushed the limit to extreme small sizes that can actually run on edge devices at real–time. We explore possible solutions to this problem with the aim to fill the gap between classic upscalers and small deep learning configurations. As a transition from classic to deep–learning upscaling we propose edge–SR (eSR), a set of one–layer architectures that use interpretable mechanisms to upscale images. Certainly, a one–layer architecture cannot reach the quality of deep learning systems. Nevertheless, we find that for high speed requirements, eSR becomes better at trading–off image quality and runtime performance. Filling the gap between classic and deep–learning architectures for image upsampling is critical for massive adoption of this technology. It is equally important to have an interpretable system that can reveal the inner strategies to solve this problem and guide us to future improvements and better understanding of larger networks.

1. Introduction

A market is growing rapidly and steadily to provide so–called Edge–AI chips that will be able to spread the success of deep–learning systems to edge devices [13 17 4]. This is a massive market that includes phones, tablets and high resolution TV displays, among others. For some applications the success is guaranteed, such as image classification or object detection, where input images are relatively small (e.g. 256 × 256) and the output data is low dimensional (e.g. labels or bounding boxes). For other applications such as recovering a high–resolution image from a small–resolution image, also known as image super–resolution (SR), the future is less certain since both input and output images can contain a large amount of data. Consider upscaling images from Full–HD to 4K resolution in TV displays for example. The input layer needs to handle 2 megapixels and the output layer needs to deliver 8 megapixels at a rate of at least 24 frames per second. Interestingly, upsampling with small factors (e.g. 2 × ) is both the easiest problem for networks to fix, typically requiring less number of parameters to learn, and at the same time the most difficult solution to deploy. The latter is due to the fact that display devices have a fixed output resolution. For small upscaling factors the input images are still large and demand higher input throughput compared to higher upsampling factors, where input images get smaller and smaller. Small upsampling factors are also of primary concern in applications since they are the most critical technology for transitions between current and new standards (e.g. FHD to 4K, 4K to 8K, etc). Thus, the
problem of image SR becomes both more interesting and more challenging given extreme performance constraints.

**History of SR.** Standard upscaler algorithms, such as linear or bicubic upscalers, apply a low-pass filter on a high resolution image created by inserting zeros between adjacent pixels in the low resolution [24, 23]. Modern tensor processing frameworks (e.g. Pytorch, Tensorflow, etc.) implement this process using a so-called **strided transposed convolutional layer** with a single filter per input channel. More advanced upscalers have followed geometric principles to improve image quality. For example, *edge-directed interpolation* uses adaptive filters to improve edge smoothness [2, 18], or *bandlet* methods use both adaptive upsampling and filtering [24]. Later on, machine learning has been able to use examples of pristine high-resolution images to learn a mapping from low-resolution [60]. The rise of deep–learning and convolutional networks in image classification tasks [15] quickly saw a series of important improvements. Many of these improvements followed the progress in network architectures for image classification, as seen for example with CNNs applied in SRCNN [5], ResNets [18] applied in EDSR [20], DenseNets [10] applied in RDN [43], attention [9] applied in RCAN [41], non–local attention [36] applied in RNAN [42], and swin transformers [22] applied in SwinIR [19].

**Real–time SR.** The first deep learning system proposed for image SR, namely SRCNN [5], used a relatively small number of parameters (60k) and became a suitable candidate for edge devices. Soon after, FSRCNN [6] realized that significant improvements in quality and performance can be achieved by performing computations at low resolution. They proposed a **short** configuration using 4k parameters in a sequence of 4 convolutional layers, plus a final strided transposed convolution to perform upscaling, reaching real–time performance for small resolutions. The next major progress towards real–time applications was made by ESPCN [33] that made popular the application of pixel–shuffle layers, multiplexing several network channels to form higher resolution outputs [29, 27]. They proposed a configuration using 20k parameters and 3 convolutional layers with all computations performed at low resolution. Both FSRCNN and ESPCN left a strong mark on future image SR research that very often performs computations at low resolution and use pixel–shuffle layers. Nevertheless, the research clearly shifted to networks of larger sizes that can achieve much better quality. But large networks that contain several million parameters, for example EDSR [20] (combining ResNets and pixel–shuffle), are currently unable to reach the throughput needed for real–time applications on edge devices. Several so–called lightweight networks have been proposed for middle ground applications [40, 16, 21, 37, 3, 12]. Typical lightweight networks use hundred of thousands parameters and are still beyond the capabilities of real–time applications on edge devices.

**The Problem.** Despite the promising advances in technology, the challenge of image SR for edge devices remains largely unresolved. One might expect Edge–AI chips to get faster and cheaper but standards also evolve to make problems more difficult (e.g. BT.2020 [34]) with more pixels, higher bit depths, higher framerate, etc. Thus, the success of AI chips to deploy image SR technologies and reach massive markets strongly depends on better algorithm solutions. The major challenge is how to simplify network structures all the way down to reach performance levels comparable to those of classic non–adaptive upscalers. A classic 2× bicubic, doubling the horizontal and vertical resolution, can be implemented using a transposed convolutional layer with a single filter using 121 parameters. We can think of this as the simplest possible network configuration for image SR. A configuration that is interpretable in the sense that we understand what the interpolation filter values represent. Our main task here is to explore the landscape between classic upscaling on one hand, and small deep–learning systems on the other hand, in order to provide practical solutions for the current state of applications in edge devices.

**Towards a solution.** Exploring different configurations for existing networks, such as FSRCNN and ESPCN, is a straightforward and necessary task to undertake. But we propose to move a step further, introducing a minimal set of architectures, *edge–SR (eSR)*, that can perform image SR even with a single convolutional layer. We explore both a straightforward 1–layer Maxout network (eSR-MAX) as well as self–attention strategies (eSR-TM and eSR-TR) that provide a semi–classical interpretation. The latter approaches use a single layer both to detect local patterns (e.g. edges or textures) as well as to generate candidate upscale solutions. Generally speaking, the detection mechanism estimates the probability of the best upscale solution and it is used to compute a weighted average of the candidate output images that gives the final output. We will show how to implement this solution efficiently using standard deep learning modules that can run on AI chips.

**Contributions.** Our major contributions include:

- The **proposal** of several one–layer architectures that strive for simplicity to fill the gap between classic and deep learning upscalers.

- An **exhaustive search** among 1, 185 network models, including different configurations of eSR, FSRCNN, and ESPCN. Each architecture was trained under identical conditions and tested for speed, power consumption and image quality. The results allows us to visualize the trade–off between image quality and runtime performance that is critical for our purpose. Figure [1] shows the general pattern observed in our results. We found that different architectures show very different balance in the
Figure 2. Classic $s \times s$ image upscaling is performed by a transposed convolutional layer. An efficient implementation splits the filter into $s^2$ smaller filters that work at LR. The final output is obtained by multiplexing the $s^2$ channels using a pixel–shuffle layer.

trade–off between speed and image quality. Multi–layer networks (deep learning) show a strong advantage at low speed and high quality, and our proposed one–layer solutions show a clear advantage at high speed requirements.

• The interpretation and analysis of strategies learned by self–attention in one–layer architectures. We provide a novel interpretation of the self–attention mechanism based on the simple principles of template matching and classic upscaling. Here, training results indicate that one–layer networks do not use smooth upscaling kernels and rely mostly on independent sub–pixel solutions.

These results may bring about the following future impact: 1) the possibility of image SR systems that can be massively deployed on edge devices, 2) a better understanding of the internal learning mechanisms of small network architectures, and 3) a better appreciation of the trade–off between image quality and runtime performance for future applications and research.

2. Super–Resolution for Edge Devices

Classical. Image upscaling and downscaling refer to the conversion of low resolution (LR) images to high resolution (HR) and vice versa. These two processes are closely related. The simplest way to downscale an image from HR to LR is known as pooling or downsample. The process of downsample uniformly drops pixels in both horizontal and vertical directions. The problem with such downscaling is that groups of high and low frequency components of the HR image can end up in the same low frequency component at LR, leading to well known aliasing artifacts [32,23]. To avoid this problem a classic linear downscaler first removes high frequencies using an anti–aliasing low–pass filter and then downsamples the image. This process is implemented in tensor processing frameworks with strided convolutional layers where the kernel or weight parameters correspond to the low–pass filter coefficients. The process of classic linear upscaling corresponds to the transposed of the downsampling linear transformation and it is illustrated in Figure 2. The transposition reverts the ‘filter–then–downsampling’ operation into an ‘upsampling–then–filter’ operation where the upsampling increases the resolution of an image by inserting zeros between LR pixels. The upsampling introduces high frequencies that are removed by a so–called interpolation filter with coefficients $w$. The interpolation filter is the transposed of the anti–aliasing filter, typically identical because most upscalers are symmetric. Tensor processing frameworks implement this process using strided transposed convolutional layers.

The upscaling definition in Figure 2 is clearly inefficient as the upsampling introduces many zeros that will waste resources when multiplied by filter coefficients. A very well known optimization, widely used in practical implementations of classic upscalers is to split or demultiplex the interpolation filter from size $sk \times sk$ in Figure 2 to $s^2$ so-called efficient filters of size $k \times k$ working at LR [32,23]. The outputs of the $s^2$ filters are then multiplexed by a pixel–shuffle operation to obtain the upsampled image, as illustrated in Figure 2. Let $\tilde{w}_i \in \mathbb{R}^{k \times k}$, with $i = 1, \ldots, s^2$, be the coefficients of the efficient filters. The interpolation filter can then be recovered by multiplexing the efficient coefficients back to their original place. This is

$$w = \text{Pixel–Shuffle}_{s \times s}(\tilde{w}_i, i = 1, \ldots, s^2).$$

In our experiments we will compare different architectures including a bicubic upscaler. In order to remove implementation advantages we implemented the upscaler using the efficient implementation in Figure 2. We used standard bicubic interpolation filter coefficients and verified that we obtain the same outputs as other software implementations up to floating point precision.

Maxout. Our first proposal is edge–SR Maximum (eSR–MAX). This is an attempt to obtain the fastest solution from a single convolutional layer that outputs several upsampled candidates. A quick decision is made by choosing the maximum value across all channels as shown in Figure 3. This

**eSR–MAX**($y, C, k, s$):
\begin{enumerate}
  \item Parameters: Integer $C > 1, k > 1, s > 1$.
  \end{enumerate}

\begin{enumerate}
  \item $Y = \max_{1 \rightarrow C} \text{Pixel–Shuffle}_{s \times s} (\text{Conv}_{k \times k} (y))$
\end{enumerate}

**eSR–TR**($y, C, k, s$):
\begin{enumerate}
  \item Parameters: Integer $C > 1, k > 1, s > 1$.
  \end{enumerate}

\begin{enumerate}
  \item $f = \text{Pixel–Shuffle}_{s \times s} (\text{Conv}_{k \times k} (y))$
  \item $p = \text{SoftMax}(f_{1} \rightarrow C \odot f_{C+1} \rightarrow 2C)$
  \item $Y = \sum_{1 \rightarrow C} (f_{2C+1} \rightarrow 3C \odot p)$
\end{enumerate}

**eSR–CNN**($y, C, D, S, s$):
\begin{enumerate}
  \item Parameters: Integer $C > 1, D > 1, S > 1, s > 1$.
  \end{enumerate}

\begin{enumerate}
  \item $f = \text{Pixel–Shuffle}_{s \times s} \circ \text{Conv}_{S \times S} \circ \text{Tanh} \circ \text{Conv}_{3 \times 3} \circ \text{Tanh} \circ \text{Conv}_{3 \times 3} (y)$
  \item $Y = \sum_{1 \rightarrow C} (f_{C+1} \rightarrow 2C \odot \text{SoftMax}(f_{1} \rightarrow C))$
\end{enumerate}

Self–Attention. Our second proposal is edge–SR Template Matching (eSR–TM) that follows a semi–classical strategy. The basic idea is explained in Figure 5. First, a template matching module detects patterns (e.g. edge directions) and gives us the probability for each pattern. This is achieved by: first, use matching filter coefficients that resemble the pattern, and second, normalize pixel values across channels to represent the probability of each template. A set of upscale images are computed at the same time for each one of the patterns. Since both the matching and the upscaling filters follow the same patterns, we expect the filter coefficients to look similar as displayed in Figure 5 for the case of edge patterns. Thus, we can verify if an eSR–TM configuration learned to perform template matching by checking the correlations between filter coefficients. The optimal prediction for the output image is the expected value over all templates. Thus, the probabilities are used to compute the expected value by weighing the solution of different upscalers that when combined give the final output.

Figure 4a shows the diagram of the efficient implementation of this idea using $C \in \mathbb{N}^+$ templates. In this efficient implementation of a transposed convolution the $C$ matching filters $K$ split into $Cs^2$ efficient filters $\hat{K}$, before multiplexing with pixel–shuffle. We can always get the interpolation filters $K$ from $\hat{K}$ using equation (1). The outputs of the

\begin{equation}
\begin{aligned}
  p_i &= e^{K_i \odot (y \uparrow s)} / \sum_{j=1}^{C} e^{K_j \odot (y \uparrow s)},
\end{aligned}
\end{equation}

where $i = 1, \ldots, C$, $\odot$ is the convolution operator, $\uparrow$ refers to the upsampling operation defined in Figure 2. The same convolutional layer in Figure 4a runs $Cs^2$ efficient filters $V$
to get $C$ high resolution candidates after pixel–shuffle. The final luminance HR output image $Y$ is given by:

$$Y = E[V_i \odot (y \uparrow s)] = \sum_{i=1}^{C} p_i \odot (V_i \odot (y \uparrow s)),$$

where $\odot$ represent a Hadamard (or pixel–wise) product.

The eSR–TM system is essentially a self–attention module, except for the pixel–shuffle layer and the sum over all channels in this last stage. These two differences are significant since: first, they embed the upsampling process within the attention module, and second, they make explicit use of probabilities to compute an expected value thus providing a clear interpretation of this module.

Our third proposal is edge–SR TRansformer (eSR–TR) that uses the popular transformer self–attention module from [35]. Figure 4b shows the efficient implementation of this system. Here, the matching filters from eSR–TM are replaced by two sets of query ($Q$) and key ($K$) filters to estimate the probabilities. This changes the template matching interpretation of eSR–TM, using a rank–1 quadratic form with $Q$ and $K$ filters instead of a single template matching filter. The purpose of this architecture is to test any advantage that this change could bring given the increasing popularity and success of this module in recent research.

The code for all eSR systems is given in Algorithm 1.

### Deep–Learning

We consider FSRCNN [6] and ESPCN [33] as candidate deep learning architectures for image SR on edge devices. Figure 6 shows the detail structure of FSRCNN and ESPCN network architectures. In comparison, FSRCNN uses more layers (at least 5) and smaller number of channels per layer than ESPCN. Another difference is the upsampling strategy, with FSRCNN using a strided transposed convolution and ESPCN using pixel–shuffle. According to classic interpolation theory these two approaches are equivalent as shown in Figure 2 (see also [32, 23]), but implementations can be different. Tensor processing frameworks typically implement transposed convolution using the gradient of a convolutional layer [31], based on the vector calculus property for gradients of linear transformations: $\nabla_x(Ax + b)y = ATy$. This very different approach might lead to differences in performance.

Finally, we also propose the edge–SR CNN (eSR–CNN) architecture in Figure 4c and Algorithm 1. This is simply an extension of the single convolutional layer in eSR–TM into a multi–layer structure identical to ESPCN. Here, the purpose is to test if ESPCN, that achieves better results compared to FSRCNN in our tests, can be improved by using a self–attention module to upscale.

### 3. Experiments

**Models.** Candidate models for test evaluations include: bicubic, FSRCNN, ESPCN and eSR. From these, the bicubic classic upscaler is the only one without hyper–parameters and fixed configuration that do not require training. For other architectures we need to train a model for each set of hyper–parameters. Table 1 shows the list of hyper–parameters chosen for our experiments. These include default settings of FSRCNN and ESPCN as well as configurations with very small number of parameters. Our model pool includes a total of 1,185 models to evaluate.

**Training.** We need to train a total of 1,185 models that include different scaling factors, network architectures and model hyper–parameters. We trained all these models independently using an identical procedure. We used the General–100 dataset [6] combined with 91–image dataset [38] to extract training patches. For each image in the dataset we randomly cut a HR patch of size $78 \times 78$ for $2 \times$ and $3 \times$ upsampling factors, and $76 \times 76$ for $4 \times$ factor. The images were converted to grayscale using BT.609 color matrix and downscaled using a standard Bicubic algorithm. We used minibatch size 16 and trained each model for 25,000 epochs using a standard mean–square–error (MSE) loss. We started with a learning rate of $10^{-4}$ and reduce it to half once every 3,000 epochs. We used Adam optimizer [14] with $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$. We used seven Tesla M40 GPUs for training with the whole process completed in about two months.

**Measurements.** To test our final models we considered two inference devices: 1) Nvidia Jetson AGX Xavier, an embedded system–on–module (SoM) from the Nvidia

---

**Table 1.** Set of hyper–parameters used to create a pool of 1,185 models that were trained and tested in our experiments.

<table>
<thead>
<tr>
<th>Model</th>
<th>Total</th>
<th>Notation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bicubic</td>
<td>1</td>
<td>model per scale factor.</td>
</tr>
<tr>
<td>eSR</td>
<td>144</td>
<td>C: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16.</td>
</tr>
<tr>
<td>eSR-CNN</td>
<td>100</td>
<td>k: 3, 5, 7.</td>
</tr>
<tr>
<td>FSRCNN</td>
<td>100</td>
<td>Type: Maximum (MAX), Template Matching (TM), Transformer (TR).</td>
</tr>
<tr>
<td>ESPCN</td>
<td>100</td>
<td>Total: 144 models per scale factor.</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Factors</th>
<th>Total</th>
<th>Notation</th>
</tr>
</thead>
<tbody>
<tr>
<td>2x</td>
<td>50</td>
<td>D: 6, 19, 32, 44, 56,</td>
</tr>
<tr>
<td>3x</td>
<td>50</td>
<td>S: 1, 3, 6, 9, 12,</td>
</tr>
<tr>
<td>4x</td>
<td>100</td>
<td>M: 1, 4,</td>
</tr>
<tr>
<td>Total</td>
<td>1,185</td>
<td>Factor: 2x, 3x, 4x.</td>
</tr>
</tbody>
</table>
Figure 7. Scatter plot to compare speed, in number of Full-HD pixels per second, with respect to quality, measured as PSNR for the BSDS-100 dataset. A total of 1,185 models were identically trained considering different upscaling factors (2×, 3× and 4×) and architectures (eSR, ESPCN and FSRCNN). We run all models on edge devices: Jetson AGX Xavier (GPU with 16-bit floating point precision) and Raspberry Pi 400 (CPU with 32-bit floating point precision). Magnified plots with model annotations are provided in the Appendix.

AGX Systems family, including an integrated Volta GPU with tensor cores, and 2) a Raspberry Pi 400, an embedded device featuring a quad-core 1.8GHz, 64-bit ARM Cortex CPU processor. The power consumption of the Jetson AGX is set to a 30 Watt profile, while the Raspberry Pi 400 nominal consumption is 15 Watt.

We run each model to output a set of 14 Full-HD images, downscaling appropriately from randomly selected images of the DIV2K dataset [1]. We use 16-bit floating point precision during inference. For each image we run the model 10 times to avoid warm-up effects, measuring the minimum CPU and GPU processing time from profiler’s data. We computed the speed of a model using the total number of pixels processed (considering only one run per image) divided by the processing time (using the minimum time over each one of the 10 runs). To make the measurement of speed easier to read we use units of [FHD/s], this is, number of Full-HD pixels (1920 × 1080) per second.

Image quality was measured separately using the standard datasets: Set-5, Set-14 [39], BSDS-100 [25], Urban-100 [11] and Manga-109 [26]. We also measured maximum power consumption for the Jetson AGX and CPU usage for the Raspberry Pi that does not include power sensors.

Results. Figure 7 shows scatter plots to compare speed with respect to image quality, measured as PSNR for the BSDS-100 dataset. Results for other datasets, metrics (SSIM) and devices (GTX 1080 Max-Q) are shown in the Appendix with similar conclusions. The size of the circles are proportional to the power consumption and CPU usage for the AGX and Raspberry Pi devices, respectively. Finally, Table 2 shows detailed results per dataset for a subset of the models selected according to different criteria.

4. Analysis

Trade-off. The results displayed in Figure 7 allow us to fully appreciate the trade-off between image quality and runtime performance. The bicubic upscaler sets the target as we know that it can be massively deployed in display devices at large scale. Between the bicubic upscaler and deep-learning configurations using FSRCNN, ESPCN or eSR-CNN we observe a large empty region. Our proposed edge-SR (eSR) architectures succeeds to fill this gap in edge GPU devices (AGX and also GTX 1080 MaxQ available in the Appendix) and improve bicubic upscaler both in speed and image quality. In the Raspberry Pi CPU device edge-SR partially succeeds to fill this gap for 2× and 3× upscaling factor and fails at 4× factor where bicubic reaches a better performance. The best results of edge-SR is observed for 2× upscaling factor. The distribution of scatter points in Figure 7 for 2× upscaling factor shows that deep-learning meth-
Table 2. Image quality and performance metrics for selected methods among all 1,185 models trained in our experiments. Values of speed, measured in number of Full–HD pixels per second, and power, in units of Milliwatts, are specific of a Jetson AGX Xavier GPU. Methods are selected based on best speed, PSNR in BSDS–100 dataset, and default configurations. Best results are shown in bold (ignoring bicubic).

| Algorithm | s | Selection | Configuration | Speed [FHD/s] | Power [mWatts] | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM |
|-----------|---|-----------|---------------|---------------|----------------|------|------|------|------|------|------|------|------|------|
| Bicubic   | 2 | PSNR   | CNN: C = 6, D = 3, S = 15 | 19 | 1550 | 33.73 | 0.928 | 30.29 | 0.669 | 29.57 | 0.842 | 26.89 | 0.841 | 30.85 | 0.934 |
| eSR       | 2 | speed   | MAX: k = 3, C = 1 | 34 | 1859 | 33.15 | 0.928 | 30.16 | 0.882 | 29.67 | 0.862 | 26.94 | 0.857 | 30.46 | 0.937 |
| ESPCN     | 2 | default | D = 64, S = 32 | 6 | 6800 | 36.64 | 0.953 | 32.46 | 0.907 | 31.32 | 0.887 | 29.37 | 0.893 | 35.76 | 0.967 |
| ESPCN     | 2 | PSNR   | D = 22, S = 32 | 8 | 4945 | 36.70 | 0.953 | 32.47 | 0.907 | 31.35 | 0.887 | 29.44 | 0.894 | 35.79 | 0.967 |
| ESPCN     | 2 | speed   | D = 0, S = 3 | 17 | 2324 | 29.76 | 0.919 | 28.96 | 0.881 | 28.69 | 0.865 | 26.38 | 0.853 | 27.67 | 0.938 |
| FSRCNN    | 2 | default | D = 32, S = 6, M = 1 | 4 | 4793 | 36.29 | 0.951 | 32.20 | 0.904 | 31.10 | 0.884 | 28.91 | 0.886 | 35.03 | 0.963 |
| FSRCNN    | 2 | PSNR   | D = 56, S = 12, M = 4 | 2 | 5566 | 36.74 | 0.954 | 32.45 | 0.907 | 31.34 | 0.887 | 29.42 | 0.895 | 35.87 | 0.967 |
| FSRCNN    | 2 | speed   | D = 6, S = 3, M = 1 | 5 | 3560 | 35.36 | 0.943 | 31.52 | 0.898 | 30.64 | 0.878 | 28.01 | 0.870 | 33.13 | 0.951 |

The bold values in Table 2 highlight the best metrics for different columns, ignoring bicubic. edge–SR systems reach the best speed and lowest power consumption except for 4× where ESPCN gets better. They also succeed to improve bicubic’s image quality for small upscaling factors.

The filters in Figure 10 display the step by step processing of 2× upscaling using eSR–TM with kernel size k = 7 and C = 4 number of matching/upscaling filters. Here, we used equation (1) to reconstruct the 4 matching/upsampling filters from the efficient implementation containing 4 × 22 = 16 filters. In addition to the filter coefficients we also display the FFT computed using a Kaiser–Bessel window for better frequency visualization [32]. The output for this particular image is about 1.5 dB better than the bicubic output and it is displayed next to the outputs of ESPCN and FSRCNN models with similar image quality. Here, eSR–TM achieves roughly the same speed of bicubic upscaler.

The efficient filters use kernel size k × k, and after multiplying them with a pixel–shuffle layer we can recover the original filters of size sk × sk. Thus, the filter sizes of eSR models grows with the upscaling factors as seen in Figure 9. The filter coefficients in frequency domain show that each model is performing template matching, with upscaling and matching filters that resemble a common template.

Figure 8. Correlations between upscaling and matching filters in eSR–TM k = 7, C = 16. Higher correlations along the diagonal mean that the model is performing template matching, with upscaling and matching filters that resemble a common template.
Figure 9. Matching and upscaling filters obtained after training a one–layer architecture eSR–TM with kernel size $k = 7$ and $C = 18$ number of filters for $2 \times$, $3 \times$ and $4 \times$ upscaling factors. Filters are displayed in the original spatial format as well as in frequency domain by using FFT visualization. The filters do not change smoothly within a single filter but show diverse directionality among different filters.

Figure 10. Inspection of all intermediate outputs and filter coefficients for the eSR–TM $2 \times$ architecture with kernel size $k = 7$ and $C = 4$ number of matching/upscaling filters. The diagram follows the interpretation in Figure 5. Filters are displayed both in the original spatial format as well as in frequency domain by using FFT visualization. Each of the 4 branches is focusing on a particular sub–pixel array.

Now, moving one step inside the network from the output in Figure 10, we observe that the 4 components of the sum are clearly focusing on different sub–pixel images. This pattern is also visible in the outputs of upscaling filters and template matching modules. Both matching and upscaling filters are not smooth and also show signs of different sub–pixel processing with some degree of directionality. This indicates that the different branches of the single convolutional layer used in eSR–TM are solving the upscaling problem independently for each sub–pixel image. This is in contrast with the smooth scaling filters used in the classical edge–directed interpolation and also compared to smooth directional filters observed in CNNs super–resolution interpretations. Next, in Figure 8, we compute the Pearson correlation between upscaling and matching filters for eSR–TM with $k = 7$ and $C = 16$. The results show dominant correlations along the diagonal, stronger for $2 \times$ factor and reducing strength towards $4 \times$ factor. Strong correlations along the diagonal indicate a template matching strategy where upscaling and matching filters are similar for the same pattern and different to other patterns (see Figure 5). Thus, we confirm that the training process has a tendency to converge towards a template matching strategy that is particularly strong for small upscaling factors.

5. Conclusions

The current trend in Edge–AI chips offers the chance to deploy efficient AI solutions at massive scale. But there is a vast range of performance requirements for which these solutions are unavailable for image SR. We propose the edge–SR architectures with the aim to fill the gap between classic and deep learning upscalers. We performed an exhaustive search among more than a thousand different models identically trained, revealing the gap between classic upscalers and deep–learning solutions. Our edge–SR configurations using a single convolutional layer showed promising results to fill this gap for small upscaling factors. The simplicity of the model also makes it interpretable and allows to visualize and understand all the intermediate steps of the process.
References


