Image Quality Assessment for Realistic Zoom Photos

New CMOS imaging sensor (CIS) techniques in smartphones have helped user-generated content dominate our lives over traditional DSLRs. However, tiny sensor sizes and fixed focal lengths also lead to more grainy details, especially for zoom photos. Moreover, multi-frame stacking and post-sharpening algorithms would produce zigzag textures and over-sharpened appearances, for which traditional image-quality metrics may over-estimate. To solve this problem, a real-world zoom photo database is first constructed in this paper, which includes 900 tele-photos from 20 different mobile sensors and ISPs. Then we propose a novel no-reference zoom quality metric which incorporates the traditional estimation of sharpness and the concept of image naturalness. More specifically, for the measurement of image sharpness, we are the first to combine the total energy of the predicted gradient image with the entropy of the residual term under the framework of free-energy theory. To further compensate for the influence of over-sharpening effect and other artifacts, a set of model parameters of mean subtracted contrast normalized (MSCN) coefficients are utilized as the natural statistics representatives. Finally, these two measures are combined linearly. Experimental results on the zoom photo database demonstrate that our quality metric can achieve SROCC and PLCC over 0.91, while the performance of single sharpness or naturalness index is around 0.85. Moreover, compared with the best tested general-purpose and sharpness models, our zoom metric outperforms them by 0.072 and 0.064 in SROCC, respectively.


Introduction
With the rapid development of mobile CMOS imaging sensor techniques such as 2-layer transistor pixel and dual vertical gates [1][2][3][4][5], smartphone picture quality has been improved a lot and user-generated images and videos have become the mainstream of social media and our entertainment. Among these contents, zoom photos account for a large portion due to its ability to highlight subjects and make composition easy. For professional photographers, zoom capability (or focal length choices) with its shadow depth of field is one of determining factors in taking great pictures; while for average consumers, magnifying image without losing clarity is also very meaningful. But how to assess the realistic zoom image quality remains uncovered.

Difference between Zoom Photos and Gaussian Blur Images
Although many phones have been equipped with one or more zoom lenses, their focal lengths are fixed and digital zoom is still the universal ways of magnifying images. The straightforward implementation is to crop a small portion of image, then interpolate pixel values by the nearest neighbor, bilinear or bicubic methods [6], but it would lead to pixelated or blurry details. To alleviate blurry appearances, some phone-makers postprocess images through unsharp masking. Other manufacturers utilize learning-based super-resolution [7] or multi-frame stacking algorithms in order to add fine details in the delicate areas such as cement roads and rocks. Representative techniques are the deep fusion in iPhone and the super res zoom in Google Pixel [8]. However, both post-processing and synthesis-based approaches would produce over-sharpening effects, unnatural details and ringing into the image, for which traditional sharpness metrics over-estimate. In fact, this is where zoom photos differ from purely gaussian blurred images. To better visualize these artifacts, Figure 1b-d show the over-sharpening effect, unnatural details and ringing artifacts, respectively. It can be seen that, the details of Figure 1b look more grainy and noisier than Figure 1a because of over-sharpening; the lines of the buildings, windows, walls in Figure 1c are zigzag; and the ringing phenomenon appears around the edges of Figure 1d due to JP2K compression and over-sharpening. To our best of knowledge, this paper is the first attempt to deal with realistic zoom photos in the task of IQA.

Laboratory Measurements of Targets vs. Perceptual Metrics on Natural Photos
In industry, the edge blur is often measured through SFR or MTF, which can be calculated via a slant checkerboard, or the more complex Sine Siemens Star and Log-F contrast target. To overcome the texture loss problem due to edge-preserving filters, DxO Lab proposed the dead leaves model in [9], and the IE group came up with the Gaussian white noise target for the same purpose [10]. Although the correlation between measurements of both targets and subjective scores of photographic images has been confirmed in [11], there is no doubt that human-made targets are not as realistic and complex as natural photos. The measurement of the energy fall-off by DxO or the kurtosis changes of Gaussian noise image by IE are also more simpler than what current objective IQA algorithms model. Moreover, the over-sharpening phenomenon, which affects both edges and textures of the images, can't be handled by these texture-blur measures. Therefore, in this paper, we address the zoom photo quality assessment problem realistically, instead of using laboratory targets and environments.

Limitations of Sharpness, Super-Resolution, and General-Purpose Metrics
Since the sharpness/blurriness is the most related quality factor for realistic zoom images and can be used to assess the zoom quality naturally, we review them next in more details.
In [12], the author first detected edge points by Sobel operator, then measured the sharpness through the edge width. This method works well for images with the same content, but can't handle heterogeneous contents. Ferzli et al. [13] proposed the probability of just noticeable blur (JNB) in consideration of contrast masking effect. The edge width and local contrast were first computed in edge blocks, and then fed into a probabilistic summation model to estimate the whole image blur. To deal with images with different portions of background blur, a saliency map was incorporated into JNB [14]. Furthermore, by utilizing the cumulative probability, the same pooling effect is achieved under a unified framework [15]. Inspired by the triangle model of gradient profile, Yan et al. [16] defined the edge sharpness as the triangle height divided by the width. Vu et al. [17] combined the slope of spectrum magnitude with the total variation of pixel values in the spatial domain into a sharpness value S3. The spectral term is responsible for fine-detailed textures while the spatial part accounts for high-contrast edges. Thereafter, the same author proposed a fast sharpness index called FISH where the energy of different highfrequency sub-bands are accumulated in the wavelet domain [18]. Observing that different scale local phase maps align in strong image structures, Hassen et al. [19] measured the image sharpness through the local phase coherence. Besides image blur, LPC can respond effectively to noise contamination, which distinguishes itself from other sharpness metrics. Based on the re-blurring process, the sharpness was defined as the decreased percentage of fourth-order moments of the re-blurred version relative to its test image [20]. More recently, Li et al. [21] decomposed gradient image blocks into a novel set of Tchebichef basis. The square Tchebichef moments were normalized by the variance to compensate influence of image contents, and further weighted by a saliency model. Besides spatial and spectral domain, there are a few metrics making use of high-level features. Gu et al. [22] proposed a sharpness metric where autoregressive (AR) parameters are estimated, then the differential energy and contrast between AR values are linearly combined. Li et al. [23] and Han et al. [24] both measured the blurriness through sparse representation. The latter needs partial information of the original image, thus belonging to the reduced-reference category. Liu et al. [25] developed a new sharpness metric for realistic out-of-focus images where phase congruency and gradient magnitude maps are merged by max pooling, and further weighted through a saliency map to form the sharpness value.
Despite the success of sharpness index on gaussian blurred images, they generally over-estimate the sharpened details and artifacts in zoom photos. In [26,27], this over-shoot problem has been raised, but still limited to the simulated scenario. Moreover, the S3-III [27] performs moderately in our database. Another relevant work is the quality assessment for super-resolution [28][29][30]. These metrics often assume the existence of "original lowresolution image". In contrast, our zoom quality metric only relies on the tested zoom photos and is thus no-reference. The distortion types are also more subtle and complex. It may be argued that general-purpose IQA metrics [31][32][33][34][35] can assess these artifacts correctly, but we will demonstrate that they are not effective as the sharpness metric for zoom photos in Section 4.

The Contribution of This Article
To tackle the above problem, we propose a novel zoom quality metric which incorporates the image sharpness and naturalness. For zoom photos, sharpness is the determining quality factor. To estimate it, we resort to the free energy theory which models the brain interpretation process by decomposing an image (scene) into the predicted and residual version [36]. Specifically, the gradient image is first encoded by sparse representation. Then the square sparse coefficients are summed for the reconstructed gradient image, and the entropy is computed for the residual part, respectively. The sharper a zoom photo is, the higher energy/entropy the reconstructed/residual image has. Thus, we form the sharpness index by adding these two quantities. As mentioned above, zoom photo differentiate itself from gaussian blurred image in the over-sharpening details and unnatural looks. To compensate such perceptually harmful effects, a set of parameters of MSCN distribution are first extracted from zoom photos, and then compared with that of natural images. Finally, this NSS score is combined with the sharpness index linearly. In summary, our contributions are as follows: • First, we construct a realistic zoom database by collecting 20 phone cameras and shooting them in 45 texture-complex scenes. Mean expert opinion scores and several analyses are also provided. Compared to existing gaussian blur databases, our database contains the most authentically distorted photos and scenes. The image resolution is also the highest. • We are the first to derive the whole formulation of the free-energy under the sparse representation model, according to which a new sharpness index is proposed. • A novel zoom quality metric is proposed which incorporates image sharpness and naturalness. Natural scene statistics are included as a mechanism to prevent oversharpening and penalize unnatural appearances. • The zoom quality metric is tested in both unsharp masking simulations and the zoom photo database, which outperforms existing sharpness and general-purpose QA models.
The rest of this paper is organized as follows: Section 2 describes the construction of a realistic zoom image database. Section 3 explains two components of our sharpness metric and the important NSS score. Extensive experimental results are given in Section 4. Finally, Section 5 concludes the paper.

The Construction Process and Comparison with Other Databases
Zoom photo refers to the image shot at focal lengths several times longer than the main camera (e.g., 23-28 mm), which includes both optically and digitally zoomed ones. For their quality differences, there are three cases. First, between optical zoomed photos, the texture blur & noise balance is the key quality issue, which is determined by the imaging sensor size. The smaller sensor has lower signal-to-noise ratio (SNR). Thus, its rendered texture will be either grainy or muddy depending on whether and how much noise reduction algorithms are applied [37]. The photo from bigger sensor is often clearer in details.
Second, between optically and digitally zoomed images, the blur itself is the biggest difference, which can be modeled by a global point spread function (PSF). Third, between digitally zoomed images, the post-processing algorithm plays a vital role. The ideal result is to add details without introducing annoying artifacts such as ringing, or over-sharpening appearances.
However, current image quality databases mainly simulate the second scenario, i.e., the original image is convolved with a gaussian PSF. The LIVE database [38] collects 29 original images and filters them through a circular-symmetric 2-D gaussian kernel whose standard deviation ranges from 0.42 to 15 pixels. The TID2008 and TID2013 database [39,40] contain 25 reference images and four or five levels of blur distortions. The LIVE MD database [41] improves LIVE or TID by partly simulating the first quality difference scenario: independently and identically distributed gaussian noises are added to the Gaussian blurred image to mimic realistic ISP output.
The BID database [42] is the first attempt to consider realistic blur distortions. Totally, there are 585 images of five classes: unblurred, out-of-focus, simple motion, complex motion and other types. Following BID, Liu et al. [25] investigated the out-of-focus type more deeply. The 150 out-of-focus photos were created by manually focusing a DSLR elsewhere (e.g., at background objects).
Although motion blur and out-of-focus distortion are more realistic than gaussian blur, both rarely happen in the image acquisition process, if not intentionally. For motion blur, the shutter speed is often faster than 1/100 sec in daylight, while in night time, the optical image stabilization (OIS) can help to compensate hand shakes. For phone units without OIS, a safe shutter speed is often enforced by manufacturers (e.g., 1/15 s) to avoid motion blur. As for the out-of-focus problem, first, the depth-of-field of mobile cameras is very thick due to small sensors, that is, the out-of-focus or bokeh phenomenon is not prominent as in the DSLRs. Second, with the advanced dual PDAF and laser autofocus technique, fewer and fewer photos suffer from focusing issues these days. Practically, the low-light environment and moving objects are two probable scenarios where the motion blur or the autofocus system fail, but neither BID [42] nor the out-of-focus database [25] have considered that.
In comparison with motion blur and out-of-focus, blur induced by zooming is much more common and susceptible to smartphones, since no current smartphone cameras possess continuous zooming capability as DSLRs. To build such a zoom photo database, we first select 20 mobile cameras from the market, which include all kinds of brands and span from mid-rangers to high-end models. A large majority of them own at least one zoom lens, while others don't. Then 45 test scenes are carefully chosen, which include distant sceneries, buildings, characters, portraits as main subjects. During the shooting process, we only zoom in at 2, 3 or 5 times according to the photo composition rule [43]. All grid lines are opened to align image perspectives across different cameras. It is worth noting that no external lenses are mounted on smartphones, and all photos are straight out of the camera without filters and retouches. Finally, each scene is simultaneously captured by all cameras, producing a total number of 900 images. Example images of this Zoom PHoto Database (which we name ZPHD) are shown in Figure 2.

Subjective Quality-Evaluation Study and Analysis
In our pivot subjective study, we found that naïve observers often mistake unnatural textures and over-rate them. Thus, only 10 experts who have experiences in phone camera quality testing are involved. Since we mainly care about the zoomed detail quality, subjects are asked to rate photos by the clarity, ignoring their brightness variations and color differences [44]. According to the recommendations by ITU-R BT.500-12 [45], the test image should be fully shown and the viewing distance be fixed at three times the image height. However, first, our photo size is too large to be displayed in full-screen; second, to differentiate subtle texture differences, more vision acuity (or higher visual angle/spatial frequency) is required [46]. Thus, we allow observers to magnify the test image by themselves for pixel-peeping purposes. It may be questioned that the compared region and magnifying ratio vary across images and subjects. However, except few cases where images from different lenses are stitched together, the clarity of a zoom image does not change dramatically across its patches. And although the magnifying factor is not controlled, the subject tends to compare image details at the same scale. The ambient light is kept low to reduce fatigue.
Moreover, photos from the same scene are grouped together and rated in a session, instead of in completely random order. The subject first views through the whole set (e.g., 20 photos) to form a general idea of the scene content and quality range. Then he/she gives the opinion score based on the single stimulus method [45]. Score of 100 means sharp and clear images while 0 represents heavily pixelated or blurry ones. This is what we improves from our previous work, where only quality rank orders are provided [47]. Table 1 lists major information about the test conditions. Many items have been updated from traditional setups to adapt to our zoom quality applications, and highlighted in boldface.

Evaluation method single-stimulus within a scene
Evaluation scales continuous quality scales from 0 to 100

Evaluation standard image clarity instead of color or overall quality
Image number 20 mobile cameras × 45 scenes

Image encoder JPEG and HEIF
Image resolution mostly 4000 × 3000

Subjects ten experienced experts
Viewing angle can be adjusted by the subject

Room illuminance low
After obtaining the individual scores, we check the inter-subject consistency using the Spearman correlation coefficient and the Cronbach's Alpha reliability coefficient [48]. Both results (r = 0.943, α = 0.957) indicate high correlations and reliability of the subjective scores. Therefore, no second alignment study is performed. Finally, the outlier screening procedure is conducted as in [45]. No scores are found abnormal, so we average all ten scores to obtain the final MOS. The MOS histograms of the 2×, 3×, 5× photos are shown in Figure 3. First, we can observe that, both MOS distributions of 2× and 5× photos are shorter-tailed than 3×'s, i.e., more MOSs are concentrated on either side of the score scale. The reason is that, for most 2× zoom photos, the image quality is still very acceptable (>40) even by the digital zoom of the main camera, not to mention those with 2× optical lenses. While for the 5× zoom photos, except few phone units with 5× periscope, the image quality of the digital zoom by the main camera degrades heavily. Hence, the 5× zoom MOS histogram is more concentrated at the lower end. Second, the overall MOS distribution shifts toward higher value when zoom times becomes larger, this is because, for a single main camera, the 5× zoom photo quality is definitely worse than that of 2× zoom. We also analyze the MOS distribution correlations between difference scenes (of the same zoom times). The r > 0.9 indicates that the influence of different scene contents is minor, but the camera with edge-preserving smoothing operation [49,50] indeed obtains higher MOSs in scenes of characters, architectures than textures. The overall scores are also computed for each camera, which are included in the database.
At last, we summarize characteristics of this zoomed photo database with other representative databases in Table 2. It can be seen that our database contains the largest number of scenes and blur-related images. The image resolution is also the highest.

The Image Sharpness Measurement
As the theoretical foundation, free-energy principle is first introduced in this section. Then the brain interpretation process is approximated using sparse representation. Based on these two models, we deduce the whole formulation of the free-energy and propose a novel sharpness metric, which considers both energy and entropy of the predicted and residual image. Details are given below.

Formulation of Free-Energy Principle
A basic premise of the free-energy-based brain theory [52] is that the cognitive process is governed by an internal generative model. With this internal model, the brain is able to generate the corresponding predictions for its encountered visual scenes, and direct our actions accordingly. For operational amenability, let G denote the brain internal model, and let θ represent the model parameter vector which can be adjusted by G to explain perceived scenes. Given an image I, we define its 'surprise' or entropy by integrating the joint distribution P(I, θ) reciprocal over the space of the parameter vector θ: Since the P(I, θ) is too complicated to write analytically, we introduce an auxiliary term Q(θ|I) into both the denominator and numerator of the right part in Equation (1) and have: Here Q(θ|I) is the posterior distribution of the model parameters given image I, which can be thought of as an approximate posterior to the true posterior of the model parameters P(θ|I) calculated by the brain. When perceiving the input image I, the brain intends to minimize the discrepancy between the approximate posterior Q(θ|I) and the true posterior P(θ|I). In fact, this approximation technique has been used in ensemble learning and variational Bayesian estimation framework. Please see [53] for more details.
By using Jensen's inequality, we can move the logarithm operation inside the integral, thus Equation (2) can be translated into: According to statistical physics and thermodynamics [54], the right side of Equation (3) is defined as "free energy" as follows: Obviously, F(θ) defines an upper bound of 'surprise' for image I. In practice, the integration over the joint distribution P(I, θ) can be intractably complex. By decomposing P(I, θ) = P(θ|I)P(I), we obtain: where KL(·) refers to the Kullback-Leibler divergence between the approximate posterior and the true posterior distributions and it's nonnegative. It is clearly seen that the free energy F(θ) is greater than or equal to the image 'surprise', − log P(I). In visual perception, the brain tries to minimize KL(Q(θ|I) P(θ|I)) of the divergence between the approximate posterior and its true posterior distributions when perceiving image I. Alternatively, noticing that P(I, θ) = P(I|θ)P(θ), (4) can be rewritten as: In the next two subsections, we will explain how to calculate the two parts in the right hand side of Equation (5) or (6), and leverage both of them for the sharpness estimation.

Approximation of the Brain Generative Model
For the application of free-energy theory into the image quality assessment, the concrete form of G needs to be specified first. In [55], the receptive fields of simple cells in mammalian primary visual cortex are characterized as spatially localized, oriented and bandpass. Sparse representation mimic the above biological process by assuming that the image or its patch can be modeled by a linear combination of few atoms from a predefined or trained dictionary [56]. Such atoms have been evidenced to resemble neural response characteristics well, and the superiority of sparse representation for approximating the internal model has been verified in [57]. Therefore, we utilize the sparse coding as the deputy of model G.
Mathematically, given an image I, a patch x k ∈ R B of size √ B × √ B is extracted from I by: where R k (·) simply copies the pixel values from image I at location k into x k , k = 1, 2, 3...N. N is the total number of image patches. For the specific extracted patch x k , its sparse representation over a dictionary D ∈ R B×d (d > B) refers to finding a sparse vector α k ∈ R d (i.e. most of the elements in α k are zero or close to zero) to satisfy: If small approximation errors are permitted, the exactly equal relation of (8) can be relaxed as: where · p refers to the l p -norm. ξ is a positive number. Because our objective is the sparsity of coefficients, the constrained problem can be formulated as: Alternatively, the dual problem of (10) could be considered: where δ is the threshold of sparsity. Both (10) and (11) can be transformed into an unconstrained optimization problem: where the first term is the reconstruction fidelity constraint and the second term is to punish the sparsity of the representation coefficient vector. λ is a positive constant to weigh the importance of these two terms. When p = 0, the sparsity of α k is controlled by the l 0 -norm. An alternative way is to replace the l 0 -norm with l 1 -norm, which is convex and can be solved by iterative shrinkage/thresholding algorithm [58]. We employ off-the-shelf 2-D DCT bases as our predefined dictionary D, which is illustrated in Figure 4. More implementation details can be found in the Section 4. After obtaining the sparse coefficient vector α * k for the image patch x k , we substitute x k with Dα * k , then copies it back into the original position by: where R T k is the reverse operator of R k ; and I refers to the sparse representation for the image I, or the brain prediction result in the free-energy theory.

The Sharpness Index
In [36], the mathematical expression of free-energy has been derived for AR model and applied for RR-IQA tasks. In this paper, the sparse representation coefficients α become the model parameter θ, and the first term of (6), KL(Q(α|I) P(α)) measures the distance between the recognition density and the true prior density of model parameters. However, although the support (i.g. non-zero position) of α and the distribution of P(α) have been discussed in structured compressed sensing [59,60], their exact forms are not determined yet. To simplify computation, we choose Gaussian distribution for the prior density P(α). This Gaussian prior rightly corresponds to l 2 -norm relaxation of the l 0 -norm in (12), and can be transformed into an inverse Bayesian inference problem [56]. To model the recognition density Q(α|I), i.e. the posterior distribution of 2D-DCT sparse coefficients, we use Gaussian function as well.
Specifically, let P(α)=N (α p ; µ p , Σ p ), and Q(α|I)=N (α q ; µ q , Σ q ), the KL-divergence becomes: where d is the dimension of variable α. If we further assume sparse coefficient vector (14) can be simplified as: where (λ 1 , ..., λ d ) are variances of α p , const = 1 2 (log . This equation manifests that, the KL-divergence can be calculated through the quadratic sum of sparse coefficients, divided by the variance components of α. The second term of (6), E Q [log 1 P(I|α) ], measures the average likelihood of the P(I|α) under Q(α|I). If we approximate Q(α|I) with P(I|α), then By combining (15) and (16), the free-energy quantity can be written as : which means free-energy equals the model approximation error plus image 'surprise'. This factorization can also be seen through (5), where the first term − log P(I) defines the log-evidence of the image, which is just the negative of 'surprise'; and the second term KL(Q(θ|I) P(θ|I)) measures the KL-distance between approximate model density and true posterior density.
In implementation, we set µ p = 0 (smooth prior), µ q = α * k , and further compress the variance vector into a single scalar λ, which is computed through the k-th image patch variances σ 2 k . The σ 2 k also serves as the compensation of contrast masking effect induced by image contents and has been proved effective in previous works [13,23]. The entropy is calculated in the residual domain. To balance different scales of the model error term and the entropy, we also introduce a weighing coefficient k 1 . Thus, (17) is simplified as: where I is the reconstructed image by sparse coding. Since human visual system is more sensitive to sharper regions, it is effective to only pool them together [15]. To capture image structures and details efficiently, the gradient domain is used. Let ∇I denotes the gradient magnitude of image I by Sobel edge detector, S is the binary operator which assigns 1s for the top l% sharpest patches (0s otherwise), and N s = l% × N denotes the number of these sharpest blocks, where N is the total number of image patches. We define the sharpness index as: Looking at (18,19), our work is the first to give the complete formulation of freeenergy under the sparse representation model and leverage its full power for sharpness assessment. In contrast, Wu et al. [61] interpreted the predicted and residual image as ordered and disordered portion. The latter is regarded useless for high-level inference and mainly responsible for distortions such as noise, compression artifacts, etc. Gu et al. [22] decomposed the blurred image through the autoregressive model. The AR-predicted result is a re-blurred version of the test image and used for the sharpness estimation afterwards, but the residual part is not considered. Although the residual part is used in [62], the reconstructed image is ignored. In our paper, the reconstructed part represents the prominent structure/edges of the image. And decent amounts of fine-grained details are left in the residual image, which are crucial in the differentiation of small quality gaps between zoom photos. Figure 5a,d compare two 3× zoom photo crops. The MOS of Figure 5a is higher than Figure 5d because of hardware advantage. In their sparse reconstructed gradient images, the edge intensity of eaves and decorative patterns in Figure 5e are much stronger than Figure 5b, which is reflected in the KL-divergence (KL b = 4.2008 , KL e = 6.0919). Meanwhile, the residuals in Figure 5f are also more obvious and uneven than Figure 5c (see the decorative patterns), leading to a higher entropy as well. Therefore, both the energy and entropy of the predicted and residual portions can represent the sharpness changes.
Moreover, we can observe that Figure 5b,e primarily capture the intensity of edges (i.e., acutance), while Figure 5c,f depict the subtle and fine-grained textures (i.e., resolution). Acutance and resolution are two complementary factors in the perception of sharpness [17], both of which form an integral part of our sharpness model.

The Image Naturalness Measurement
Although the sharpness metric can distinguish quality differences between optical and digital lenses, it easily over-estimates the over-sharpening effect and spurious artifacts present in the zoom photos due to the the post-processing algorithms. The image naturalness, by its name, can measure such loss of naturalness and degradation of the perceptual quality. In literature, different forms of the natural scene statistics have been used: the power law of the spectrum energy is measured in [9,17]; the non-gaussian distribution of the gradient component, DCT or Fourier coefficients are used in [63,64] for noise estimation and quality assessment; the Rayleigh or Weibull distribution of the variance or gradient magnitude are leveraged in [33,65]. Here, we utilize the distribution of mean subtracted contrast normalized (MSCN) coefficients as in [31,32].
Specifically, let x and y denote the pixel coordinates, and I(x, y), µ(x, y), σ(x, y) refer to the original images, mean and standard deviation of the local image patch centered at (x, y), respectively, then the MSCN coefficient I(x, y) at (x, y) is defined as: According to [66], the image structure transitions are reduced due to this local nonlinear divisive normalization, and for pristine natural images, the highly leptokurtic and long-tailed characteristics are transformed into a unit normal Gaussian distribution. However, for over-sharpened photos, the abrupt details would change the behavior of both peaks and tails of the empirical coefficient distribution, which can be well modeled by a generalized Gaussian distribution (GGD): where Γ(·) refers to the gamma function, which is defined as: where α and β are the GGD parameters, which can be estimated by the moment matchingbased method [67].
In order to better visualize the effects of over-sharpening and zoom blur, Figure 6a shows a natural-looking photo crop from our 5× zoom database, while Figure 6b,c are two other photos from the same scene that suffer from over-sharpening problems and slight blur, respectively. Figure 6d plots the corresponding empirical distributions of Figure 6a-c. It is observed that the MSCN coefficients of Figure 6a follow a uniform Gaussian distribution, while Figure 6b reduces the weight of the tail of the histogram and Figure 6c appears to exhibit a more Laplacian appearance. Instead of calculating sample statistics such as the variance, skewness or kurtosis [10,68], we directly use α and β to encompass a wider range of distortion changes. Aside from the MSCN coefficients, the products of the adjacent MSCN coefficient pairs are also powerful to characterize the image quality. Figure 6e shows the empirical distribution of the horizontal product MSCN coefficients of Figure 6a-c. As we can see, the histogram of Figure 6c is more peaked and leptokurtic than Figure 6a, while the distribution shape of Figure 6b looks more flat-topped. In this paper, we calculate the products of the adjacent MSCN coefficients along four directions, i.e., horizontal, vertical, main-diagonal and second-diagonal as in the [31]. Each of these products can be modeled with the zero-mode asymmetric GGD (AGGD): Unlike MSCN, the mean of the product MSCN distributions also differs for Figure 6a-c, indicating the changes of zoom quality. Thus, we compute the mean as: The informative model parameters (γ, β l , β r , η) of the AGGD are estimated and introduced into our quality-aware NSS features. As research in quality assessment has demonstrated that incorporating multi-scale information correlates better with human perception [69,70], we extract the above features in two scales (low-pass filtered and downsampled by 2).
Instead of calculating feature vectors on the whole image, we partition the photo into non-overlapping patches and perform feature extraction on each of them, leading to a 36-dimensional vector for each patch. Then we stack all the feature vectors together and fit them with a multivariate Gaussian (MVG) density as: where x refers to the quality feature vector and k refers to the dimension of x. To learn a model that serves as the pristine anchor for the NSS features, we select one hundred pristine images from the Berkeley image segmentation database [71], then model their patch-based feature vectors using MVG as well: A common measure between two distribution distances is the KL-divergence. However, the KLD is asymmetrical. In this paper, we use the square root of the symmetric Jenson-Shannon (JS) divergence [72] to define our unnaturalness score: where NS measures the distance between the tested zoom photos and pristine images, thereby representing the inverse of image naturalness. The smaller NS is, the more natural a zoom photo appears.

The Final Zoom Quality Metric
After calculating the sharpness and image naturalness, we attempt to merge them into a single zoom quality index. We found that a simple linear combination is enough to yield good results. Other weighting strategies such as geometric weighting, Boltzmann machine and SVM regression can achieve similar results, but their interpretability is not as good as linear combination and may suffer from over-fitting problems. Thus, the final zoom quality metric Q is defined as: where w is a negative constant that determines the relative importance of SS and NS. We will discuss it more deeply in the Section 4. For intuitive understanding of the proposed zoom quality metric, we show its flowchart in Figure 7.

Implementation Details
In implementation, the zoom photo is divided into 8 x 8 non-overlapping patches. Then we reshape the square patch into a 64 × 1 column, which constitutes x k in Equation (7). To construct the 2D-DCT dictionary, we first create a 1D-DCT matrix A 1D of size 8 × 12, where the k-th atom (k = 1,2,. . . ,12) is given by a k = cos((i − 1)(k − 1)π/12), i = 1,2,. . . ,8. Then all the atoms except the first constant one are processed by removing their mean. The final 64 x 144 over-complete dictionary D is obtained by a Kroneckerproduct D = A 1D ⊗ A 1D . The non-convex l 0 -minimization problem (10,11) is solved using the orthogonal matching pursuit (OMP) algorithm [73]. We set the sparsity degree at 6 experimentally. The l = 60%, k 1 = 0.5 and w = −0.7 are optimized to achieve the top result on the zoom photo database.
To easily follow the process of this zoom quality metric, we show its pseudo code in the Algorithm 1 below.

Algorithm 1: Pseudo-code of the proposed zoom quality metric
Input: Zoom photo I, over-completer DCT dictionary D, mean and variances of pristine MVG parameters µ y , Σ y , weighting parameters k 1 , w and l.
1 Initialization: 2 Compute the photo gradient ∇I using Sobel operator; 3 Partition the photo I into non-overlapping 96 x 96 patches x k ; 4 The measurement of sharpness: 5 foreach k = 1, 2, . . . , N do 6 Solve the sparse coding coefficients α * k using the OMP algorithm [73]; 7 Calculate the patch variance σ 2 k ; 8 Sort and select the top l% patches according to the variance σ 2 k ; 9 end 10 Compute mean KL-divergence or energy: ∑ N s k=1 Compute the residual gradient image: ∇I − ∇I ; 12 Compute entropy of the residual: E(log  (µ x − µ y ); Output: Zoom quality metric Q = SS + w · NS

Illustrative Results
Before quantitative results, let us examine three scenarios where our zoom quality metric succeeds while single SS or NS fails to predict the image quality in Figure 8: In its top row, the quality of A1 > A2 > A3. Specifically, A1 is a 5× zoom photo taken with an optical camera, while A2 comes from a smaller sensor and looks more grainy. A3 is interpolated from a 3× zoom camera, which suffers from zoom blur. From Figure 9, we can observe that the sharpness score (i.e., SS) wrongly judges A2 > A1 > A3, and the naturalness score (i.e., −NS) mistakes A1 > A3 > A2. This fact reveals drawbacks of the SS and SS: SS over-estimates the sharpening effect (A2 > A1), while NS over-emphasizes the image naturalness or smoothness (A3 > A2). In contrast, our zoom quality metric, indicated by the level sets of straight lines in Figure 9, successfully gives the order of A1 > A2 > A3; 2.
Similarly, in the second row of Figure 8, all B1, B2 and B3 are 5× zoom photos interpolated from three 2× optical lenses but with different ISPs and post-processing algorithms. The quality order is B1 > B2 > B3. However, the SS of B2 is larger than B1 because of annoying artifacts. Although the NS predicts the quality of B1 > B2 without error, it over-estimates the zoom blur (softness) present in B3, thus wrongly judging B3 > B2. By combining the SS with NS, the correct order of B1 > B2 > B3 can be achieved with our metric Q; 3.
In the bottom row, C1 is the original "hestain " image, and C2 and C3 are created using the Matlab imsharpen function with different amounts of unmask sharpening. The quality order is C2 > C1 > C3, as it is well-known that moderate amounts of sharpening can improve an image's perceptual quality, while excessive sharpening would lead to a more unnatural appearance, thereby degrading the naturalness. However, from Figure 9, we can see the SS scores increase monotonically with the sharpening amounts, that is, C3 > C2 > C1, while the NS penalizes the C2 too much, leading to C1 > C2 > C3. In contrast, our metric Q can evaluate them more appropriately (C2 > C1 > C3). This fact implies our zoom quality metric can be used to control the parameter of sharpening algorithms.

Performance Comparison
In this subsection, we compare the proposed quality metric with seventeen state-ofthe-art NR-IQA algorithms on the zoom photo database, which can be classified into three categories: sharpness-specific, unsupervised and supervised general-purpose ones. The sharpness metrics include JNB [13], CPBD [15], S3 [17], FISH [18], LPC [19], SPARISH [23] and S3-III [27]. The unsupervised or opinion-free algorithms are NIQE [32], SNP-NIQE [33], IL-NIQE [65], NPQI [74], LPSI [75] and QAC [76]. Belonging to the supervised models are BIQI [77], BRISQUE [31], BLIINDS-II [64] and M3 [78]. Except S3-III [27], all source codes of these algorithms are obtained from original authors or their public websites. We implement the S3-III algorithm [27] by ourselves. The SROCC, KROCC and PLCC are calculated using the protocol suggested by VQEG [79]. The line passing through C2 almost overlaps with that of A3, thus is omitted for a better view. Table 3 lists the performance on the our zoom photo database. The best performed method is marked in boldface. It can be seen that the general-purpose NR methods assess the quality of the zoom blurred images moderately due to their general QA ability for distorted images. Compared with the general-purpose methods, the sharpness specific methods achieve better prediction results. This can be verified by the observation that most of the SROCC values of the sharpness metrics are higher than 0.75. Moreover, S3-III [27] doesn't improve the S3 [17] by a large margin in our database. Last but not least, our proposed zoom quality metric earns superior prediction performance to all of the competing methods and outperforms them remarkably.

The Discussion of w: The Tradeoff between Sharpness and Naturalness
The special case of w = 0 corresponds to the sharpness measure, while −w = ∞ refers to the naturalness measure. Figure 10 plots the SROCC value versus different w. It can be seen that, the SROCC increases steeply when −w is in (0-0.3), then reaches a plateau in (0.4-1) and eventually drops down as −w increases to ∞. This is because a small −w (−w < 0.3) only emphasizes the importance of sharpness, ignoring the photo smoothness or naturalness. This is also the drawback of existing sharpness indices, which are not close-looped. At the other end, a very large −w (−w > 1) gives more weight to the image softness, which contradicts the common sense that an appropriate sharpening operation can improve the perceptual quality. By choosing an appropriate −w, we can obtain a trade-off between sharpness and naturalness. In this paper, we choose −w = 0.7, as it achieves the best result in Figure 10. However, it is worth noting that −w ∈ (0.4, 1) are all reasonable choices, which depend on personal preference. People who tend to prefer a smoother photo may choose a larger w, and vice versa. There have been phone models such as Galaxy S23 series that offer this softness adjustment option.

Limitations of the Current Work
Despite the best result achieved by the proposed metric in the zoom quality database, there exist several limitations: (1) our metric doesn't consider the influence of color differences and exposure variations. Although detail rendering is perhaps the most determining factor in the zoom photo quality, taking into account other quality aspects is also necessary, since the HDR capability and color rendering in zoom lenses are always not consistent and good as the main camera [47]; (2) the constant w could be generalized to a function w(c, q). As we mentioned in Section 2.2, image contents of characters could bear more sharpening amounts than textures, animal fur and people skin. And compared to high-quality photos, images suffering heavy zoom blur could benefit from more sharpening, too. In these two cases, the w could be lowered; (3) besides using w(c, q), another way to improve the metric performance is to utilize machine-learning [80], which we will look into in the next future; (4) There has been a trend of using AI-restoration technique, especially for the long-range zoom photos. These AI-generated textures may improve perceptual quality for characters, but the fake, wrinkle-like details would make photo dirty and weird. Currently, our algorithm couldn't handle the quality degradation of AI-generated textures very effectively.

Conclusions
Zoom photos differ from gaussian blurred images in their over-sharpening appearances and harmful artifacts. To assess them rightly, we first build a zoom photo database which consists of 20 mobile units and 45 texture-complex scenes. Then we propose a novel zoom quality metric, considering both sharpness and naturalness. To evaluate the sharpness, we are the first to give the whole formulation of free-energy theory under sparse coding, and leverage both the energy and entropy of the predicted and residual images.
To measure the naturalness, we extract a set of MSCN coefficients, and then compare it with that of pristine images under the multi-variant Gaussian model. In the experiments, drawbacks of the single sharpness or naturalness are revealed, and the effectiveness of their summation is illustrated by three scenarios. We also discuss different choices of the linear combination coefficient. Finally, the SROCC, KROCC, and PLCC in the zoom photo database demonstrate the superiority of our metric over traditional sharpness and general-purpose methods.