Visual Saliency Guided Foveated Video Compression

Video compression has become increasingly crucial as video resolution and bitrate have surged in recent years. However, most widely applied video compression methods do not fully exploit the characteristics of the Human Visual System (HVS) to reduce perceptual redundancy in videos. In this paper, we propose a novel video compression method that integrates visual saliency information with foveation to reduce perceptual redundancy. We present a new approach to subsample and restore the input image using saliency data, which allocates more space for salient regions and less for non-salient ones. We analyze the information entropy in video frames before and after applying our algorithm and demonstrate that the proposed method reduces redundancy. Through subjective and objective evaluations, we show that our method produces videos with superior perceptual visual quality. Moreover, our approach can be added to most existing video compression standards without altering their bitstream format.


I. INTRODUCTION
Widely applied video compression methods, e.g., AVC [1], HEVC [2], VP8 [3], VP9 [4] and AV1 [5], use block-based algorithms to reduce spatial and temporal redundancy. A video frame is first divided into several blocks, then the encoder performs intra-frame (for spatial redundancy) or inter-frame (for temporal redundancy) predictions according to the frame type. Blocks might be partitioned into smaller ones in this process. The encoder calculates the errors and transmission costs of different prediction modes and partitioning patterns, and records the best performing combination for transmission. Next, a block-wise transform is applied to the prediction errors, resulting in coefficients in another domain. Discrete cosine transform (DCT) and discrete sine transform (DST) are commonly used as block-wise transforms. The coefficients are then quantized and encoded into a bitstream.
Various compression methods can effectively eliminate spatial and temporal redundancies. However, the spatially-varying sensing characteristics of the Human Visual The associate editor coordinating the review of this manuscript and approving it for publication was Gangyi Jiang. System (HVS) are often not considered. Humans have two types of photoreceptors in the eye, namely rods and cones [6]. Rods and cones are unevenly distributed across the human retina. Cones have the highest density in the fovea, the center of the retina, while rods are almost absent in the fovea and reach their highest density in a 10 to 20 degree periphery of the fovea. Rods and cones also have different sensitivity to light. Rods support vision under low illumination levels, while cones support vision under normal and higher brightness. Even though rods have higher sensitivity, their visual acuity under low illumination is extremely poor compared to visual acuity under photopic conditions. The reason for this is that signals from many rods converge onto a single neuron within the retina. This improves sensitivity in exchange for spatial resolution. On the other hand, every cone is connected to multiple neurons, and they have a high density in the fovea. This means that the fovea has a higher spatial resolution than the periphery. As a result, the Human Visual System (HVS) encodes more information from the center of the receptive field, and less information from the periphery.
Existing compression methods treat all parts of a video frame equally, encoding all blocks with the same resolution. Resolution scales evenly as the target video resolution changes. This introduces perceptual redundancy, since information in the periphery is sampled at a lower spatial resolution due to the characteristics of the HVS. To eliminate this redundancy, different blocks in the video needs to be encoded with different resolutions depending on their locations. The blocks in the periphery should be encoded with a lower resolution, while the blocks in the fovea should be encoded with a higher resolution. In this paper, we propose a novel video compression method which incorporates the non-uniform spatial resolution of the HVS to reduce perceptual redundancy. The proposed method has the following novel features: • A foveation process based on per-quad image warping is used to preserve image quality of salient regions, achieving non-uniform subsampling based on saliency level.
• The saliency data is incorporated at a lower granularity, providing more precise quality control of salient regions.
• Our method is independent of traditional encoding processes, making it applicable to improve most existing compression methods.

II. RELATED WORK A. VISUAL SALIENCY
Visual saliency data gives us a description of visual fixation points and relative saliency levels in image and video frames. Saliency information can be obtained by using eye-trackers to track eye movements when viewing images and videos. However, gathering such data requires specific hardware, proper experimental setup, and many participants for subjective evaluations. Thus, researchers have proposed many visual saliency models using biological/psychological knowledge and machine learning methods. Visual saliency detection methods can be categorized into bottom-up and top-down models [7]. Before deep learning was widely applied in this field, most of the early methods were bottom-up models. The early methods usually involved biological and psychological research about the visual attention mechanism. Furthermore, these two approaches match common believes about the biological process of human vision. In general, these models try to establish links between visual saliency and low-level image features, such as color, contrast, and brightness [7], [8].
Differing from the above approaches, top-down models try to find factors that have the most impact on visual saliency. These models use visual saliency datasets, which contain images and their saliency annotations, for a data-driven analysis. In recent years, deep learning has been introduced into this area and has boosted the performance of saliency prediction [9], [10], [11], [12], [13], [14], [15], [16]. improved the performance of video transmission under the asynchronous transfer mode (ATM) protocol by introducing foveal priority dithering [18]. VR was later extended for improvement of the MPEG algorithm based on the available network bandwidth [19], transmission of 3D mesh and texture [20], and improvement of the HEVC algorithm [21]. The distinct advantage of VR based methods is that the quality of an image changes smoothly and continuously. This prevents creating hard edges or artifacts around region boundaries.
Other research approaches usually make improvements based on existing video compression methods like JPEG2000, AVC, and HEVC. Sanchez et al. used a Gaussian distribution to assign different priority levels to data packets according to their distance to the region of interest (ROI) [22]. Pohl et al. used an eye-tracker to get real-time fixation information. They divided the video into several fixed tiles, then compressed different tiles at different resolutions based on the fixation information. Another approach for foveated compression is to set different quantization parameters (QPs) for different regions in a video frame [23], [24], [25], [26], [27]. QP controls the step length in the quantization process of coefficients. A higher QP results in larger quantization steps, which causes the decoded image quality to decrease and the compression ratio to increase. These foveated compression methods also used eye-trackers to acquire real-time saliency information and assign higher QPs to regions with higher visual saliency. Polakovič et al. blurred the blocks in the visual periphery to remove details in those areas and consequently remove high frequency components in the transformed coefficients [28].
In conclusion, existing foveated compression methods can be classified into two main categories: VR-based methods and ROI-optimized methods based on existing video encoders. VR-based methods use pixel relocation to achieve foveation based on the distance from the fixation point. However, this approach enlarges salient areas and is less effective in handling multiple fixation points. In contrast, the proposed method addresses these limitations by using a per-quad image warping process for foveation. In addition, ROI-optimized methods based on AVC/HEVC are limited by block-based compression, which necessitates the encoding and transmission of all pixels regardless of their saliency. However, the proposed method overcomes this constraint through a novel saliency-based image warping process, enabling the removal of unimportant pixels before encoding and transmission. This property also makes the proposed method compatible with most existing video compression methods.

A. OVERVIEW
Our method aims to reduce perceptual redundancy, which is usually not handled by widely used video compression methods. As mentioned in the introduction, HVS encodes more information from the center than the periphery of the receptive field. Thus, humans are more sensitive to quality 62536 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. degradation around the fixation point. Furthermore, the salient area, in a single video frame, i.e., the area that needs to have relatively high quality after compression, is usually a very small part of the frame given the limited viewing time for each frame. Based on these two factors, we design an algorithm to subsample different regions of a video frame at different sampling rates according to their saliency. Pixels in salient areas are sampled at a higher sampling rate to preserve the image quality in those areas, while other pixels are sampled at a much lower sampling rate to effectively reduce perceptual redundancy. The overall pipeline of our approach is illustrated in Fig. 4.

B. SALIENCY ENCODING
The saliency map is a grayscale image. This map needs to be transmitted to the decoder to provide the necessary information for image reconstruction. But transmitting it as an image significantly increases the data size. As an alternative, we use a few parameters to describe this saliency map and only transmit the parameters for reconstruction on the decoder end. Such a simplification is possible because the saliency maps are formed by a combination of several gaussian distributions.
The saliency datasets we use are collected using eye-trackers or similar technologies. The direct output of these devices are fixation points instead of saliency maps. For every frame, a collection of fixation points is generated: where (x i , y i ) are the coordinates of a single fixation point i, and n is the total number of fixation points. Then, the saliency map is generated using Eq. 1.
Function f (x j , y j ) gives the saliency value at location x j , y j . A is the amplitude of the distribution and (x 0 , y 0 ) is the center of the distribution. a, b, c are the other three parameters that are used to define the distribution in Eq. 2.
θ is the angle of the long axis of the distribution blob. σ X and σ Y are the standard deviations along the X and Y axes, respectively. Let D be the set of parameters: Then, the parameterization of the saliency map can be formulated as an optimization problem; namely, finding the D that minimizes the difference between the actual saliency values and the fitted values generated using D in Eq 1: where D min is the optimal parameter set, m is the total number of pixels, S x j ,y j is the saliency value in the ground truth saliency map at location x j , y j , and f D x j , y j is the fitted saliency value at x j , y j calculated using f with the parameter set D. This equation can be solved using a non-linear leastsquares solver.

C. FOVEATION USING IMAGE WARPING
In foveated compression, the quality of the salient areas needs to be preserved and this restriction gradually relaxes as the distance from the salient areas increases. Variable resolution transformation [17] introduces one way to approach this problem, by placing pixels to new locations based on their distances to the fixation point. After VR transformation, the area around the fixation point is enlarged and the other areas are squeezed. However, this is not optimal since the salient area takes up more space than in the original image. It also reduces the space available for other areas and impacts the overall image quality. Furthermore, when dealing with multiple fixation points, two methods can be used with VR, namely, collaborative foveae and competing foveae. However, which method is better for compression cannot be determined before applying them to assess the resulting image quality. To address these issues, we propose a subsampling strategy inspired by feature-aware texturing [29].

1) PROBLEM FORMULATION
Our goal is to find an image warping function that can reduce the total number of pixels in an image by sub-sampling, while maintaining the image quality of salient regions. Specifically, the function W : R 2 → R 2 defines a mapping of pixel locations from the original image to the warped image.
In Eq. 4, (x i , y i ) is the coordinate of a pixel in the original image, and where h and w are the height and width of the original image, respectively.
where h ′ and w ′ are the height and width of the warped image, respectively. Since we are reducing the total number of pixels, we have hw > h ′ w ′ . To preserve image quality in salient areas, W needs to sample the salient regions at a higher sampling rate, and other regions at a lower sampling rate. We divide the original image into rectangular grids, and denote the resulting . . , v n is the set of vertices of the mesh, E is the set of edges between adjacent vertices, and F is the set of faces formed by vertices and edges. We denote the set of quads formed by four adjacent vertices as quad located at the i th row (r rows in total) and j th column (c columns in total), and v ij1 , v ij2 , v ij3 , v ij4 are the four vertices. The quads are illustrated in Fig. 1.
The saliency map associated with the input image specifies visual saliency at the pixel level. We divide this saliency map using the same mesh and obtain the set of faces F s = S ij , where S ij is the face corresponding to Q ij . We define the salient areas as the set of quads Q s , whose average saliency level exceeds the threshold s t . In general, a smaller saliency threshold will result in a larger area being categorized as salient, and vice versa. In cases where saliency predictions may not be precise, a smaller threshold value is recommended as it increases the probability of capturing the actual salient regions by expanding the labeled salient regions.
In Eq. 5, m p is the pixel value in the saliency map. Now, we can formulate the problem as finding the warping function to transform all quads in Q to reduce the image size while maintaining the size of any Q ij ∈ Q s .

2) FEATURE PRESERVING MESH TRANSFORMATION
To preserve the quality of salient areas, a bigger portion of pixels are sampled in any Q ij ∈ Q s than in other quads. The result of this is that any Q ij ∈ Q s contains more pixels than other quads and thus has a larger area. The variation in size makes it hard to describe the whole transform as a single warping function. Thus, we carry out the transformation on a per quad basis. Similar to the VR transformation, our method results in salient regions being enlarged and other regions being squeezed. As a consequence, the relative offsets of quads to the origin might change after the transformation and it is not trivial to calculate the new offsets. We decided to make the translation of each quad a free parameter if other restrictions are met. The transformation of a quad can be expressed as the transformation of its four edges. We denote the four edges as vectors e 1 , e 2 , e 3 , e 4 , where: The transformation can be performed as a matrix multiplication: whereẽ ′ andẽ are the transformed and original homogeneous coordinates of the edge in the form: Then, the target transformed edge can be calculated as: whereṽ k represents the homogeneous coordinates of the vertex v k . This equation defines the relationship between the transformed vertices and the original ones. For any Q ij ∈ Q s , only translation is allowed, since we want to maintain the original size. Thus, their transformation matrices have the form: where t x and t y are translation parameters. Using Eq. 8 enables t x and t y to be free. A total of 4N s linear equations can be obtained from Eq. 8, where N s is the number of Q ij ∈ Q s .
For all other quads, we want them to scale with the entire image using the same scaling ratio. Thus, their transformation matrices have the form: where s x and s y are the scaling ratios for the height and width. A total of 4N ns linear equations can be obtained from Eq. 8, where N ns is the number of Q ij / ∈ Q s . Furthermore, vertices on the boundaries of the original image should stay on the boundaries after the transformation. Thus, for these vertices, Eq. 9 is used.
In total we have 4(N s + N ns ) equations and they form a system of linear equations. This system is overdetermined, so we can solve it using least squares. Solving this system for v ′ gives us the transformed homogeneous coordinates with their squared errors to the desired coordinates minimized.

3) SALIENCY GUIDED WEIGHTING
In the system of linear equations, for any single equation, multiplying both left and right-hand sides with the same weight parameter w does not break the equality. However, when solving this system in the least squares sense, adding the weight w causes the squared residual to be multiplied by w 2 . Consequently, the transformed locations of vertices with bigger weights will be closer to their location estimated by the given transformation. This means the shapes and sizes of quads containing those vertices are better preserved. Thus, we apply the average saliency level of a quad as the weight w to all four edges in the quad. The corresponding equations are changed to: We use the value 1 as the minimum of the saliency level.

D. SALIENT AREA SCALING
It is not always possible to keep the size of the salient areas unchanged. If the target compression scale is too small, the salient areas will also need to be scaled to make sure they do not fall outside the compressed image. Thus, we calculate the maximum possible scale for the salient area using Eq. 11. (x ij , y ij ) is the coordinate of a point in S ij , and s margin is a parameter to control the space reserved for peripheral regions, as shown in Fig. 2. The required salient area scales, s rsx and s rsy , are defined as the maximum ratio of the area occupied in the x and y axes plus the two margins. Then, the maximum possible scales are calculated by dividing the target scales by the required salient area scales. If the result is larger than 1, 1 is used as the scale. Where,

1) PERIPHERAL IMAGE QUALITY CONSTRAINTS
The non-salient parts of the image might suffer from a loss of quality because of the sudden change of the transformation method at the boundaries of salient regions. The salient quads are only allowed to translate, while the non-salient quads are allowed to translate and transform perspectively. This results in a relatively big deformation at the boundaries, as shown in Fig. 3. To address this problem, we develop a smoothing weighting method and introduce a uniform constraint on nonsalient quads.
The weights of non-salient quads are defined in Eq. 12.
where S ′ is the new saliency map generated using the parameters discussed in Section III-B. Specifically, 1.5σ X and 1.5σ Y are used to increase the saliency level on the boundaries, consequently increasing the weight of quads in that region. This also ensures that the saliency level changes smoothly from the fixation centers to peripheral regions, which prevents generating artifacts due to sudden changes in weight. Furthermore, we introduce the uniform constraint as a set of additional linear equations to further alleviate this problem. To make sure pixels are uniformly sampled in the non-salient quads, one intuitive approach is to make sure quads on the same row (column) have the same width (height).
Thus, the linear equations can be formulated as Eq. 13, wherẽ v 0 denotes the vertices in the first quad of this row (column), and w is calculated using Eq. 12.
As illustrated in Fig. 3, when transforming the mesh without any constraints, the salient areas near the center overlap with each other, and the quads near the edge of the image are squeezed into a very small area. Because of this, for some quads in those areas, no pixel is sampled, and the information in those quads is totally lost. After applying the smoothing weight method, the extent of deformation on the boundaries of salient areas is reduced, as shown in Fig. 3 (c). Finally, after applying the uniform constraint, the salient areas do not overlap anymore, and the deformation is further reduced, allowing the pixels in non-salient areas to be sampled more uniformly. VOLUME 11, 2023

E. EFFECTIVENESS OF FOVEATION IN REDUCING REDUNDANCY
The foveation process reduces the total number of pixels that needs to be encoded. We assume that this can reduce redundancy in the video frames. To verify this assumption, we calculate the average information entropy and total information entropy of all video frames in the dataset before and after applying foveation. Information entropy measures the average level of information that a random variable contains [30]. For a discrete random variable X with n possible values x 1 , x 2 , . . . , x n , the information entropy H (f ) is defined as Eq. 14, where p i is the probability of x i .
The entropy can be seen as a lower bound on the average number of bits needed to encode a random variable. However, it is not trivial to extend Shannon's original information entropy to higher dimensions, such as images. We use delentropy to measure the average information entropy of an image in our experiment, as it compares favorably with the conventional intensity-based histogram entropy and the compressed data rates of a lossless image encoder [31]. The results are shown in Tables 1 and 2. The results confirm our assumption. As shown in Table 1, the average entropy increases from 4.0907 bits per pixel (bpp) to 4.4182 bpp after the foveation process. According to [31], images with simple patterns like a pure black image has a lower average entropy than images with complex patterns like a natural scene. Therefore, the increase in the average entropy indicates that the foveation process has removed some redundancy in the video frames. It also means that each pixel is carrying more meaningful information than before.
The total entropy for a frame decreases from 1.29E6 bits to 6.95E5 bits, which means that the total amount of information in the video frames has been reduced. Since this is the lower bound on the average number of bits needed to encode a frame without loss, it shows that our method can effectively reduce redundancy in the video frames, if the perceived quality remains similar. Thus, adding the foveation process can help increase the compression ratio. Table 2 shows the results of all categories, and we can draw similar conclusions.

IV. EXPERIMENTS AND DISCUSSION
We test our compression algorithm on the UCF Sports dataset [32]. The UCF Sports dataset contains 150 videos in 12 categories. The resolution of these videos is 720 × 480. We implement the test pipeline using the Gstreamer framework, as shown in Fig. 4. First, the input video sequence is decoded into a YUV sequence and then converted to the RGB color space. Then, the saliency map of the corresponding frame is read and decomposed into a combination of several gaussian distributions. The parameters of these gaussian distributions are then saved for reproducing the saliency maps on the decoder end. To be specific, the parameters are converted to 16bit float point numbers and saved as bytes in a raw text subtitle track using the MKV container. Next, the warped mesh is generated based on the reconstructed saliency map using the gaussian parameters. We then compute a pixel location mapping from the original frame to the compressed frame from the warp mesh parameters. Finally, we construct the compressed RGB frame using the mapping and encode the resulting frame using a video encoder, such as H.264 or H.265.
This process is repeated for all frames in a video. Assuming the eye movement is relatively small during a short interval of time, we only generate a new warped mesh every 5 frames (about 167ms in a 30 frame/second video) to reduce the computational complexity.
The decoding process is the reverse of the encoding process. First, the compressed video is decoded by a video decoder to produce YUV frames, and then converted to the RBG color space. Then, the saved gaussian parameters are extracted from the raw text subtitle track in the MKV container and used to reconstruct the saliency map. Next, the saliency map is used to compute the warped mesh and a pixel location mapping from the compressed frame to the original frame. Finally, the original frame is reconstructed using the compressed frame and the mapping. Figs. 9 and 10 show some results from the proposed method, H.264, and H.265 with details magnified. It can be seen that the proposed method retains details in salient areas, while H.264 and H.265 produce blurry blocks and color artifacts.
For the subjective and objective image quality tests, when compared with H.264, the target bitrate settings are 0.054 bpp, 0.026 bpp, 0.014 bpp, and 0.008 bpp for the high, medium, low, and very low settings, respectively. When compared with H.265, the target bitrate settings are 0.06 bpp, 0.026 bpp, 0.01 bpp, and 0.006 bpp for the high, medium, low, and very low settings, respectively.

A. SUBJECTIVE IMAGE QUALITY ASSESSMENT
For comparison, we compress the original video sequences using H.264 and the proposed method. We use 4 different quality settings for the proposed method, and use the x264 encoder to produce compressed videos that have the same bitrates. We conduct a subjective quality assessment using the double-stimulus impairment scale (DSIS) method     specified in Recommendation ITU-R BT.500-14 [33]. A total of 20 videos are randomly selected from the UCF Sports dataset with at least one from each category.
Twelve test subjects are asked to compare a video produced by either the proposed method or x264, and give a response in the five-grade impairment scale. The mean scores and 95% confidence intervals (CI) for every quality setting are summarized in Table 3. The results show that the perceptual video quality of our method is better than H.264 for all quality settings. This is because more bits are used to store information on the salient areas in our method compared to H.264.

B. OBJECTIVE IMAGE QUALITY ASSESSMENT
The proposed method optimizes the perceptual video quality using saliency information. This introduces more distortion in non-salient areas than traditional video compression methods like H.264 and H.265. Video quality metrics based on signal processing techniques, such as peak signal-noise ratio (PSNR) and structural similarity (SSIM) [34], are not suitable for evaluating the perceptual video quality in this case, because the relatively large distortion in non-salient areas causes a large decrease in overall PSNR and SSIM. These metrics might give results that do not align with human perception. Thus, we conduct an objective image quality assessment using perceptual quality metrics, specifically using the Eye-tracking Weighted PSNR (EWPSNR) [35], Perceptual Similarity (LPIPS) [36], and Video Multi-Method Assessment Fusion (VMAF) [37] metrics.
The EWPSNR metric is a perceptual objective quality metric incorporating saliency information when calculating the PSNR score.
The LPIPS metric uses deep features trained on supervised, self-supervised, and unsupervised objectives alike, to model low-level perceptual similarity. The results show that LPIPS can outperform traditional metrics like l 2 and SSIM.
VMAF is a perceptual video quality metric that tries to approximate human perception of video quality. It is formulated by Netflix to correlate strongly with subjective mean opinion scores using machine learning techniques. Figure 5 shows the rate-distortion curves. These test results align with our subjective test results at medium and lower bitrates. Compared with H.264, the proposed method performs better at low and very low bitrate settings according to most metrics we use. Compared with H.265, the proposed method performs better at medium and lower bitrate settings. We calculate the BD-EWPSNR, BD-VMAF, and BD-LPIPS scores at medium and lower bitrates, as well as the corresponding BD-Rate values, for the proposed method and H.264/H.265 [38]. The results are shown in Table 4. Overall, at medium and lower bitrates, the proposed method achieves better perceptual video quality than H.264 and H.265.
We also observe that the scores vary for different video categories. We summarize all test results for different categories in Figs. 6, 7, and 8. It can be seen that in ''Diving'', ''Riding'', and ''Run'' categories, the proposed method is better on almost all four bitrate settings. In ''Kicking'', ''Lifting'', ''SkateBoarding'', and ''Swing'' videos, the proposed method is better at medium and low settings. However, in ''Golf'' videos, H.264 and H.265 perform better on all bitrate settings. These results show that the proposed method is suitable for videos with fast motions. When watching such videos, people tend to focus only on the main subject in the videos and are less likely to notice the image quality degradation in the background. However, when watching videos with less motion, like videos from the ''Golf'' category, because the main subject cannot draw enough attention, people are more likely to notice the quality difference between salient and non-salient areas. Thus, the proposed method achieves better results in categories with fast motion.
One problem we notice in the objective test is that the EWPSNR and VMAF scores are relatively low for the proposed method at medium and high bitrates, and they do not align well with the LPIPS and subjective test scores. Our assumption is that the two full-reference image quality metrics are essentially based on the pixel-to-pixel difference between the original and compressed images. However, the proposed method might shift the pixels in the salient regions by a small distance from their original location, as shown in Fig. 11, because of the warping transform process. This might cause the pixel-to-pixel difference between the original and compressed images to be larger than the actual difference perceived by humans. Thus, the proposed method might achieve better perceptual quality than the full-reference metrics suggest, as the subjective test result indicates.

C. IMPACT OF SALIENCY PREDICTION ACCURACY
The accuracy of saliency prediction is an important factor that affects the performance of the proposed method. Thus, we also conducted several additional experiments to study the impact of saliency prediction accuracy on the proposed method. We used the first frame of the Diving-Side-005 video as the input for these experiments. Saliency predictions of different accuracy were generated using the following procedure: Step 1: Data on fixation points for this image frame is obtained from the dataset.
Step 2: We shift all the fixation points by a random distance between 0 and 50 pixels.
Step 3: 2D Gaussian distributions with standard deviations σ x = 20, σ y = 20 are placed on a grid with the center of each distribution being the shifted fixation points.
Step 4: The values in the grid are then normalized to have a minimum of 0 and a maximum of 255. This forms the saliency map.
Step 5: We calculate the Normalized Scanpath Saliency (NSS) score defined in Eq. 15 for the generated saliency  map. If the NSS score is not within the desirable range, we repeat the process from Step 2.
In Eq. 15, S is the saliency map,S is the mean of S, σ S is the standard deviation of S, F is the fixation map with only 0 and 1 as pixel values in it, T i is the pixel value in T at location i, and N is the total number of non-zero pixels in F. A higher NSS score indicates a higher saliency prediction accuracy. The original image and the generated saliency maps are then used as the input for the proposed method. The target scales in the experiments are s x = 0.5, and s y = 0.5. We chose these aggressive scaling factors to make the proposed method more sensitive to the saliency prediction accuracy. This will also cause the EWPSNR scores to be relatively low.
Three saliency maps are used with NSS scores of 8.3277, 2.082, and 0.5184, respectively. A NSS score of 8.3277 is very high and the corresponding saliency map can be seen as the ground truth. A NSS score of 2.082 indicates a moderate saliency prediction accuracy. A NSS score of 0.5184 indicates a poor saliency prediction accuracy. We also experiment with three different saliency threshold (s t ) settings of 1, 50, and 100.
We summarize the results in Fig. 12.
It can be seen that as the saliency accuracy decreases, the proposed method produces images with lower EWPSNR scores. This is expected because poor saliency prediction results in regions being labeled incorrectly as salient. Consequently, more bits will be assigned to non-salient regions, and cause a quality decrease in the actual salient areas. However, the results demonstrate that this problem could be alleviated by using a lower saliency threshold s t . A lower s t will result in more regions being labeled as salient, and those regions have a chance to cover the actual salient regions. The resulting warped meshes of different saliency thresholds s t are shown in Fig. 13. Thus, as long as the saliency prediction accuracy is not too low, the proposed method can still produce images that preserve the quality of the actual salient regions.

D. APPLICATIONS AND LIMITATIONS
The proposed method can serve as a pre-processing step for any existing video compression method. Experimental results indicate that the proposed method is particularly effective when compressing videos with restricted bitrates, such as those streamed on mobile devices using cellular data. This is because the proposed method has a bigger advantage at medium and low bitrates. Additionally, the proposed method shows potential for compressing videos featuring fast motion, such as sports videos. In this case, peripheral details receive less attention, making the proposed method highly suitable.
The proposed method has two limitations. Firstly, it may introduce minor pixel displacements in salient regions compared to their original locations. This might result in a decrease in PSNR scores. However, it is not expected to significantly impact the perceptual quality of the video. Secondly, the proposed method may not perform optimally for videos with slow motion, as quality degradation in the background is more likely to be perceptible in such instances.

V. CONCLUSION
We presented a novel approach to video compression, taking into account the characteristics of the human visual system and leveraging foveation to allocate bits to different regions in a video based on their visual saliency. This was achieved through a feature-aware image warping technique that preserves image quality in salient areas. One of the main advantages of the proposed method is that it can be easily integrated with existing video compression standards without requiring modifications to the bit stream format. Our subjective evaluations show that the proposed method outperforms H.264 and H.265 in terms of perceptual quality, and objective evaluations confirm these findings at medium and low bitrates. These results suggest that our approach has the potential to improve the compression ratio while maintaining the perceptual quality.
SHUPEI ZHANG received the bachelor's and master's degrees in engineering from Beihang University, Beijing, China, in 2016 and 2019, respectively. He is currently pursuing the Ph.D. degree in computing science with the University of Alberta (UofA), Edmonton, AB, Canada.
In 2016 and 2019, he was a Research Assistant with Beihang University. Currently, he is a Research Assistant with UofA. He has published one journal article and two conference papers. His research interests include visual saliency, video compression, and HDR reconstruction using single exposure.
ANUP BASU (Senior Member, IEEE) received the Ph.D. degree in CS from the University of Maryland, College Park, USA. He originated the use of foveation for image, video, stereo, and graphics communication, in 1990, and an approach that is now widely used in industrial standards. He pioneered the active camera calibration method emulating the way the human eyes work and showed that this method is far superior to any other camera calibration method. He pioneered a single camera panoramic stereo and several new approaches merging foveation and stereo with application to 3D TV visualization and better depth estimation. He has been a Professor with the CS Department, University of Alberta (UofA), since 1999. He has also held the following positions: a Visiting Professor with the University of California, Riverside, from 2003 to 2004; a Guest Professor with the Technical University of Austria, Graz, in 1996; and the Director of the Hewlett-Packard Imaging Systems Instructional Laboratory, UofA, from 1997 to 2000. His current research interests include multi-dimensional image processing and visualization for medical, consumer and remote sensing applications, multimedia in education and games, and robust wireless 3D multimedia transmission. He has also been a NSERC, iCORE, and Castle Rock Research Chair. He is also a fellow of the American Neurological Association.