Region of Interest Coding Based on Convolutional Neural Network

The traditional region of interest coding method mainly uses low-level features to detect the Region of Interest (ROI). The ROI detected by it is poor in stability and is not easily interfered by noise. In this paper, ROI detection is performed on the image through a deep convolutional network to obtain a stable ROI based on the high-level feature extraction of the image, and then the discrete cosine transform (DCT) is performed on the image and divided into coding units. According to whether the coding unit is an ROI, To determine the quantization matrix used when encoding it. This article uses fine quantization for coding units that belong to ROI, and coarse quantization for non-ROI coding units. In this way, it can be ensured that the compression rate is greatly reduced without affecting the subjective perception of the image. Experiments show that the compression rate of this method can reach about 84%, and the weighted peak signal-to-noise ratio is improved by about 0. 99dB on average compared with JPEG encoding.


Introduction
With the rapid development of image signal acquisition and display technology and the rapid development of the Internet, the penetration rate of high-definition video has greatly increased, 4k and 8k have also begun to spread, and the pressure on image transmission and storage is increasing. Although with the development of image coding technology in recent decades, the image compression rate has been greatly improved, but because the image resolution is getting higher and higher, the amount of compressed video data is still very large, so a more advanced method is needed. Efficient coding method. Taking into account the characteristics of the Human Visual System (HVS) [1] , when humans are observing natural scenes, they often only focus on part of a scene, or consciously only observe a certain part of the scene. Specific scenery. We call this part of the focused observation area or specific area the area of interest. That is, the purpose of region-of-interest extraction is to locate the most sensitive and eye-catching region in the image. This paper proposes a coding based on the region of interest. For the region of interest, a fine compression with a low compression ratio is adopted, that is, more resources are allocated to the region of interest, sometimes even without compression, in order to obtain better results after decompression. Image effect. The background area of the image uses lossy compression with higher compression, that is, less resources are allocated to the background area, and sometimes the background area is not even transmitted. The purpose is to give priority to ensuring the quality of important information in the case of a larger compression ratio. This can greatly increase the compression rate without losing subjective experience.

Region of interest extraction
Region of interest extraction is the prerequisite of region of interest coding, and the quality of the ROI extracted from the region of interest directly affects the subjective experience of the decoded image. Traditional methods for extracting regions of interest usually use hand-made low-level features, such as color, texture, and contrast, to extract regions of interest. However, these methods that use underlying features to extract regions of interest in complex scenes have poor stability and are extremely susceptible to interference from image noise, and the actual effect is not good. In recent years, with the introduction and development of convolutional neural networks (CNN), convolutional neural networks can directly extract high-level, multi-scale semantic information from original images, and make many visual tasks have made great progress. Therefore, the extraction of the region of interest based on the advanced features has achieved a high improvement compared with the extraction of the region of interest based on the underlying features. There are many current region-of-interest detection algorithms based on deep learning. The most commonly used region-of-interest detection methods include DSS [2] , Amulet [3] , BDMP [4] , PiCANet [5] , etc. , although the detection accuracy of the region of interest Both have been greatly improved, but most of them are directly and indiscriminately applying multi-level convolution features. Due to the interference of redundant details, the results are not very good in terms of stability. However, PAGR [6] uses a gradual attention-guided recurrent network to selectively integrate contextual text information from multi-level features, which can alleviate background interference and generate powerful attention features. Through the introduction of multi-path reflow connection, the use of global semantic information to guide the shallower feature learning process, essentially improving the entire network. Improved the stability and accuracy of the region of interest detection.

Two-dimensional DCT transform
The full name of DCT transform is Discrete Cosine Transform [7] , which is mainly used to compress some data or images. It converts the signal from the spatial domain to the frequency domain, because in this way, the relevant parts in the time domain can be separated in the frequency domain, so that the required frequency domain information can be retained in a targeted manner. Required frequency domain information. In fact, the DCT transform itself is lossless, that is, the original data can be restored losslessly in the inverse DCT transform. For the two-dimensional DCT transformation used for images, the mathematical formula is as follows: From the formula, we can see that if the two-dimensional image data is a square matrix, the formula satisfies: (2) where, , cos . Therefore, in practical applications, if it is not a square matrix, the data is generally filled and then transformed. After reconstruction, the filled part can be removed to obtain the original image information. Figure 2 DCT transform In fact, DCT cannot directly compress the image, but it has a good concentration effect on the energy of the image, laying the foundation for compression.

3.2.Two-dimensional inverse transform IDCT
IDCT transform is called Inverse Discrete Cosine Transform (Inverse Discrete Cosine Transform), DCT transform is to transform time domain information to frequency domain information, and IDCT transform is the inverse process of DCT transform, which transforms frequency domain information to time domain. The DCT transform has a very wide range of applications in the field of image analysis that has been compressed. Our commonly used JPEG still image coding and MJPEG and MPEG dynamic coding standards all use the DCT transform. The mathematical formula is as follows: , where Similarly, we can see from the formula that if the two-dimensional image data is a square matrix, the formula satisfies: where, , cos .

ROI coding algorithm
Based on the combination of DCT transform and Huffman coding, we propose coding of interest. The idea is to divide the image into a region of interest and a background region (non-interest region). For the region of interest, we retain more frequency domain information, and for the background region, we retain only a small amount of frequency domain information, and then perform quantization and coding. In this way, the information we need can be retained as much as possible, and the subjective quality of the decoded picture is greatly improved. Specific steps are as follows: (1) Divide: Divide the picture into 8×8 sub-areas.  According to the degree of coincidence between the sub-region and the region of interest, calculate the degree of retention of frequency domain information. By discarding the high-frequency coefficients in each block, the purpose of image compression is achieved. Regions with a coincidence degree greater than 1/2 retain more highfrequency coefficients, and for regions with a coincidence degree less than 1/2, more high-frequency coefficients are discarded.
(4) Quantization: The quantization process is to divide the DCT coefficients by a certain quantization step size, and use different quantization precisions for the 64 DCT transform coefficients in an 8x8 DCT transform block to ensure that the specific DCT spatial frequency is contained as much as possible Information, so that the quantization accuracy does not exceed the need. Among the DCT transform coefficients, low-frequency coefficients are more important to visual induction, so the quantization accuracy of the assignment is finer; high-frequency coefficients are less important to visual induction, and the quantization accuracy of the assignment is coarser. For different sub-regions, different quantization matrices are used for quantization. Even if γ*Q is used to quantify the image, the choice of γ depends on the result of step 3. The γ value used for the non-interest area is generally less than 1, and the γ value for the interest area is generally greater than 1. For example, when γ=1, the quantization matrix Q is used to quantize the DCT transform result.  91 CF FE A5 7F D1 BF CF FA 45 Generally speaking, when γ<1, the sequence length is smaller than the sequence length when γ=1, and when γ>1, the sequence length is larger than the sequence length when γ=1.

Evaluation method
Image quality evaluation is generally divided into subjective evaluation and objective evaluation. Among them, subjective evaluation relies on people's subjective feelings to evaluate merits, which is direct and convenient. It is in line with people's most visual and intuitive feelings. Objective evaluation is to establish some mathematical models based on some physical quantities of the image to evaluate the image quality according to some objective statistical data [8] . For example, the weighted peak signalto-noise ratio of the image can be a good indicator of the quality of the image.

Peak signal-to-noise ratio (PSNR)
Peak of the Signal-to-Noise Ratio (Peak of the Signal-to-Noise Ratio) [9] is a common parameter used to measure image quality. The peak signal-to-noise ratio is the larger the value, the better the image quality. Their expressions are as follows: Among them, H and W are the height and width of the image respectively; n is the number of bits per pixel, generally taken as 8, that is, the number of pixel gray levels is 256; the unit of PSNR is dB, the larger the value, the smaller the distortion; g(i , J) is the image before encoding, (i, j) is the image after encoding.

Weighted peak signal-to-noise ratio (PSNR)
In order to better evaluate the quality of the coded image, according to the characteristics of the human eye, in an image, the definition of the area we care about is more important than the definition of the area that we don't care about. Therefore, by modifying the peak signal-to-noise ratio, the significance is obtained. Weighted signal-to-noise ratio (saliency region-weighted PSNR, SPSNR) SPSNR α PSNR 1 α PSNR (6) Among them, PSNR_ROI is the PSNR value of the region of interest, PSNR_(non-ROI) is the PSNR value of the non-interest region, and the parameter α is the weighting coefficient of the saliency region, and its value range is 0. 7-0. 9.

Experimental results and analysis
In this chapter, we choose the ECSSD [10] dataset for experiments, and select some typical pictures from it to illustrate our experimental results.

Subjective evaluation and analysis
As shown. First, we use the method in Chapter 4 to process image a) to obtain the saliency image (region of interest), which is image b). Then according to the method of this chapter, the region of interest coding and decoding (ROI coding and decoding) is carried out to figure c), and figure d) is the image obtained by using JPEG coding and decoding [11]. Comparing figure c) and figure d), we can see that when we focus our eyes on the moment, the overall effect looks no different from figure d). When we look closely, we can see that the horse and background in figure c) are clear. Degrees are inconsistent. Among them, the horse's emotional degree is higher and the background clarity is lower. d) The clarity of the horse in the picture is the same as that of the background, the clarity of the background is higher than that of the picture c) the clarity of the horse is lower than that of the picture c). In order to observe more clearly, we show the details e) of Figure c) and Figure d). Observation shows that the details of Figure e) are more than those of Figure  f). For example, there are more detailed images d) for the eyes of horses than images f). Therefore, this is in line with the characteristics of human observation, that is, when observing the image, I hope that the details of the information part I want are more. Figure 10 The detailed comparison diagram of ROI encoding(right) and JPEG encoding(left)

Objective evaluation and analysis
We select three pictures with successively decreasing proportions of the region of interest for ROI encoding and decoding, as well as JPEG encoding and decoding at the same time, as a comparison. It can be seen from the table that the ROI coding method proposed in this chapter is better than JPEG coding in terms of compression ratio and weighted peak signal-to-noise ratio. And the smaller the proportion of the region of interest, the higher the compression of the ROI encoding, which is consistent with our subjective thinking.

Conclusion
This paper uses the convolutional neural network region of interest detection technology to realize the region of interest extraction and coding. According to the region of interest obtained by the convolutional neural network, we can perform different quantization coding according to whether the image block belongs to the region of interest. This ensures that the coding quality of the region of interest is consistent with the visual characteristics of the human eye. Without affecting the overall observation, the details of the area that people are paying attention to are more prominent. In addition, since most of the images are non-interest regions, the region-of-interest coding proposed in this paper greatly reduces the compression ratio. Of course, there are still some problems with the method in this paper, such as the extraction speed of the region of interest, and only the region of interest agreed by people can be detected. We will further study and improve in the follow-up work.