A Novel Steganography Method for Infrared Image Based on Smooth Wavelet Transform and Convolutional Neural Network

Infrared images have been widely used in many research areas, such as target detection and scene monitoring. Therefore, the copyright protection of infrared images is very important. In order to accomplish the goal of image-copyright protection, a large number of image-steganography algorithms have been studied in the last two decades. Most of the existing image-steganography algorithms hide information based on the prediction error of pixels. Consequently, reducing the prediction error of pixels is very important for steganography algorithms. In this paper, we propose a novel framework SSCNNP: a Convolutional Neural-Network Predictor (CNNP) based on Smooth-Wavelet Transform (SWT) and Squeeze-Excitation (SE) attention for infrared image prediction, which combines Convolutional Neural Network (CNN) with SWT. Firstly, the Super-Resolution Convolutional Neural Network (SRCNN) and SWT are used for preprocessing half of the input infrared image. Then, CNNP is applied to predict the other half of the infrared image. To improve the prediction accuracy of CNNP, an attention mechanism is added to the proposed model. The experimental results demonstrate that the proposed algorithm reduces the prediction error of the pixels due to full utilization of the features around the pixel in both the spatial and the frequency domain. Moreover, the proposed model does not require either expensive equipment or a large amount of storage space during the training process. Experimental results show that the proposed algorithm had good performances in terms of imperceptibility and watermarking capacity compared with advanced steganography algorithms. The proposed algorithm improved the PSNR by 0.17 on average with the same watermark capacity.


Introduction
Recently, infrared images have been widely used in many real-time applications, such as pedestrian segmentation [1], salient object detection [2], object fusion tracking [3], and invisible clothing [4]. Compared with visible images, infrared images are also able to obtain temperature information of scenes and human bodies. Furthermore, they provide good visual effects in dark environments. Consequently, infrared images can provide new ideas for visible scenes in many different research fields.
Therefore, many researchers have proposed several image-processing algorithms based on infrared images. Most of these algorithms were based on fusing infrared images with visible images. The infrared images usually have the disadvantages of low texture details and more noise, which are the advantages of visible images. Thus, the idea of fusing both visible and infrared images is required to reconstruct a synthesized image containing prominent targets and abundant texture details. Many researchers have used multi-scale transforms to extract and match the features for both images. For example, Chen et al. [5] applied Laplace's pyramid for infrared and visible images to obtain high-Prediction-Error Expansion (PEE). The PVO algorithm performs pixel prediction based on ordering pixels according to their values in an image block. However, their method failed to provide sufficient embedding positions by applying only one bin as the inner region. Ou et al. [22] extended the PVO method [21] and proposed PVO-k, where the maximumvalue and minimum-value pixels in a block are taken as a unit to hide the watermark and were modified together to keep the PVO invariant. These PVO-based methods [21,22] performed splendidly, especially for low embedding capacity. However, a smaller block size should be applied to obtain a high embedding capacity, which causes more degradations in the quality of the watermarked image. Therefore, Wang et al. [23] improved the performance of the PVO-based methods and proposed a reversible data-hiding method based on dividing the cover image into a dynamic size of image blocks instead of applying same-sized blocks. Weng et al. [24] discovered that most of the improved PVO-based algorithms are applied to predict the pixels that occupy only a small fraction of the image block. Thus, they proposed a Dynamic Improved PVO (DIPVO) algorithm based on graded local complexity to dynamically hide information. Zhang et al. [25] proposed the Location-based PVO (LPVO) algorithm to solve the problem of irregular prediction-error distribution of PVO-based algorithms. The LPVO algorithm simultaneously considers the sequential location of pixel values, which makes more pixels available for image steganography.
In this paper, the infrared image is firstly divided into two sets: Dot sets and Cross sets. Secondly, we use the method of Super-Resolution based on deep CNN (SRCNN) [26] to pre-process half of the original image. This step refers to the image-super-resolution algorithm based on signal preprocessing proposed by Huang et al. [27]. Then, we apply Smooth Wavelet Transform (SWT) to extract high-and low-frequency sub-bands of the image. Thirdly, the CNN-based Predictor (CNNP) [28] is employed with a few attentionnetwork layers to improve the accuracy of the pixel prediction. Finally, to make the process of the algorithm reversible, we proposed to train models that can recover the image and extract information simultaneously. The main contributions of this paper are as follows: 1.
Pixels in the same position have better prediction accuracy. Therefore, we employ SRCNN for pre-processing the Cross set of the infrared image to predict the Dot set of the original image, which reduces pixel-prediction errors.

2.
We apply SWT for extracting image-frequency-domain features to improve the prediction performance of the spatial CNN models. Furthermore, we employ the improved CNNP model on the image and frequency-domain features. The features in the spatial and frequency domains improve the accuracy of the training stage.

3.
An attention mechanism is added to the CNNP model. This allows the extracted features to be ranked according to their importance. The attention mechanism will also speed up the model training and improve the performance of the testing stage of the model.
The rest of this paper is organized as follows: Section 2 summarizes the related works, including the basic idea behind the SRCNN model and the CNNP model. Section 3 presents the proposed method in detail. The experimental results and performance analysis of the proposed algorithm compared with other algorithms are shown in Section 4. Finally, the conclusion and some future directions can be found in Section 5.

Related Works
Recently, CNN-based image-steganography algorithms have become a popular research direction. Luo et al. [29] used DenseNet to extract high-level semantic features of each block instead of low-level features and then hide information by using DCT. Sharma et al. [30] combined steganography with cryptography, and they used CNN for image steganography and image recovery. Similarly, Hu et al. [28] proposed a CNNbased Predictor (CNNP). They applied the same structure of the CNN model for image steganography and image recovery, but the training set was different. Hu et al. [31] improved the structure of the model and introduced the concept of chunking to improve the accuracy of pixel prediction. The CNN-based models presented by Hu et al. [28] and Hu et al. [31] were very small and the training process could be run on the CPU. Furthermore, Panchikkil et al. [32] applied Arnold-transform-based data hiding to improve the performance of image recovery, where each block underwent some complex scrambling processes based on the watermark information required for embedding in the selected block. Their method has two segments: a watermark-embedding segment that hides confidential information in the cover image, and a watermark-extraction segment that recovers the original cover image and extracts the watermark information [32].
Overall, CNN-based algorithms have excellent performance in image steganography in the spatial domain, but the combination with the traditional frequency-domain transforms should not be neglected. The algorithms based on the spatial domain directly process the pixels in the image, and the frequency domain is a variety of high-or low-frequency sub-bands that are created by an image-transform algorithm. With this in mind this, Liu et al. [33] combined U-Net with Haar Wavelet Transform (HWT). HWT was applied to extract high-and low-frequency features of the image. Then, the U-Net was used to hide information in the original image. In this paper, the model of Super-Resolution based on deep CNN (SRCNN) [26] was applied to pre-process the infrared image and combine it with its corresponding SWT sub-bands. Furthermore, we applied the CNNP model [28] on the image and frequency-domain features. The features in the spatial and frequency domains improved the accuracy of the training step and hence improved the prediction accuracy of the proposed algorithm.

SRCNN Model
Dong et al. [26] presented the first machine-learning framework of SRCNN for singleimage Super Resolution (SR). In the pre-processing process, they applied bilateral interpolation for upsampling the original image I co to the desired size. Bilinear interpolation enables fast and relatively accurate upsampling of images. The image at this Dot set is defined as the interpolated image I bi . The general structure of the SRCNN model is shown in Figure 1.
steganography and image recovery. Similarly, Hu et al. [28] proposed a CNN-based dictor (CNNP). They applied the same structure of the CNN model for image steg raphy and image recovery, but the training set was different. Hu et al. [31] improve structure of the model and introduced the concept of chunking to improve the acc of pixel prediction. The CNN-based models presented by Hu et al. [28] and Hu et a were very small and the training process could be run on the CPU. Furthermore, chikkil et al. [32] applied Arnold-transform-based data hiding to improve the perform of image recovery, where each block underwent some complex scrambling proc based on the watermark information required for embedding in the selected block. method has two segments: a watermark-embedding segment that hides confidenti formation in the cover image, and a watermark-extraction segment that recovers the inal cover image and extracts the watermark information [32].
Overall, CNN-based algorithms have excellent performance in image steganogr in the spatial domain, but the combination with the traditional frequency-domain forms should not be neglected. The algorithms based on the spatial domain directly cess the pixels in the image, and the frequency domain is a variety of high-or low quency sub-bands that are created by an image-transform algorithm. With this in this, Liu et al. [33] combined U-Net with Haar Wavelet Transform (HWT). HWT wa plied to extract high-and low-frequency features of the image. Then, the U-Net was to hide information in the original image. In this paper, the model of Super-Resol based on deep CNN (SRCNN) [26] was applied to pre-process the infrared imag combine it with its corresponding SWT sub-bands. Furthermore, we applied the C model [28] on the image and frequency-domain features. The features in the spatia frequency domains improved the accuracy of the training step and hence improve prediction accuracy of the proposed algorithm.

SRCNN Model
Dong et al. [26] presented the first machine-learning framework of SRCNN for s image Super Resolution (SR). In the pre-processing process, they applied bilateral polation for upsampling the original image co I to the desired size. Bilinear interpo enables fast and relatively accurate upsampling of images. The image at this Dot defined as the interpolated image bi I . The general structure of the SRCNN mo shown in Figure 1. SRCNN aims to recover an interpolated image I bi from an image F(I bi ) that is mostly similar to a high-definition (HD) image I sr . The algorithm should ensure that the output result of F(I bi ): HD image I sr as close as possible to the interpolated image I bi . The SRCNN algorithm can be divided into three main steps: patch extraction, nonlinear mapping, and image reconstruction. Intensive patch extraction is a very popular strategy for image recovery. Moreover, feature extraction and representation of the patches in a reasonable way plays a key role in the image-SR algorithm. Thus, the first layer of the SRCNN model is represented in Equation (1) as follows: where W 1 is a filter for extracting features, B 1 is a parameter for adjusting the bias, and * represents the convolution operation. First, W 1 extracts the features from the interpolated image I bi . Then, the ReLU activation function is applied to correct the extracted features. Hence, plenty of features can be obtained in the patch-extraction step, whereas in the nonlinear-mapping step, more features are further extracted and then classified. The calculation process is shown in Equation (2) as follows: where W 2 is a filter for extracting features, and B 2 is a parameter for adjusting the bias. Theoretically, more convolutional layers can be added in this step to improve the performance of the model, but the complexity of the model and the training cost will also be increased. In transform-based algorithms of the frequency domain, the predicted overlapping blocks of SR pixels are usually averaged to generate a good image recovery. In order to improve the image-prediction accuracy of the model, we need to increase the number of convolutional layers to improve the perceptual field of the convolutional kernel. This reduces the error of the model-output results and is also more beneficial for the subsequent information-hiding step. Inspired by traditional algorithms, the SRCNN model incorporates a dialogue layer. Equation (3) shows how this dialogue layer is computed.
Let the parameters to be learned by the SRCNN model be: Then, for the set of HD images I sr and the corresponding interpolated images I bi , the loss function is defined as the Mean-Square-Error (MSE) function as follows: The MSE loss function tends to result in high-quality images for the algorithm. The SRCNN model is obtained when the value of the loss function meets the requirements of the model training.

CNNP Model
The CNNP model [28] is a CNN-based framework for image steganography. The structure of the model is very simple and does not require the GPU server during the model-training phase. Furthermore, it makes full use of the excellent extraction capability of CNN for global features, which makes the watermark capacity exceed many traditional image-steganography algorithms. The CNNP model firstly divides the images into two groups equally, as shown in Figure 2, which are named the Cross collection and the Dot collection. The algorithm then predicts the Dot collection I D by the Cross collection I C . Next, the predicted Dot collection I D is input into the CNNP model to predict the Cross collection I C . Finally, they combine the predicted Cross collection I C with the predicted Dot collection I D to obtain the full predicted image. The main structure of the CNNP model is shown in Figure 3.

Proposed Algorithm
The proposed method contains three main aspects: image steganography, image recovery, and information extraction. For the information-hiding process, we firstly divide the input infrared image into two sets: the Cross collection C I and the Dot collection D I , as shown in Figure 2. Then, the watermarked Dot collection W D I , is used to predict the Cross collection C I . The prediction process is performed by SRCNN and SWT for the Cross collection C I and Dot collection D I , respectively. The CNNP model for predicting infrared images is also incorporated into the attention layers to improve the accuracy of the prediction. After the full predicted image W I is obtained, the watermarking infor- The model consists of several convolutional blocks. Each convolutional block consists of K×K convolutional layers, a LeakyReLU activation function, and 3 × 3 convolutional layers. The algorithm applies a two-stage information-hiding approach. Firstly, the error between the Dot collection I D and predicted Dot collection I D needs to be calculated, and then the appropriate prediction error (usually 0) is selected for image steganography. The information-hiding process for the Cross-set images I C is the same as that for the Dot collection I D . Furthermore, the algorithm flow of CNNP is reversible. In the flow of image recovery, the CNNP model is applied first to recover the Cross collection I C , and then the Cross collection I C is input into the model to recover the Dot collection I D .

Proposed Algorithm
The proposed method contains three main aspects: image steganography, image recovery, and information extraction. For the information-hiding process, we firstly divide the input infrared image into two sets: the Cross collection I C and the Dot collection I D , as shown in Figure 2. Then, the watermarked Dot collection I D,W is used to predict the Cross collection I C . The prediction process is performed by SRCNN and SWT for the Cross collection I C and Dot collection I D , respectively. The CNNP model for predicting infrared images is also incorporated into the attention layers to improve the accuracy of the prediction. After the full predicted image I W is obtained, the watermarking information is embedded into the infrared image. Furthermore, information-extraction and imagerecovery processes are then performed simultaneously. The watermarked Cross collection I C,W is first applied to predict the Dot collection I D and the watermark is extracted based on the prediction error. The process for recovering the Cross collection I C and extracting the information is done in the same way.

Image Steganography
In image-steganography step, two CNN-based models are employed: the SRCNN model and the CNNP model. Moreover, SWT is employed to extract the frequency-domain features of the infrared image. The overall structure of the proposed steganography algorithm is shown in Algorithm 1 and Figure 4a. Figure 4b indicates the process of SSCNNP (SRCNN followed by CNNP based on the SE layer).

Image Steganography
In image-steganography step, two CNN-based models are employed: the SRCNN model and the CNNP model. Moreover, SWT is employed to extract the frequency-domain features of the infrared image. The overall structure of the proposed steganography algorithm is shown in Algorithm 1 and Figure 4a. Figure 4b indicates the process of SSCNNP (SRCNN followed by CNNP based on the SE layer).   Image steganography can be divided into four main steps: initial image prediction, SWT decomposition, CNNP-based image prediction, and difference-histogram translation. These segments are carefully described in the following subsections.

Initial Image Prediction
Inspired by the SRCNN model for image SR, the initial prediction stage of the image in the proposed method aims to take up as few resources as possible to predict the information of neighboring pixels accurately. Empirically, the CNN model is trained better for input and output images with the same pixel positions. Therefore, if we apply the Cross/Dot I C /I D collections to predict each other, the prediction error can be reduced. Therefore, the original collections of Cross/Dot I C,co /I D,co are required to be accurately pre-processed before predicting the images. Since the SRCNN model is applied for pre-processing, the Dot/Cross-set images here are denoted by I D,sr /I C,sr , respectively. As shown in Figure 1, the SRCNN model receives the original Cross/Dot-set image I C,co /I D,co and outputs the SR collections of Dot/Cross I D,sr /I C,sr , respectively. Generally, a 9 × 9 convolutional layer first transforms the original Cross/Dot images I C,co /I D,co into 64 feature maps. Then, the Sensors 2023, 23, 5360 8 of 18 model can be applied to more nonlinear scenes by adding nonlinear factors using the ReLu activation function. Finally, two 5 × 5 convolutional layers are applied to extract the features from the image and recover the image as a single-channel infrared image. Moreover, the SRCNN model updates the learning parameters θ sr to reduce the error between the SR Dot/Cross images (I D,sr /I C,sr ) and the original Dot/Cross images (I D,co /I C,co ). Finally, the classical MSE function is applied by the loss function L sr as follows:

SWT Decomposition
CNN model-based algorithms usually extract features from images in the spatial domain. In this paper, we propose to combine an infrared image with its four SWT low-/high-frequency sub-bands inside a tensor and then feed that tensor into the CNNP model for image prediction. In contrast to other wavelet transforms, SWT is a non-downsampling transform, i.e., all sub-bands are equal in size to the original image [34]. Low-pass and highpass SWT filters keep all low-and high-frequency components the same size by padding with 0 coefficients in the process of coefficient processing. Taking the SR Dot-set image I D,sr as an example, the SWT decomposition can be shown as in the following Equations (6)-(9): where LL is a low-frequency sub-band, and HL, LH, and HH are high-frequency subbands generated by SWT decomposition. g and h are the filters of the SWT transform for different sub-bands. In this paper, the images in the spatial domain and low-/highfrequency features in the frequency domain were jointly used as inputs to the CNNP model to improve the convergence speed and prediction accuracy of the model.

CNNP-Based Image Prediction
In this paper, we employed a Squeeze-and-Excitation Network (SENet) [35] to improve the accuracy of the CNNP model. Compared with many CNN-based models that modify the spatial dimension to improve the image-prediction accuracy, SENet instead increases the performance by modeling the relationship of feature channels. Usually, an SE layer takes up few resources and largely improves the convergence speed and performance of the model. In this paper, we modified the CNNP model by applying the SE layer to improve the prediction accuracy of the CNNP model. The structure of the modified CNNP model is shown in Figure 5.
The input tensor consists of an infrared image and four frequency components of SWT (LL, HL, LH, LL). The SWT sub-bands need to be modified according to the type of infrared image. For example, if the infrared image is the SR collection of Dot I D,sr , the Cross-set position of all SWT sub-bands should be set to 0. Consequently, the coefficient of the Dot-set position has to be changed to 0 if the infrared image is the SR collection of Cross I C,sr . We first extracted the features of the input tensor by applying 3 × 3, 5 × 5, and 7 × 7 convolutional blocks. Each convolution block consisted of two K × K convolution layers and a LeakyReLU activation function, as shown in Figure 5b. The LeakyReLU function is an improved version of the classical ReLu activation function, and it has a small slope output for negative inputs. Since the derivative is essentially non-zero, the number of silent neurons is reduced substantially, which allows for gradient-based learning. The LeakyReLU function can solve the problem in which the neurons do not learn if the ReLu function has negative values. However, the training time is increased dramatically. Therefore, we summed up the extracted features and recorded the importance of each feature layer by the SE layer. The SE layer quantifies the importance of the features and visualizes it on the features. Finally, we predicted the final infrared image by using two 3 × 3 convolutional blocks to prevent the CNNP model from gradient disappearance or explosion. One of the convolution blocks was referenced in the form of a residual block, which was connected to the SE layer. By using the SR Dot/Cross (I D,sr /I C,sr ) collections as inputs to the model for training, the loss function L CP could be defined as in Equation (10) as follows:  where LL is a low-frequency sub-band, and HL, LH, and HH are high-frequency bands generated by SWT decomposition. g and h are the filters of the SWT tran for different sub-bands. In this paper, the images in the spatial domain and lowfrequency features in the frequency domain were jointly used as inputs to the C model to improve the convergence speed and prediction accuracy of the model.

CNNP-Based Image Prediction
In this paper, we employed a Squeeze-and-Excitation Network (SENet) [35] prove the accuracy of the CNNP model. Compared with many CNN-based mode modify the spatial dimension to improve the image-prediction accuracy, SENet in increases the performance by modeling the relationship of feature channels. Usua SE layer takes up few resources and largely improves the convergence speed and p mance of the model. In this paper, we modified the CNNP model by applying the SE to improve the prediction accuracy of the CNNP model. The structure of the mo CNNP model is shown in Figure 5.

Histogram Shifting
Each digital image has its own histogram, including infrared images. The histogram visualizes the number of pixels in the image for each pixel value. Due to its small computational cost and its translation, scaling, and rotation invariance, the image histogram is widely applied in many fields of digital-image processing. In image steganography, histogram shifting is one of the classical reversible information-hiding algorithms [36][37][38]. In this paper, we propose using the histogram-shifting algorithm for reversible image steganography based on the CNNP model to predict the complete final image. Different from traditional histogram-shifting-based algorithms, we selected the prediction error of infrared images to count the histogram. Figure 6a shows a histogram of the prediction error for an infrared image. This image was randomly selected from the test set.
From Figure 6a, we can see that the proposed SSCNNP model had a good prediction accuracy, where the peaks of the prediction-error histogram were very high at the position 0. This means that we do not need to remember the peak points of the histogram, as in the traditional histogram-shifting algorithm. The peak point of the histogram must be located at level 0 because the prediction error of the algorithm concentrates at 0. The information-hiding algorithm is slightly different depending on the location of the nearest zero point to the peak. For example, in Figure 6a, the zero value on the right side of the histogram was closest to the peak point (the pixel value between the two points was the lowest), so the equation for the proposed image steganography is shown in Equation (11) as follows: where P co and P w are the original and watermarked prediction errors, respectively. The prediction error is the difference between the pixel predicted by the model and the original pixel. The pixel prediction error is inversely proportional to the capacity of information hiding. i is the index of the pixel. w is the watermarking information to be hidden, and j is watermark's index. Conversely, if the nearest zero point is located to the left of the peak, the proposed image steganography can be determined as in Equation (12) as follows: the model for training, the loss function CP L could be defined as in Equation (10) I  I  I  I   I  I  I  I  L   ,  ,  , (10) where cp θ is the parameter of the CNNP model. Similar to Section 3.1.1, the CNNP model also uses the MSE function as a loss function. When the loss function CP L is at a very low value for a long time, it means that the parameters of the model have reached the best case.

Histogram Shifting
Each digital image has its own histogram, including infrared images. The histogram visualizes the number of pixels in the image for each pixel value. Due to its small computational cost and its translation, scaling, and rotation invariance, the image histogram is widely applied in many fields of digital-image processing. In image steganography, histogram shifting is one of the classical reversible information-hiding algorithms [36][37][38]. In this paper, we propose using the histogram-shifting algorithm for reversible image steganography based on the CNNP model to predict the complete final image. Different from traditional histogram-shifting-based algorithms, we selected the prediction error of infrared images to count the histogram. Figure 6a shows a histogram of the prediction error for an infrared image. This image was randomly selected from the test set.  Briefly, the algorithm strives to minimize the number of pixels that are changed with an equal number of watermarks. Furthermore, the positions of these zeros remain unremembered. For example, in Figure 6b, since the nearest zero point was located to the right of the peak, the watermark-containing histogram had more pixels to the right of 0. Therefore, the direction of the histogram translation could be selected as the right side.

Image Recovery and Information Extraction
Image recovery and information extraction are performed simultaneously. Furthermore, the information-extraction process is similar to the steps of hiding information. The SRCNN initially predicts the infrared image. Moreover, low-/high-frequency sub-bands of SWT decomposition are fed into the improved CNNP model along with the infrared image itself. For information extraction, the proposed method first uses the watermarked collection of Dot I D,W to predict the collection of Cross I C . The result obtained from the prediction and the watermarked collection of Cross I C,W calculates the prediction error to extract the information. While the information is extracted, the collection of Cross I C is also recovered. The flow structure of the proposed image recovery and information extraction is shown in Figure 7.  Equation (13) and Equation (14) are the equations for image recovery and watermark extraction when the pixel-prediction error ≥ 0 and ≤ 0, respectively. Based on the difference-histogram translation mentioned in Section 3.1.4, the formulas for extracting information in both cases are defined as follows: (14) Next, the same process is applied to predict the collection of Dot D I by using the watermarked collection of Cross

SRCNN and SWT Pre-Processing
Firstly, SRCNN is applied to pre-process the watermarked collection of Dot/Cross (I D,W /I C,W ). The output of the model is recorded as the watermarked collection of SR Cross/Dot (I C,srw /I D,srw , respectively). Secondly, SWT decomposes the watermarked collections of SR Cross/Dot I C,srw /I D,srw , and hence one low-frequency sub-band (LL) and three high-frequency sub-bands (HL, LH, and HH) are obtained. The SWT sub-bands together with the watermarked collections of SR Cross/Dot I C,srw /I D,srw are fed into the CNNP model to improve the convergence speed and prediction performance of the model.

CNNP Recovery and Extraction
Firstly, the watermarked collections of SR Cross/Dot (I C,srw /I D,srw ) are fed into the improved CNNP model. The addition of the SE layer in the embedding stage allows the CNNP model to obtain a more accurate predicted collection of Cross/Dot (I C,pre /I D,pre ). Secondly, the prediction-error difference between the predicted collections of Cross/Dot (I C,pre /I D,pre ) and the watermarked collections of Cross/Dot (I C,W /I D,W ) is calculated. Equation (13) and Equation (14) are the equations for image recovery and watermark extraction when the pixel-prediction error ≥ 0 and ≤ 0, respectively. Based on the differencehistogram translation mentioned in Section 3.1.4, the formulas for extracting information in both cases are defined as follows: Next, the same process is applied to predict the collection of Dot I D by using the watermarked collection of Cross I C,W . While the watermark information is extracted from the watermarked collection of Dot I D,W , the collection of Cross I C is also recovered. Finally, the algorithm can obtain both the extracted watermark information w and the original infrared image I.

Experimental Configuration
We conducted our experiments on 113 infrared images from wind turbines from Zhejiang Shangfeng Group Co., Ltd., Shaoxing, Zhejiang, China. The size of each infrared image was 256 × 256. The dataset was partitioned into 90 images in the training set and 23 images in the testing set. They were all infrared images captured from all angles in normal or faulty conditions. The shape of the wind turbine differed greatly from angle to angle, resulting in some variation in the images of the dataset. This effectively alleviated the problem of model overfitting. If researchers would like to obtain this dataset, please contact Zhejiang Shangfeng Group Co., Ltd. Additionally, to demonstrate the generality of the proposed algorithm on infrared images, a substation-power-equipment dataset was purchased online and used for the experiments. The substation-power-equipment dataset contained 100 infrared images, from which 80 infrared images were selected as a training set and 20 infrared images as a testing set. Figure 8 shows four infrared images from the substation-power-equipment dataset. The proposed algorithm was trained on Dell OptiPlex 7070, with an Intel(R) Core(TM) i7-9700 CPU. The SRCNN and SSCNNP models used are very simple, so the CPU was sufficient to train and test the models. The codes were implemented in Matlab 2019a and Pycharm 2020.1.2.

Experimental Configuration
We conducted our experiments on 113 infrared images from wind turbines from Zhejiang Shangfeng Group Co., Ltd., Shaoxing, Zhejiang, China. The size of each infrared image was 256 × 256. The dataset was partitioned into 90 images in the training set and 23 images in the testing set. They were all infrared images captured from all angles in normal or faulty conditions. The shape of the wind turbine differed greatly from angle to angle, resulting in some variation in the images of the dataset. This effectively alleviated the problem of model overfitting. If researchers would like to obtain this dataset, please contact Zhejiang Shangfeng Group Co., Ltd. Additionally, to demonstrate the generality of the proposed algorithm on infrared images, a substation-power-equipment dataset was purchased online and used for the experiments. The substation-power-equipment dataset contained 100 infrared images, from which 80 infrared images were selected as a training set and 20 infrared images as a testing set. Figure 8 shows four infrared images from the substation-power-equipment dataset. The proposed algorithm was trained on Dell Opti-Plex 7070, with an Intel(R) Core(TM) i7-9700 CPU. The SRCNN and SSCNNP models used are very simple, so the CPU was sufficient to train and test the models. The codes were implemented in Matlab 2019a and Pycharm 2020.1.2.

Prediction Accuracy
To evaluate the prediction-accuracy performance of the proposed algorithm, we used the mean absolute value, variance, and MSE of the image-prediction errors for the proposed SSCNNP model, the CNNP model, and three state-of-the-art algorithms: BIP [39], MEDP [40], and GAP [41]. Table 1 shows that the proposed SSCNNP had a good prediction accuracy compared with CNNP, BIP [39], MEDP [40], and GAP [41]. For the three evaluation indicators shown in Table 1 below, the lower the values, the more accurate the prediction was. As we can see from Table 1, the absolute value of the mean prediction error of the proposed SSCNNP model was only 2.04, which was lower than that of the CNNP model

Prediction Accuracy
To evaluate the prediction-accuracy performance of the proposed algorithm, we used the mean absolute value, variance, and MSE of the image-prediction errors for the proposed SSCNNP model, the CNNP model, and three state-of-the-art algorithms: BIP [39], MEDP [40], and GAP [41]. Table 1 shows that the proposed SSCNNP had a good prediction accuracy compared with CNNP, BIP [39], MEDP [40], and GAP [41]. For the three evaluation indicators shown in Table 1 below, the lower the values, the more accurate the prediction was. As we can see from Table 1, the absolute value of the mean prediction error of the proposed SSCNNP model was only 2.04, which was lower than that of the CNNP model  Table 1 indicates that the proposed SSCNNP model outperformed the others on image-prediction accuracy. Figure 9 shows the prediction errors of the proposed SSCNNP, CNNP, BIP [39], MEDP [40], and GAP [41] for the four infrared images shown in Figure 8. The higher the vertical coordinate of the model in the figure at the position where the prediction error was 0, then the better the model was. For better visualization, only the prediction errors within the range [-10, 10] are shown. From Figure 9, it was found that the prediction error histogram of the proposed SSCNNP model outperformed the other methods. In particular, the proposed SSCNNP model had more pixels with accurate predictions for the prediction error in the range [-1, 1]. The BIP, MEDP, and GAP algorithms use a single convolution-kernel feature extraction and pixel prediction. Furthermore, the CNNP model applies three convolution kernels, 3 × 3, 5 × 5, and 7 × 7, at the same time to extract the features more comprehensively. However, the proposed SSCNNP model applies SRCNN to unify the positions of the predicted carrier pixels with the pixels to be predicted. Moreover, SWT also facilitates the training of the model by extracting low-and high-frequency sub-bands of the image. Additionally, the SE layer improves the training accuracy of the CNNP model. Therefore, the experimental results shown in Figure 9 demonstrate that the proposed SSCNNP model had better prediction performance compared with the other algorithms. This provided a good basis for the subsequent reversible steganography.
formed the others on image-prediction accuracy. Figure 9 shows the prediction errors of the proposed SSCNNP, CNNP, BI MEDP [40], and GAP [41] for the four infrared images shown in Figure 8. The high vertical coordinate of the model in the figure at the position where the prediction was 0, then the better the model was. For better visualization, only the prediction within the range [-10, 10] are shown. From Figure 9, it was found that the predictio histogram of the proposed SSCNNP model outperformed the other methods. In par the proposed SSCNNP model had more pixels with accurate predictions for the pre error in the range [-1, 1]. The BIP, MEDP, and GAP algorithms use a single convo kernel feature extraction and pixel prediction. Furthermore, the CNNP model a three convolution kernels, 3 × 3, 5 × 5, and 7 × 7, at the same time to extract the fe more comprehensively. However, the proposed SSCNNP model applies SRCNN to the positions of the predicted carrier pixels with the pixels to be predicted. Moreove also facilitates the training of the model by extracting low-and high-frequency sub of the image. Additionally, the SE layer improves the training accuracy of the model. Therefore, the experimental results shown in Figure 9 demonstrate that th posed SSCNNP model had better prediction performance compared with the othe rithms. This provided a good basis for the subsequent reversible steganography. Figure 9. The prediction errors for the four infrared images shown in Figure 8. (a-d) corresp the four subfigures in Figure 8, respectively.

Information-Hiding Performance
In this section, we introduce a performance analysis of the proposed SSCNN pared with CNNP, LPVO [42], IPPVO [25], and BLTM [43] for information hidin different watermark capacities on the two test datasets, as listed in Tables 2 and 3.

Information-Hiding Performance
In this section, we introduce a performance analysis of the proposed SSCNNP compared with CNNP, LPVO [42], IPPVO [25], and BLTM [43] for information hiding with different watermark capacities on the two test datasets, as listed in Tables 2 and 3.
The IPPVO algorithm adaptively selects the size of the convolution kernel according to the texture complexity of the current pixel block. The LPVO algorithm can quickly and accurately select multiple histograms for panning to hide information. The IPPVO and LPVO are commonly related to the proposed algorithm, so it can be applied for comparison. Figure 10a-d show the PSNR values of the four infrared images shown in Figure 8 with various watermarking capacities on the wind-turbine dataset. The higher the PSNR value, the closer the watermarked image I w was to the original image I, i.e., the better the imperceptibility of the image-steganography algorithm. The watermarking capacity goes from 2000 bits to 30,000 bits and increases by 2000 bits at a time.
In Equation (15), M and N are the length and width of the original image I, respectively, and in this paper M = N = 256. The higher the PSNR value, the closer the watermarked image was to the original image I. The IPPVO algorithm adaptively selects the size of the convolution kernel acco to the texture complexity of the current pixel block. The LPVO algorithm can quickl accurately select multiple histograms for panning to hide information. The IPPVO LPVO are commonly related to the proposed algorithm, so it can be applied for com son. Figure 10a- The experimental results shown in Figure 10 demonstrate that the watermarke ages with the proposed algorithm had higher PSNR values than those of the wind-tu dataset. Table 2 shows the PSNR values of the proposed algorithm compared with C [28], LPVO [42], IPPVO [25], and BLTM baiyu@hdu.edu.cn 43] when the waterma capacity was 10,000 bits and 20,000 bits. When the watermark capacity was 10,000 bit average PSNR value of the proposed SSCNNP-based image steganography algorithm 59.27 dB, which was better than that of the BLTM (58.88 dB), IPPVO (58.91 dB), L (58.93 dB), and CNNP (59.14 dB) based on image-steganography algorithms. Further when the watermarking capacity was increased to 20,000 bits, the average PSNR valu our algorithm was still higher. The better imperceptibility of the proposed SSC model could be directly observed by averaging the PSNR values of four images.   The experimental results shown in Figure 10 demonstrate that the watermarked images with the proposed algorithm had higher PSNR values than those of the windturbine dataset. Table 2 shows the PSNR values of the proposed algorithm compared with CCNP [28], LPVO [42], IPPVO [25], and BLTM baiyu@hdu.edu.cn 43] when the watermarking capacity was 10,000 bits and 20,000 bits. When the watermark capacity was 10,000 bits, the average PSNR value of the proposed SSCNNP-based image steganography algorithm was 59.27 dB, which was better than that of the BLTM (58.88 dB), IPPVO (58.91 dB), LPVO (58.93 dB), and CNNP (59.14 dB) based on image-steganography algorithms. Furthermore, when the watermarking capacity was increased to 20,000 bits, the average PSNR value for our algorithm was still higher. The better imperceptibility of the proposed SSCNNP model could be directly observed by averaging the PSNR values of four images.

Performance on Substation-Power-Equipment Dataset
The substation-power-equipment dataset had many different types of infrared images. These different kinds of infrared images may have had different characteristics. Therefore, we demonstrated the performance of the proposed algorithm for the infrared images in the substation-power-equipment dataset through comparison experiments. The comparison experiments use the four infrared images of substation power equipment shown in Figure 8. The variation of the PSNR values of the algorithm with different watermarking capacities is shown in Figure 11. The experimental results demonstrate that the proposed SSCNNP model had the highest PSNR values for all watermarking capacities. This also proves that the proposed algorithm had good imperceptibility compared to the other state-of-the-art algorithms. Because the watermark capacity of individual images was less than 20,000, Table 3 shows the PSNR values of the four algorithms when the watermarking capacity was 5000 bits and 10,000 bits.
When the watermark capacity was 5000 bits, the average PSNR value of the proposed algorithm was 60.08 dB, which was higher than the BLTM (59.34 dB), IPPVO (59.34 dB), LPVO (59.52 dB), and CNNP (59.86 dB) image-stenography algorithms. The image-steganography algorithm based on the proposed SSCNNP model still had the highest average PSNR value when the watermarking capacity was 10,000 bits. Overall, the experimental results demonstrate that the proposed algorithm had better performance in terms of imperceptibility compared with the state-of-the-art algorithms. When the watermark capacity was 5000 bits, the average PSNR value of the prop algorithm was 60.08 dB, which was higher than the BLTM (59.34 dB), IPPVO (59.34 LPVO (59.52 dB), and CNNP (59.86 dB) image-stenography algorithms. The imag ganography algorithm based on the proposed SSCNNP model still had the highest age PSNR value when the watermarking capacity was 10,000 bits. Overall, the ex mental results demonstrate that the proposed algorithm had better performance in of imperceptibility compared with the state-of-the-art algorithms.

Conclusions
In this paper, a novel steganography algorithm for infrared images is proposed b on the SRCNN and CNNP models. The pre-processing model is the SRCNN model, w is able to predict image pixels more accurately. Furthermore, the proposed model bines the frequency domain of SWT with the spatial domain of the CNNP model to re the prediction error. Experiments show that the proposed SSCNNP model had bette diction performance for both classical-prediction-based algorithms and novel-CN based models. Therefore, the imperceptibility and watermark capacity were als proved for infrared images. Moreover, the SRCNN and CNNP models are both weight CNN models, and they can be combined with each other in a harmonious w future work, we will try to introduce more lightweight CNN models and wavelet t form in the frequency domain to reduce the requirements of hardware equipmen improve the performance of the algorithm at the same time. Moreover, the experim images should be more abundant and closer to real-time images to ensure the practi of the algorithm.

Conclusions
In this paper, a novel steganography algorithm for infrared images is proposed based on the SRCNN and CNNP models. The pre-processing model is the SRCNN model, which is able to predict image pixels more accurately. Furthermore, the proposed model combines the frequency domain of SWT with the spatial domain of the CNNP model to reduce the prediction error. Experiments show that the proposed SSCNNP model had better prediction performance for both classical-prediction-based algorithms and novel-CNNP-based models. Therefore, the imperceptibility and watermark capacity were also improved for infrared images. Moreover, the SRCNN and CNNP models are both lightweight CNN models, and they can be combined with each other in a harmonious way. In future work, we will try to introduce more lightweight CNN models and wavelet transform in the frequency domain to reduce the requirements of hardware equipment and improve the performance of the algorithm at the same time. Moreover, the experimental images should be more abundant and closer to real-time images to ensure the practicality of the algorithm.
However, the infrared applied in this paper was only for the device, so in the future we hope to extend the study to a larger range of infrared images. In addition to image steganography, robust image-watermarking algorithms are also well worth exploring.