Resampling detection of recompressed images via dual-stream convolutional neural network

Resampling detection plays an important role in identifying image tampering, such as image splicing. Currently, the resampling detection is still difficult in recompressed images, which are yielded by applying resampling and post-JPEG compression to primary JPEG images. Although low quality primary compression benefits the detection, it remains rather challenging due to the widespread use of middle/high quality compression in imaging devices. In this paper, we propose a novel deep learning approach to learn resampling features directly from the recompressed images. To this end, a noise extraction layer based on low-order high pass filters is deployed to yield the image noise residual domain, which is more beneficial to extract manipulation trail features. A dual-stream convolutional neural network (CNN) is presented to capture the resampling trails along different directions, where the horizontal and vertical streams are interleaved and concatenated. Lastly, the learned features are fed into Sigmoid/Softmax layer, which is used as a binary/multiple classifier for achieving the blind detection or parameter estimation of resampling operations, respectively. Extensive experimental results demonstrate that our proposed method could detect resampling effectively in recompressed images and outperform the state-of-the-art detectors.


INTRODUCTION
Digital images play an important part in spreading information, leading public opinion and taking evidence in legal proceedings and criminal investigations.Advances in photo editing and manipulation tools have made it significantly easier to create fake images without leaving any perceptible artifacts than ever before.Malicious image manipulation may lead to serious moral, ethical and legal problems.Therefore, it is significant to detect the image manipulation blindly [1].Image manipulations typically leave behind unique traces that can be used to detect the type of image editing.Researchers extract features related to these traces and develop associated algorithms to determine the type of processing operation.In the last decade, numerous methods have been developed to detect various types of image manipulations, such as JPEG compression [2,3], contrast enhancement [4,5,6], sharpening [7], median filtering [8,9], image splicing [10,11,12], etc.
Forensic detection of image resampling has also drawn wide attention.Existing image resampling detection methods operate by identifying operation traces from spatial [13,14,15,16,17,18] and frequency domains [19,20]. A. C. Popescu et al. propose to detect the statistical correlation of interpolated pixels by expectation maximization (EM) algorithm [13].Mahdian and Saic propose to detect pixel interpolation traces by capturing the statistical changes on signal covariance structure [17].In [15], subspace decomposition and random matrix theory are used to identify resampling traces in upscaled images.Feng et al. exploit normalized energy density to derive a 19-dimensional feature vector, which is fed into a SVM classifier [19].In addition, Kirchner [20] models the detectable resampling artifacts in spatial and frequency domains by means of the variance of prediction residue.
However, such prior resampling detection methods are rather fragile with post-JPEG compression, even for the quality factors (QF) about 95 [12,25].The periodic pattern introduced by resampling is disturbed by JPEG blocking artifacts.In order to detect resampling in JPGE images, the periodicity in the second derivative of interpolated images is employed as evidence of resampling detection [21].Furthermore, Nataraj et al. [22] propose to suppress JPEG periodic patterns by adding Gaussian noise to post-compressed images.However, such prior methods are ineffective in low quality JPEG images.Kirchner and Gloe [23] demonstrated that the previous JPEG compression before resampling can well detect resampling in recompressed images.Although such an approach can detect resampling in the cases of low and moderate quality primary compression, it is invalidated for the high quality case.In [24], the operation chain including JPEG com- Recently, convolutional neural network(CNN) is exploited to detect resampling in recompressed images [25,26].In order to prevent CNN from tending to learn features representing image content, the constrained convolutional layer is used to suppress image content and learn manipulation features adaptively.Such a method achieves the best performance in resampling detection compared with previous methods.
In this paper, we propose a new CNN architecture to detect resampling in recompressed images.A noise extraction layer consisting of high pass filters is inserted into the CNN architecture.An input image can be converted into noise residuals by such filters for suppressing the influence of image contents.Meanwhile, the the dual-stream CNN is employed to extract the high level feature of resampling traces.We also note that resampling is often enforced locally to adjust region dimensions for creating realistic spliced images.Additionally, the proposed dual-stream CNN method is applied to the detection of such resampling-involved image splicing.Extensive experimental results verify that the proposed resampling detection method can achieve the state-of-the-art performance.
The rest of the paper is organized as follows.Section 2 describes the proposed resampling detection scheme detailedly.Section 3 shows experimental results and discussions.Finally, the concluding remarks are given in Section 4.

PROPOSED METHOD
In this section, the proposed image resampling images forensics scheme is presented detailedly.

Overview of proposed model
We address resampling detection as a pattern classification problem, which would be resolved by deep learning methodology.Figure 1 depicts the overall design of our proposed dual-stream CNN architecture for resampling detection.It consists of three main components.First, due to resampling features exist in inhorizontal and vertical directions independently, the noise is extracted from adjacent pixel differences in both directions by noise extraction layer.Second, the high level representation of image manipulation features are generated by horizontal and vertical streams.In order to capture the correlation of both directions, horizontal and vertical streams are interleaved and concatenated.Finally, the concatenated features of horizontal and vertical streams are fed into the sigmoid/softmax layer.Note that sigmoid layer is used for binary classification, and softmax layer is used for multiple classifications.Below we present a detailed introduction of such three components and the different layers used in our CNN architecture.

Noise extraction
Resampling detection is typically interfered by image content which should be suppressed.It has been found that the resampling trace generally exist in the redundant domain of images and is irrelevant to image content [27,28,11].In our approach, noise is modeled by the residual between a pixel and its estimation yielded by interpolating only neighboring pixels.To accomplish this, a special convolution layer is set as the first layer of the network, namely noise extraction layer.As shown in Figure 2, two low-order high pass filters are chose as convolution kernels of noise extraction layers.Such a layer would not require training and bias.The adjacent pixel differences in horizontal and vertical directions can be captured by the filters shown in Figures 2(a) and (b), respectively.More specifically, a patch sized 256 × 256 pixels from a grayscale input image is first convolved with 1 × 3 and 3 × 1 filters with stride 1 and padding 1.Such filters could learn the prediction error features between the estimated central pixel and its local neighbors.As a result, the noise extraction layers would yield a noise map of prediction residuals with dimension of 256 × 256 × 1.

Horizontal and vertical streams
Noise from horizontal and vertical directions is extracted through the noise extraction layer.Then high level representation of resampling trace would be extracted from such noise.Experimental observations show that, a part of the evidence would be lost and the overall resampling detection performance would degrade if a single, instead of double, direction(s) of noise gradient was used.Therefore, two identical piplines (horizontal stream and vertical stream) are designed to extract features independently from different directions, and the weights of two piplines are not shared.As shown in Figure 1, the horizontal and vertical streams are composed of five similar groups.Each group consists of four layers, namely convolution layer, batch normalization layer, activation layer and pooling layer.The fifth group receives additional features from the interleaved stream.Finally, the features generated by the horizontal/vertical and interleaved streams are concatenated.

Interleaved stream
In order to determine the resampling behavior, it requires to fuse the two directional features for making a decision together.So, we propose a feature fusion strategy, namely interleaved stream, to concatenate such two features.The interleaved stream consists of four similar groups.The first group consists of a convolutional layer, a batch normalization layer, and an activation layer.The remaining groups have an additional pooling layer.The features from the first group in horizontal and vertical streams are concatenated, and then enforced by 1 × 1 convolutional kernels with stride 1.Such a type of convolution kernel weights each position of the two feature graphs linearly for fusion.Then the remaining three groups extract advanced representation of the fused features.Finally, the feature map output from interleaved stream is interpolated back to the horizontal and vertical streams.As cooperative learning, the horizontal and vertical streams are aware of each other without affecting the feature extraction in each direction.

Classification
The final features learned from the previous layers are fed into the classifier, which is a fully connected layer using softmax or sigmoid functions.Through such a classifier, the probability that the feature belongs to each category is obtained, and the most probable category is the result of the classifier.The two involved functions are Eq.(2.1) is a sigmoid function used in binary classification, where z is the value of neurons in the full connected layer.P(y = 1|x) is the probability that x belongs to the positive category.Eq.(2.2) is a softmax function used in multiple classification, where z j is the value of the j th neuron in the full connected layer.P(y = j|x) belongs to the probability that x is the j th category.

Convolution layers
The convolution layer in CNN is used to extract features.The convolutional operation between the input feature maps and a convolutional layer within the CNN architecture is defined by where * indicates a 2d convolution operation.

Batch normalization layers
In order to solve internal covariate shift, which is the change of input data distribution in each layer during the training of CNN, we need to normalize the feature maps generated by the convolutional layers.To do this, a batch normalization layer is used between the convolution layer and the activation layer.The batch normalization operation within the CNN architecture is given from Eq.(2.4) to (2.7).
First, the mean and variance of all data in a batch are calculated as shown in Eq. (2.4) and (2.5).
where m denotes the number of data in a batch, x i signifies the i th data in a batch, µ and σ 2 indicate the mean and variance, respectively.
Then each data is normalized to generate a new data xi with a mean of 0 and a variance of 1, as shown in Eq. (2.6).
where is small floating-point number greater than 0 to prevent dividing by zero errors.Finally, all data is scaled and shifted.
where y i indicates the i th output of the batch normalization layer.γ and β are parameters which are learned by network.The newly generated data can make better use of the nonlinear function of the activation function.The small changes in front layer parameters from causing huge changes in back layer parameters is prevented by batch normalization layer.

Activation layers
In order to compensate for the expressive deficiency of linear models, the convolution layer is generally followed by an activation layer containing a non-linear function called activation function.The features generated by convolution layer are transformed into another space by activation function, and the data can be better classified.
TanH, Sigmoid, and ReLu are the widely applied activation functions.The data is squashed to [-1,1] by TanH function.TanH has a good performance for the features with significant difference, because it can enlarge continuously the effect of features.Besides, TanH is preferred to Sigmoid in practical application due to TanH's mean value is 0.Although the training speed of ReLu function is faster than that of TanH, it is fragile in training and can not use a larger learning rate.Hence, activation layers are equipped with the TanH function in our network.

Pooling layers
The pooling layer aims to reduce the dimensionality of the feature maps.This reduces the computational cost of training and decreases the chances of over-fitting.In our CNN, two different types of pooling is applied, i.e., max pooling and global average pooling.Max pooling, which retains the maximum value within the local neighborhood of the sliding window, is applied to all pooling layers except for the fifth group in the horizontal and vertical streams.Max pooling layers use kernels of size 3 × 3 and stride of 2, which is the smallest one to capture the notion of left/right, up/down and center.Global average pooling downsamples the feature maps to 1 via the average pooling and it can replace full connection layer to reduce model parameters.Note that global average pooling is only used in the last pooling layer of horizontal and vertical streams.

EXPERIMENTAL RESULTS
In this section, we first describe the image dataset used in our experiments and the detail setups of the proposed model, and then present extensive experiments to show the effectiveness of the proposed model.

Performance measure and experimental details
In order to thoroughly measure our model and the baselines, the accuracy is employed as evaluation metrics.For each image, we generate the highest ranked labels and compare the generated labels to the ground truth labels.The accuracy is the number of annotated correctly labels divided by the number of generated labels.
All models in our experiment are trained and tested based on Keras with a single NVIDIA Titan GPU.The network is trained from scratch using the annotated training images.The weights of all network parameters are initialized by Xavier initialization.We train CNN model using stochastic gradient descent with a batch size of 32 images, momentum [29] of 0.9, and weight decay of 1e-5.The learning rate is initialized at 0.01.Besides, step decay of learning rate was used in our method.We gradually reduced the learning rate during training.We divided the learning rate by 10 when the validation loss stopped improving for three epochs.The number of training epochs is 30 epochs.

Resampling detection with recompression
In this section, we set up two different sets of experiments on the detection of resampling in the recompressed image.First, we evaluate the performance of our model by specifying the resampling parameter and the quality factor of JPEG compression.In the second part, we challenge a difficult task that the resampling parameters and JPEG quality factors are chosen randomly.The performance of the proposed method on different size images is also evaluated.

On fixed parameters
In our first set of experiments, we use our proposed CNN architecture to detect resampling operation in the recompressed images with different scaling factors and JPEG compression quality factors.
In order to conduct this experiments, we collect 5,000 RAW images of size 3154 × 5286.All operations are performed on RGB color images for a more realistic simulation of the image processing.Each image is divided into 512×512 pixels subimages, and seven central subimages are retained.Then JPEG compression is performed using a quality factor in the 95-97 range in order to simulate the built-in compression parameters of cameras.This eventually yields 35000 unaltered image blocks.Next, we created several sets of resampled and recompressed images.We consider the image resampling processing using bilinear interpolation with three different scaling factors, i.e., 50%, 120% and 150%.Each unaltered block is firstly rescaled using each of three different scaling factors, and then the central 256 × 256 block of the scaled block is retained to form the final database.After this, each resampled image is compressed and their corresponding unaltered version with different quality factors, i.e., 50, 60, 70, 80, 90.
Finally, we create 16 different databases each of which consists of 70000 images (35000 resampled image and 35000 unresampled images).These images are converted to grayscale by retaining only its green color layer of each image.Each database is divided into three parts: 50000 images are used for training, 10000 images are used for validation, and the rest part is used for testing.
In order to evaluate the effectiveness of the proposed model, such a model is compared with constrained convolutional neural networks [25] using both the same training and testing datasets.The comparison results are illustrated in Table 1.The proposed CNN is superior to constrained convolutional neural networks.Specially, our CNN achieves better identification rates for lower upsampling and lower quality factors.It can be noted from Table 1 that our method is similar to Bayar's method [25] in performance with high quality post-compression.However, under the post-compression of medium and low quality factors, our method shows a clear advantage.In the case of a QF of 50 and an upsampling of 120, our method improves performance by nearly 4% compared to Bayar's [25] method; when the downsampling is 50, the performance of our method is also improved by nearly 4%.

On random parameters and different image sizes
In the previous experiments, one certain scaling factor and quality factor are only included in each of databases.Next, we assess the performance of our approach at performing image resampling detection in a more complex scenario where the quality factors and the resampling parameters are chosen arbitrarily.Additionally, the effectiveness of our method is also evaluated under different sizes of images.To accomplish this, we create a training and testing datasets using the same unaltered 512 × 512 image blocks that we collected in the previous experiment.The images are firstly sampled using bilinear interpolation with the upsampling and downsampling scaling factor which is chosen uniformly at random from the range of possible values.The range of upsampling and downsampling factors is respectively from 110% to 200% and from 50% to 90%, and the interval is 10.Then we retain the central blocks of the resampled images and their corresponding Finally, three databases with different dimension are created, where each database corresponds to three sub databases, i.e., upsampling-jpeg, downsampling-jpeg and unalteredjpeg.Each sub database consists of 35,000 images in which 25,000 images are used for training, 5,000 images are used for validation, and the rest part is used for testing.We compare the results of our proposed method and Bayar's method [25] under images with different dimension, as shown in Table 2.The experimental results demonstrate that even in this more challenging scenario, our CNN is still able to accurately identify image resampling.Our approach increases the overall classification rate from 88.46% to 99.07%.Although the lowest detection accuracy was 88.46% for our approach, it is still very high compared with Bayar's method.We can notice that the Bayar's method performance degrades severely when reducing the dimension of image.

Resampling detection with post-processing
In order to evaluate the performance of our method in another challenging scenario, we used our CNN to detect image resampling when one manipulation is applied for resampled images, followed by JPEG recompression.To demonstrate this, we conducted another experiment where each image patch is first resampled, then it is edited by one manipulations, and finally JPEG compressed using different quality factors.Five editing operations and parameters are listed in Table 4, and they are Gamma correction (GC), Mean filtering (MeanF), Gaussian filtering (GF), Median filtering (MedF) and Wiener filtering(WF).We create data for this experiment by using 35,000 unaltered 512 × 512 image blocks collected from the first experiment (Section 3.2.1).The image blocks are firstly sampled using bilinear interpolation with the upsampling and downsampling scaling factor which is randomly chosen.The range of upsampling and downsampling factors are respectively from 110% to 200% and from 50% to 90%, and the interval is 10.Then resampled image blocks and their corresponding unaltered version are modified with using a specific manipulation, the associated parameter was chosen uniformly at random from the set of possible values.After this, each manipulated image block is compressed using selected quality factor arbitrarily in the range of 50-100 with the interval of 10 and then the central 256 × 256 block is retained.Finally, five databases are built which are GC, MeanF, GF, MedF and WF.Table 4 depicts the detection rate for our proposed method and Bayar's method [25] when one manipulation is applied for resampled images.From Table 3, the performances of our method and Bayar's method have different degree of reduce for different post-processing operations.Noticeably, our proposed method is superior to Bayar's method.For upsampled images, our proposed CNN can achieve at least 87.90% (MedF) detection rate with all types of manipulations, and it can achieve 97.85% (GC) detection rate.Both values have significantly higher than 81.89% and 96.50% of Bayar's method.Although the influence of postprocessing operation on downsampling is greater than that on upsampling, our proposed method results are much bet- ter than the results of Bayar method.The detection rate of our approach for each type of manipulation are typically greater than 77% except for the MeanF operation images which is detected with an accuracy of 73.88%.These results demonstrate that even in this more challenging scenario, our CNN is still able to accurately detect image manipulations.

Estimation of resampling parameters
In the previous experiments, the proposed CNN is effective at detecting image resampled features in recompressed images with different scaling factors and JPEG compression quality factors.In this part, in order to evaluate the performance of the proposed approach in resampling parameters evaluation, tow multiclass classifiers are respectively performed for parameters evaluation of upsampling and downsampling.Similarly to the previous set of experiments, 35000 unaltered image blocks of size 512 × 512 from the first experiment are used to build the database.The images are respectively sampled using bilinear interpolation with fifteen different scaling factors, i.e., 50%, 60%, 70%, 80%, 110%, 120%, 130%, 140%, 150%, 160%, 170%, 180%, 190% and 200%.And then the central 256×256 blocks of the sampled image is retained to form the final database.Next the generated blocks of size 256 × 256 pixels are compressed using a quality factor which is randomly selected in the range of 50-100 at the interval of 10.Eventually, 15 different databases are created.In the each database, we convert these images to grayscale by retaining only its green color layer of each image, and we used 25,000 images for the training, 5,000 images for the validation and 5,000 images for testing data.Table 6.Average detection accuracy for upsampling and downsampling on image blocks with size of 64 × 64 pixels.
Upsampling Downsampling Accuracy 89% 77% Table 5 shows the confusion matrix achieved by the proposed CNN.Actuals belong on the side of the confusion matrix and predictions are across the top.The accuracies for the proposed CNN with different scaling factors are greater than 91% and the scaling factor of 200 performs the best results of 99.58%.For the classifier some images with different scaling factors, such as the scaling factor of 60 and 70, seem to be difficult to distinguish from the scaling factor of 50.

Image splicing detection with resampling
We assessed the performance of our approach at performing image splicing detection.In this experiment, 64 × 64 is chosen as a small size to detect resampling.In order to train the variant CNN, we created a dataset using the same unaltered 512 × 512 pixels image blocks that we collected in the first experiment.These image blocks are first cut to 64 × 64 pixels subblocks, and then 300,000 subblocks are randomly selected, followed by resampling and JPEG recompression.Scaling factors and quantity factors are randomly chosen in the range of 0.5 to 2 at the interval of 0.01 and the range of 50 to 100 at the interval of 1, respectively.Finally, we create a database which consists of 900,000 images of size 64×64 pix- els (300,000 upsampled images, 300,000 downsampled images and 300,000 unresampled images).The dataset is divided into three parts: 80% of the images are used for training, 10% are used for validation, and the rest part is used for testing.Two binary classifier are trained to deal with upsampling and downsampling respectively.
The resampling detection results are shown in Table 6.From Table 6, we can observe that the accuracy of upsampling classification reaches 89%, which is already an acceptable result.However, the accuracy of downsampling is only 77%.
The Wild forensics dataset [30,10] is applied to verify the effectiveness of the proposed approach in image splicing detection.The tampered areas of the test images are resampled by unknown parameters and compressed by jpeg.As illustrated in Figure 3, our method successfully detects and locates the resampling region in some images, which indicates that the resampling feature is helpful for image splicing detection.However, such a method can only detect the resampled part of the resampled image.If the mosaic part has not been resampled or the image has been processed by a variety of post-processing such as sharpening and filtering, our method will not work.

Effectiveness validation of model architecture
The structural design of a CNN's architecture has a large impact on its final accuracy.We conduct several sets of additional experiments related to the structural design of our CNN's architecture.In this subsection, we try to present some experimental results to further validate the rationality of the proposed model.Three parts of the proposed model have been considered, including the noise extraction layers, hori-  zontal stream or vertical stream and the activation functions.The sub database (Gamma correction, down-sampling) that we generated in the previous experiment in Section 3.4 is utilized to accomplish this experiment.

Effectiveness of noise extraction layers
As described in Section 2.2, two low-order high pass filters are used to extract noise maps in our proposed method.A set of experiments are conducted to prove the effectiveness of noise extraction layers in this section.We consider the three Here, high/low-order denote the filters illustrated in Figures 2 and 4, respectively.
different cases about the noise extraction layer, including using two high-order high pass filters (denoted as "high-order") from Figure 4, using two low-order high pass filters that used in our method ("low-order") and removing the noise extraction layer ("no noise layer").
The experimental results are shown in Figure 5.We can observe that low-order high pass filter achieves the best performance, and its accuracy is distinctively higher than other methods.The accuracy of high-order high pass filter is slightly higher than removing noise extraction layer.

Performance of horizontal/vertical and interleaved streams
In order to assess the performance of horizontal stream and vertical stream, four cases are studied, which are only using horizontal stream, only using vertical stream, using horizontal and vertical stream without interleaved stream, and using horizontal and vertical stream with interleaved stream.Table 7 depicts that the performance of our model is superior to other model, and the accuracy is 97.37%.The accuracies of horizontal and vertical stream models are 95.52% and 96.25%, respectively.Both methods are significantly lower than that in the other two methods.The primary reason is that only one directional feature is considered.The accuracy rate typically greater than 97% when the horizontal and vertical features are considered simultaneously.Noticeably, the performance of model with interleaved stream is better than the model without interleaved stream.This is because that the correlation of directions is further strengthened.Thus, the proposed dualstream method and interleaved stream can improve the performance of the model.

Selection of activation function
Selection of activation function is an important part in the design of convolutional neural networks and different activation functions have great influence on network performance.In order to evaluate the effectiveness of the proposed model with different activation function, two commonly used activation functions, TanH and ReLu, are chosen.From Figure 6, we can notice that TanH is more suitable for our model.The performance of a network with TanH is better than that of network with ReLu.Furthermore, TanH has better stability and faster convergence than ReLu.

CONCLUSION
In this paper, we presented a new deep learning-based method to detect resampling in recompressed images.A noise extraction layer is used to extract noise residual and a dual-stream CNN is proposed to extract resampling features from noise maps in different directions.The experimental results show that the proposed method could not only detect the resampling of recompressed images, but also achieve excellent robustness against some additional post-processing operations.Moreover, we apply the global resampling detection method to resampling parameter estimation and image splicing detection.It should be noted that the proposed detect resampling traces if the image has undergone complex post-processing or anti-forensic manipulations.In the future, we would extend our work to identify more types of resampling in the existence of complex operation chain and anti-forensics attacks.

Fig. 2 .
Fig. 2. Two low-order high pass filters used in the (a) horizontal and (b) vertical streams of noise extraction layer.

Fig. 5 .
Fig. 5. Comparison of different settings in the noise extraction layers.Here, high/low-order denote the filters illustrated in Figures2 and 4, respectively.

Table 2 .
Resampling detection accuracy of different methods in different image sizes.

Table 3 .
Resampling detection accuracy of different methods against different post-processing.

Table 5 .
Confusion matrix for identifying resampling parameters using the proposed method.The relative number of correct and incorrect classifications are shown and the number are written in percent.Tru and Det represent the true label and detection label, respectively.

Table 7 .
Classification accuracy for upsampling and downsampling in four models.