Color image splicing localization algorithm by quaternion fully convolutional networks and superpixel-enhanced pairwise conditional random field

: Recently, fully convolutional network (FCN) has been successfully used to locate spliced regions in synthesized images. However, all the existing FCN-based algorithms use real-valued FCN to process each channel separately. As a consequence, they fail to capture the inherent correlation between color channels and the integrity of three channels. So, in this paper, quaternion fully convolutional network (QFCN) is proposed to generalize FCN to quaternion domain by replacing real-valued conventional blocks in FCN with quaternion conventional blocks. In addition, a new color image splicing localization algorithm is proposed by combining QFCNs and superpixel (SP)-enhanced pairwise conditional random field (CRF). QFCNs consider three different versions (QFCN32, QFCN16, and QFCN8) with different up-sampling layers. The SP-enhanced pairwise CRF is used to refine the results of QFCNs. Experimental results on three publicly available datasets demonstrate that the proposed algorithm outperforms the existing algorithms including some conventional algorithms and some deep learning-based algorithms.

applied two post-processing operations to finalize pixel-wise forged region localization. Salloum et al. [21] used multi-task fully convolutional network (MFCN) to learn surface label and edge of the spliced regions respectively. Liu et al. [22] utilized three different fully convolutional networks (FCNs) to locate spliced regions respectively and then fused the predictions of three FCNs by CRF to obtain the final location map. Chen et al. [23] proposed an improved splicing localization algorithm to make the work by Liu et al. [22] be an end-to-end learning system. They also utilized region proposal network (RPN) to enhance learning ability of object areas because forgery usually happens in the object areas. Bappy et al. [24] proposed a two-stream network which exploits the features in both frequency domain and spatial domain to locate forged regions by incorporating encoder and LSTM network. However, for color forged images, all these deep learning-based methods use real-valued CNNs to process each channel separately [25]. As a consequence, they fail to capture the inherent correlation between color channels and the integrity of three channels [26].
The quaternion is an extension of complex number. During the past two decades, it has been regarded as a tool of color images processing by encoding their three channels into the imaginary parts of the quaternion representation (QR) [2,[25][26][27]. The two main advantages of QR are that: (a) it helps capture the inherent correlation between color channels; (b) it treats a color image as a vector field. So, using the QR and quaternion algebra, many classical tools developed for gray-scale image have been successfully extended to color image processing, such as Fourier transform [26,27], neural networks [28], principal component analysis [29], kernel quaternion principal component analysis [30], fractional Fourier transform [31], fractional cosine transform [2], and discrete fractional random transform [32], etc. Recently, CNN as a powerful feature representation method has achieved fine performance in almost all vision tasks [33][34][35][36][37]. So, some researchers also investigated the extensions of the CNN in quaternion domain and proposed quaternion CNN (QCNN) model [25,38,39]. QCNN model has been shown to achieve better results than the traditional CNN model in both of color image classification task [38] and color image segmentation task [39].

Some preliminaries
This section recalls QR and some layers in QFCN.

Quaternion number and quaternion color representation
Quaternion numbers are the generalization of complex numbers. A quaternion number has one real part and three imaginary parts as where a, b, c, and d are four real numbers, and i, j, k are three imaginary units obeying the following rules 2 2 2 1, , When the real part a 0, q is called a pure quaternion. The conjugate and modulus of a quaternion number are respectively defined as Let f (u, v) be an RGB image function, each pixel can be represented as a pure quaternion by the QR where are respectively the red, green and blue components of the pixel (u, v).

Quaternion convolutional layer
In the quaternion convolutional layer, convolution is performed by convolving a quaternion filter matrix with a quaternion input vector. Let W  W 0  W 1 i  W 2 j  W 3 k be a quaternion filter matrix and x  x 0  x 1 i  x 2 j  x 3 k be a quaternion input vector. The quaternion convolution between W and x is given by where  is real-valued convolution.

Quaternion batch normalization layer
Batch normalization layer is usually used to speed up training in a real-valued CNN [40]. In some cases, batch normalization is essential to train a model. So, quaternion batch normalization (QBN) proposed in [39] is also utilized for QCNN. The QBN composes of two steps presented in the following.
Firstly, a whitening approach is used to normalize the input data x. The whitening approach is where E[x] is the mean of x and W is the covariance matrix of x. Secondly, two learnable parameters are introduced to make sure that the transformation inserted in this layer can represent the identity transform. The two learnable parameters β and γ scale and shift the normalized value as follow

Other layers
Other layers in QCNN models, such as quaternion activation layers, quaternion pooling layers and quaternion dropout layer, are obtained by a so-called split approach [25] from the corresponding real-valued layers. Taking quaternion ReLU activation function for example, the quaternion ReLU activation function  is obtained by applying separate real-valued ReLU on all four parts of a where  is the real-valued ReLU function.

Proposed algorithm
In this section, firstly, QFCN is proposed to generalize the real-valued FCN to quaternion field. Then, the SP-enhanced pairwise CRF used in the proposed color image splicing localization algorithm is described. Finally, the main architecture of the proposed algorithm is presented.

Quaternion fully convolutional neural networks(QFCNs)
FCN [41] is a special type of CNN with only convolutional layers. There are three commonly-used versions FCNs (FCN32, FCN16, FCN8) cast from VGG 16 [33] with different up-sampling layers. Taking FCN32 for example, the input image is processed by seven convolutional blocks to generate feature maps. Then, a 1 × 1 convolutional kernel is considered to predict scores for each class. Finally, a deconvolutional layer is used to up-sample coarse outputs to pixel-dense predictions.
It is obvious that the generated feature maps after seven convolutional blocks can affect the final results greatly. So, in this paper, the quaternion convolutional blocks given in subsection 2.2.1 are used to replace the real-valued convolutional blocks in FCN. In addition, the input image is represented by QR. The architectures of the original FCN32 and the proposed QFCN32 are shown in Figure 1. Following the construction of QFCN32, QFCN16 and QFCN8 are easy to build.

SP-enhanced pairwise CRF
In [22,23], the pairwise CRF is used to refine the results of FCN. CRF is a probabilistic graphical model which formulates label assignment problem as a probabilistic inference problem. It assigns similar pixels same label by capturing consistency between pixels. However, the pairwise CRF used in [22,23] only considers unary and pairwise potentials. It is not expressive enough to model higher level consistency, such as region-level consistency, co-occurrence of objects or detector-based cues [42,43].
In order to capture region-level consistency, Sulimowicz et al. [44] first introduced SP-enhanced pairwise potentials, and then proposed SP-enhanced pairwise CRF by combining conventional potentials used in pairwise CRF and their SP-enhanced pairwise potentials. The SP-enhanced pairwise potentials incorporate superpixel-based higher-order cues by conditioning on a superpixel segmentation image, which is obtained by an unsupervised segmentation algorithm, i.e., mean-shift algorithm. The mean-shift algorithm works by clustering pixels on the basis of low level image features [45]. For different images, the numbers of superpixels obtained by this algorithm are usually different. Furthermore, Sulimowicz et al. [44] proved theoretically that the sum of SP-enhanced pairwise potentials inside each superpixel was equal to robust superpixel-based CRF model proposed in [45]. Therefore, the SP-enhanced pairwise CRF is also a robust superpixel-based model. Experimental results presented in [44] show that the SP-enhanced pairwise CRF achieves a better performance than the pairwise CRF. So, in this paper, the SP-enhanced pairwise CRF is also used to refine the results of QFCN.

Main architecture of the proposed algorithm
The main architecture of the proposed algorithm is shown in Figure 2.  Figure 2. Main architecture of the proposed algorithm.
The proposed algorithm uses QFCN to predict splicing location map and considers three QFCNs with different up-sampling layers: QFCN32, QFCN16 and QFCN8. The reason of considering three networks is that one network can be specialized in handling one aspect of the whole problem, while the fusion of three networks can deal with different scales of image contents [22]. The SP-enhanced pairwise CRF is utilized for all three networks to improve the results obtained from QFCNs. In addition, quaternion batch normalization is also used to make the network converge easily. Finally, the final location map is obtained by merging the predictions of three networks. For each pixel, let y 32 , y 16 , and y 8 denote the predictions of three networks with the value 0 or 1. 0 is for unforged, and 1 is for forged. The final predictions m of this pixel is given by 8 16 32 Eq. (10) shows that a pixel is forged if there are at least two networks predict this pixel as forged.
The details of the proposed algorithm are as follows： (a) Taking QFCN32 for example, an input image is first represented by QR in Eq. (5). Then, seven quaternion convolutional blocks process the quaternionic input to generate feature maps. Both of the first two blocks consist of two 3 × 3 quaternion convolutional layers, each of which is followed by a quaternion ReLU layer and a quaternion batch normalization layer. The following three blocks are similar to the first two but having three 3 × 3 quaternion convolutional layers. Moreover, each of the first five blocks is followed by a quaternion pooling layer. The last two blocks have a quaternion convolutional layer followed by a quaternion ReLU layer and a quaternion dropout layer. The kernel sizes of the last two blocks are 7 × 7 and 1 × 1, respectively. In addition, the numbers of kernels in seven blocks are 64, 128, 256, 512, 512, 4096, and 4096, respectively. (b) The generated maps are fed into a 1 × 1 convolutional layer to predict score for each class. Then a deconvolutional layer is used to up-sample the coarse outputs to pixel-dense predictions. Finally, a SP-enhanced pairwise CRF layer is used to refine the result. (c) QFCN8 and QFCN16 are similar to QFCN32. The difference is that they have different up-sampling layers. QFCN16 fuses the results of the fifth and fourth quaternion pooling layers before deconvolutional layer, while QFCN8 combines the results the fifth, fourth and third quaternion pooling layers. (d) The final predictions are obtained from three predictions from QFCN8, QFCN16 and QFCN32 by Eq. (10).

Experimental results and analysis
In this section, we compare the proposed color image splicing localization algorithm with some state-of-the-art algorithms on three publicly available datasets. All the deep learning-based algorithms are performed in Keras with 11GB GeForce GTX 1080 Ti, 3.20 GHz i7-6900K CPU, and 65GB RAM. The conventional algorithms are in Matlab.

Experimental datasets
In this paper, in order to evaluate the performance of the proposed algorithm, three datasets are considered: CASIA v1.0, CASIA v2.0 [46], and Columbia color DVMM [47]. CASIA v1.0 is a forgery dataset which focuses on color image splicing. The tampered regions are carefully selected and some post-processing operations are also applied. This dataset is composed of 1721 images with 384 × 256 or 256 × 384 resolution. The number of authentic and forged images is 183 and 180, respectively. CASIA v2.0 is an extended version of CASIA v1.0 with more forged images and more post-processing operations. It contains 7491 authentic images and 5123 spliced images with the resolution from 240 × 160 to 900 × 600. DVMM is the first publicly available color image dataset for image forgery detection and localization without editing or post-processing. This dataset contains 183 authentic images and 180 spliced images with the resolution from 757 × 568 to 1152 × 768. Notice that CASIA v1.0 and CASIA v2.0 do not provide ground truth masks. So, we use Adobe Photoshop software to generate the ground truth masks from the corresponding host images. Some forged images and their corresponding ground truth masks are illustrated in Figure 3.

Evaluation metric
The accuracy of splicing localization is evaluated by the following per-pixel metric F-measure 2 , Here Precision means the probability that a detected forgery is truly forged, while Recall represents the probability that a forgery is detected. They are defined by , and . PP P P P N

TT Precision
Recall T F T F   (12) where T P is the number of correctly detected pixels, F N means the number of missed forged pixels, and F P represents the number of pixels erroneously detected as forged.

Experimental results and analysis
In order to evaluate the efficiency of the proposed QFCNs over the conventional real-valued FCNs, the first experiment directly trains three QFCNs (QFCN32, QFCN16 and QFCN8) and FCNs (FCN32, FCN16 and FCN8) respectively for splicing localization on CASIA v2.0 dataset. Notice that the SP-enhanced pairwise CRF is not considered in this experiment because we want to compare QFCNs and FCNs directly. In the experiment, we randomly select 5/6 spliced images to train the models and use the remaining images to evaluate the models. The average F-measure values of three QFCN-based algorithms and three FCN-based ones are given in Figure 4. It can be observed from Figure 4 that all the QFCN-based algorithms are superior to their corresponding FCN-based algorithms with same up-sampling layers. This is because the quaternion convolution can obtain more representative features than the conventional real-valued convolution [38]. The second experiment is to test the influence of the parameters of mean-shift algorithm in the SP-enhanced pairwise CRF. In detail, kernel bandwidth parameter h = (h s , h r ) in the mean-shift algorithm has been considered. Here, h s is the kernel bandwidth for spatial domain, and h r is for range domain. The same experimental dataset used in the previous experiment is considered. The average F-measure values of the proposed algorithm with different parameters are given in Table 1.
The results in Table 1 show that the kernel bandwidth parameter has little influence on the proposed algorithm and the optimal parameter is (8, 8) among ten compared parameters. So, the parameter (8,8) is considered in the following experiments. The third experiment is to evaluate the localization ability of the proposed algorithm based on QFCNs and SP-enhanced pairwise CRF. The same experimental dataset considered in the first experiment is used for this experiment. We compare the proposed algorithm with some existing algorithms, including eight conventional algorithms and five deep learning-based algorithms. Eight conventional algorithms contain NOI1 [4], NOI2 [5], CFA1 [6], CFA2 [7], ADQ [9], NADQ [10], DCT [11] and BLK [12]. They are implemented through a publicly available Matlab toolbox written by Zampoglou et al. and Xiao et al. [48,49]. Five deep learning-based algorithms are FCN+CRF [22], MFCN [21], QFCN+CRF, FCN+ResNet+CRF and LSTM+EnDec [24], respectively. QFCN+CRF uses QFCN to replace FCN in the work FCN+CRF [22]. The main objective of the comparison between FCN+CRF and QFCN+CRF is to show the improvement by using QFCN. FCN+ResNet+CRF combines ResNet [37] with the work FCN+CRF [22]. The comparison of these fourteen algorithms by average F-measure values are given in Figure 5. It can be seen from Figure 5 that: (a) all the deep learning-based algorithms outperform all the conventional algorithms. It is owing to the fact that the deep learning-based algorithms learn the feature automatically and expectedly; (b) among six deep learning-based algorithms, the proposed algorithm achieves the best performance. It is better than QFCN+CRF because the SP-enhanced pairwise CRF used in the proposed algorithm is more effective than the CRF in enforcing long-range consistency in pixel-wise  [44]. The proposed algorithm performs better than FCN+CRF [22] due to the use of quaternion-based method and SP-enhanced pairwise CRF. MFCN [21] is superior to FCN+CRF [22] and QFCN+CRF because the boundaries of spliced regions are also trained. The fourth experiment is to evaluate the generalization ability of the proposed algorithm. In this experiment, all spliced images in CASI v2.0 dataset are used for training, while CASIA v1.0 dataset and DVMM dataset are for testing. Figure 6 shows the average F-measure values of all the algorithms. It can be observed from Figure 6 that the proposed algorithm and FCN+ResNet+CRF achieve the best performance in both datasets. In addition, some conventional algorithms also have a good performance in DVMM dataset. The main reason is that the DVMM dataset does not perform post-processing after being tampered and does not contain small spliced regions. The last experiment is to evaluate the robustness against JPEG compression, Gaussian blur and Gaussian noise. Same as the previous experiment, all spliced images in CASIA v2.0 dataset are utilized for training, while all spliced images in CASIA v1.0 dataset are used for testing. For the JPEG compression, the quality factors are set to two levels: QF = 50 and QF = 70. For Gaussian blur operation, the Gaussian smoothing kernel (σ) varies from 0.5 to 2.0 with a step size of 0.5. For Gaussian noise addition, the SNR value varies from 25 db to 15 db with a step size of −5. The comparison results by average F-measure values for three attacks are given in Table 2. The results in Table 2 show that, similar to the previous two experiments, the proposed algorithm performs best among fourteen compared algorithms in all the three types of attacks with different levels. In addition, the performance of all algorithms decreases with the increase of attack intensity. In order to better show the superior performance of the proposed algorithm, Figure 7 presents the visual results and their corresponding F-measure values of the proposed algorithm and the other five deep learning-based algorithms. These results are corresponding to the forged images given in Figure 3. It can be observed from Figure 7 that: (a) the visual comparison is basically in keeping with the F-measure value comparison presented in Figure 5, Figure 6 and Table 2; (b) the proposed algorithm locates the forged regions more accurate, especially for CASIA v1.0 and CASIA v2.0 datasets. For example, the proposed algorithm can detect the legs of animals accurately for the forged images   Figure 3 (a)-(f). For columns, from left to right, spliced image, ground truth mask, localization results for FCN+CRF [22], QFCN+CRF, MFCN [21], FCN+ResNet+CRF, LSTM+EnDec [24] and the proposed algorithm.

Conclusions
In this paper, QFCNs are proposed to extend real-valued FCNs to quaternion field. In addition, a color image splicing localization algorithm based on QFCNs and SP-enhanced pairwise CRF is proposed. The proposed algorithm is superior to some existing algorithms for the following reasons: (a) compared with the conventional algorithms without deep learning, the proposed algorithm is a deep learning-based algorithm. It integrates feature extraction and localization map generation into the network for end-to-end training. In addition, it learns effective features automatically during the training; (b) compared with other deep learning-based algorithms, the proposed algorithm uses the quaternion-based color image processing method to capture the integrity of three channels and the inherent correlation between color channels; (c) SP-enhanced pairwise CRF is used to refine the results obtained by QFCNs. For the future work, we will try to construct a network to model sensor noise well and then use it for image splicing localization.