Deep Learning for Distinguishing Computer Generated Images and Natural Images: A Survey

: With the development of computer graphics, realistic computer graphics (CG) have become more and more common in our field of vision. This rendered image is invisible to the naked eye. How to effectively identify CG and natural images (NI) has been become a new issue in the field of digital forensics. In recent years, a series of deep learning network frameworks have shown great advantages in the field of images, which provides a good choice for us to solve this problem. This paper aims to track the latest developments and applications of deep learning in the field of CG and NI forensics in a timely manner. Firstly, it introduces the background of deep learning and the knowledge of convolutional neural networks. The purpose is to understand the basic model structure of deep learning applications in the image field, and then outlines the mainstream framework; secondly, it briefly introduces the application of deep learning in CG and NI forensics, and finally points out the problems of deep learning in this field and the prospects for the future.


Introductions
Natural Images (NI) reflect real-world scenes, and computer graphics tools can now generate virtual but visually trustworthy images. Therefore, the recognition of NI and computer generated CG images has received increasing attention. However, this problem is difficult to solve because the ultimate goal of computer graphics is to make CG images have the same surrealism as NI. The emergence of ultra-realistic CG images has revolutionized the multimedia industry, providing the ability to easily create realistic animations and images. Such effective technical support has brought new possibilities to the games, movies and other industries. Numerous realistic games and movie works have appeared in our field of vision, bringing a new visual experience to players and audiences. People can finally break through the reality and give full play to their imagination in the virtual world. At the same time, if CG images are used in areas such as law, authority, and news [1], it poses a serious security threat to the public. For example, using false CG images to tamper with evidence, confuse audiovisuals, or even frame others, will affect normal analysis and judgment in the judicial realm. Therefore, the recognition of CG and NI has become an important topic in the field of image forensics [2], and has attracted widespread attention in the past decade.
Feature representation is the key to image processing. Traditional feature design needs to be done manually. However, this method is complicated and has high requirements on the designer's technology. Therefore, automatic feature design has become an urgent requirement for efficient image processing. Deep learning is an emerging field of machine learning research. It aims to study how to automatically extract multi-level feature representations from data. The core idea is to extract multiple levels from the original data through a series of nonlinear transformations in a data-driven manner. Multi-angle features, so that the acquired features have greater generalization and expressive ability, which just meets the needs of efficient image processing. In order to meet the various needs of image processing problems, the deep learning theory represented by convolutional neural network has made breakthroughs. Based on the basic principles of deep learning, this paper summarizes the evolution and innovation of its algorithms, models and even methods in the field of CGNI image forensics. The purpose is to keep track of the latest developments and applications of deep learning in the field of CG and NI forensics.

Related Background
Back in the 1950s, neural networks were used as a toy project to study their core ideas, but for a long time there was no way to train large neural networks. This changed in the mid-1980s, and many people independently discovered the backpropagation algorithm, a method of training a series of parametric operations chains using gradient descent optimization, and began to apply it to neural networks. For the first time in 1989, Bell Labs successfully implemented the practical application of neural networks. Yann LeCun combined the idea of convolutional neural networks with backpropagation algorithms and applied them to the problem of handwritten digital image classification, which led to LeNet. The network [3,4] was adopted by the US Postal Service in the 1990s to automatically read postal codes on envelopes. Then, due to the limitations of hardware conditions and data, the development of neural networks has once again encountered bottlenecks. What followed was the rapid development of a new machine learning method called nuclear method, one of which is the well-known support vector machine (SVM).
From 1990 to 2010, the speed of non-custom CPUs increased by about 5,000 times; in the first decade of the 1900s, companies such as NVDIA and AMD invested billions of dollars to develop fast massively parallel chips (GPUs). In terms of data, in addition to the exponential growth of storage hardware over the past 20 years, the biggest change has come from the rise of the Internet, which has made it possible to collect and distribute very large datasets for machine learning. The ImageNet dataset [5] has spawned the rise of deep learning, which contains 1.4 million images that have been manually divided into 1000 image categories (one for each image). The special thing about ImageNet is not only the sheer volume, but also the annual competition associated with it. Researchers challenge the common benchmark through competition, which greatly promotes the rise of deep learning in the near future. In addition to hardware and data, we did not have a reliable way to train very deep neural networks until the end of the first decade of the 20th century. Therefore, the neural network is still very shallow, using only one or two presentation layers, and cannot go beyond more precise shallow methods (such as SVM and forest trees). The key issue is the gradient propagation through multiple layers of overlays. As the number of layers increases, the feedback signal used to train the neural network will gradually disappear. This situation changed around 2009-2010, when several simple but important algorithmic improvements occurred: 1) Better activation function; 2) Better weight initialization scheme (weight) -initialization scheme); 3) Better optimization schemes such as RMSProp and Adam. These improvements made it possible to train more than 10 layers of models, and deep learning began to receive widespread attention.
In the past few years, deep learning has demonstrated its powerful performance in various speech, image and video classification and recognition tasks.In the field of multimedia forensics, the performance of median filter detection [6], pattern recognition [7][8][9], forgery detection [10] and steganographic analysis [11][12][13][14] has also been greatly improved under the use of CNN.

The Basic Structure of the Neural Network
Neurons. Deep learning mainly relies on neural network technology. The basic unit of neural network is neuron, as shown in Fig. 1. Figure 1: Neuron x1, x2 represent the input vector, w1, w2 represent the weight, and several inputs mean that there are several weights, that is, each input is given a weight, b is the bias, g(z) is the activation function, a is the output.
Convolution. For the image and the filter matrix to do the inner product (each of the elements is multiplied and then summed). The operation of re-summation is the so-called "convolution" operation, which is also the source of the name of the convolutional neural network.

Figure 2: Convolution
The left part of Fig. 2 is the original input data, the middle part of the figure is the filter, and the right side of the figure is the new two-dimensional data output. The intermediate filter and the data window do inner product, and the specific calculation process is: Activation function. Commonly used nonlinear activation functions are sigmoid [15], tanh, ReLU, etc. The first two sigmoid/tanh are more common in the fully connected layer, the latter relu is common in the convolutional layer. In the actual gradient descent, the sigmoid is easily saturated, causing the termination of the gradient transfer, and there is no zero centralization. The ReLU function does not have these defects, and its graphical representation is as follows:  In the left part of Fig. 4, the matrix of 2 × 2 in the upper left corner is 6 largest, the matrix of 2 × 2 in the upper right corner is 8 largest, the matrix of 2 × 2 in the lower left corner is 3 largest, and the matrix of 2 × 2 in the lower right corner is the largest, so the result of the right part of the above figure is obtained.

AlexNet
The literature [16] is considered to be the origin of deep learning. The title of the article is "ImageNet Classification with Deep Convolutional Networks", which is widely regarded as an article with far-reaching influence in the industry. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton created a "large-scale, deep convolutional neural network" and used it to win the 2012 ILSVRC Challenge (ImageNet Large-Scale Visual Recognition Challenge). As the Olympics in the field of machine vision, ILSVRC attracts research teams from all over the world every year. They come up with all kinds of competition and use their own machine vision models/algorithms to solve image classification, positioning and detection. In 2012, when CNN first entered the stage, it achieved a good score of 15.4% in the top five test error rate. The score behind it is 26.2%, indicating that CNN has a shocking advantage over other methods, which caused a huge shock in the field of machine vision. It can be said that since then CNN has become a household name in the industry.
Literature [16] mainly discusses the implementation of a network architecture (we call it AlexNet). Compared with the current architecture, the layout structure discussed in [16] is relatively simple, mainly including five convolutional layers, a maximum pooling layer, a dropout layer, and three fully connected layers. The structure is categorized for having 1000 possible image categories.

Figure 5: AlexNet
The main points of the literature [16]: 1) Using the ImageNet database for network training, the library contains 22,000 kinds of 15 million tag data; 2) Using a nonlinear function of the linear rectification layer ReLU. (Using the linear rectification layer ReLU, the running speed is several times faster than the traditional hyperbolic tangent function); 3) utilizes data augmentation methods, including image transformation, horizontal reflection, block extraction and other methods; 4) Introduced the dropout layer [17] to solve the over-fitting problem of the training set; 5) Training using batch stochastic gradient descent, setting limits for momentum moment and weight decay weight decay.

VGGNET
One of the 2014 ILSVRC models relied on a simple increase in depth to achieve a 7.3% error rate (but not the champion of the year), named VGGNET [18]. Oxford's Karen Simonyan and Andrew Zisserman created a 19-layer CNN with only 3 × 3 size filters in the network, with 1 stride and padding, and a pooled layer using 2 × 2 max. Pooled function with a step size of 2.
The main points of the literature [18]: 1) Use only 3 × 3 filters, which is quite different from the previous AlexNet first layer 11 × 11 filter and ZF Net 7 × 7 filter. The reason stated by the author is that two 3*3 convolutional layers combine to produce a valid 5 × 5 perception zone. Therefore, the use of a small-sized filter can maintain the same function as a large size while ensuring the advantage of a small size. One of the advantages is the reduction of the parameters. Another advantage is that we can use one more linear rectification layer ReLU for the two convolution networks. (The more ReLU, the lower the linear performance of the system).
2) Three 3 × 3 convolutional layers are arranged side by side to represent a valid 7 × 7 perception zone. 3) The spatial size of the input image decreases as the number of layers increases (by convolution or pooling of each layer), and its depth increases as the number of filters increases.

GoogLeNet
Google has abandoned the principle of keeping the network hierarchy simple in its own architecture, Inception Module [19][20][21][22]. GoogLeNet [23] is a 22-layer CNN that won the 2014 ILSVRC championship with a 6.7% error rate. This is the first new architecture of CNN that differs from the traditional method in that the convolutional layer and the pooled layer are simply superimposed to form a sequence structure. The authors emphasize that their new model also pays special attention to the use of memory and computational quantities (this has not been considered before: multi-layer stacking and the use of a large number of filters can consume a lot of computational and storage resources, as well as over-fitting. The chance). The Inception structure is shown in Fig. 6.

Figure 6: Inception
The main points of the literature [23]: 1) A total of 9 Inception module modules are used in the model, with a total depth of 100 layers. 2) Instead of using the fully connected layer, use an average pooling pool average [24] instead, and reduce the data of 7 × 7 × 1024 to 1 × 1 × 1024. This configuration greatly reduces the number of parameters.
3) 12 times less than AlexNet's parameters. 4) In the test, multiple cropping images of the same input image are used as system inputs, and the results are normalized by the exponential function softmax averaging operation to obtain the final result. 5) Introduced the concept of regional convolutional network (R-CNN) [25] in the model. 6) Inception module is constantly updated.

Residual Network
ResNet [26] is a 152-layer network architecture that combines classification, detection and translation capabilities. ResNet's own performance broke the record of ILSVRC2015, reaching an incredible 3.6% (usually humans can only achieve 5 to 10% error rate). The residual block concept of the residual block proposed in the article is designed as follows: The result generated by the operation of input x through convolution -linear rectifying -convolution series is set as F(x), which is added to the original input x to obtain H(x) = F(x) + x. In traditional CNN, only H(x) = F(x). And in ResNet you have to add the convolution result F(x) to the input x. The submodule in Fig. 7 shows a calculation process that is equivalent to calculating a small change "delta" for the input x, so that the output H(x) is the superposition of x and the change delta (in traditional CNN, the output F(x) is a completely new expression, which does not contain the information of the input x). The author believes that the residual mapping is easier to optimize than the previously unreferenced mapping. 2) The authors declare that if you increase the number of layers randomly in the plain nets, the training calculation and the error rate will increase. 3) Try to use the 1202 layers network architecture, the accuracy of the result is reduced, the presumed reason is over-fitting.

Relevant Status
The starting point of CG and NI forensics is to extract the relationship between local pixels. Different from image recognition, image recognition is to distinguish the difference in image content, and extract high-level semantic features. CG and NI forensics mainly focus on low-level images, try not to let the neural network extract high-level semantic features, but focus on simple statistical features as features into the classifier. Therefore, the general deep learning model of CG and NI forensics problems is not very competent. Generally dealing with CG and NI forensics, we often divide it into three steps: feature extraction, feature transformation, and incoming classifier classification. The existing deep learning-based CG and NI forensics methods are mainly embodied in the three aspects of feature extraction, feature transformation and migration learning. The literature on deep learning based CG and NI forensics is not rich. I think the main reason for this situation may be the lack of data sets and the difficulty of self-built CG image data sets. Before Nicolas Rahmouni et al. [27] presented their own data sets in 2017, most of the methods were based on the original Colombian dataset [28], the Colombian dataset was proposed in 2004, and the CG images were placed in today's CG. And the lack of persuasiveness on the NI forensics mission. After the new data set was presented, the most advanced methods are now experimenting with Nicolas Rahmouni's datasets, but their datasets also have a significant problem with too little sample size. The CG image and the NI image each contain 1800 sheets and the fidelity of the CG image needs to be improved. Self-built CG image datasets are a very difficult task, because there are fewer channels for acquiring CG images, and video game screenshots are more convenient. There are few games that can meet the requirements of realism, and screenshots are taken. Building a large data set itself has a lot of work. These problems have limited the research and development of CG and NI forensics.

Application of Deep Learning in Forensics of CG and NI
Method based on image preprocessing. Yao et al. [29] proposed a recognition method based on sensor pattern noise (SPN) and deep learning CG and NI. These images (CG and NI) are clipped into the image patch before being input to the Convolutional Neural Network (CNN) based model. In addition, three high pass filters (HPFs) are used to remove low frequency signals representing image content. These filters are also used to display residual signals as well as SPNs introduced by digital camera devices. Unlike traditional methods of distinguishing between CGs and NIs, this method uses five layers of CNN to classify input image patches. Based on the classification result of the image patch, a majority ticket scheme is deployed to obtain the classification result of the full-size image. This method achieves nearly 100% verification accuracy on a data set contributed by Nicolas Rahmouni et al.
Method based on feature extraction. Quan et al. [30] made some changes to the deep learning model, as shown in Fig. 8, A convFilter layer is added in front of the traditional neural network structure, which is specifically composed of 32 3D convolution kernels. The performance of this network is better than the traditional method, not only for Google and PRCG data sets, but also for simple data sets, they consider the fine-tuning and structure of CNN. Activate function design, flexibility, visualization, and understanding. After voting on the 240 × 240 patch, it achieved 93.2% accuracy on full-size images. Chawla et al. [31] proposed an improved Convolution layer called New Conv Layer. As shown in Fig.  9, The idea of the improved convolutional layer comes from the fact that the structural relation between some regions in the image has nothing to do with the image content, but exists in the pixel relation. The correlation between pixels in a photographic image is different from that between computer-generated images. Therefore, the classifier should determine the relationship between the pixels and them. The difference between this layer and the ordinary convolutional layer is that the following constraint relationship needs to be satisfied during training.  Method based on feature-conversion. Nicolas Rahmouni et al. [27] pointed out that the commonly used Colombian dataset [28] was created in 2004, and the CG images at the time were incomparable with the current CG images, so they collected and contributed a data set of their own. The CG image of the dataset is derived from the Level-Design reference database [32], which consists of more than 60,000 high-resolution video game screenshots, and selects 1800 CG images that meet the criteria. Another 1800 NI images are from the RAISE data set [33]. Their method adds a feature conversion layer after the feature extraction of the two layers of convolutional layers, in order to extract the simple statistical properties including the expectation, variance, maximum and minimum values, and then pass it to the classifier. This method achieved an accuracy of 93.2% on their own data set. Then HuyH.Nguyen et al. [34] combined the idea of migration learning on the basis of Nicolas Rahmouni, using the pre-trained VGG-19 network for feature extraction, and then taking features for feature conversion after each layer is convolved. Finally, each layer of transformed features is passed to the classifier for training, and nearly 100% accuracy is obtained on the new data set contributed by Nicolas Rahmouni et al.
Method based on transfer learning. Edmar R. S. de Rezende * , Guilherme C. S. Ruppert * et al. [35] proposed a new CG image detection method based on ResNet-50 deep convolutional neural network model and transfer learning concept. After simple pre-processing, each image in the dataset is input into the deep CNN model. Therefore, the CNN convolution obtains a 2048-dimensional feature vector, which is called the bottleneck feature. These feature vectors are used to train a machine learning classifier to detect whether an image is generated by a computer graphics method. Applying the t-SNE dimensionality reduction technique to the visualization of high-dimensional features, it can be observed that the bottleneck feature generated by the ResNet-50 transport layer is more separable than the original image input feature, which makes the classification task easier. The method achieved an accuracy of 94.1% on a public dataset containing 9,700 images of different scenes.
Tiago Carvalho * , Edmar R. S. de Rezende † proposed a CG image detection method based on the preservation of important information in the human eye region. The method determines whether the image is a CG image by determining whether the human eye region is generated by the CG technique. When the highlight of the eye is removed from the previously detected and cropped eye area, the resulting image of the CG image (without reflection) shows more artifacts than the PG image (and no reflection). Since the human eye is a difficult part of the generation in CG images, they believe that these artifacts are caused by defects in the computer graphics technology used to generate the eye. Once each eye in the image is positioned and its highlights are removed, they use the idea of migration learning to remove the fully connected layer from the pre-trained VGG19 architecture in Imagenet as a feature extractor to create a set of bottlenecks feature. Finally, the feature extracted from each eye is input to the classifier to detect whether one eye is generated by the CG, thereby detecting whether the image is generated by the CG.

Conclusion
Deep learning is mainly based on convolutional neural network research in the image field, but CNN network training requires large-scale database and powerful computing power, and obtaining labeled samples requires a lot of manpower and material resources, even in some directions, such as for the problem of forensics, there is no human or material force to calibrate the sample, and the corresponding professional knowledge is needed. The data sets needed for the forensic problems of CG and NI are difficult to construct. At present, the related data sets are slow to update, the number of samples is small, and the diversity is poor. Therefore, the trained models have poor generalization and are prone to overfitting problems. In the future work, building a complete data set is an urgent problem to be solved. Further improvement of the model and enhancement of its generalization are the main objectives.
Funding Statement: This work is supported by National Natural Science Foundation of China (62072250).

Conflicts of Interest:
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.