An Exploratory Analysis on Visual Counterfeits Using Conv-LSTM Hybrid Architecture

In recent years, with the advancements in the Deep Learning realm, it has been easy to create and generate synthetically the face swaps from GANs and other tools, which are very realistic, leaving few traces which are unclassifiable by human eyes. These are known as ‘DeepFakes’ and most of them are anchored in video formats. Such realistic fake videos and images are used to create a ruckus and affect the quality of public discourse on sensitive issues; defaming one’s profile, political distress, blackmailing and many more fake cyber terrorisms are envisioned. This work proposes a microscopic-typo comparison of video frames. This temporal-detection pipeline compares very minute visual traces on the faces of real and fake frames using Convolutional Neural Network (CNN) and stores the abnormal features for training. A total of 512 facial landmarks were extracted and compared. Parameters such as eye-blinking lip-synch; eyebrows movement, and position, are few main deciding factors that classify into real or counterfeit visual data. The Recurrent Neural Network (RNN) pipeline learns based on these features-fed inputs and then evaluates the visual data. The model was trained with the network of videos consisting of their real and fake, collected from multiple websites. The proposed algorithm and designed network set a new benchmark for detecting the visual counterfeits and show how this system can achieve competitive results on any fake generated video or image.


I. INTRODUCTION
Image or Video manipulation [1], [2] is been carried since photography and videography were born. The artifact platforms for images and videos, powerful editing tools have been playing a crucial role in image/video manipulation and animations [3], [4]. However, these visual artifacts are easily detectable and distinguishable with human-centric vision, and hence do not have much scope in producing them. Started with animations, commercial elements for promoting and entertainment purposes, with rapid advancement in The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . technology, using image processing techniques, production of face warping [5], [6] on target from source for a realistic appearance emerged on large scale in recent months [7], [8]. Now, fake multimedia has become a central problem, especially after the advent of the so-called DeepFakes started revolving in the social media platforms by defaming famous personalities [7], [9].
DeepFakes have become a new form of altering reality that is spreading very fast than expected [10]. DeepFakes is a machine learning technology that fabricates or manipulates video and audio recordings to show people doing and saying things that they never did or said. DeepFakes appear authentic and realistic, but they are not [5], [8], [10]- [12]. It could be VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ superimposing a face on to a body, so it looks like they are doing something that they have never done [10]. There are two methods of altering a person's identity with DeepFakes: one is counterfeiting images and videos and another one is counterfeiting audio. Visual DeepFakes are much more complicated to identify than Audio DeepFakes. Most of the existing DeepFakes are in form of vision-based and are easily manipulated for a realistic look [5], [7]. Generally, DeepFakes was limited to imitate actors and politicians to say funny things. As the improvement in the accuracy begins, it has spread to circulate fake news and create ruckus in society by disturbing peace and harmony. The recent troubling floating in the media is the swaying away from the opinions during an election or implicit a person in a crime [13]. These manipulations of facial features of any historical figure, politician, sportsmen, or any CEO by synthesizing a re-enacted face swapped by the movements and actions of another person are realistic in such a way where any human vision cannot distinguish a real or fake video [8]. Today, with recent advancements in deep learning, it has become very easy to build a model using the existing photos and videos of a person [12]. Here, people perform face detection on each of these images using various real-time face detection algorithms. The backbone of a DeepFake is the deep neural networks trained on faces that outputs the facial expressions of the source to the target after proper postprocessing, resulting in a high level of realism. To do this, one existing method is to use an autoencoder [2], [8] which is a convolutional network that tries to reconstruct the input image and lets it learn a lower-dimensional representation of the input image which will be later used to swap the faces. Specifically, DeepFakes uses one encoder and two decoders. During training it trains two networks; both the networks share a common encoder and different decoders. The encoder transforms the input image into a base vector which is a set of numbers that identify the face. The decoder transforms that vector back into the image. There is an error function that measures how good that transformation works. The model then learns to lower that training error. During training, both images are trained with the same encoder but two different decoders. After the network training is done, the network is fed with a video which is a collection of image frames one frame at a time. Finally, the concatenation of all the images together and the video results are displayed.
If this kind of passive information is used, just photos and videos that are out there, that's the key to scaling to anyone as shown in Figure 1a. A sample dataset of DeepFakes creation on actors taken from YouTube is released by Google AI as shown in Figure 1b. These examples shown below describes how deep the images are in a realistic view with a perfect warping of target faces onto the source faces using GANs.

A. DeepFakes
At the end of 2017 and the starting of 2018 DeepFakes came out. This is a deep learning approach to swap faces into videos. Comparing the progress of GANs [12], in four years the world has upgraded from blurry to photorealistic faces. To create a DeepFake you just need a few thousand frames of training data both for the source face and the target face [8], [14]. After the training face, you will have a trained swap face network. So, you start with the trained network and then the video you wish to modify and it might take minutes to hours depending on the length of the video you wish to alter [5], [10], [13], [15], [16].
One of the popular DeepFakes approaches uses a machine learning approach known as auto-encoders which has three parts [2]. The first is the encoder and the second is the encoding or the bottleneck and the third is the decoder. It's the job of the encoder to take a high dimensional input image and compress it down into a semantically meaningful representation in the encoding and it is the job of the decoder to reconstruct that encoding to make an approximation for the input image. It gives us an unsupervised way to learn a semantically meaningful representation of the data and by comparing the differences between the input and the output data that provides us an error signal which can be used to update the weights in the network during training. Here the encoder is common to both the two faces. So, it shares weights and learns representations and two decoders are trained individually on two people. At the production time, person A is taken and encoded, then the decoder is used for person B to reconstruct the input image that has the effect of swapping the faces across the two videos [2], [8], [17], [18].

B. DIGITAL MEDIA FORENSICS
It is the process of identifying, preserving, analyzing and presenting digital evidence [9], [10]. Since the 1970's the field of digital forensics has evolved to keep up with the widespread adoption of technology. The use of computers for financial crimes in the 1980s helped shape digital forensics methods with what they are today. Due to the rise of large datasets such as ImageNet [19] which were created for further research in computer vision community such as object detection and image classification. It contains millions of manually labeled examples across thousands of classes and then in 2012 a breakthrough in performance in image classification AlexNet [20]. It used a deep learning technique for large scale visual recognition challenge that uses Ima-geNet. It provided such significant gain in performance that is essentially reshaped the computer vision community. Two years after AlexNet [20] was proposed, Ian Goodfellow [12] published a paper on generative adversarial networks and GANs are broadly applicable because they give new ways to train deep neural networks, they give us ways to generate new synthetic data for training. What's more important is the ability of GANs to generate synthetic faces [5], [9], [21]. In 2016 an approach called face to face was published and this uses real-time tracking to monitor both a source and a target face and to transfer expressions across them [18]. The techniques used here are as simple as simple Image processing techniques. The accuracy levels are very low and an impossible task to identify/classify it into 'real' or 'fake'. The recent work includes tracing the IP address and history of the source from which the video is playing [22], however, this would become a mere challenge and time consuming to track and trace each suspected IP address.

C. VISUAL SYNTHESIS
The availability of a huge number of datasets on DeepFakes can now be used to train a simple classification model or even objection detection models of RPN or segmentation. Transfer Learning such as InceptionV3 and ResNet models VOLUME 8, 2020 can be used for advanced training [10], [23]. The CNN-based system with a global pooling layer which computes statistical equations such as mean, variance, maximum and minimum pool using a pre-trained weight from ImageNet [19] database for skipping common features and reducing computation power. This powerful method with different architectures using Transfer Learning such as ResNet, Inception V3, Xcep-tionNet, ShuffleNet, BubbleNet.
Most of the DeepFake datasets are video type, existing state-of-art methods are using LSTM to classify videos into 'real' or 'fake' [6], [10], [18], [23], [24]. Video classification methods have evolved from hand craft features until 2D and recent 3D Conv nets [25]- [27]. Recurrent neural networks have also been utilized to model video sequences. Siamese networks in a supervised way are been used in a few cases for few-shot learning and one-shot learning. The training video dataset is taken as one episode and trained. Vinyals et al. [28] proposed metric learning for a few-shot classification by measuring the distance between the sequential vectors from the attention kernel. Zhu and Yang [29] proposed compound memory networks for few-shot learning. This network learns a fixed size matrix for a variable-length video-sequence. This architecture is a variant of a key-value memory network. Visual features are stored in the key part and its corresponding labels are stored in the value part. This model, however, isn't enough to train in the micro-level since it uses distance to plot the differences to differentiate the class branch and it is obvious the differentiate cannot be done with few shots of learning and requires an in-depth micro-level learning framewise. The differences are very minute and thanks to GANs for that, a micro difference leads to differentiate the video into 'real' or 'fake'.
Some methods include training the neural networks for video analysis where 3D ConvNets [26] are included to extract temporal information but are much time-consuming. Xu et al. [30] acclimate frame-wise features in ConvNets using VLAD pooling for video classification. This has shown great advantages than conventional average pooling. Sutskever et al. [31] adopt a more sophisticated sequence to sequence using up-convolution and down-convolution which uses multi-layered RNN to encode input frame into a hidden state. Another layer of RNN takes this encoded state as input and decodes it into a sequence of outputs. In the recent advancements, encoder-decoder, LSTM autoencoders are being used to pact having a very large-scale dataset due to its efficiency in training and decent performance than the rest [32], [33]. With the recent advancements, the above methods are being used in the Image Captioning applications where LSTM is being used to understand the scenario behind the frames. LSTM is mainly used as an RNN language model to generate a sequence of words framewise based on visual features extracted framewise. The model deploys an encoder-decoder network for image captioning. Also called as vision language, a combination of visual features and language encoded framewise using natural language unit (NLU). However, to the best of our knowledge, the above-mentioned models are heavyweights and memory consuming models that require high-end GPU systems to train on large scale datasets. One-shot learning and few-shot learning is generally used for small scale dataset. Deep fakes require much in-depth training in order to persist the features at a microlevel. Therefore, CNN and LSTM are preferred with a large amount of training rather than a quick training methodology. RNN would extract the features from the video frames as a whole where the features get diluted since the whole image frame is given for extraction. This would consume the overloading of memory space and computational resources. The overall accuracy is somewhat satisfactory, i.e., mAP 85% accuracy. Datasets may be available in large numbers, but the visual forgery is in microscopic scale and specific parameters work precisely rather than a global feature scaling.

D. ARTIFACTS FROM GANs
A reverse engineering technique is a useful read the patterns in GANs generated DeepFakes [5]. While the GAN network [12] is being trained to generate DeepFakes [21], the features that are impeded to generate realistic fake images/videos need to get stored under the 'Fake' class. This feature vector can differentiate between 'real' and 'fake' videos. The CNN feature extractor [34] used here to extract the features stores in the form of an array containing all the essential image warping features to generate a DeepFake. These features of original real videos also follow the same procedure and both features under their respective labels are trained using CNN architecture. This can distinguish the 'real' and 'fake' video base on the features extracted. One such best GAN network [12] for extracting facial attributes during face warping is the StarGAN [35] network which is used by Face App for swapping users face to famous personality's faces [16], [36].
The computation power that takes is much larger when compared with all other state-of-art methods. Often the features at the end while merging of fake into real are mostly similar. Hence, based on the density of the videos, we must choose a trade-off for ending the storage of features from a particular hidden layer. The overall accuracy decreases and is not reliable because the Black Box is indistinguishable on what kind of features is extracted in each hidden layer. The patterns can be visualized on training, but the exact depth is unknown.
Zhang et al. [37] find that classifiers generalize poorly between GAN models. The purpose AutoGAN model for generating images which consists of generator and discriminator architecture. The other work includes detecting GAN images using hand-crafted co-occurrence features and anomaly detection models on faces by isolating faces using face detector models. Most of the recent works include GANs generated images and videos and detecting counterfeits using GANs itself from both GANs generated deepfakes and GANs detection of deepfakes. Bau et al. [38] concluded that GANs have limited role and capacity in the generation and analyzed that the pre-trained GANs model is unable to grab the image structures from the given datasets. Having limitations in a generation, it is obvious that the GANs model is not so reliable in detecting deepfakes alone. Recent works include state-of-the-art architectures namely, ProGAN [14], StyleGAN [39], BigGAN [40] trained on Ima-geNet [20] dataset. As mentioned earlier, the training needs to be computed at micro-level features for classifying deepfakes. To perform micro-level feature extraction, the network needs to extract high-frequency details framewise [41]. Star-GAN injects large, per-pixel noise in each frame to introduce high frequency. BigGAN uses self-attention layers [42], [43] on very large scale datasets. Recent works include imageto-image translation approaches such as GauGAN [44], Star-GAN [39], and CycleGAN [36]. Wang et al. [45] adapt frequency analysis on each dataset with different GAN approaches. They performed CNN image synthesis with a simple form of high-pass filtering for subtracting median blur version frame-wise and Fourier transform for extracting informative visualization. The real image spectra and their distinct patterns are visible by CNN generated models. Agarwal et al. [46] proposed a novel approach by matching the lip-synch movement and mouth shape. The main focus is on the visemes which are occasionally inconsistent with a spoken phoneme. The sounds of M (mama), B (baba), or P (papa) which the mouth must completely close to pronouncing these phonemes are the primary focus.
However, this is research concludes a similar approach but with grasping almost all facial movements. Generally, deep fakes are the face swaps and the mouth movements are micro variations. Mouth shape alone can not help in accurately detecting the deep fakes. Other than these visemes, the model would not give an accurate result. Hence, mouth shape alone is not sufficient to conclude the best possible way for classifying deepfakes.

E. CNN-BASED VISUAL MANIPULATIONS
Recent works have addressed CNN-based deepfakes detection using basic Conv nets training of real and fake images. These models are not reliable since the deepfakes variations are at the microlevel and need to be compared framewise to detect the anomaly. Rossler et al. [47] evaluated methods on detecting facial manipulations, while they have shown simple CNN classifiers to classify deep fakes by using the same model to generate deep fakes, but these models fail to generalize the models are deep fakes datasets. Marra et al. [48] showed a similar approach that image classifiers can create and detect deep fakes, but the same problem and cross-model transfer are not considered. Cozzolino et al. [49] concluded that forensic classifiers are very poor between models and accomplish near-chance performance in classifying deepfakes. They propose a new representation learning model based on autoencoders in zero and low shot training regimes. This work is similar to Wang et al. [45] the first one took an orthogonal approach. Both try to improve the transfer learning performance but the latter one tries to study the performance of simple baseline classifiers under training and testing conditions for CNN-based image generations. Researchers have shown enough work that common CNN models contain artifacts that reduce their representational power. The work mainly focused on the up-sampling and down-sampling approach. Durall et al. [50] proposed spectral regularization which is based on spectral effects on CNN generated adversarial networks. The power spectrum uses the input data to generate similar realistic data with a slight tweak in the power spectrum. Even the low-resolution data can be used in this approach to generate and detect the need. This approach is similar to the proposed method while the difference is in the plotting spectrum in the first one and plotting the motion framewise in the latter one. Both works promise high accuracy levels. Mittal et al. [51] proposed the Siamese network on the audio-visual dataset and visual datasets. This further includes audio and video of moving lip-synch in videos. It includes modality embeddings and emotion embeddings given to triplet loss.
This research is purely based on facial analysis at the micro-level. Each parameter extracted is dug in-depth and trained at a high-frequency notion [41]. Alongside the facial landmark patterns motion is trained on a decent bandwidth scale frame-wise, which makes this work different from other SOTA.

F. ATTENTION NETWORKS
The best architecture used for image captioning to the best of our architecture is the attention network. Xu et al. [52] adopt the extraction of convolution features in the encoder section. Frame-wise feature extraction is done from every single raw image and for each frame, the model generates a caption encoded as a sequence of total encoded words. At the decoder end, the LSTM network produces the caption by generating each word on a context vector. The proposed architecture by the authors is inspired by the attention networks since the main reason for choosing this network is the 'forget' mechanism which saves the memory and would not become a burden for training and resource usage. However, the convolutional features are implemented on a selected part such as face but not the whole frame using a face detector. This extracts the micro-level features which cannot be detected by the landmarks 'alone'. These conv-features are combined with the landmark movements which makes this work unique.

III. TECHNICAL APPROACH
This paper gives an essence of combining two-novel architecture esteems, i.e., facial landmarks movement and convolutional long short-term memory (Conv-LSTM). The complete layout of the architecture is shown in Figure 2. This pipeline gives the essence of the stepwise framework and tracks to train the model from scratch till the prediction layer.
Using convolutional drift neural network architecture, the features are extracted from the frames, without saving the frames for memory constraints that require minimal training to achieve competitive performance on Spatio-temporal tasks. LSTM alone as a feature extractor and handling temporal data tend to be an expensive task to train. Hence, combining CNN for feature extraction and LSTM for storing the feature vector for frames [53] would be efficient for training such a humongous amount of data without any memory expense. LSTMs and GRUs are the most common recurrent neural networks used to compute temporal sequence problems. The two are having different data flows with a common component which is called a memory state. Mangal et al. [54] concluded that GRUs takes less time than LSTMs and Bidirectional RNN in training limited datasets. However, LSTM outperforms GRUs in training and descent into loss with just having few epochs [54]. The main reason for using LSTM is memory constraints. LSTM has three gates (input, output, forget gates) whereas GRU has two gates (reset and update gates). Since accuracy is critical in this work, LSTM is being in place. The Forget gate in LSTM determines which part of the previous cell state to retain, the Input gate determines the amount of new memory to be added. These two gates are independent of each other, meaning that the amount of new information added through the Input gate is completely independent of the information retained through the Forget gate [54], [55].

A. FACIAL LANDMARKS
Given the input dataset, i.e., video frames, the network runs the face detector algorithm to locate the face region in the frame k. The face parametric model denoted by S(0) represents the mean face shape in the frame. The landmark vertices S(0) are projected onto the frame to determine the initial 2D landmark locations as well as their visibility in the frame. The alignment of 2D landmarks on the frame is done via 128 landmark vertices. The initial 2D locations of these landmarks are denoted by X (0) ∈ R 136 [10]. Besides, a dense set of landmark vertices is used to evaluate dense face feature descriptors, to capture more global and deep insight information of the face and produce more robust results in the facial movements. A dense sample of the face mesh is used to obtain 512 landmarks vertices, which includes the previous 128 alignments. The landmark locations are denoted by U (0) ∈ R 320 and their visibility in the frame is indicated using a binary vector V (0) ∈ R 160 . During the training, updating landmarks in each frame k th iteration evaluates the dense face feature descriptors of the dense landmarks using their current locations U (k) and visibility on the frame V (k) . The dense face features descriptors are concatenated into a feature vector F (k) ∈ R 5120 . For invisible landmarks, such as extreme side-angles, dark pixels, their corresponding alignments/components F (k) are set to zero. The target location defined byX (k+1) which improves the accuracy of forgery detection are customized since those are the key parts in detecting the swapping from real ones. The camera parameters or the angle at which the target face is located is given w k+1 . The proposed solution adapts the stateof-art [10], [23] approach to determine w k+1 andX (k+1) , by computing their displacements proposed as a linear function w k+1 − w k andX (k+1) − X (k) of dense face feature descriptor F (k) .
where matrix R (k) X and vector b (k) X represents a generic descent direction that improves the accuracy of landmark s locations that overcomes the dark and extreme side-angles.
The network learns from the training set, where the face is detected in each frame and landmarks are aligned in 2D space from the components, matrix R (k) w , and vector b (k) w . The 512 landmarks define the movement of each part on the face. This is compared with the fake video, where the actual movement in different scenarios of the real video is compared with movement detected of each part in the shammed one using landmarks.

B. CNN FEATURE EXTRACTOR
Generally, features are taken by the network and are trained. To reduce the spatial complexity, specific locations are given in the frame to the network, where the features are extracted and are trained. CNN feature extractors are built upon the transfer learning approach using a pre-trained model. This allows the network to leverage the power of the pre-trained model combined with a custom-trained model for video analysis. This approach is designed in such a way, after identifying the layer in the source CNN to use as a feature extraction point, rest layers beyond that are removes, leaving behind a feature extractor network and then passing it on to the LSTM network. The video data is extracted into frames but not saved for reducing memory complexities. These collections of frame images at the frame level extract raw abstract face features with high dimensional visual information impeded as feature vector per frame, u where n is the number of frames or frame sequence and m is the output dimension of the Deep-CNN feature extractor. The hidden layer visualization of pattern learning by the model is shown in Figure 3, trained on ResNet model. Given v as unique video identifier for an available number of total videos in the training set, features represented as the set of all u (v) n frame feature vectors used for predicting the sequence U (v) , given in a set of T , where T is the number of frames per video v.

C. LSTM PIPELINE
In LSTM, there is a mean pooling layer to extract frame-level visual features without saving it by holding it temporarily in its memory unit and the output video-level feature vector [4], [34] are fed as input to the LSTM cell inputs. The feature vectors are shot from the CNN feature extractor from the face embedding. For each video frame v, LSTM memorizes the pattern in current time t and then the patterns are erased from the cell and successive patterns in frames in the queue will be computed. The previous patterns and correlations are fed in hidden states h t , internal memory cell state c t and three gates i t , o t , f t , g t is a candidate memory cell state or bridge between the current input and previous hidden/stored patterns. The computation mathematically is given by: where is a vector concatenation operator, is the element-wise multiplication between two vectors, W 's are the weights of inputs to hidden states, U 's weights from hidden to hidden. All weight matrices and biases b's represent the value of each feature and need to be trained. The LSTM pipeline is divided into two parts: 1) the first part is the comprehensive feature that comes from the feature vector of the CNN model that represents the whole frames of a given video v. The mean pooled frame feature is set as the input of the model of each frame. 2) the second part is dividing each video into equal instances of frames and taking their feature vectors under target vector/label T . In this case, there are two classes: real and fake, the target label is divided into one-hot encoding vectors (y 1 , y 2 ). The embedding layer is added to squeeze the high dimensional sparse vectors into the lower dimensional sparse vectors of weights (w 1 , . . . , w n ) for each feature vector. Then the averaged frame feature x is duplicated and concatenated with weight vectors w t , finally input to the LSTM model at each time step, as (x 1 w 1 , . . . , x T w T ). The intermediate hidden layers (h 1 , . . . , h t ) are outputs of LSTM cells in charge of memory cells of LSTM converging to the final goal state. The conditional probability for each video frame v would be:

IV. RESULTS AND DISCUSSION
The proposed approach has a 3-phase architectural design, which is designed to reduce the memory constraints on physical GPU servers. AI-generated DeepFake videos are given to training and the reflection is given to testing.

A. NETWORK CONSTRUCTION 1) FACIAL LANDMARKS EXTRACTION
The first step would be the facial landmarks extraction from each frame v. 128D and 512D landmarks are extracted VOLUME 8, 2020

2) CNN FEATURE EXTRACTOR
The further advance step is feature extraction from the facial landmarks and their locations. The in-depth insights are extracted using CNN feature extractors alongside with the facial landmark movements for better and precise accuracies, to avoid false positives and false negatives. CNN layers [11], [56] are used to extract features and give the feature vector to the LSTM cell as input. The image size is set as (224, 224). The network is loaded with the face detection model through Multi-Cascaded Convolutional Network (MTCNN) architecture. The bottleneck feature extraction method is initialized in the CNN layers with Glob-alAverage Pooling of the features extracted from the video frame v. The features are stored as Nd-array and are combined with a frame-level feature vector for each frame v. The CNN network starts with the Sequential model, first and second layers consist of Conv2D filters of random numbers with size 3×3 followed by ReLuactivation function, with the first layer having Padding of '1'. Based on the number of frames v, the filters are taken in the hidden layer from here, i.e., Conv2D filters based on frames with size 3 × 3 and padding '1' followed by ReLuactivation function. MaxPooling2D is then applied with a pool size of 2 × 2. Dropout function is applied to remove unnecessary features for eliminating the memory constraints. The final output layer consists of Flatten the features extracted with is fed to Dense node of size 256 and ReLu activation function. A final Dropout is given to clean the features for a final time. This output is fed to the LSTM cell unit.

3) LSTM UNIT
Due to heavy and larger datasets, extracting frames from the videos and give them to the network consumes loads of memory and requires very high computational power. LSTM network works best for video classification and it allows storing the patterns or features without saving the frame in the physical server. It extracts the frames from the videos, extracts the facial landmarks and features but never saves the frame physically. The pattern recorded by the LST memory cell stores the pattern and learns the patterns until the next pattern arrives and then the current gets erased from the cell [4], [7], [18]. These patterns are given weights and are stored in the LSTM memory units. The loop continues throughout the video frames v. The facial extractions of faces in frames is shown in Figure 4, which tracks the movement of essential parts such as eyes, eyebrows, nose, mouth and head using 512 facial embedding. The LSTM network is built using recurrent neural network functions. Under the Sequential model, LSTM nodes are initiated of random choice based on the input frames. The first LSTM node with return sequences and Recurrent Dropout of '0.2' is added, followed by Dropout function. This loop is continued throughout the total number of layers in the network. Following the loop of initial layers is the LSTM node with Recurrent Dropout of '0.2', followed by Dropout function. This is given as input to Dense layers of total 256 with activation function ReLu. The final layers consist of Dense layers with several classes, i.e., '2' and with activation function Sigmoid since it is binary class. Applying more complex convolutions for CNN and LSTM pipeline for prediction, the authors now concatenate the features coming down from CNN feature extractor and LSTM for training. 1D-Conv layers with ReLuactivation function is initialized for embedded sequence followed by MaxPooling x 5 for 1D-Conv. These are appended and are concatenated with the 1D-Conv of filter size '5' and ReLuactivation function.

B. TRAINING ON DFDC DATASET
The dataset [57] is collected from Kaggle DeepFake Detection Challenge. The total dataset consists of 470 GB of videos which are AI-generated datasets of multiple persons. The whole dataset is given for training to the Conv-LSTM network into 2 classes: 'Real' and 'Fake'.

C. RESULTS AND ANALYSIS
Some of the major dataset on deep fakes has been tested such as Google AI DeepFake raw dataset [58], Kaggle DFDC testing dataset [57], Youtube raw videos from Google AI deepfake dataset [58], Celeb-DF dataset, shown in Figure 5 and 6.
The training almost lasted for 7 days on the humongous data (Kaggle DFDC training set). The trained model is tested on different datasets for testing the accuracy. Some standard deepfake datasets are mentioned above. Figure 7 describes the prediction results on Google AI deepfake dataset [58]. Google AI deepfake dataset is humongous as compared with other testing datasets selected by the authors. As visualized in the Figure 7, landmarks play a pivotal role in identifying the facial parts movements which can be visualized in a graph as shown in Figure 8. The authors have plotted the variations in each frame with respective to each parameter such as eyebrows, nose movement, mouth movement consistency as shown in Figure 6. It clearly shows the spike in variations in each frame. Since, the original face is being warped with VOLUME 8, 2020  the another, the facial parameters does not be consistent with respect to the original target facial movements. A normal original video would have quite consistent waves of consistency with few decent spikes unlike heavy spikes as shown in Figure 8 on a visual deepfake video. The variations in each frame shall be the deciding factor for classifying it into 'real' or 'fake'.
As we can see the subjective results in Table 2, side facial movements will fail to classify since the model needs to analyze the whole facial part for comparing proper movements frame-wise. There are some deepfake videos where just one side of the face is shamed and the other half such as eye, eyebrow, the cheek is kept normal. In these extreme cases of forgery, the model gets stuck at low accuracy levels. Hence, a proper full-face video is necessary to get proper results.  Figure 9 resumes the performance of the network trained on the Kaggle DFDC dataset [58]. The dataset is very huge weighing around 470 GB in size. Overall 50 training videos, each with ∼ 10 GB in size is being trained. The training patch accuracy is noted at 95.12% and the validation patch at 89.01%. The network having Adam optimizer, the loss curve has shown a decreasing behavior. The network took 1024 × 1024 image patches to train the model. The batch size was defined as 1 to prevent memory allocation errors. Overall the model was trained till 50K epochs which took  over a week. The learning rate which started at 0.100 at training initiating was reduced to half of it in each further 10K step. Further, he loss did not decrease post 46K epoch. The widely used model evaluation metrics are average precision and F1 score. The face detector compares the predicting boundary boxes and landmarks inside it according to VOLUME 8, 2020 the intersection over union (IOU), where its parameters get updated at each epoch. F1 score is used for calculating the success rate of precision and recall. Precision is the ratio of actual matches and recall is defined as the ratio of correct predictions to that of total ground truths. However, neither of them is sufficient to measure the performance of the network. F1 score is calculated with the precision and recall as dependent parameters to compute the evaluation of the network on data. F1 score is termed by the parameters true positives (TP) as correct predictions, false negatives (FN) as false nondetections, false positives (FP) as false correct predictions. The mathematical computations for the above-mentioned parameters are: The model trained and saved is a heavyweight model and computationally complex. Training is performed on Nvidia 1080x TI GPU for 7 days without disruption. Inference speed on GPU systems is around ∼ 0.5 ms and on CPU systems the speed is ∼ 10 ms.

D. PERFORMANCE METRICS
As shown in Figure 10, the labeled dataset from DFDC which is used for training, the landmarks motion framewise is plotted and visualized. The unusual massive spikes can found in fake videos. These frame-wise results of landmarks motion are the key but not alone sufficient to classify, hence, along with the landmarks motion results, CNN feature results which are results of the landmarks patterns given to the network are combined to classify the video into 'real' and 'fake'.

V. CONCLUSION
In this paper, the authors have designed a benchmark-setting architecture Conv-LSTM which uses facial landmarks and convolutional features to automatically detect the visual forgery in videos and images. Visual forgery on videos and images can be automatically detected with precise accuracies from the proposed approach. As this architecture was built and trained on a humongous amount of data, any visual forgery can be easily detected giving the best accuracy when compared to other art-of-states. The proposed approach is simple and best in terms of usage although the black-box training inside neural nets is much complex, memory constraints are addressed as the frames are not saved from the videos during the training process.
This research has addressed on detecting the visual forgery through convolutional methods, the further scope of this research includes the reverse engineering of AI-generated visual forgery by saving the patterns and features while generating visual forgery and directly training them under forgery class. This would be much complex than expected but opens the gate for light-weight model approaches.
However, the primary limitations of this work are having real videos of the subject. Although the model works well without having any real videos of the subject, as comparing the variations from frame-to-frame as shown in Figure 8. But there would be a challenge in predicting the 'real' or 'fake' classification without real video for having precise results. The real video of the subject would act as a backbone for the predicted results. The next major limitation is the position of the subject in the video. Zoomed angles are preferred rather than side angles since the accuracy levels are low in the side angles. The model is heavyweight in terms of computational complexity. The video must be at least 10 seconds for getting decent results, 15-30 seconds would be much better for higher accuracy.

VI. FUTURE SCOPE AND FUTURE RESEARCH
The authors have performed various computational training with several architectures on just videos of 'real' and 'fake' classes resulting in high training accuracy but very low prediction accuracy. This is due to the mismatch of the process for identifying the DeepFakes patterns in videos. Having got the proper pattern recognition obtained by the authors, i.e., the facial landmark patterns approach and CNN feature extraction on specific main-course parts, these features can be further applied in the Reinforcement Learning architecture where the rules (here the features of 'real' and 'fake') are given to the system and humongous amount of data is given for training and testing. This can have a future scope in successfully nullifying any kind of DeepFake videos generated in the future course. The authors will be contributing by extending this approach using Reinforcement Learning in the upfront.