DEEP-STA: Deep Learning-Based Detection and Localization of Various Types of Inter-Frame Video Tampering Using Spatiotemporal Analysis

: Inter-frame tampering in surveillance videos undermines the integrity of video evidence, potentially in ﬂ uencing law enforcement investigations and court decisions. This type of tampering is the most common tampering method, often imperceptible to the human eye. Until now, various algorithms have been proposed to identify such tampering, based on handcrafted features. Automatic detection, localization, and determine the tampering type, while maintaining accuracy and processing speed, is still a challenge. We propose a novel method for detecting inter-frame tampering by exploiting a 2D convolution neural network (2D-CNN) of spatiotemporal information and fusion for deep automatic feature extraction, employing an autoencoder to signi ﬁ cantly reduce the computational overhead by reducing the dimensionality of the feature’s space; analyzing long-range dependencies within video frames using long short-term memory (LSTM) and gated recurrent units (GRU), which helps to detect tampering traces; and ﬁ nally, adding a fully connected layer (FC), with softmax activation for classi ﬁ cation. The structural similarity index measure (SSIM) is utilized to localize tampering. We perform extensive experiments on datasets, comprised of challenging videos with di ﬀ erent complexity levels. The results demonstrate that the proposed method can identify and pinpoint tampering regions with more


Introduction
Currently, surveillance cameras are used everywhere, and their recorded videos are often used as electronic evidence to strengthen certain claims in forensic statements or criminal investigations, enhancing people's understanding of the described incident.However, this evidence remains valid only if the content represented in the digital video is genuine.It is now easy to edit video content due to the accessibility of video editing delivering educational content and aiding in skill acquisition.Its authenticity is crucial, particularly when this recorded footage serves as primary evidence in legal investigations, criminal trials, and judicial proceedings.With simple video editing tools, an attacker can easily delete events, alter the chronological order of events, or duplicate frames to create a fake version that seems genuine.These tampered videos can be used to mislead the police and the courts' decisions.The research community has proposed novel techniques to address these emerging challenges.Since these tamperings are often imperceptible to the human eye, verifying the authenticity and integrity of multimedia data such as graphics, audio, and videos appearing on different social networking sites has become a major challenge for researchers, scientists, and investigative agencies.The types of video tampering detection techniques include (1) active techniques and (2) passive techniques (also known as blind methods).Active techniques rely on known traces such as digital signatures or watermarks embedded into the content during the acquisition phase when the video is being recorded or later during the transmission of data.Any change in this embedded information indicates tampering.This technique may fail when the alteration is performed before inserting a digital signature or watermark [1][2][3].Passive techniques are further categorized into inter-frame and intra-frame tampering detection techniques [4][5][6][7].Both of these methods involve manipulating video content but target different domains.Intra-frame tampering (spatial tampering) involves manipulating the individual frames of a video, which can be detected by using image forensics algorithms.Common types of intra-frame forgeries include copy-move and region splicing [7].Inter-frame forgeries (temporal tampering), on the other hand, involve manipulating the video content between frames in a video sequence.In our research, we focus on inter-frame tampering, a technique commonly applied to surveillance videos because it is easy to execute and almost imperceptible.Frame duplication, deletion, and insertion are the most commonly used semantic-focused operations in surveillance videos, as illustrated in Figure 1.In frame insertion, frames from a different video are introduced to add a fake event.Some sequence of frames is deleted in frame deletion tampering to hide an event.Frame duplication contains the repetition of an event.Such tampered videos can mislead investigators, especially in criminal investigations [8,9].Therefore, there is an urgent need to verify that the content is genuine, and that it accurately represents reality.This problem requires the development of a robust tampering detection system to combat malicious video tampering.
Various video tampering detection techniques have been proposed in the literature to detect inter-frame tampering; these techniques are based on extracting manual features, such as statistical features [10][11][12], pixel and texture characteristics [13][14][15], motion residual, and optical flow [16][17][18], and a few are based on deep learning [8,[19][20][21].The manual features are sensitive to post-processing operations like blurring, brightness, noise, and compression.Additionally, most existing approaches examine traces to detect only one type of inter-frame tampering such as frame cloning [5], frame deletion [16,22], frame shuffling, frame insertion, and frame duplication [15,23,24] and thus, cannot simultaneously detect all kinds of inter-frame tampering, along with their types.These limitations hinder their performance in real-world applications.
Despite various proposed solutions for detecting inter-frame tampering, four major challenges remain.First, there is limited applicability; many video tampering detection techniques are restricted by factors like the number of tampered frames, frame rate, and video format, which limits their practical use [25,26].For example, the deep learningbased method proposed in Ref. [21] can only detect inter-frame tampering if the tampered frames exist in multiples of 10, failing if there are fewer than 25 tampered frames.Similarly, the method proposed by Bakas and Naskar in [27] cannot detect frame duplication of more than 20 frames.
Second, there is poor generalization; in order to evaluate video tampering detection algorithms, benchmark datasets are crucial [28].Many researchers have developed their personal datasets [16,17,22] to perform experiments to detect inter-frame tampering, but these datasets are not made available to the public and the research community and are often small in size.Due to the unavailability of benchmark datasets, cross-validation has not been performed; thus, the generalization capability of existing methods cannot be ensured [14,25].Third, there is the challenge of forgery localization; some methods detect deletion only at specific positions within a video shot, such as the method in Ref. [29], which detects frame deletion forgery only at the center of a 16-frame video shot; however, the frames are not always deleted in the middle portion of a video.
Fourth, there is high computational complexity; many state-of-the-art approaches are computationally intensive because they rely on pixel-based [30,31], spatial and/or temporal correlation-based [32][33][34], or high dimensional feature-based methods [35][36][37], making it time-consuming to analyze high resolution or lengthy videos.Due to these challenges, there is a dire need for a video tampering detection system that meets these basic requirements: high accuracy, strong applicability, and high generalization capability, with good robustness.In order to address these drawbacks, we propose a novel forensic system which is capable of detecting and localizing multiple types of inter-frame tampering using spatiotemporal analysis, based on deep learning.It selects the robust features using a state-of-the-art 2D convolution neural network (2D-CNN) and the fusion of spatiotemporal information.To deal with high dimensional features and computational complexity, an autoencoder is utilized to reduce the dimensionality of feature space.Moreover, special types of recurrent neural networks (RNN), like long short-term memory (LSTM) and gated recurrent units (GRU), are used to handle long-term dependencies and input sequences of variable length, performing well with time series data.Finally, a fully connected (FC) layer, with softmax activation, is added to LSTM/GRU, yielding posterior probabilities of the classes.The major highlights of the proposed work are presented as follows:

•
We proposed a robust video tampering detection method, which first extracts discriminative features using a CNN model and then takes into account the interdependencies of frames to detect tampering traces in videos due to frame deletion and insertion.It detects deletion and insertion simultaneously, unlike the state-of-the-art methods [8,29,38], which detect only one type of video tampering.Moreover, the proposed technique does not impose any constraint on the minimum number of inserted/deleted frames in a video to make the tampering detectable; it can detect the insertion and deletion of as few as ten frames, along with the type of tampering.On the contrary, the method in Ref. [21] detects tampering if tampered frames exist in multiples of 10 and cannot detect tampering of less than 25 frames.

•
For the proposed method, we introduced an efficient feature extraction method that first uses spatiotemporal average pooling (STP) of overlapping video clips and then employs a pre-trained CNN model such as VGG-16 as a feature extractor.This approach harnesses the hierarchical structure of the CNN model to extract rich and deep features.Our method demonstrates superior performance compared to the state-of-the-art techniques.

•
The dimension of the features is very high, which causes computational difficulties.
To overcome this issue, we propose to use an autoencoder to reduce the dimensionality of feature space.This significantly lessens the computational overhead of the proposed method by reducing the dimensionality of the feature space.

•
We analyze the long-range dependencies among the video frames using LSTM/GRU to detect tampering traces; this leads to high accuracy in detecting tampering in videos, irrespective of their frame rates, video formats, number of tampering frames, and compression quality factor.

•
The rest of the paper is outlined as follows: Section 2 strengthens this research with a review, showing the gaps in the existing research in this field.Sections 3, 4 and 5 represent the proposed method, dataset description, and experiments, along with the results, respectively.Finally, in Section 6, the conclusions, along with future directions, are presented.

Literature Review
In the realm of multimedia forensics, the challenge of detecting video tampering remains at a nascent stage.There is a lack of robust techniques that can detect and localize video inter-frame tampering [5,39].Several significant approaches have been introduced in digital multimedia forensics, which can be broadly divided according to their feature extraction methods: handcrafted-based methods and deep learning-based methods.
In handcrafted-based methods, many researchers have proposed numerous techniques that use both temporal and spatial correlations of overlapped video clips, and the similarity was determined to detect frame duplication tampering [14,40,41].Lin et al. [34] presented the idea of detecting frame duplication by comparing graphs of frames, and similarity was checked by comparing only the spatial correlation of the original and forged clips.All these similarity detection techniques access the stored surveillance footage from the stored database; resulting in significant computation time to process each video frame.Features such as correlation [6,10,13], optical flow [17,42,43], prediction residual [22,44,45], bag-of-words (BoW) model [23], standard deviation of residual frames [40], motion vector and motion residual [36,46], and noise residue [32] have been utilized in the literature to identify inconsistencies introduced by inserted, deleted, or duplicated frames in videos.Some methods, like those of Wang et al. [11] and Huang et al. [47], use statistical features like the consistency of correlation coefficients of gray values (CCCoGV) and triangular polarity feature classification (TPFC) to detect inter-frame tampering.These algorithms, although based on statistical features, may struggle to detect tampering in the presence of different compression types.Motion residual, optical flow, and/or prediction residual based features are employed by Jia et al. [17], Kingra et al. [43], Shanableh et al. [48], Chao et al. [49], and Feng et al. [50] to detect video inter-frame tampering.When the video is tampered with, it also disturbs the texture of the video frames.This change in texture provides clues to detect forgery.
Recently, Shehnaz and Kaur detected and localized multiple inter-frame forgeries in a video by employing a histogram of oriented gradients (HoG) and local binary pattern (LBP).However, this method cannot localize frame duplication and frame shuffling attacks.Many other authors, such as Zhang et al. [13], Liao and Huang [51], Zhao et al. [52], Bakas et al. [53], Kharat et al. [15], and Shehnaz and Kaur [54], utilized texture features to detect inter-frame tampering in a video.These techniques yield good results; however, these methods are computationally extensive due to their high dimensional features.In the deep learning-based methods, Longet al. [29] used a C3D network to detect and localize frame dropping from a single video clip, comprising 16 frames, by checking the center of the clip, i.e., between the 8th and 9th frames.They defined the confidence score with a peak detection trick and a scale term based on the output score curves to reduce false alarms.They also proposed a coarse-to-fine deep learning approach [8] for the detection and localization of frame duplication at the video and frame levels.Each video was split into 64 frames, with an overlap of 16 frames, and deep spatiotemporal features were extracted using a pretrained I3D network.Additionally, a Siamese network based on ResNet was utilized to verify frame duplication at the frame level.Location of tampering is determined with an I3D-based inconsistency detector.
Bakas et al. [27] proposed a deep learning technique based on 3D-CNN to detect inter-frame forgeries within a single video.They introduced a difference layer (pixel-wise difference layer) at the beginning of C3D, which extracts the temporal information suitable for detection of inter-frame anomalies in a video.
For detection of inter-frame tampering, Fadl et al. [21] used a pre-trained 2D-CNN model for automatic feature extraction.They computed the spatiotemporal average of every 10 non-overlapped frames of video before passing them to 2D-CNN.Then, the structural similarity index (SSIM) among features is computed, which is then fed to MSVM for classification.It detects frame insertion, deletion, and duplication, with average accuracies of 99.9, 98.7, and 98.5, respectively, but it cannot detect tampering involving fewer than 25 frames.It shows effectiveness in detecting tampering when the selected frames for insertion, deletion, or duplication are in multiples of 10.This method works with the assumption that frames should only be inserted at the static portions of the video when performing frame duplication tampering.The localization of the tampered region is not precise.Additionally, the method has shown good performance on their developed dataset, but it is not validated across different datasets, potentially limiting its generalizability.
Considering video tampering detection as an anomaly detection task, integrating deep learning techniques with prior information could potentially enhance the efficacy of video tampering detection [55].Kumar and Gaur [56] proposed a method that extracts deep features using a CNN model.This method establishes the relationship between consecutive frames by calculating the inter-frame correlation coefficient.The inter-frame correlation distance is then computed, and a dual-threshold is applied to identify the forgery and its type.A detection accuracy of 86.5% is achieved for the VIFFD dataset.However, the VIFFD dataset used in this approach shows a lack of realistic representation, i.e., it only incorporates certain scenarios: frames are only inserted at the start or end of the video in the frame insertion forgery; if frames are removed at the beginning of the video, the deleted frames are replaced by black frames, which is not practical approach.
Deep learning techniques rely heavily on large datasets to automatically extract the high-dimensional features essential for video tampering detection.Numerous researchers have carried out experiments on their developed datasets [17,21,25,29,40,43,52,54,57], yielding commendable detection accuracies; however, these datasets are not accessible to other researchers.A thorough analysis reveals that most of the existing inter-frame tampering detection methods are based on handcrafted features, which are sensitive to postprocessing operations like blurring, noise, and compression.Most of them cannot detect and localize all kinds of inter-frame tampering.The method of Long et al. [29] can detect frame deletion tampering only from the center of a 16-frame video shot.In Ref. [27], the authors construct individual trained models for each type of inter-frame tampering within a single video; this method cannot identify frame duplication involving more than 20 frames.On the other hand, the method in Ref. [21] cannot detect tampering if the number of tampered frames are fewer than 25.
Table 1 provides comprehensive details of the relevant literature regarding interframe forgery detection techniques for digital video.Notably, most of the state-of-the-art deep learning-based techniques [8,29,38] can only detect a single type of temporal tampering within a video.Furthermore, they are computationally extensive due to high dimensional features.Similarly, many temporal tampering detection methods exhibit robust performance on a specific set of videos but struggle to replicate such results on other unknown video datasets.
To evaluate the effectiveness of the proposed techniques, it is necessary to test them on publicly available datasets.Unfortunately, this is a big challenge, as these datasets are not accessible to other communities or researchers.Compared to tampered image datasets tampered video datasets are significantly less mature.In this paper, we propose a novel method for detecting inter-frame tampering with high accuracy in regards to both detection and localization.Our method remains effective, even when the video exhibits variations in format, resolution, and frame rate and contains as few as 10 tampered frames.We employed 2D-CNN for feature extraction and utilized an autoencoder to reduce the computational overhead by shrinking the dimension of the feature space.LSTM/GRU, with an FC layer, is used to train all tampering simultaneously for classification.Our system overcomes the previous method's drawbacks and improves performance.The detail of the proposed method is presented in the next section.

Proposed Method
This research aims to develop an automated method that detects, localizes, and accurately and precisely determines the type of inter-frame tampering in videos.First, we formulate the problem and then present the details of the proposed method to tackle the obstacle.

Problem Formulation
We are given a surveillance video  ∈  × × , comprising t frames (i.e., t represents the time axis), each with a resolution of r × c.It is required to determine whether the video is tampered with by inserting or deleting frames.If the video is found to be tampered with, then the inserted/deleted region is located.
Let  = {ℎ, , } .We formulate this detection problem as a three-class classification problem and design a classifier :  × × →  such that  ;  =  , where  ∈  × × ,  ∈  and  represent the learnable parameters.This means that we need to design a mapping  ;  that takes a video  as input and predicts whether it is authentic or tampered with, by insertion or deletion.
Given a video  ∈  × × , which is already detected as tampered with by insertion or deletion, it is required to determine the exact location of tampering.Let  represents the objective function for locating inter-frame tampering.This takes tampered video  as the input and pinpoints the tampered region   = , where d precisely indicates the location of tampering within a video.

Proposed Method for Detection
We design the mapping  as a composition of four mappings, as follows: where the mapping  preprocesses  to yield  ∈  × × ,  takes  as input and extracts t temporal features  ∈  × , each of dimension ,  reduces the dimension  of the feature space to  and yields  ∈  × , and finally,  analyzes the temporal features  to predict the label  of .
An overview of the proposed method is presented in Figure 2; it comprises four blocks.The first block models  and is concerned with preprocessing; a video is split into overlapped clips, and spatial and temporal information is fused to generate a unified image for each clip.The second block specifies  , which extracts the features.The third block defines  , which deals with reducing the dimensionality of the feature space and generating a consolidated feature matrix for the entire video.The final block models the mapping  , that performs the inference of whether  is authentic or tampered with via deletion or insertion.Further details are provided in the subsequent sections.

Preprocessing
The tampered videos exhibit inconsistencies in pixel values between two consecutive frames, as presented in Figure 1.It is challenging to detect these small variations in pixel values, especially in the case of sophisticated video tampering, where the inconsistencies are minimal.To address this, we introduced an efficient preprocessing method.Processing the input video as a whole is time-consuming and impractical for locating the tampered regions.Therefore, first, we segment the input video  into overlapping clips { ,  ,  , …  } , where each clip consists of L frames, i.e.,  = ( ,  , … ,  ) ∈  × × .Subsequently, inspired by the approach provided in Ref. [59], we pool the spatial and temporal information corresponding to each clip, yielding spatiotemporal pooling (STP).For pooling, first, each frame of a clip  is filtered using the 3 × 3 averaging filter, i.e., where  filters each frame  of  to generate ̃ = (( ,  , … ,  )) .Then,  ∈  × of  is computed by averaging the frames of ̃ , as follows: An STP (zoomed-in form) of a video clip is presented in Figure 2; it shows the movement of a red car between two consecutive frames.As we use a pre-trained CNN model for feature extraction from each STP, each STP image is resampled to the dimensions of a × b × 3 using the bicubic interpolation algorithm.Images resampled with the bicubic interpolation exhibit high picture quality, smoother textures, and fewer interpolation artefacts, resulting in significantly superior outcomes [21,60,61].In our case a = 224 and b = 224 because we employ VGGNet for feature extraction.
Finally, the concatenation of  ,  = 1, 2, … ,  results in  , where The above operations define the mapping  , which takes the video  and generates  ∈  × × , i.e., where  is the mapping that splits the input video into n overlapping video clips, and  represents the concatenation operation.
The number of frames L in each clip plays a key role in encoding tampering sensitivity.There are two possibilities: L is fixed for all clips, and L varies for each clip, i.e., adaptive L. For adaptive L, the video comprising t frames is segmented into clips of variable lengths, based on the adaptive selection of the number of frames until the mean difference between the ith frame and each subsequent i + 1 to i + k frames exceeds the specified threshold, here, i = 1, 2, 3, …, t and k = 1, 2, 3, …, t − 1.Each clip starts from the frame where the mean difference exceeds the threshold, and this process persists until the video's conclusion.In this way, we obtain segmented clips  =  ,  ,  , … ,  , with variable lengths.After conducting numerous experiments, a threshold value of 100 is chosen.We discuss the effect of L in Section 5.1.

Feature Extraction
Videos can vary significantly in regards to content, quality, resolution, lighting conditions, camera motion, and other factors.Extracting robust features that are unaffected by such variations is challenging and requires careful consideration of feature selection and preprocessing techniques.To address this, we have introduced an efficient method that extracts discriminative features by employing advanced CNN models initially trained on ImageNet.CNNs have demonstrated effectiveness in tasks like image classification [62] and object localization [63] within images.We employed a state-of-the-art pre-trained CNN model, VGG-16, which consists of 13 convolutional layers, 5 max-pooling layers, and 3 dense layers, totaling 21 layers.However, it contains only 16 weight layers, which are learnable parameters layers.Following a stack of convolutional layers, three fully connected (FC) layers are utilized: the first two, with 4096 channels each, and the third, performing 1000-way ILSVRC classification, containing 1000 channels, one for each class.The final layer is the softmax layer.Since it is explicitly trained on millions of images, e.g., "ImageNet" [64], it is thus effective for modeling vision-related problems [65,66].The convolutional base of this pre-trained model is used to extract rich and deep features, while discarding the remaining network layers.Additionally, we introduce a global average pooling layer after the convolutional base of VGG-16.Since the global average pooling layer has no parameter, using it instead of fully connected layers significantly reduces the model's complexity.Each STP image, 224 × 224 × 3 in size, is fed to VGG-16 to get feature vector  of size m, i.e., where Ǥ extracts feature matrix  of size  × n from the resampled STPi images  , i.e.,  ∈  × .This mapping is represented as: where the mapping  takes  as input and generates a feature matrix  ∈  × .Since the global average pooling layer generates a feature vector of size 512, corresponding to each STP image, the feature matrix 512 × n is obtained for each video.

Dimensionality Reduction
High-dimensional features are prone to overfitting, require more memory, and can lead to computational difficulties.Many approaches have high computational complexity due to high dimensional features [35][36][37], which is a significant task.Similarly, choosing an appropriate dimensionality reduction method is also challenging.Principal component analysis (PCA) simplifies data using linear combinations, while the use of autoencoders, a more flexible approach, captures non-linear patterns, making them powerful for complex data reduction, with enhanced representation capabilities.The nonlinear dimensionality reduction technique using autoencoders can effectively learn the nonlinear correlation among numerous variables and succeed in detecting anomalies.In contrast, linear PCA, which employs linear dimensionality reduction, overlooks anomalies [67].To address this issue, we proposed the use of an autoencoder to reduce the dimensionality of feature space in an unsupervised way [68,69].This autoencoder was developed with a single hidden layer, using ReLU as the activation function and Adam as an optimizer.The encoder takes deep features and maps them into a latent encoding space, generating a latent code; the decoder then takes the encoder's output and attempts to reconstruct the original input, as shown in Figure 3b.We adopted this simple architecture because networks with more hidden layers are difficult to train [70].We employed an autoencoder £ :  × →  such that £ ( , ) =  , where  ∈  × and V∈  × , and  represents the learnable parameters, i.e., where an autoencoder takes each feature vector  of  with dimension m and mapped to a vector  of dimension , where d << m.This mapping is represented as where  takes feature matrix  ∈  × as input and generates a feature matrix  ∈  × of reduced dimensionality, corresponding to a video.The reduction in dimensionality of feature space not only leads to notable reductions in training time, memory usage, and computational overhead, but also enhances the visibility of tampering traces.Figure 3 represents an example of frame deletion tampering when video clip frames from 107 to 121 have been removed.The effect of CNN features, without using an autoencoder, is presented in Figure 3c, and after using autoencoder in Figure 3d, where the feature dimension is reduced from 512 to 128 per STP image.

Classification
Video frames often exhibit temporal dependencies, where information in one frame may be related to information in temporally distant frames.Traditional neural networks may struggle to effectively capture such long-range dependencies.Hochreiter and Schmidhuber introduced LSTM networks [71] and GRUs [72] that can analyze long-range dependencies among video frames and detect tampering traces.For classification, we employed LSTM/GRU with an FC layer, activated through the softmax function, to compute the probability of the video sequence being categorized as  ∈ {ℎ, , } = .This classifier is represented as :  × →  such that ( ; ) =  , where  ∈  × ,  ∈  , and  represent the learnable parameters.This classifier operates on a reduced feature matrix  and assigns it to one of three categories: authentic, insertion, or deletion.
This mapping is represented as It demonstrates that reduced feature  obtained from the previous block  , as shown in Figure 2, is fed to LSTM/GRU, which further analyses the tampering traces, and a final inference  corresponding to the entire video is generated.This mapping is represented by Equation (11).In this way, classification is conducted at the video level.The functions of LSTM and GRU are described below.
LSTMs efficiently improve performance by memorizing the relevant information for a long period of time and identifying the pattern through its gates.The LSTM layers also examine the inter-frame inconsistencies in the extracted features, leading to high accuracy in detecting tampering in videos, irrespective of their frame rates, video formats, number of tampering frames, and compression quality.The standard formulation of a single LSTM cell is described by the following equations: where σ is the sigmoid function, tanh is the hyperbolic tangent function, and i, o, f, C, and C, are the input gate, output gate, forget gate, memory cell content, and new memory cell content, respectively.GRUs were introduced in 2014 [72], and are similar to LSTM but they are fast, compact, more efficient in terms of simpler structure, and have fewer parameters [73].GRU formulation is given by the following equations where σ represents sigmoid activation, W represents weight matrices, b represents biases, tanh is the hyperbolic tangent, and z3n, rn, en, and hn are the input vector, reset gate, update gate, and output vector, respectively.These recurrent neural networks excel at capturing temporal dependencies in sequential data, making them applicable to tasks involving video analysis and tampering detection.Both approaches utilize a different method of fusing previous time step information with gates to prevent vanishing gradients.
To calculate the loss of the video tampering classification model, the cross-entropy loss function is applied, as provided below: where ŷi is the predicted score of class i at the softmax layer.

Proposed Method for Localization
Once the video is detected as tampered with, along with its type, the subsequent task involves pinpointing the tampered region to gain more trust of the end user.For localization, first, the SSIM among the features of reduced dimension is computed, i.e., where  ∈  × contains a feature matrix of reduced dimensionality, as presented in Section 3.2.3, computes the SSIM among consecutive features, and  contains a sequence of all the SSIM values of the tampered video.SSIM utilizes the luminance, contrast, and structural element information for the compared images.Tampering may result in abrupt changes in SSIM values or inconsistencies, where three scenarios are perceived.In the first scenario, no dominating falling points are detected in the SSIM curve, indicating that there are no inconsistencies.Such videos are classified as original; examples of this scenario are illustrated in Figure 4a-c.In the second scenario, a single obvious falling point is present, leading to the classification as tampered, by frame deletion attack.This falling point denotes the precise location of frame deletion tampering, as depicted in Figure 4d-f and Figure 4g-i, for feature dimensions of 512 and 128, respectively.In the third scenario, two distinct falling points are observed, resulting in the classification as tampered, by frame insertion attack.These falling points represent the start and end points of frame insertion tampering, as demonstrated in Figure 4j-l and Figure 4m-o, for feature dimensions of 512 and 128, respectively.In this way, SSIM effectively localizes video tampering by identifying regions where the spatiotemporal characteristics of the video have been altered.
In the case of video with deletion tampering, the lowest value of S represents the point of deletion and is represented as: where  takes the sequence S as input and determines the lowest SSIM value p, representing a falling peak, and the value of k at p represents the exact position of frame deletion.In the case of video with insertion tampering, the two lowest values of sequence S, i.e., p and q, represent the two falling peaks and are represented as: The values of k at p and q represent the tampered locations, i.e., the smaller value of k denotes the start, and the larger value of k indicates the end point of insertion within a video.

Dataset and Evaluation Protocols
Due to the absence of extensive video datasets for detecting inter-frame tampering in surveillance videos [74], we developed the COMSATS Structured Video Tampering Evaluation Dataset (CSVTED) that contains challenging videos with different complexity levels, i.e., from simple background to complex background, single object to random movement of multiple objects, and different lighting conditions such as morning, noon, evening, night, and fog.The videos are recorded by multiple cameras of different models, capturing both moving and static views; CSVTED covers all forms of tampering, including frame deletion, insertion, duplication, copy-move, and splicing.The videos in CSVTED have frame rates ranging from 12 to 30 fps and duration spanning from 5.648 to 75 s.Moreover, the videos in CSVTED portray natural scenes post-tampering and are available in popular formats such as avi, mp4, or mov.Illustrations of some frames from CSVTED are provided in Figure 5. Extensive experiments were carried out on numerous videos captured by the static cameras, varying the number of inserted/deleted frames per test video from 10 to 545.To evaluate the performance of our system, we used 10-fold crossvalidation.Further details of the developed dataset are presented in Table 2.
For evaluation, we used standard metrics, including precision rate (PR), recall rate (RR), F1 measure, sensitivity (TPR), specificity (TNR), and detection accuracy (DA), which are given by where TP represents the count of true positives (i.e., accurately identified instances of video forgeries); FP represents the count of false positives (i.e., incorrectly identified instances of video forgeries or falsely identified forgery type); TN represents the count of true negatives (i.e., accurately identified instances of video authentication); and FN represents the count of false negatives (i.e., incorrectly identified instances of video authentication).The method is deemed satisfactory when achieving high detection rates for TPR, TNR, and DA.

Experimental Results and Discussion
Extensive experiments are carried out on a notebook computer equipped with an RTX2070 graphics card, a core i7 processor, and 32GB RAM, running Python 3.8.12.Different libraries used for the experimentation are OpenCV, Keras, sklearn, Tensorflow, Pandas, and matplotlib.VGG16 with LSTM and GRU are used to perform experiments.Cross entropy loss, batch size of 32, learning rate of 0.001, Adam optimizer, and early stopping with patience 50 are employed to train the neural network models.Precision, recall, F1score, TPR, and detection accuracy metrics are used for model evaluation.A statistical paired t-test is applied to compare the averages/means and standard deviations of the LSTM/GRU models with different configurations to determine if there is a significant difference between these models.In this test, we conducted two replications of 10-fold crossvalidation.For each replication, the available data are randomly divided into ten equalsized sets.Each learning algorithm is trained on nine sets and tested on the remaining set.The experiments were run using two models, as previously described: standard LSTM and GRU, each with four different configurations.The details of the layers, along with the number of neurons in each layer, are provided in the subsequent sections.

The Effect of L
The choice of the number of frames L in a clip has a significant impact on the performance of the proposed method.To show its effect and choose the best L, we selected a video x (Moving Fish 01.mp4), which is tampered with by deletion, and computed the corresponding  =  () and extracted features matrix  =  ().Using  , we computed the SSIM [76] among features.Figure 6 shows the SSIM curves for different choices of L.
While the large value of L, i.e., 10, as used by Sondos Fadl and Qi Han et al. [21], reduces the temporal redundancy due to the slight differences among consecutive frames, it results in the loss of tampering traces, as shown in Figure 6b; L = 5 is relatively better than L = 10, but it also does not result in as sharp peak, as in the case of L = 2. Further, adaptive L also does not show better performance than L = 2 in pinpointing the deletion spot in the video.L = 2 preserves the temporal coherence in a better manner and detects the deletion region in the video, consistent with the findings by Yoo and Chang et al. [77].
Similarly, we selected another video x (Customer Dealing.mp4), which is tampered with via insertion.After computing  and  , the SSIM among features is also computed.Figure 7 shows the SSIM curves for different choices of L. L = 10 and L = 5 do not represent two separate peaks (one peak for the start point and the second for the end of insertion).Further, adaptive L does not show better performance than L = 2 in pinpointing the start and end point of frame insertion according to its two sharp peaks, as shown in Figure 7d.L = 2 identifies insertion regions in a better manner.In view of these observations, we set L = 2 in our experiments.

The Effect of Dimensionalit Reduction
Dimensionality reduction reduces the redundancy among features and focuses on the most relevant and discriminative features for tampering detection, thereby enhancing detection performance.Figures 6 and 7 show the SSIM curves for frame deletion and insertion across various choices of L. While L = 02 yields sharp peaks, employing L = 02 with reduced feature dimensions results in even stronger peaks, as shown in Figures 6e and 7e.L = 02, with reduced feature dimensions, significantly improves the tampering traces, which yields better localization and results in less computational cost.We performed extensive experiments using features with different dimensions, and a comparison is given in Section 5.8.Utilizing dimensionality reduction proves advantageous for detecting interframe tampering, especially in the case of large-scale video datasets.

LSTM with Var ing Depths
To assess the effectiveness of the LSTM network, we investigate the impact of various LSTM depths on classification performance.After the LSTM, an FC layer is incorporated, with softmax activation.The number of LSTM layers is incremented from 1 to 4. The number of neurons is kept to 15 in all LSTM layers.In order to avoid overfitting, early stopping is employed with patience 50, where patience is a hyper parameter.The number of iterations is set to 300.The findings indicate that a single-layer LSTM performs well when compared to multiple-layer LSTM configurations across all dimensions.Figure 8 illustrates that when the number of LSTM layers continues to increase, the network performance decreases, likely due to model overfitting.Table 3 presents the results of the paired t-test, with p values calculated at a significance level of 5%, comparing the performance of the single-layer LSTM model with that of multi-layer LSTM.Experimental results show that single-layer LSTM outperforms multi-layer LSTM.

LSTM with Var ing Numbers of Neurons
In this segment, the primary objective is to optimize the number of LSTM neurons while keeping the depth fixed, since single-layer LSTM shows better performance than multi-layer LSTMs.We further assess the performance of single-layer LSTM by varying the number of neurons, ranging from 15 to 150. Figure 9 depicts the classification performance of the single-layer LSTM model; the performance decreases when the number of neurons is increased above 15, in the case of 512 feature dimensions, while the performance increases in the case of 256 and 128 feature dimensions; however, it remains consistent when the number of neurons exceeds 100.Although a peak accuracy of 90.53% is achieved with 15 neurons and 512 feature dimensions, the significance of achieving an accuracy of 90.12% with the same model by utilizing 100 neurons, even with a much smaller feature size of 128, cannot be overlooked.Thus, the structure based on a singlelayer LSTM (LSTM-L1) network, with 15 neurons, is suitable and recommended for interframe tampering detection.

GRU with Var ing Depths
In order to test and verify the effectiveness of the GRU network, we explore the effects of different GRU depths on classification performance.After GRU, an FC layer, with softmax activation, is added, yielding posterior probabilities of the classes.The number of GRU layers is increased from one to four.Specifically, a one-layer GRU comprises one GRU layer with 64 neurons; a two-layer GRU comprises 8 and 64 neurons; a three-layer GRU has 8, 32, and 64 neurons; and the number of neurons used in a four-layer GRU is 8, 16, 32, and 64. Figure 10 illustrates how varying the number of GRU layers affects the detection accuracy of the network.It can be distinctly seen that there is a minor decline in detection accuracy due to the addition of GRU layers when features with 512 dimensions are used, likely due to model overfitting, and then it remains consistent until reaching four GRU layers.Figure 10 reveals a clear trend in the case of features with dimensions of 256 and 128: the accuracy improves with additional GRU layers, up to the addition of three layers.This observation suggests that higher numbers of GRU layers enable the model to capture more effective features when inputs of 256 and 128 dimensions are used.However, increasing the depth beyond a certain point may lead to overfitting due to increased model complexity.The outcome indicates that the performance of GRU-L2 with a feature dimension of 128 is very close to that of GRU-L1 with dimensions of 512, thereby reducing the computation complexity and showing the effectiveness of dimensionality reduction.
We utilize a statistical paired t-test to compare GRU models with various configurations, aiming to ascertain any significant differences between these models.Table 4 presents the p-values at a 5% significance level for the comparison between single-layer and multi-layer GRUs, indicating no significant difference when GRUs with different layers are employed.However, the highest accuracy of 90.73% is achieved by a single-layer GRU with 64 neurons.

GRU with Var ing Number of Neurons
In this segment, the primary aim is to optimize the number of GRU neurons while maintaining the fixed depth.Since the highest detection accuracy is achieved by single-layer GRU, we further analyze the performance of single-layer GRU by varying the number of neurons, i.e., spanning from 8 to 128.The experimental results revealed that performance increases when the number of neurons is increased and remains consistent when the number of neurons exceeds 64 for 512 and 128 feature dimensions.Figure 11 illustrates the detection accuracy of single-layer GRU models, highlighting the optimal performance with 64 neurons when employing features with 512 dimensions, achieving a peak accuracy of 90.73%.When comparing LSTMs and GRUs with various configurations, it becomes evident that LSTM-L1, along with GRUs ranging from L1 to L4 with a feature dimension of 512 and a GRU-L2 with a feature dimension of 128, exhibit superior detection accuracy.LSTMs are characterized by more gates and parameters, rendering them more flexible but also entailing higher computational costs and a greater risk of overfitting.In contrast, GRUs have fewer gates, fewer parameters, and require less computational power than LSTMs, making them simpler and faster.

Localization
Table 5 shows a comparison of the detection and localization accuracies of the proposed models with the state-of-the-art method for both types of tampering.The superior results of our method show its effectiveness for the detection and precise localization of frame insertion and deletion tampering.

Discussion and Comparison with State-of-the-Art
Based on the experimental results carried out, the proposed system has been compared with similar state-of-the-art methods, based on deep learning features [8,21,29,38] and handcrafted features [43,56].A C3D-based method in Ref. [29] only detects the framedropping forgery at the center of a single video clip of 16 frames, while the deletion may not necessarily be in the center; moreover, it is unable to identify other forms of interframe tampering.Unlike existing methods [8,29,38] that focus on detecting only one type of video tampering, the proposed approach can detect both types (i.e., insertion and deletion) simultaneously.This comprehensive detection capability enhances the system's effectiveness in identifying tampered videos, along with types of tampering.The approach in Ref. [20] selected for comparison, utilizing CNN, demonstrates the highest detection accuracy among the state-of-the-art methods presented in Section 2. However, it is important to mention that this accuracy is reported on a proprietary dataset developed by the authors, which is not publicly accessible.Table 6 shows the computation of various comparison parameters-precision, recall, F1 score, and detection accuracy-of the proposed and state-of-the-art methods regarding CSVTED.When compared to certain existing methods (referred to as Ref. [20]), which require tampered frames to be in multiples of 10 and cannot detect tampering involving fewer than 25 frames, the proposed technique exhibits higher sensitivity, and it can detect tampering involving smaller numbers of frames, thus enhancing its effectiveness in detecting subtle manipulations.Furthermore, the method in Ref. [21] does not show robustness, as it has low accuracy in regards to our challenging CSVTED, and its localization is not precise.
Figure 12 shows the performance rates of TPR and DA of the proposed systems and state-of-the-art regarding CSVTED.The experimental results indicate that our deep learning approaches (CNN-GRU and CNN-LSTM) are capable of capturing various hidden features from STP images using VGG16, leading to high accuracies for detecting interframe tampering.VGG16 has been trained on a large dataset, like ImageNet, which contains a vast number of diverse images.As a result, the convolutional layers of VGG16 have learned rich representations of various visual patterns and objects.Leveraging these prelearned features can be highly beneficial, especially when working with limited data or resources.Proposed Model-02 (2D-CNN with GRU-L1) achieves the highest accuracy of 90.73%, while using a small feature size of 512 per STP image, as compared to state-of-theart 4096, which makes the proposed approach computationally efficient.The second proposed model, Model-06, demonstrates a 90.67% accuracy, despite having a significantly smaller feature size, only 3% of the size of the state-of-the-art model.This not only underscores the effectiveness of employing an autoencoder to reduce feature space, but also contributes to improving the extraction of discriminative features for the task of detecting video tampering, as depicted in Figure 4g-i and Figure 4m-o.Figure 13 compares the accuracy of detecting and identifying tampering types at the video level.Kumar and Gaur [56] reported a detection accuracy of 47% for frame deletion and 77% for frame insertion tampering.They stated that detecting tampering becomes highly complex when only a few frames are deleted from the sequence.The close relationship between the non-deleted frames at these locations contributes to this complexity, resulting in lower accuracy for detecting frame deletion forgery.In contrast, our Model-01 exhibits detection accuracy of 93% and 98.5% for frame deletion and insertion tampering, respectively, while Model-02 achieves detection accuracy of 94% for frame deletion and 99% for frame insertion tampering, even when dealing with a small number of tampered frames, which indicates that our method can efficiently detect both insertion and deletion tampering with high accuracy, without imposing strict constraints on the number of tampered frames; this represents a significant improvement over the accuracies of existing techniques in the field.Thus, LSTM-L1 and GRU-L1, with dimensions of 512, and GRU-L2, with dimensions of 128, outperform traditional methods and are recommended to detect multiple inter-frame tampering in videos.
The following are some advantages and limitations of the proposed system: First, the proposed system considers multiple inter-frame tampering in one 2D-CNN model, where each STP image of the video is processed independently; this avoids the need for extensive computational space and complex computations involved in processing the entire video sequences simultaneously, as required by 3D-CNNs.Second, the spatiotemporal feature fusion, coupled with LSTM/GRU to examine long-range dependencies among video frames, enhances the robustness of the detection process.Third, the feature dimensions are excessively high, leading to computational challenges.To address this, we suggest employing an autoencoder to diminish the feature space's dimensionality.This approach notably alleviates the computational burden of our method by reducing the dimensionality of feature space, while maintaining high detection accuracy.Last, our method used spatiotemporal deep automatic features, which achieved superior results in detecting, localizing, and determining the type of inter-frame tampering when compared to the stateof-the-art methods.It does not impose any constraints on frame rate, video formats, the type of capturing device, or the number of tampered frames; it is capable of detecting tampering involving as few as ten frames.However, there are some limitations: First, since our focus is on surveillance videos, the proposed system is unable to detect tampering if there is a scene change in the video.Second, it is observed that by increasing the depth of LSTM/GRU, there is a slight decrease in performance.This may be due to model overfitting because additional layers increase the complexity of the model, making it harder to train and optimize it effectively, limiting our method to the use of simple model architecture.For complex models, more training data is required.Last, dimensionality reduction techniques inherently involve reducing the dimensionality of the feature space, which may result in a loss of information.It is important to balance dimensionality reduction with the preservation of discriminative information relevant to tampering detection.
Overall, the proposed method offers a robust approach to video tampering detection by leveraging CNNs for feature extraction and considering inter-frame dependencies.Its ability to detect, localize, and determine the type of multiple inter-frame tampering, without imposing strict constraints on the number of tampered frames, leads to superior performance compared to that of state-of-the-art methods.To estimate computational cost, we monitor the models in terms of running time, as shown in Figure 14.Among these, both the LSTM and GRU models exhibit the shortest running time in terms of hours when utilizing a feature dimension of 128, and they are superior to the model with higher dimensional features.However, regarding detection accuracy, superior results are observed when employing features with 512 and 128 dimensions.Consequently, we do not compare the accuracy achieved with 256 dimensional features with these models.In the case of the LSTM model, the utilization of 512 dimensional features results in a runtime exceeding 9 h, whereas employing features with 128 dimensions requires approximately 5.5 h.Notably, the detection accuracy values with the GRU model closely align with those of the LSTM model, while requiring a shorter runtime.Specifically, the GRU model with 128 dimensional features consumed the shortest runtime of 1.5 h.
This underscores the effectiveness of autoencoders in reducing data dimensionality, while preserving tampering information.Such dimensionality reduction, coupled with the preservation of key features, can substantially alleviate computational complexity, making autoencoders valuable tools for various tampering detection tasks based on deep learning.

Conclusions
Inter-frame tampering in surveillance videos undermines the integrity of video evidence, potentially influencing law enforcement agencies and court decisions.This paper proposes an innovative inter-frame tampering detection system that exploits deep feature fusion techniques.The system utilizes a 2D-CNN to extract features from the spatiotemporal average pooling of overlapped video clips.Overlapped clips preserve the temporal coherence in a better manner.To address computational challenges arising from the high dimensionality of the features, an autoencoder is employed.In our work, we show that deep features with reduced dimensionality are effective for video forensics, achieving high accuracy compared to high-dimensional feature approaches, which is beneficial for real-world applications where video data is collected continuously and in large volumes.Long-range dependencies between distant frames are also captured by LSTM/GRU.The proposed system undergoes testing on our large dataset comprising 2555 videos, which includes challenging videos with different complexity levels, i.e., from a simple background to complex backgrounds, a single object to multiple objects, with random movement, and diverse lighting conditions such as morning, noon, evening, night, and fog.Notably, our technique does not restrict the minimum number of frames required to be inserted or deleted.Compared with handcrafted and deep feature-based methods, our approach offers a distinct advantage in detecting, localizing, and determining the type of inter-frame tampering (i.e., frame insertion and deletion), achieving over 90% accuracy, even with as few as ten tampered frames.By thorough comparison of simple LSTM and GRU models using a statistical paired t-test, we recommend single/two-layer LSTM and GRU models with appropriate parameters for inter-frame tampering detection.The experimental results validate the efficacy of the proposed method in detecting, localizing, and determining the tampering type in videos, demonstrating superiority over the state-of-the-art techniques.In the future, the frame duplication type of tampering can also be considered.Additionally, we intend to enhance the system's capability to detect multiple inter-frame tampering in a single video, thus fortifying its utility in real-world applications.

Figure 2 .
Figure2.Workflow of the proposed method for inter-frame tampering detection in a video.First, a video is split into frames, and the average of every consecutive pair is computed.The average frames are passed to a CNN for the extraction of features, which are fed to an autoencoder to reduce dimension, and then to an LSTM.The output of the LSTM is finally passed to an FC layer, with softmax activation, that yields the posterior probabilities of the classes.

Figure 3 .
Figure 3.Effect of using an autoencoder for reducing the dimensionality of feature space, the tampering point is indicated by falling peak marked with a red circle (Door Opening(N).mp4).(a) Autoencoder.(b) Reconstructed features.(c) Before dimensionality reduction.(d) After dimensionality reduction.

Figure 4 .
Figure 4. Examples of tampering localization of different cases; the first row (a-c) shows SSIM of features of original videos; the second row (d-f) shows SSIM of features of frame deletion tampering, without dimensionality reduction; the third row (g-i) shows SSIM of features of frame deletion tampering with dimensionality reduction to 128; the fourth row (j-l) shows features of frame insertion tampering, without dimensionality reduction; the fifth row (m-o) shows features of frame insertion tampering with dimensionality reduction to 128.

Figure 8 .
Figure 8. Deep LSTM network performance with different numbers of layers on different feature dimensions.

Figure 9 .
Figure 9. Classification performance of LSTM for different numbers of neurons.

Figure 10 .
Figure 10.Deep GRU network performance with different numbers of layers on different feature dimensions.

Figure 11 .
Figure 11.Classification performance of GRU for different numbers of neurons.

Figure 12 .
Figure 12.Performance rates of the proposed systems and state-of-the-art model [21] regarding CSVTED.

Figure 13 .
Figure 13.Comparison with the state-of-the-art methods for inter-frame tampering detection, along with type, at the video level [21,30,44,58].

Figure 14 .
Figure 14.Running time of two models (in hours) with different feature dimensions.

Table 2 .
Details of datasets used in the literature for inter-frame tampering detection.

Table 3 .
p-Value using paired t-test for different deep LSTM networks.

Table 4 .
p-value using paired t-test for different deep GRU networks.

Table 5 .
Detection and localization of forged videos (in %) (DA: detection accuracy; LA: localization accuracy.

Table 6 .
Detailed results of our 2D-CNN-based network with LSTM, GRU, and different feature sizes regarding CSVTED.
Highest values are represented in bold corresponding to each tampering