MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection

In this paper, we introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes. Previous approaches disregard such information either by using simple a-posteriori aggregation schemes, i.e., average or max operation, or using only one identity for the inference, i.e., the largest one. On the contrary, the proposed approach builds on a Spatio-Temporal TimeSformer combined with a Convolutional Neural Network backbone to capture spatio-temporal anomalies from the face sequences of multiple identities depicted in a video. This is achieved through an Identity-aware Attention mechanism that attends to each face sequence independently based on a masking operation and facilitates video-level aggregation. In addition, two novel embeddings are employed: (i) the Temporal Coherent Positional Embedding that encodes each face sequence's temporal information and (ii) the Size Embedding that encodes the size of the faces as a ratio to the video frame size. These extensions allow our system to adapt particularly well in the wild by learning how to aggregate information of multiple identities, which is usually disregarded by other methods in the literature. It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people and demonstrates ample generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection.


Introduction
We are witnessing day by day an increasingly rapid evolution in the field of synthetic multimedia content generation. With the continuous advancement of Generative Adversarial Networks (GANs) [21] capable of generating highly credible images and videos, distinguish reality from fiction is a daunting task at best. Among synthetic content, the most dangerous are those involving human faces, i.e., deepfakes. Exploiting synthetic images and videos of people or manipulating existing ones has generated several defamation campaigns against public figures, celebrities or even regular citizens whose lives have been undermined by manipulated content 1 .
In response to this sudden and uncontrollable evolution, many efforts have been made to counteract deepfakes by implementing a multitude of deepfake detection systems based on a variety of approaches. However, it is evident that being able to distinguish manipulated from pristine content introduces many challenges and there are still many open issues. In this work, we focus on those issues that we consider Figure 1. The impact of the deepfake detection strategy in a case of video containing multiple identities and variations in the faceframe area ratio. The proposed approach is the only one capable of identifying spatio-temporal anomalies and at the same time effectively handling cases of multiple identities, variations in face size and without the use of aggregations that would impact the result. crucial for detecting deepfakes 'in the wild ' and have yet to attract the research community's attention. Previous studies [5,22,49] have pointed that identifying temporal anomalies and inconsistencies between two frames of the same video, is fundamental for successful deepfake detection. Much work, however, tends to focus on spatial anomalies only, handling videos with a frame-by-frame classification approach and relying on naive aggregation schemes to extract video-level predictions [1,6,13,29]. When videos are analyzed frame-by-frame or divided into separately classified sequences, the problem of their aggregation to obtain a video-level prediction also arises. We can divide the approaches into a-posteriori aggregation, i.e. those that use static functions as average or maximum to obtain the final result, and internal aggregation, i.e. those that let the model analyze the entire video as it is and so decide internally the final video-level prediction. Methods based on the first approach are very sensitive to the choice of aggregation function as shown in [5,10].
An additional problem resides in a vulnerability that attackers can exploit to deceive a deepfake detector. In cases of videos with multiple distinct people (identities) appearing together [30], an attacker could decide to manipulate only one of them. However, if the detection is carried out en bloc for all the detected faces, the negative contribution to the final prediction made by the fake faces could be 'overshadowed' by the non-manipulated ones, thus deceiving the system. One possible approach would be to split the video into several clips, at least one for each identity, but this would require multiple forward passes of the model, and there are currently no techniques that can handle any number of identities in a single one. These approaches also rely on an aggregation policy for the distinct predictions to obtain an overall video-level prediction, which significantly affects the final result. Furthermore, in general, previous works in the literature have put minimal effort into this specific situation, which, however, is extremely frequent in real-world verification tasks.
Typically in deepfake detection systems, faces are detected from the video or image to be classified, and before being given as input to a classification model, it is resized to be uniform with all the others. This may result in an essential loss of information, namely the size of the subject's face as a ratio to the rest of the scene. Such information could be exploited to build models robust to environmental clutter, e.g., a sudden change of distance from the camera or a blur resulting from an original small image, which can distinguish better anomalies introduced by the deepfake generation process.
Finally, the creation of a deepfake also tends to introduce specific anomalies within images and videos. Hence, deepfake detectors often end up learning to recognize only the ones included in the training dataset and are therefore ineffective when dealing with unseen manipulations, demonstrating poor generalization capability [9,13,16,23,31,37,45]. Furthermore, many approaches may be ineffective in real-world cases because they are trained and validated on very constrained situations where, for example, there is only a single subject in the video or people tend to always stay at the same distance from the camera [50].
To overcome the above challenges, we present the Multi-Identity size-iNvariant TIMEsformer (MINTIME) for video deepfake detection. The main novelties presented by our approach, which are also illustrated in Figure 1, are: • Ability to identify both spatial and temporal anomalies through a combination of TimeSformer [7] and Convolutional Neural Networks (CNNs), unlike other hybrid models designed for deepfake detection [14,29] working exclusively in a frame-by-frame manner.
• Ability to handle multiple people in the same video through the introduction of a novel positional embedding technique, namely Temporal Coherent Positional Embedding inspired by the traditional one but capable of maintaining both spatial and temporal coherence and a new type of attention, namely Identity-aware attention, capable of keeping track of the identity to which each face detected in the video refers.
• Ability to handle variations in the face-frame area ratio through the introduction of size embeddings that keep track of the ratio between the detected face area and the entire frame at each instant of time.
• Being unaffected by aggregation strategies through internal aggregation obtained by analyzing the entire video in a single sequence and let it decide the videolevel prediction and being able to handle multi-identity videos even with a single forward pass. This way, the model directly returns a single prediction for the entire video without requiring additional post-processing.
The performance of the proposed system has been evaluated in a multitude of different contexts, with an improvement of up to 14% AUC in videos containing multiple identities. It has been also validated in some cross-forgery and cross-dataset scenarios, and exceeds the state-of-the-art in all contexts, with improvements in the AUC score of up to 22% in some cases, also demonstrating a high level of generalization.

Related Work
The growing interest in deepfakes has fuelled the emergence of numerous solutions to detect them in a variety of ways [47]. Generally speaking, what practically all these methods seek to achieve is to correctly classify a video as pristine or deepfake by looking for anomalies of various kinds in it. As we focus on video deepfake detection, we summarize previous works into two categories based on the type of anomalies they focus on.
Space-Only These methods often treat the video as a series of frames by performing a separate classification for each of them and then aggregating them into a final overall classification using an a-posteriori aggregation scheme such as the average or maximum function. These approaches focus primarily on identifying specific anomalies introduced by deepfake generation methods, namely those of a spatial nature. Even though they are most suited for image deepfake detection, where there is no need for detecting temporal anomalies, they are often applied to videos in a frame-by-frame manner. Most of them are based on well-established Deep Learning techniques such as CNNs [1,6,27,38] but recently also some attention-based approaches have been proposed [16,25,44]. Some previous works also propose hybrid architectures that combine Vision Transformers with various types of CNNs as in [14] exploiting an EfficientNet-B0 [42] as a features extractor to feed a Cross Vision Transformer [11]. Aggregation in this case is done by making the maximum among the predictions obtained on the individual frames. A similar approach but using an XceptionNet [39] is the proposal in [29]. However, all these methods are unable to pick up temporal anomalies that can be crucial in deepfake detection.
Spatio-Temporal These approaches aim to capture temporal anomalies and obtain a video-level classification without any kind of aggregation. Some of them focus on specific types of spatio-temporal artifacts common in the case of deepfake videos, such as anomalous lip movement [23] or inconsistent eye-blinking [33], but they are clearly limited since they do not look for additional anomalies that may be present in other parts of the face. Proposals have also been made that exploit the optical flow of video [4] or analyze the relationships between audio peaks and video content [35]. A more complete proposal was made by [22], which uses a self-supervised approach to learn effective frame representations to train an audiovisual, cross-modal, student-teacher framework used then to train a deepfake detector. The authors of [3] also proposed a method that tries to detect interframe anomalies via LSTM networks looking for temporal inconsistencies. Finally, one of the most relevant methods focusing on temporal incoherence within videos is FTCN [49], which uses a Temporal Transformer Encoder at its core. Although these methods efficiently handle videos by capturing spatio-temporal anomalies, they do not consider several important nuances of the problem. For example, they often construct the input sequence by selecting a single person from the video, even when there are more than one, and do not take into account the different face-frame area ratios that may occur or vary in them. An attempt to handle cases of multiple people in the video comes from [14] in which identities are classified separately and if even one of them is manipulated then the whole video is considered fake but this method is frame-based and relies on an ad hoc aggregation scheme (average or max) also requiring a forward pass for each frame. Incorrect handling of these situations can lead to completely ignoring a manipulated subject in favor of a perfectly pristine one or mistaking variations in the face size for anomalies. An aspect that has been partially exploited in deepfake detection is the identity of the subjects filmed in videos. Some works attempt to detect deepfake by identifying a person from temporal facial features, specific to how a person moves while talking [15] or by the usage of biometrics analysis techniques [2]. However, these latter approaches remain particularly effective mainly in the case of a major manipulation of the person's identity and have limited capability in detecting videos manipulated by other types of deepfake generation methods. They are also often very specific to an identity and are ill-suited in cases where the subject under consideration is not a famous person and there are few videos depicting him or her.

Proposed Approach
The proposed Multi-Identity size-iNvariant TIMEsformer (MINTIME) architecture is illustrated in Figure 2. It is capable of receiving as input a sequence of N faces containing one or more identities and detect whether the video has been manipulated. MINTIME is versatile and able to efficiently adapt to a multitude of real world complex settings as described in Section 1. In the following paragraphs we present in detail the proposed MINTIME novelties.

Adaptive Input Sequence Assignment
To enable the model to handle multiple identities within one video we generate a single input sequence composed by faces of all the considered identities. The identities are  . Sequence sampling when the number of allowed faces in the sequence is inherited through identities. The numbers on the right indicate, for each identity, how many faces can be used, how many are available, and how many have been inherited from or to another identity. reordered according to the size of the faces within them, and the most important identities are given a higher number of slots to be exploited in the input sequence, in order to give more importance to faces that cover a larger area and are therefore likely to be more relevant in the video, as opposed to smaller faces. The length of the input sequence, and so the available slots, is fixed in our case to 16. The number of faces for each identity is fixed and based on the number of identities in the video. If there is only one identity in the video, the faces detected by it will be used to fill the entire sequence, otherwise the sequence is filled by distributing the space between the various identities and giving higher priority to those containing faces that occupy a larger area. In case where there are fewer faces in the considered identity than necessary and no other identities are available, more empty images are added and then a mask is used to drive the calculation of attention properly, ignoring these blank images. In the case of longer sequences, however, uniform sam-pling is performed as shown in Figure 3. As a kind of input sequence diversification, this uniform sampling is performed in order to alternate various combinations of detected faces. Indeed, in this way at different epochs, the faces chosen for a given video are not necessarily always the same and the probability of catching the anomalies increases. Formally, the output of this process will be the sets of tuples X = {(x 1 , I 1 , s 1 , t 1 ), (x 2 , I 2 , s 2 , t 2 ), ..., (x N , I N , s N , t N )}, where N is the length of the input sequence. Each tuple contains the face image x i ∈ R 3×H×W , where 3 is the number of channels of RGB image and H and W are its height and width, respectively, their corresponding identity depicted in the input video I i ∈ N, their faceframe area ratio s i ∈ (0., 1.] and their timestamps t i ∈ N corresponding to the frame index. This set represents the information given as input to the system to perform the detection.

Backbone network
We use a CNN backbone to generate features as input to the TimeSformer to help it more easily grasp the spatial information of the analyzed faces during training. In our case, an EfficientNet-B0 inspired by [14] and an XceptionNet as in [29] were chosen as CNN backbones. The input images are then transformed into features by this CNN backbone, which can be described as C : R 3×H×W → R D×H ×W . The resulting features is then where H and W are the dimensions of the feature map resulting from the transformation and depend on the chosen backbone.

Temporal Coherent Positional Embedding
We evolved the positional embedding presented in the original Transformers paper [43] in order to ensure temporal consistency between frames as well as spatial consistency between tokens. Tokens are numbered in such a way that two faces, of different identities but belonging to the same frame, have the same numbering. Temporal coherence is maintained locally by having an increasing numbering sequence related to the frames from which the faces have been detected. It is also maintained globally by being generated on the basis of the global distribution of frames of all identities in the video. In this way, a more significant variation simply due to a greater sparsity of the frames considered is not interpreted as an anomaly. We called this approach Temporal Coherent Positional Embedding (TCPE). Figure  4 depicts an example case where, although the two identities have the same number of faces, they are extracted from different frames and this information is encoded in this embedding. The intervals of integer values E i ⊂ E ⊂ N to be assigned to the tokens related to the faces detected at frame i is obtained as: where i is the frame index associated with the tokens derived from the face that needs to be enumerated, N is the total number of faces detected and T is the number of tokens for each face in the input sequence. The numbering associated with tokens thus acquires both spatial meaning, i.e. incremental for tokens of the same face, and temporal meaning, i.e. incremental with respect to all faces of the video. This concept may be formally described by the function T E(·) that T E : N → R D , z i = z i + T E(t i ).

Size Embedding
As explained in Section 1, not considering the variation of the face-frame area ratio can lead to misclassification errors. To handle this situation, we further enriched each token with a Size Embedding, a numeric index generated on the basis of the face-frame area ratio. The numerical value that the size embedding assigns to tokens belonging to a where L is fixed to 20. For example, if a face has an area covering the 16% of the entire frame, the size embedding associated with its tokens will be 3 because it falls in the fourth interval [15%, 20%). Formally, this is a function

Identity-aware Attention
We use the Divided Space-Time Attention TimeSformer, which was found to be the best performing one [7]. Attention is calculated spatially between all patches in the same frame, but then it is also calculated between the corresponding patches in the next and previous frames. As far as spatial attention is concerned, no further effort is required for this to be applied to our case. Not being interested in capturing the relationships between faces of different identities, the calculation of temporal attention in our context is carried out exclusively between faces belonging to the same identity. All faces, however, influence the CLS which is global and unique for all identities. Figure 5 shows how atten-tion is calculated exclusively by tokens referring to identity 0 faces (green), ignoring those referring to identity 1 faces (red) and vice versa, while all refer to the global CLS.
Formally, let Z ∈ R (N HW +1)×D be the set of all the tokens in which Z I0 ∈ R 1×D is the global CLS token and Z Ii ∈ R N I i HW ×D is the set of all tokens related to identity I i , where N Ii is the number of faces for identity I i . Consider M Ii ∈ {0, 1} (N HW +1)×1 as the mask of identity I i indicating to the model which faces attend to an identity and which to another during the attention calculation. Then, our proposed Identity Attention (IA) can be formulated as: where the attention calculation Att(·, ·) takes place according to [7], with the first argument corresponding to queries and the second to key-values, and denotes element-wise multiplication with broadcasting.

Dataset
A study of the deepfake detection video datasets in the literature was first conducted to find one that would include temporal anomalies, various face-frame area ratios, videos with several people in the same scene at the same time, and varied in terms of deepfake generation methods and perturbations. For these reasons, ForgeryNet [26] was chosen as the most suitable dataset for our experiments. It consists of 221,247 videos, 99,630 pristine, and 121,617 manipulated videos with a frame rate between 20 and 30 fps and variable duration. Of those, 11,785 videos include multiple identities with a maximum of 23 faces per video. We analyzed in the supplementary material, the face-frame area ratio of videos in this dataset discovering that it is very variable with videos containing faces covering an area up to almost, in some rare cases, even 100% of the entire frame.

Preprocessing
All the faces of the depicted subjects are first detected, using the MTCNN face detector [48]. The detected faces are then clustered by calculating the similarity between the face embeddings extracted via an InceptionResnetV1 pretrained on FaceNet [41] and grouping them into identities, as proposed in [10]. These identities are then used for the construction of the input sequence as illustrated in Section 3.1. In order to ensure greater generalization capability, in the training phase, the sequences undergo a process of data augmentation in which many random perturbations inspired by [26] are applied. We report more details about the preprocessing phase in the supplementary material.

Training Setup
MINTIME was trained in two main versions, a) MINTIME-EF, which uses an EfficientNet-B0 [42] as feature extractor and trained on DFDC dataset [17] as in [14], and b) MINTIME-XC, which uses an XceptionNet [39] as feature extractor (inspired by [29]) and trained on ForgeryNet images as in [26]. The two versions differ also in the training modality in fact the first was trained keeping fixed the whole convolutional backbone excluding the last 2 blocks, with a batch size of 8 and on an NVIDIA RTX 3060, while MINTIME-XC was trained End2End with a batch size of 32 in parallel on four NVIDIA A100 for 30 epochs. The optimizer used is SGD with a learning rate of 0.01, which decays to 0.0001 using a cosine scheduler. The weight decay was set to 0.0001 as in [7]. All models were trained considering a maximum of two identities per input sequence, which does not limit the possibility of using more or fewer identities at inference time. The maximum number of faces we put in the input sequence was set to 16. Since the SlowFast [20] model trained in [26] was not available, we retrained it starting from the model pretrained on Kinetics 400 [28] in order to see how it behaved in certain contexts not reported in the original paper. This is the only training conducted in which a single identity per video is considered in order to emulate what has been done in [26].

Metrics
To evaluate the performance of our model, we used the accuracy and AUC metrics in most contexts. This is because they are widely used in the literature by previous methods and because they are highly indicative in a binary classification context such as this. Two additional metrics were also occasionally used in the tables below, namely the False Positive Rate (FPR) and the Maximum Accuracy Variation (MAV), calculated as F P R = F P T N +F P and M AV = max(A) − min(A), where FP and TN stand for False Positives and True Negatives respectively and A is the set of accuracy scores obtained by the model on all the classes. The FPR was chosen as a metric because it is important to have a system in the real-world that does not arise too many false detections while the MAV is used to give an idea of how much the models generalize among the various forgery methods.

Comparison with State of the Art
ForgeryNet Evaluation According to Table 1, MINTIME-XC outperforms the state-of-the-art on the ForgeryNet validation set in terms of AUC and is quite similar to what was obtained with a SlowFast in terms of accuracy, which is, however, limited to considering a single identity in the classification phase. All MINTIME models also prove to be particularly robust at analyzing videos considering a variable number of identities without a significant loss of overall accuracy. By testing the trained models considering only videos containing more than one person in the scene, MINTIME-XC significantly outperforms the trained state-of-the-art models by correctly classifying most of the videos considered as shown in Table 2.
Interestingly, SlowFast R-50 trained considering only one identity per video performs poorly on multi-identity videos as it is heavily influenced by the choice of this single identity to be analyzed. This is certainly one of the most interesting results obtained as it concretely highlights how necessary it is to create models capable of handling multiple identities in the same video and not simply rely on selection criteria for these. In our case, we consider identities containing faces that cover larger areas of the video to be more important but we report a deeper analysis of the impact of this choice in the supplementary material where we show that our approach has no loss of performance when this choice is varied. We also report our experiments highlighting the impact of Temporal Positional Embedding, Identityaware Attention, and Size Embedding on performance as a further ablation study.
Looking at the accuracy obtained on the eight ForgeryNet methods in Table 3, it becomes clear how both our models, and MINTIME-XC in particular, succeed in achieving excellent results on each forgery method. SlowFast R-50 appears to be particularly superior to the other models presented in recognizing video manipulated by method 8, namely ATVG-Net [12], but looking at the MAV it is evident that this result is probably achieved by simply learning the anomalies introduced by this specific forgery method. SlowFast R-50 seems to show more pronounced fluctuations in accuracy than MINTIME-XC, which remains more consistent across different methods and achieves a lower FPR.
Generalization analysis As pointed out in [13] the generalization capability of a deepfake detection model is crucial in order to be used in the wild. In Table 4, we report the results obtained from the proposed models by training them on a subset of forgery methods and then testing them on the remaining ones. In particular, the models were trained on the so-called Identity-Remained methods, i.e. those which manipulate the video while preserving the identity of the manipulated subject, and then tested on both videos affected by Identity-Remained and by Identity-Replaced techniques (i.e. videos in which the identity was replaced). We also repeated the process but training on Identity-Replaced methods and testing on both. According to Table 4, MINTIME-XC appears to be capable of state-of-the-art results with a considerable capacity for generalization to unseen methods, which is significantly higher than that of the baseline methods. We conducted further analysis to see how the model behaves when trained on a single forgery method and tested on all others. For this experiment, the training set is composed by selecting all pristine videos and all the videos generated with a single forgery technique. The trained models are then tested on all the methods. The results obtained are summarized in the heatmap in Figure 6, where a good generalization capability is evident in most of the contexts considered. In fact, although in all contexts the model achieves higher accuracy on training methods, it still manages to recognize many videos edited with techniques not observed during training, and this is crucial in order to be able to apply it in real-world verification tasks. Method 8 differs from the others probably because of the very specific anomalies introduced in it that struggle to be recognised by our model when trained with other forgery methods. MINTIME-XC trained on the ForgeryNet training set was then tested on the DFDC Preview test set [18]. In Table  5 we provide a comparison with other methods tested in this context. Although our model is trained on a different dataset from other previous works, it achieves an excellent level of generalization with high AUC score in a cross-dataset context.

Qualitative Evaluation
All our models have been trained to perform binary classification of the entire video but, in the case of deepfake videos, a hypothetical final user might be interested not only in knowing whether the video has been manipulated but also at what instant and if there is more than one tam-  pered person. These are typical requirements when we want to expose these systems to end users (e.g. journalists) [5]. We can retrieve this information by analyzing the attention values obtained on the various faces which compose the input sequence. Indeed, it has been empirically shown that when the video is pristine, there are no relevant alarms of detection, as shown in Figure 7a, while in the presence of a deepfake video, the model pays more attention on the faces containing the anomaly, as shown in Figure 7b. By analyzing the attention values it is easy to trace which identity has been manipulated and at what instant(s) the anomaly is present. Examples of outcomes from the model are shown in Figure 8 and it can be seen that in all cases the proposed model is able to identify the fake identity, if any, even in crowd situations like 8b. Interesting is the case in Figure 8c where there is a false face detection due to the filmed subject's t-shirt but the model still manages to realize that the manipulated face is that of the man.

Conclusions
In this research, we identified a number of challenges encountered when trying to perform video deepfake detection in the wild and proposed an approach to overcome them. The proposed model, called MINTIME, has led to state-of-the-art results on the dataset under consideration, ForgeryNet and demonstrated high generalization level also in cross-forgery and cross-dataset contexts, often outperforming previous approaches. It can effectively handle instances of videos containing multiple people without any form of predictions aggregation, faces that vary in size relative to the total frame area, and all while capturing both spatial and temporal anomalies through the use of a modified version of TimeSformer. Finally, the trained models can be easily interpreted from the observation of attention values allowing for more fine-grained predictions and extracting useful information for the final user. As future work, it would be interesting to investigate how to leverage the audio component of videos, as a complementary input to the spatial and temporal components.   Table 6. Datasets statistics on face-frame area ratios of the detected faces.

A. Additional Preprocessing Details
Face Detection and Extraction To perform deepfake detection it is necessary to first identify and extract faces from all the videos in the dataset. Our proposed model takes as input a sequence composed of faces detected by means of a state-of-the-art face detector, i.e. MTCNN [48] with an additional 30% surface area to include a portion of the background as done in the literature [5,14]. To make sure that the subsequent preprocessing steps are not polluted by false detection, the face detection threshold was set at a rather high value of 95%.
Identity Clustering Having to manage multi-identity videos and wanting to detect temporal and not just spatial anomalies, it is necessary to cluster the faces in each video on the basis of their similarity and maintaining the temporal order of their appearance in the frames. To do this, a clustering algorithm was developed which groups the faces extracted from the videos into sequences. The algorithm is based on [10] and it is structured as follows. Firstly, the features of each face are extracted via an InceptionResnetV1 pretrained on FaceNet [41]. The similarity s between each face and all the other faces identified in the video is calculated using dot product as follows: where f i and f j are the features extracted from two distinct faces and N is the total number of faces detected in a video. A graph is constructed with hard connection if the similarity is higher than a fixed threshold and a set of clusters is constructed ignoring smallest clusters. In particular, two nodes i, j are connected with an edge if their similarity is greater than a predefined threshold θ [10]. The faces inside the clusters are temporally reordered and the clusters are enumerated based on mean faces size from the largest to the smallest.
Data Augmentation In order to ensure the highest level of generalization and to make the input sequences as varied as possible, a strong data augmentation was applied based on what was also applied in the creation of the ForgeryNet dataset used in our experiments. In particular, each time that a sequence is given as input to the model, 31 randomly selected transformations are applied in a uniform way for all the video, such as image compression, various types of blur techniques, image flip and invert, color editing, random noise, cutout and others.

B. Face-frame area ratios analysis
Having to deal with the problem of varying face-frame area ratio in deepfake detection, we analysed the various available datasets to find one that contained different situations to test our model. In order to conduct this analysis, we carried out the detection of all the faces in the videos of the analyzed datasets and calculated the percentage ratio between the area occupied by the face in relation to that of the entire frame. As shown in Figure 9, ForgeryNet [26] appears to have a wide variety of face-frame area ratios clearly superior to DFDC [17], which tends to have more standard shots, making it particularly suitable for our purposes. Further statistics can be found in Table 6 where a significantly higher variance and better statistics for ForgeryNet are evident.

C. Ablation study
As a final analysis, to understand whether the new features introduced in the architecture are relevant to obtain the reported results, some variants of the models were trained by disabling them in part. Size-embedding is introduced to handle certain cases that rarely occur in a deepfake detection dataset but can be very common in the real world. Nevertheless, even in a more standard context such as a dataset created specifically for deepfake detection, as Table 7 shows, its introduction yelds better results in terms of accuracy and AUC.
Furthermore, if we consider only videos containing more than one person in the same scene, we can see in Table 8 that the proposed model performs best when Multi-Identity Attention and Temporal Coherent Positional Embedding are used, demonstrating that these mechanisms contribute to a better management of multi-identity situations.
Finally, in order to explore the impact of the identity reordering policy on the results and ensure that no bias was induced during the training phase, we carried out some tests using three different techniques. When constructing the input sequence, it is in fact necessary to decide in which order to insert the various identities, in case of more than one, inside it. In the training phase, the average of the areas of the faces associated with each identity was used as a criterion (Size-Based), giving greater importance to the most prominent identities in the scene. Therefore, an identity containing more prominent faces in the video would have been inserted earlier in the sequence and would potentially have occupied a larger portion of it. As a test, at inference time, we also tried using the number of faces associated with an identity as a criterion, thus giving more weight to the most frequent identities (Frequence-Based) as done in [26], and finally we also compared a random criterion. As can be seen from Table 9, the three different strategies do not particularly affect performance of our proposed model and the only one that has a slightly significant loss is Random-Based, probably due to the fact that in some rare cases, by reordering the identities randomly and selecting the top two, the fake identity could be discarded in favour of other identities in the scene. Quite the opposite situation on a SlowFast, which, considering a single identity per video is strongly affected by the choice of the latter. The results are therefore very different from case to case, leading to better results in random-ordering simply by chance in the choice of the fake identity.