1. Introduction
Over the past decade, social media content, including photos and videos, has seen a remarkable surge driven by the widespread availability of affordable devices like smartphones, cameras, and computers. The proliferation of social media platforms has facilitated the swift sharing of such content, resulting in exponential growth of online material and easy accessibility for users [
1].
Simultaneously, there have been significant advancements in machine learning (ML) and deep learning (DL) algorithms, which are highly efficient in manipulating audiovisual content [
1]. Unfortunately, this technological progress has also created and disseminated deepfakes, i.e., synthetic audio and video content generated using AI algorithms [
2,
3]. The rapid development of deepfake technology poses a serious threat [
4] as it can be utilized to spread disinformation globally and potentially sway public opinion. In instances such as election manipulation or character defamation, the ease of spreading false information can be exploited.
As deepfake creation becomes more sophisticated, the authentication and verification of video evidence in legal disputes and criminal court cases could become increasingly challenging [
5]. Ensuring the integrity and reliability of video submissions as evidence will demand significant scrutiny, particularly in the face of advanced deepfake techniques [
6]. Moreover, the exponential growth of social media content and the evolution of deepfake technology raises concerns about the potential misuse and manipulation of information, demanding further attention from researchers, policymakers, and the technology community [
7]. The production of high-resolution deepfake images relies on intricate algorithms commonly based on DL models like GANs. These complex DL techniques are crucial in creating realistic and convincing synthetic images [
8].
The proliferation of deepfake technology gives rise to numerous concerns and potential dangers across various industries [
9]. One significant area impacted is cybersecurity [
10], where the ability to manipulate facial photos convincingly raises alarms about identity theft, deception, and unauthorized access to sensitive information. Moreover, the widespread use of deepfakes poses a substantial risk to public trust, as malicious individuals can exploit this technology to create deceitful visual cues, propagate misinformation, or tarnish the reputations of others [
11]. Due to these issues, researchers and academics have been focusing on devising methods to detect and mitigate the adverse effects of deepfakes. By developing advanced approaches, they aim to safeguard individuals and organizations from the potential harms posed by this evolving technology [
12]. This involves harnessing the progress made in computer vision, machine learning, and forensic analysis to detect crucial indicators of image manipulation and effectively differentiate between authentic and manipulated facial images [
13].
Various approaches have been put forward to detect deepfakes, and a significant portion relies on deep learning techniques [
14]. The United States Defense Advanced Research Projects Agency (DARPA) has initiated a media forensic research project to develop effective methods for detecting fake media [
15]. This endeavor reflects the growing importance of addressing the challenges posed by deepfake technology in safeguarding the authenticity and credibility of digital media content [
15]. Additionally, Facebook, in collaboration with Microsoft, has introduced an AI-based deepfake identification challenge. This joint effort signifies the industry’s commitment to combatting the risks associated with deepfake technology by fostering the development of advanced AI solutions for detecting and countering deceptive media content [
16].
Recently, numerous prominent techniques have been put forward for identifying fake images. However, these models often exhibit limited generalization capability, leading to a drop in performance when faced with the latest deepfake or manipulation methods. Akhtar et al. [
17] considered Convolutional Neural Network (CNN)-based SqueezeNet [
18], VGG16 [
19], ResNet [
20], DenseNet [
21], and GoogleNet [
22] in their study for the identification of face manipulation. The models demonstrated impressive accuracy when tested on the same manipulation type they were trained on. However, their performance declined when confronted with novel manipulations not part of their training dataset. To address the issues mentioned above, this study adopts the Vision Transformer (ViT) model. The input image is divided into blocks during the general training process, treating each block as a separate entity. The ViT employs self-attention modules to understand the relationships between these embedded patches. The ViT has demonstrated exceptional performance in standard classification tasks by emphasizing important features while reducing the impact of noisy ones through its self-attention mechanism. Inspired by this perspective, this study proposes a deepfake image identification network based on the ViT. The experimental results indicate that the proposed network achieves satisfactory outcomes in deepfake image detection. This research contributes to the field in the following ways:
Our primary contribution lies in being the first to address this problem as a multi-classification task. No prior work has tackled this specific aspect, and our study represents a pioneering effort in this area. By approaching deepfake detection through the lens of multi-classification, we aim to enhance the accuracy and efficacy of identifying and categorizing deepfake content, thereby advancing the field’s understanding and capabilities in combating this evolving challenge.
We have compiled and curated our dataset specifically for multiclass deepfake identification. This dataset is carefully designed to facilitate the training and evaluation of our deepfake detection model, allowing us to explore the complexities of multiclass classification and improve the accuracy of deepfake identification.
The proposed fine-tuned ViT model exhibits superior performance to state-of-the-art deepfake identification models.
Following an extensive analysis, our research firmly establishes the remarkable robustness and generalizability of the proposed method, surpassing numerous state-of-the-art techniques. The findings validate the effectiveness and reliability of our approach in the field of deepfake detection.
The remainder of this paper is divided as follows.
Section 2 provides the survey’s existing methods, emphasizing the role of the ViT.
Section 3 outlines the methodology of the ViT’s application, while the experimental results showcase its effectiveness. The discussion interprets findings and outlines future implications for multimedia forensics in
Section 4, and
Section 5 provides the conclusion of this study.
2. Related Works
The proliferation of deepfake technology has ushered in a new era of challenges in the realm of multimedia forensics and information veracity. Prior research has underscored the need for innovative methods to detect and combat the manipulation of digital content [
23]. Early efforts in deepfake detection centered around traditional signal processing and image analysis techniques. Researchers leveraged facial landmarks, inconsistencies in lighting, and unnatural facial movements as indicators of potential manipulation. However, the rapid advancement of GANs led to the creation of more convincing and challenging-to-detect deepfakes, necessitating a shift towards more sophisticated detection methods. Akhtar and Dasgupta [
24] investigated the feasibility of utilizing local feature descriptors to recognize manipulated faces. Their study presented a comparative experimental analysis of ten local feature descriptors, employing the ‘DeepfakeTIMIT’ database as a testing ground.
Bekci et al. [
25] presented a deepfake detection system that leverages metric learning and steganalysis-rich models to enhance performance against unseen data and manipulations. To evaluate the effectiveness of their approach, an empirical analysis was conducted using openly accessible datasets, including FaceForensics++, DeepFakeTIMIT, and CelebDF. The suggested framework demonstrated significant accuracy improvements ranging from 5% to 15% when faced with concealed modifications. Li et al. [
26] investigated the differences in eye-blinking patterns between deepfake videos and those displayed by genuine human subjects. Based on their observations, they developed a novel eye-blinking detection technique tailored to identify deepfake videos specifically.
In their study, Nguyen et al. [
27] used the eyebrow region as a set of features to identify deepfake videos. They applied four deep learning methods—LightCNN, Resnet, DenseNet, and SqueezeNet—for this purpose. The UADFV and Celeb-DF datasets produced the highest AUC (Area Under Curve) values of 0.984 and 0.712, respectively.
Patel et al. [
28] introduced Trans-DF, a deepfake detection method relying on random forests. The Trans-DF model demonstrated impressive detection accuracy, achieving a high score of 0.902, highlighting its effectiveness in identifying deepfake videos. Another approach was presented by Yang and colleagues, utilizing SVM classifiers to differentiate between deepfake images and videos. Their method capitalized on variations in head poses as essential features for discrimination. Through the implementation of this technique, they created a system with a noteworthy AUROC score of 0.890, effectively detecting and distinguishing deepfake content.
Ciftci et al. [
29] presented a pioneering technique to trace the origins of deepfake content by scrutinizing biological cues within residuals. This groundbreaking study marked the inaugural application of biological indicators in the detection of deepfake sources. The researchers performed experimental assessments on the Face Forensics++ dataset, incorporating numerous ablation tests to affirm the validity of their method. Notably, they attained a remarkable accuracy rate of 93.39% in source identification across four distinct deepfake generators. These results emphasize the efficacy of their proposed approach and its promising ability to accurately trace the roots of deepfake content.
In 2022, Yang et al. [
30] introduced a deepfake detection model named MSTA_Net, leveraging machine learning techniques. This model specifically examined the texture properties of an image to discern abnormalities indicative of deepfake alterations. Unlike other approaches that focused solely on facial regions, the MSTA_Net model considered the entire image. By establishing connections between manipulated and unmanipulated areas within the image, the model identified irregularities in texture and signaling variations as potentially fake. Conversely, when no irregularities were detected, the image received a non-fake label, suggesting a higher likelihood of authenticity. Their proposed model facilitated the identification of genuine and manipulated images based on their overall texture characteristics. In recent studies, the prominence of multi-attentional and transformer models has grown significantly in the area of deepfake detection [
31]. Overall, the multi-modal, multi-scale transformer model presented by Wang et al. [
32] offers a promising approach to deepfake detection. By enabling the analysis of image patches at different spatial levels and utilizing multiple modalities, the model aims to improve accuracy and robustness in identifying deepfake content.
CNNs have demonstrated remarkable efficacy in detecting deepfake content, underscoring their importance in this field. Despite their proficiency in extracting features from small objects, CNNs may encounter challenges in precisely identifying key regions within an image. Leveraging a ViT model for deepfake identification presents an intriguing and promising alternative. ViTs were originally introduced for image classification tasks and have demonstrated strong performance on various computer vision benchmarks [
33]. There are many reasons to choose ViTs for this study, of which the main ones are listed below.
Attention Mechanism: ViT models utilize self-attention mechanisms, which allow them to capture long-range dependencies within an image. This is crucial for detecting subtle inconsistencies and artifacts that might be present in deepfake images. Deepfake generation often involves stitching or blending different parts of images, and attention mechanisms can help identify these anomalies.
Global Context: Classic CNNs are great at pulling out details from specific areas, whereas ViT models take in the complete image as a sequence of patches, allowing them to grasp the global context. This difference can be beneficial for deepfake detection, as it lets the model scrutinize the overall structure and consistency of an image.
Robustness to Manipulations: ViT models might exhibit increased robustness to common manipulation techniques used in deepfake generation. Their attention mechanisms can potentially make them more resistant to simple modifications like noise addition or small alterations in pixel values.
Interpretable Attention Maps: ViT models generate attention maps that indicate which parts of an image are considered the most important for making predictions. These maps could provide insights into how the model distinguishes between real and deepfake images, aiding in understanding and improving the model’s decision-making process.
3. Proposed Methodology
This section outlines and presents the methodologies utilized and proposed to identify fake images accurately. These methods are carefully designed to enhance the precision and effectiveness of detecting and distinguishing fake content from genuine ones.
3.1. Dataset
For our experiment, we utilized a dataset sourced from Kaggle [
34], an online source [
35], Stable Diffusion [
36], and the StyleGAN2 encoding of Stable Diffusion [
37]. We used the free version of TPU (Tensor Processing Unit) that is provided by Google Colab to prepare the dataset as well as for research experiments.
Real Images: We considered Kaggle [
34] for real images; due to the limitation of computation power, we considered 10K images from this source.
Online Source: We obtained GAN-based fake images from an online source [
35]. This source consistently provides new fake images with each visit, enabling us to access a diverse and up-to-date dataset for our analysis and experimentation.
Stable Diffusion: In this study, we curated a dataset focused on Stable Diffusion, specifically in the context of text-to-image conversion. Stable Diffusion text-to-image conversion involves a method for consistently generating high-quality images from textual descriptions. The primary objective is to create realistic and cohesive images that faithfully represent the provided textual descriptions. This approach utilizes advanced machine learning models and deep learning techniques to achieve this goal. The process of Stable Diffusion text-to-image conversion typically encompasses several key steps, including text encoding, image synthesis, and refinement. During text encoding, the textual descriptions transform into a format compatible with processing by the image synthesis model. Techniques such as word embeddings or attention mechanisms may be employed to capture the semantic meaning of the text. Following this, the image synthesis model utilizes the encoded text to produce a corresponding image, as illustrated in
Figure 1. The image synthesis process is geared towards capturing the visual details and context outlined in the text description. To ensure stability and consistency in the image generation process, regularization techniques and control mechanisms may be incorporated. Stable Diffusion text-to-image conversion has various applications, including creative content generation, virtual world creation, and multimedia production. As this technology continues to advance, the generation of fake content and the potential for misuse of such tools are steadily increasing. This trend poses significant challenges and concerns in various domains, such as disinformation campaigns, image manipulation, and privacy breaches. Stable Diffusion based on the conditional Latent Diffusion Model (
LDM) and the equation of
LDM concerning conditional image pairs can be seen in Equation (1) [
36]. In Equation 1, models can be understood as a series of equally weighted denoising autoencoders, denoted as
for
t = 1...T. These autoencoders are trained to predict a denoised version of their input, where
represents a noisy version of the input
x.
StyleGAN2 encoding of Stable Diffusion: This dataset is available on Kaggle [
37] with the name
Synthetic Faces High Quality (SFHQ). This dataset comprises high-quality 1024 × 1024 curated face images. It was created through a multi-step process. Firstly, a significant number of “text to image” generations were generated, primarily using Stable Diffusion v2.1, along with some from Stable Diffusion v1.4 models. Subsequently, a set of photo-realistic candidate images was generated by encoding these images into the latent space of StyleGAN2 and applying a small manipulation to enhance each image into a high-quality, photo-realistic candidate. This process ensured that the dataset contained diverse and visually appealing face images, enabling us to conduct comprehensive and accurate analyses in our research. The styleGAN2 is mathematically based on a generator network (
G), mapping vector (
F), noise vector (
z), conditional vector (
), and style vector (
s) to produce the synthesized image; see Equation (2) that is used to synthesize the image
x.
The style vector (
s) is computed with mapping network (
F) with Equation (3).
In the context of StyleGAN2, the generator G and the mapping network F are trained to generate high-quality images by considering the style information (s) along with noise (z) and conditioning () inputs.
In our research, we have ultimately focused on four distinct classes and taken the initiative to address the deepfake detection problem using a multiclass approach. By considering multiple classes (Real: 10,000, GAN_Fake: 10,000, Diffusion_Fake: 10,000, and Stable&Gan_Fake: 10,000), we aim to enhance the precision and reliability of our deepfake detection model, accommodating a broader range of deepfake variations and increasing its potential for real-world applications.
To overcome the challenge of class imbalance and potential model bias, we meticulously prepared the dataset in a balanced format. By ensuring each class has a similar representation, we aim to create a more equitable training environment for our deepfake detection model. This approach helps mitigate the impact of overrepresented or underrepresented classes, leading to a fairer and more robust model capable of accurately identifying deepfake content across all classes. Sample images from the prepared dataset can be found in
Table 1.
3.2. ViT Architecture
In this section, we introduce the ViT framework, delving into its core principles, structure, self-attention mechanism, multi-headed self-attention, and the mathematical foundations that shape its design. The ViT emerged in 2020 [
38] as a groundbreaking paradigm in computer vision, revealing its potential to redefine our approach to image analysis and comprehension. Initially rooted in the Transformer architecture crafted for natural language processing, the ViT introduces a novel concept by treating images as sequences of tokens, commonly represented by image patches. With the transformer design, ViT adeptly processes these token sequences, enabling effective image analysis and understanding in a sequence-based manner.
A key strength of ViT lies in its adaptability and versatility. The foundational transformer architecture has demonstrated remarkable success across diverse tasks, including picture restoration and object detection. This underscores the broad applicability and effectiveness of the ViT framework, positioning it as a potent tool in the field of computer vision with the potential to revolutionize our approach to image-related tasks [
39].
Tokenization and embedding stand as crucial steps within the ViT architecture. When handling the input image, it undergoes initial division into a grid of non-overlapping patches. Subsequently, these patches are flattened and transformed into a higher-dimensional space through a linear operation, followed by normalization. This method endows the ViT model with the capability to capture both global and local information from the image, promoting comprehensive learning. It enables the model to effectively grasp the intricate features and context of the image. The synergy between tokenization and embedding plays a pivotal role in empowering ViT to excel in a variety of computer vision tasks.
The ViT architecture can be mathematically represented by assuming
is a set of image patches extracted from the input image. Each patch is a vector representing a portion of the image. The set of patches (
) is represented in Equation (4), where
N is the number of patches.
The ViT model consists of several components that are enlisted below (also see
Figure 2).
Patch Embedding: The image patches (
) are linearly projected to an embedding space by a linear transformation
Wpatch (see Equation (5)).
Positional Embedding: Each patch embedding (
) is augmented with positional information (
to capture spatial relationships. These positional embeddings are added to the patch embeddings (see Equation (6)).
Transformer Encoder: The transformer encoder processes the positional embeddings
Epos. This encoder comprises several layers, each incorporating self-attention mechanisms and feedforward neural networks. The result of this encoding is a collection of contextualized embeddings, as depicted in Equation (6). Equation (7), (
), represents the output representations or embeddings produced by the Transformer encoder for each position in the input sequence.
Classification Head: The final contextualized embeddings Z are used for downstream tasks. In classification tasks, a classification head takes the average or a specific token’s embedding (e.g., classification token) from Z and passes it through one or more fully connected layers to make predictions.
The ViT design centers around the Multi-head Self-Attention (
MSA) mechanism, which plays a pivotal role in the model’s capabilities. MSA empowers the ViT to attend to multiple parts of the image simultaneously. It consists of distinct “heads”, with each head independently computing attention. By focusing on different regions of the image, these attention heads produce various representations, which are then concatenated to generate the final image representation. This approach enables the ViT to capture intricate interactions between input elements by attending to multiple sections simultaneously. However, this enhancement comes at the cost of increased complexity and computational requirements. The utilization of multiple attention heads and the subsequent aggregation of their outputs necessitate more computational resources. The mathematical representation of
MSA can be seen in Equation (8).
In Equation (7), Q, K, and V stand for the query, key, and value matrices, respectively. The H1, H2,… Hn represents the output of multiple attention heads. In the context of neural networks, particularly in transformers, a multi-head attention mechanism involves using multiple sets of attention weights (attention heads) to capture different aspects of relationships in the input data. Each is the output of the i-th attention head. The self-attention mechanism plays a pivotal role in transformers, serving as the foundational component for explicitly modeling interactions and relationships across all sequences in prediction tasks. Unlike CNNs, which depend on local receptive fields, the self-attention layer gathers insights and features from the entire input sequence, allowing it to capture both local and global information. This unique characteristic distinguishes self-attention from CNNs, as it promotes a more comprehensive interpretation and representation of information, leading to improved performance in various sequence-based tasks.
The attention mechanism involves computing the dot product between the query and key vectors, followed by normalization using SoftMax. Subsequently, it modulates the value vectors to generate an enhanced output representation, a task carried out in the CLS block.
Figure 2 is the base abstract architectural diagram of the ViT model [
38].
3.3. ViT Hyper-Parameters
In this study, the initial images undergo preprocessing and are divided into patches measuring 16 × 16 pixels, subsequently scaled to 224 × 224 pixels. This reduction technique involves breaking down the image into smaller fixed-size patches, each with dimensions of 16 pixels in width and 16 pixels in height.
The model employed in this study underwent training on a substantial dataset known as ImageNet-21k. This dataset encompasses around 14 million photos, categorized into 21,841 distinct classes, making it specifically tailored for extensive image classification tasks. The model’s architecture comprises 12 transformer layers, each housing 768 hidden components. Its overall capacity is reflected in its 85.8 million trainable parameters, which play a significant role in the learning process. For a comprehensive understanding, the values and configurations of the parameters used in the ViT model are detailed in
Table 2.
Figure 3 showcases the abstract-level diagram illustrating the proposed methodology. This diagram provides an overview of the key components and steps involved (dataset preparation, preprocessing, splitting, model tuning, training, and evaluation) in our approach, offering a visual representation of how our method operates and achieves its objectives.
3.4. CNN Architecture-Based Pretrained Models
The primary objective of this study was to uncover and identify the most recently manipulated deepfake images, specifically those generated using Stable Diffusion and StyleGAN2. This research stands out as a pioneering effort not only in recognizing these cutting-edge manipulated fake images but also in addressing the challenge in a multiclass context.
To demonstrate the effectiveness of patch technology over traditional CNN and CNN-based pretrained models such as VGG16 and ResNet50, this study employed a fine-tuning approach. The models were preloaded with weights from the ImageNet dataset using a weight transfer technique. In this process, the network layers were frozen, and the last fully connected layers were omitted from the architectures.
To adapt these models for our purposes, a flattened layer was introduced to eliminate the fully connected layers, and dense layers with four neurons were added. The activation function was set to SoftMax to tackle the multiclass nature of the problem. This nuanced approach aims to showcase that, in the realm of manipulated deepfake image detection, patch technology can outperform the more conventional CNN and pretrained models. The local feature extraction is the main reason for selecting CNN-based models.