Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.


Introduction
The idea behind audio-driven video editing is to be able to re-synchronise the lip and jaw movements of an actor in a video, in response to a new speech input signal. This new speech signal may come from the original, or different speaker in order to fix any mistakes captured in the original recording or dub over the original voice. Regardless of the source of the speech, it is imperative that the performance of the actor is never diminished. No matter how the lip and jaw movements change in response to the new audio, the facial expressions, and emotions portrayed by the actor should remain as close to the original performance as possible.
Achieving such seamless audio-driven video editing is an exciting prospect for the entertainment industry, one with the potential of being applied towards movies, TV shows, live streaming, and even home made content uploaded to platforms such as YouTube, TikTok, and others. Giving video content creators the ability and option to edit their work without having to go through time consuming, and expensive re-shoots, allows them to work with a greater tolerance for error during filming. Additionally, achieving true audio-driven video editing will enable its use within automatic dubbing pipelines. This will have a huge impact on the world of cinema and television, allowing for the further democratisation of video content, making it significantly easier, and more cost-effective to dub Englishlanguage movies/TV shows/videos into other languages and vice-versa. With the rapid advances in deep learning, and talking head generation techniques, this exciting prospect is getting closer and closer to reality.
Generally, speech-driven video generation approaches can be grouped into two distinct types: structured, and unstructured generation. Structured generation refers to techniques that use the speech signal to first generate an intermediate structural representation of the face, before using said structure to aid in rendering the photo-realistic frame, an approach followed by works such as [8,31,69,88,91]. Unstructured generation techniques on the other hand such as [20,30,73,90] , utilise image reconstruction techniques to generate the photo-realistic frame directly in an end-to-end manner.
Diffusion models [60] are a relatively new class of generative model that have been gaining traction in recent months due to their strong performance on image synthesis tasks, outperforming traditionally state of the art GAN (Generative Adversarial Network) [22] based methods in some instances [17]. The last number of months in fact have seen diffusion models being applied to image to image translation tasks [53,57], as well as towards the video generation problem [27,86], audio synthesis [11,35], and many others [81]. Utilising conditioning signals such as text and even images, diffusion models have shown that they can be trained and conditioned towards generating a specific desired output at inference time with relative ease [53]. They also achieve high mode coverage unlike GANs, and maintain high sample quality. This ability makes them an ideal candidate for application towards the task of unstructured audio-driven video editing, a task that has thus far been dominated by GAN-based approaches [9,12,73]. As part of this work we make the following contributions to the field: • A novel unstructured end-to-end approach for audiodriven video editing based on the proposed architecture by in Palette [57], a denoising U-Net model trained for image to image translation tasks. We instead condition the network on audio frames and train it to inpaint the lower half region of the face such that the lip and jaw movements are synchronised to the conditioning audio signal. We train on single and multi speaker versions of the GRID data set [14]. We demonstrate convincing results despite access to limited data, and training hardware. All project code, and trained single speaker and multi speaker models are made available to the public.
• We introduce a simple conditioning mechanism for the task of audio driven video editing with diffusion models. We condition the network using mel-spectrogram features combined with the previously generated frame (to maintain temporal stability) to generate the next frame in the sequence. To our knowledge, this is the first attempt at using an audio signal to condition a diffusion model to generate an image (video frames in our case).

Audio Driven Video Generation
"Deep fakes", as they have commonly been referred to in public discourse, are synthetic video or image content of a person(s) generated by a deep learning algorithm. Audiodriven video editing is a research topic that falls under the much broader scope of such "deep fake" research, specifically under audio-driven deep fakes which are what this section will focus most upon. For a broader review of the literature surrounding deep fakes and the various techniques used to generate them, we recommend [45,70] for good overviews of the field.
As touched upon a bit earlier, audio-driven deep fakes can be categorised by whether they are generated by leveraging an audio driven structural representation of the face, or without. There have been numerous approaches over the years relating to the former, ranging from ones such as [2,7,10,16,19,31,39,56,66,68,74,75,79,91] which generate a set of 2D facial landmark co-ordinates from audio, or [8,15,32,37,52,62,63,69,76,77,[83][84][85]87] which predict expression parameters from audio to drive a 3D face model. What these approaches all have in common is that they use these intermediate structural representations as input to a separate neural rendering model which is typically trained as an image to image translation task to generate the final photo realistic image frame. As of the date of this submission, GAN-based [22] approaches such as Pix2Pix [29], Cy-cleGAN [93], and other variations have proved immensely popular for this task, however it would not be surprising to see diffusion models being used for this in the very near future given their success so far on traditional image2image tasks [57].
Non structural / End-to-end methods on the other hand utilise latent feature learning and image reconstruction techniques to generate a photo-realistic video sequence from an input speech signal and reference image/video in an endto-end manner. Approaches such as [9,20,30,36,43,49,65,73,89,90,92] have seen much success in recent times. Each of these approaches differ from the one used in this paper as they are all GAN/VAE (variational autencoder) [34] based probabilistic methods while ours leverages the power of a denoising diffusion model. While current end-to-end approaches suffer from low output resolution quality compared to structural methods, there is a lot of potential for improvement, especially by exploiting diffusion models ability to synthesise high quality samples while maintaining good mode coverage / diversity.
Diffusion models are a class of generative probabilistic models that consist of two steps: 1) the forward diffusion process that destroys data by steadily adding small amounts of random Gaussian noise over a series of time steps until it is destroyed. 2) The reverse diffusion process where a learning algorithm is trained to restore structure in the data by steadily removing noise over a series of time steps. The trained model can then sample information from a random distribution of Gaussian noise and steadily denoise it over a series of time steps to attain the desired output.
Sohl-Dickstein et al. [60] developed the first diffusion model and coined the term. Ho et al. [25] combined denoising score matching with Langevin dynamics [64] and diffusion models to synthesise images. This ignited a steady interest in diffusion models, with Nichol et al. [47] building upon the work of [25] showing that by making small adjustments to the diffusion process, they could sample data faster and achieve competitive log-likelihoods to GAN-based methods with minimal impact to sample quality. They also found that training diffusion models with more computational power typically lead to better sample quality. Chen et al. [11] and Kong et al. [35] applied diffusion models to the task of audio synthesis, succeeding in generating high quality samples. Dhariwal and Nichol [17] demonstrated that diffusion models beat GANs on image synthesis, also introducing the concept of "classifier guidance" for conditional generation.
As diffusion models are trained under a single loss, and do not rely on a discriminator, they are more stable during training and do not suffer from typical issues associated with training GANs such as mode collapse, and vanishing gradient. They produce high quality output samples, and display high mode coverage unlike GANs [78]. Despite these advantages, their sampling speed is very slow due to the need to run the backwards diffusion process many thousands of times on the same sample to denoise it completely. Xiao et al. [78] and Rombach et al [53] attempted at speeding up the sampling and training times associated with diffusion models with the former proposing a method to model the denoising distribution using a complex multi modal distribution in order to facilitate larger diffusion steps, and the latter applying diffusion models in the latent space of a pretrained autoencoder to reduce the complexity. This is an ongoing focus of research in the field, and it is a certainty that more works tackling the inference/training speed problem will emerge.
The work in this paper builds upon the work presented in Palette [57], a denoising diffusion model trained specifically for the task of image2image translation, which in turn was heavily influenced by the 256x256 class conditional U-NET of [17]. We utilise the same overall architecture as discussed in Palette with a few variations in the hyperparameters. Additionally, we modify the training procedure for the task of video editing, and introduce a feature concatenation mechanism for conditioning the network using speech mel-spectrogram features, as well as information related to the previously generated frame in the sequence so that the network can generate temporally coherent frames.

Problem Formulation
We frame the problem of audio-driven video editing as an inpainting task with a few key changes. Traditionally, inpainting is an image-to-image translation task where a neural network must learn to fill in a masked out region of the image with realistic content. For video editing, we must provide the network with additional context, to help guide its generation process. As our approach works on a frameby-frame basis, we must show the network the preceding frame in addition to the current masked frame. This is to ensure that there is temporal stability between consecutive frames. Additionally, audio information related to the previous, current, and future frame must all be provided as well. Future audio frames are included to ensure that lip movements produced by plosives, sounds created by the letters "p, t, k, b, d, g", are correctly generated. This is because the lip movements associated with plosives form before the sound is spoken. We concatenate all this information to the image channels of the current frame that is being edited, and pass this frame through the network. Fig. 2 depicts an overview of this process.

Dataset
We rely on the GRID [14] audio-visual speech data set to carry out the work in this paper. This is a multi speaker data set consisting of 34 speakers (18 male, 16 female), with each speaker uttering 1000 short 6-word sentences. We train three models: 1) A single speaker model trained with videos from speaker S1. 2) A multi-speaker model trained on 10% of the data set, where we keep speakers S1, S33, and S34 unseen to the network. 3) A fine tuned single speaker model trained on top of the base multi speaker model to determine whether faster convergence was possible while achieving similar results to the base single speaker.

Audio Processing
The audio from the GRID [14] data set is recorded with a sampling rate of 25 KHz. From the audio we compute mel-spectrogram features with non overlapping windows of length 40ms and 256 mel bands. We choose 256 so that when we concatenate the audio features to the image channels as depicted in Fig. 2, the dimensions will be the same as the target image. Alternatively, one could use a linear transformation on top of the standard spectrogram of  Fig. 1. We find that conditioning with audio via concatenation is a rather simple approach that works, however, in future work we plan to explore more complex techniques such as feeding audio embeddings through various layers of the U-net.

Video Processing
First, we resize all videos in the data set to be 256 width x 256 height from the default 360 x 288. This is done to decrease the training time by decreasing the number of pixels, and to ensure that the image can be accepted as input by the U-net. For every frame in the videos we must compute the region that should be masked out. Using an off the shelf facial landmark extractor [40], we compute the facial landmark co-ordinates to determine the position of the jaw. We then mask out the bottom portion of the face with these coordinates just below half of the nose, as depicted by Fig. 1.
We apply the rectangular face mask to data samples at train time, before they are fed into the neural network.
There is a very important reason for applying such a rectangular face mask rather than a mask in the shape of a face: to hide the jaw contour. There is a very strong correlation between lip movement, jaw movement, and overall head pose. Should the jaw contour be visible to the network, the network will learn to predict the lip movements from that alone, discarding the audio conditioning signal entirely, treating it as noise. This is a real problem when doing audio-driven video editing that is yet to be addressed in the literature, especially when working on relatively noncomplex data sets such as GRID [14]. We discovered that applying such a rectangular face mask minimises this problem significantly, however it still exists. When testing our single speaker model using silence as input, the lip movements are significantly diminished though movements do occasionally still occur. This is because for every individual speaker, there exists a slight correlation between head pose and lip movement that the model learns, especially when trained on a single speaker. We suspect that by training on a larger single speaker dataset, with more variety in both the head pose and background of the speaker, the network will accord even more attention to the conditioning speech signal in order to drive the mouth movements.

Audio Video Alignment
Each video frame is aligned with the 40ms of audio preceding it, and 80ms after it, totalling a 120ms window of audio information that is used to condition the network when generating the lip/jaw movements for each frame. Care must be taken when choosing the audio window, too large and the network won't use the most meaningful information available to it, too small and there may not be enough context for the network to generate more complex lip movements caused by plosives. We arrived at our 120ms window through empirical tests and observations. Note that image frame 0 does not have any audio preceding it. Instead of discarding it, we use it to commence the generation process, acting as an initial "identity" frame. At inference time, each generated frame is then re-input into the network serving as the "previous frame" to generate the next frame in the sequence, maintaining temporal coherence. At train time, we use the previous real frame.

Model Architecture
We follow the general U-Net [54] architecture described by [57], which in turn is based off the model proposed by [25] with modifications inspired by the works of [17,59]. For this work we use a lightweight version of the 256x256 U-net architecture described by [17], minus the class conditioning mechanism. Like [57] we also introduce additional conditioning of the source image via the concatenation of our audio features and the previous frame. Tab. 1 displays the hyper-parameters we use to train our diffusion model for the task of audio-driven video editing. Notably, we omit the use of attention within the up/downsampling layers of the U-Net in an effort to speed up training. For all of our experiments, we train using a batch size of 10 per GPU on 4 32gb V100 GPUs in parallel. For inference, we rely on a single GPU to generate the videos, as we use the previously generated frame as input to the network.
A diffusion model is defined as having two steps, the forward diffusion process where the data is gradually destroyed, and the learned backwards diffusion process which reconstructs the data, and is used during training and inference.

Forward diffusion process
As defined by [60], the forward diffusion process is a Markov chain that adds small amounts of noise to the data y over a predefined number of time steps T, until the data is completely destroyed at time step t=T. This state is represented as y T with y 0 representing the data before any noise was added to it. The Markov chain is defined by: where at each step, Gaussian noise is added by: with α t := (1 − β t ), representing the hyperparameters of our fixed noise scheduler. [25] show that it is possible to sample y t at any step t in closed form: withᾱ t := t s=1 α s . This is an important observation, as it significantly speeds up the forward diffusion process, and can be used to train the model on the fly with random noise levels at each forward step.

Backwards diffusion process
Given a noisy imageȳ defined as: the goal of the backwards diffusion process is to learn an algorithm that can denoise and restore the noisy image to its original image Y 0 . Following the approach in [57], we train a neural network f θ (x,ȳ,ᾱ) to predict the noise generated at time t, optimising the L simple objective proposed by [25]: where x represents the conditioning audio and previous frame input to our network,ȳ the noisy image, andᾱ the noise level. During training, we only calculate the loss for the masked region of the face to save on compute, following the approach in [57].
Following [25], to run inference, each step of the backwards diffusion process can then be computed by: where ∼ N (0, I). The backwards diffusion step is repeated for as many times necessary to denoise the image fully. Please see Fig. 2 for a high level view of our network architecture, and to better understand where each equation is used. For a more detailed discussion behind these equations, and how they are derived, please see [25,60,64].

Experiments & Results
We train and evaluate three versions of our video-editing diffusion model, a single speaker model, a multi speaker model, and a single speaker model we fine-tuned on top of the multi speaker one. We evaluate the videos generated by our models against the ground truth using a number of objective metrics. We compare our results to recent methods for audio-driven video generation [13,30,36,65,72]. The results we provide for each model are cited directly from their own research papers, with each model tested on a subset of the GRID [14] data set, unless explicitly stated otherwise.

Evaluation Metrics
We use a a number of objective metrics to measure the quality of our generated videos, allowing us to compare them directly to the ground truth, and other state of the art audio driven video generation methods from the literature. To make a fair comparison we calculate the Image Quality metrics(SSIM, PSNR) only on the masked portion of the image as shown in Fig. 1.

SSIM (Structural Similarity Index):
This is a perceptual metric to quantify the degradation of image quality. A larger SSIM signifies the better quality of the reconstructed image.
PSNR (Peak Signal to Noise Ratio): We compute the peak signal to noise ratio between the ground truth and the generated image. The higher the PSNR the better the quality of the reconstructed image.

CPBD (Cumulative Probability Blur Detection) [44]:
This is a perceptual based no reference objective image sharpness metric. Similar to [30,36,72] we have used this metric to compare the CPBD results on the generated videos.
WER (Word Error Rate): It evaluates the performance of a pre-trained speech recognition network on a given video. Similar to previous works [30,36,72] we use the LipNet [3] model which is pre-trained on GRID data set and achieves 95.2 percent accuracy.
Facial Action Unit (AU) [18] Recognition: Following the previous works [13,65] we also evaluate our reconstructed images with respect to five facial action units (AU10: Upper Lip Raiser, AU14: Dimpler, AU20: Lip Stretcher, AU25: Lips Part, AU26: Jaw Drop). We use the Facial Behavior Analysis Toolkit [5] to detect the presence of these AUs (boolean true on activation) on each generated frames and compare them with the ground truth frames. Finally we calculate the average F1 score and the average accuracy based on the AU recognition.
ACD (Average Content Distance) [71]: Similar to [72] we use Openface Face Recognizer [1] to calculate the Cosine(ACD-C) and Euclidean(ACD-E) distance between the generated frame and ground truth image. The smaller the distance between two images the similar the images.

Single Speaker Model
We train our single speaker model on identity S1 using data from the GRID audio visual corpus [14]. There are 1000 videos in total, each of them roughly 3 seconds in length totalling about 50 minutes of audio-visual content for training. We train our model on 996 videos, withholding 4 of them for testing purposes. We call this the "unseen" test set. The "seen" test set consists of videos that the model has seen during training, but with different speech inputs to the originals. Our unseen test set is relatively small with respect to the size of our data set as we wanted to give the model as much information as possible about the speaker it was training on given that there were only about 50 minutes worth of audio/visual content available.
Tab. 2 depicts the results our model scores when tested on the unseen data set versus other approaches in the literature. While the results we obtain are not state of the art, they demonstrate that using a denoising diffusion model to do audio-driven video editing, is indeed quite feasible, and produces reasonable results. Further time spent training the model, and exposure to a larger data set should improve these scores further. Additionally, using our single speaker model, we edit a number of videos by introducing new speech inputs instead of the originals, and attach these videos to our supplementary materials section, encouraging readers to have a look.

Multi-Speaker Training
We train our multi speaker model on 30 different identities using a subset of 100 random videos per speaker from the audio visual GRID corpus [14] for 185 epochs. During training, we withheld identities S1, S33, and S34 entirely from the training set of the network so that we could  use them for additional testing purposes. We use approximately 10% of the entire data set to train our model for two major reasons: 1) To train on the entire data set with our current hardware would take approximately 15 hours per epoch. We intend to incorporate the findings of [53] into our future work to significantly speed up training time.
2) We sought to train a "base" model using multiple identities but with relatively few samples per speaker, and use it to fine tune a single-speaker model with more samples, investigating whether the fine-tuned model could be trained for less time than the single-speaker model while achieving similar performance. When testing on identities unseen to the model, it would struggle to maintain the identity of the speaker consistent throughout the generation process. We speculate that since our model relies on the previously generated frame alone to generate the next frame in the sequence, over time, information about the original identity is lost, as demonstrate in Fig. 3.
When testing on identities previously seen by the network during training, we report relatively poor results when compared to other methods in the literature as depicted by Tab. 2. We speculate this is due to the much reduced data set size, and training time accorded to the model. As [25] state, diffusion model output quality typically scales up with additional training time and data, and we expect this to be the case as well here. Despite these issues, the results still look very promising Fig. 4, with certain identities perform-ing better than others. We encourage readers to view the provided multi speaker video samples for a mix of both failure cases and successful videos.

Single Speaker Fine Tuned
We fine tune a single speaker model using identity S1 on top of the pre-trained multi speaker model discussed above. We use the same training hyper-parameters as the base single speaker model, however we maintain 20 random videos unseen to the network for testing, and train it for only 150 epochs instead of 895. We report reasonable results in Tab. 2 on the unseen test set . We show that by fine tuning on a pre-trained multi speaker model, we achieve slightly worse results to the base single speaker model while training for significantly less time. It stands to reason that with further training, we would see even better results. We provide videos demonstrating this models editing capabilities in the supplementary materials.

Limitations & Future Work
Training & Inference Speed: It is no secret that diffusion models are slow, both to train, and to sample from. Our model is no exception, taking approximately 30 minutes/epoch to train the single speaker model, 90 minutes/epoch for the multi speaker one, and approximately 1 minute to generate 1 frame with 2000 diffusion steps on a single 32gb v100 GPU. We plan on updating our model with the approach proposed by [53], to facilitate training in the latent space, in addition to methodically shrinking the number of parameters our model has to determine the optimum set up. We suspect our current model has too many parameters for the task at hand, and intend to reduce it. It remains to be seen how image quality will be impacted by these changes, however a faster model would allow for more in depth tests, and comparisons across a wider range of data sets, furthering the field. Multi Speaker Model: While we demonstrate that our single speaker models perform reasonably well (still a lot of room for improvement!), further work must be done to ac- complish reliable multi-speaker performance. We want to train the model longer, and with more data to improve the lip synchronization. To facilitate this we need to implement the improvements mentioned above. Additionally, we noticed that our current multi speaker model struggled to keep the identity consistent throughout the generation process. To address this, we propose introducing an additional "identity" frame in the conditioning process, that the network may use as a reference for how the speaker should appear.
Dataset Size: We train our networks on such small subsets of the GRID data set due to our limited hardware capacity. Despite these limitations however, we achieve convincing results, as shown in Tab. 2. We encourage readers with access to more powerful hardware to train on the full length data set, as well as other sources such as the Obama White House single speaker data set, or the BBC Lip Reading Data set [61].
Talking head generation: The task of audio-driven video editing involves modifying a small portion of an already existing video in response to a new audio signal. We demonstrate that diffusion models can be used successfully towards this goal. The next step is extending this functionality to talking head generation, where the network must learn to synthesize full frame videos from a driving audio signal and single image. This is a challenging task as the network must now also generate natural head movements, eye blinks, and facial expressions as well as maintaining accurate lip and jaw movements synchronised to the audio. We plan to explore this task in our future work, and study various ways into conditioning the network to control the facial aspects mentioned above.

Conclusion
Throughout this work, we demonstrate the feasibility in applying denoising diffusion models to the task of end to end audio-driven video editing. Although the slow sampling and training speeds associated with diffusion models hindered our approach in the multi-speaker domain, we still show reasonable results within the single speaker context, generating high quality videos. With our work, we take a promising first step forward towards achieving accurate audio-driven video editing with denoising diffusion models.