Lip-Reading with Visual Form Classification using Residual Networks and Bidirectional Gated Recurrent Units

Lip-reading is a method that focuses on the observation and interpretation of lip movements to understand spoken language. Previous studies have exclusively concentrated on a single variation of residual networks (ResNets). This study primarily aimed to conduct a comparative analysis of several types of ResNets. This study additionally calculates metrics for several word structures included in the GRID dataset, encompassing verbs, colors, prepositions, letters, and numerals. This component has not been previously investigated in other studies. The proposed approach encompasses several stages, namely pre-processing, which involves face detection and mouth location, feature extraction, and classification. The architecture for feature extraction comprises a 3-dimensional convolutional neural network (3D-CNN) integrated with ResNets. The management of temporal sequences during the classification phase is accomplished through the utilization of the bidirectional gated recurrent units (Bi-GRU) model. The experimental results demonstrated a character error rate (CER) of 14.09% and a word error rate (WER) of 28.51%. The combination of 3D-CNN ResNet-34 and Bi-GRU yielded superior outcomes in comparison to ResNet-18 and ResNet-50. The correlation between increased network depth and enhanced performance in lip-reading models was not consistently observed. Nevertheless, the incorporation of additional trained parameters offers certain benefits. Moreover, it has demonstrated superior levels of precision in comparison to human professionals in the task of distinguishing diverse word structures.


Introduction
Language is the ability of humans to communicate with each other.Verbal communication will be disrupted if there is hearing loss or noisy environmental conditions.Another alternative that can be used to communicate is by using sign language or lip-reading.However, both require special training as they are challenging to learn.Lip-reading is a technique that relies on visual interpretation to comprehend spoken words or sentences.The use of lip-reading arises as a must in situations when the auditory perception of the speaker's speech is hindered by ambient noise or when the comprehension of dialogue in a video is impeded due to the unavailability of audio.Moreover, it has the potential to be incorporated into biometric security systems designed for mobile devices [1].
As technology has advanced, lip-reading has been widely researched.One of the primary difficulties encountered in the practice of lip-reading is the limitation of visual representation for several phonemes, resulting in potential ambiguity in word interpretation.Viseme, which stands for a visual phoneme, is the shape of the lips to represent a specific sound.The word viseme was introduced by Fisher as a visual form of a phoneme [2].For example, /s/ and /r/ are phonemes because they differentiate the meanings of the words: "sing" and "ring".One viseme can represent more than one phoneme [3].For instance, the phonemes /b/, /p/, and /m/ have the same viseme.
Multiple classification schemas have been developed to categorize lip movements due to their potential interpretations, such as visemes [4], phonemes [5], and ASCII characters [6].A viseme classification schema benefits over other methods because it can predict words that are not in the training phase.It is because a viseme can be categorized to match all possible spoken words.Since several languages share the same viseme, it can also be employed in numerous languages [7].Lip movements in different languages have a similar pattern due to similarities in the development of human vocal organs, even though each language has its own unique grammar and pronunciation norms [8].Lip-reading research has been carried out using various classification segments such as letters [9], numbers [10], syllables [11], words [12], and sentences [13].
The datasets used as training data are diverse.The commonly used datasets were lip-reading in the wild (LRW) [14] and lip-reading sentences 2 (LRS2) [15], both of which consist of news or event programs.Other datasets were created for research audio-visual speech recognition purposes, namely OuluVS2 [16] and GRID [17].Other studies used custom datasets to fit their models, as their focus was mostly on classifying short speech segments.
In their study, Lu & Li [10] constructed an in-house dataset for the purpose of predicting numerical values within the range of 0 to 9. The integration of the visual geometry group (VGG) network and the long short-term memory (LSTM) with attention mechanisms resulted in the development of a feature extraction model.This model has demonstrated a high level of fault tolerance in the domain of image recognition.The VGG network was used for other micro-content [9].The dataset is based on 2700 recordings of the letters being pronounced by 11 different people.Short speech segments with syllable-level models were developed for the purpose of recognizing novel words that were not included in the training phase [11].The model architecture was built with a 3-dimensional convolutional neural network (3D-CNN) and tested on a dataset of self-recorded videos containing Indonesian phrases.These studies highlight the advancements in lip-reading for classifying short speech segments.However, there is still a need to extend the scope of lip-reading to accurately recognize and classify words with different structures, such as verbs, colors, and prepositions.To address these challenges, we investigated computing accurate measurements for different word structures, including verbs, colors, prepositions, letters, and numbers present within the dataset.Fenghour et al. [4] presented a viseme prediction model using 3D-CNN with residual networks (ResNets) architecture.This is subsequently followed by sentence prediction using generative pre-trained transformers (GPT).Even though the model achieved high accuracy in classifying visemes, there was a significant decrease in the accuracy of word classification following the conversion.During the perplexity calculation stage, misclassifications have frequently occurred due to the existence of local optima in the implementation process of local beam search.These local optima pose a challenge at each iteration of the viseme sequence, leading to incorrect classifications.Defining a viseme is challenging due to the shorter duration of visemes; there is not enough temporal information to distinguish between the various classes [4] and requires more background information to detect small variations [18].
The word-level lip-reading method had been experimented on the LRW dataset using the two-stream network, which is 3D-CNN and bidirectional long short-term memory (Bi-LSTM).Optical flow and grayscale video as inputs can further improve performance.The results demonstrated the two-stream network's effectiveness in lip-reading [19].Another study in the same dataset [20], consisting of a 3D-CNN ResNet-18 followed by a temporal convolutional network (TCN).The efficacy of the initial TCN designs is improved through the utilization of densely connected TCN (DC-TCN) [21].The squeeze-and-excitation block was employed by the model.This technique is employed to capture more comprehensive attributes at higher temporal resolutions.Wang et al [12] utilized a 3D convolutional vision transformer (3DCvT).The method combines the strengths of both vision transformers and 3D convolutions to extract spatiotemporal features from continuous images.By leveraging the properties of convolutions and transformers, it can effectively capture local and global information from these images.The extracted features are subsequently fed into a bidirectional gated recurrent unit (Bi-GRU) for sequence modeling to improve the capture of overall correlation among feature sequences and accurately identify crucial information.
The sentence prediction lip-reading model received more attention from researchers.At the sentence level, a hybrid lip-reading network (HLR-Net) was developed [13].The model consisted of two distinct phases, namely an encoder and a decoder.The encoder is constructed using the three inception layers that structure the spatiotemporal CNN (StCNN), gradient, and Bi-GRU layers.The decoder uses the connectionist temporal classification (CTC) loss function and is built using the attention layer, fully connected, and activation functions.Jeon et al. [22] developed a lip-reading model at sentence level with multiple visual feature extraction methods.It was accomplished by employing a combination of 3D-CNN, 3D DenseNet, and multi-layer feature fusion (MLFF) 3D-CNN.However, performing automatic speech recognition solely based on visual speech recognition is still a challenging task due to the reliance on both acoustic and visual cues in spoken language.Visual recognition of mouth movements is a crucial aspect to consider in the creation of a lip-reading model.Furthermore, the inclusion of a temporal sequence modeling component is necessary.This component often involves training a language model capable of disambiguating distinct lip forms.One notable limitation of lip-reading models in practical settings is their considerable dimensions and insufficient computational efficacy.
Prior research has primarily focused solely on a singular variant of ResNets.The primary objective of this work was to perform a comparative examination of various ResNet architectures.This research utilizes different iterations of the ResNets to assess the efficacy of the model, encompassing variations in the number of trained parameters and layers.The suggested methodology has multiple stages, including pre-processing, which incorporates the tasks of face detection and mouth localization, feature extraction, and classification.The feature extraction architecture consists of a 3D-CNN combined with ResNets.The classification phase involves the control of temporal sequences, which is achieved by employing the Bi-GRU model.Furthermore, this work computes metrics for several word structures present in the GRID dataset, which consist of verbs, colors, prepositions, letters, and numerals.This component has not been subject to prior investigation in previous studies.The findings obtained from this study can offer valuable insights into the assessment of the precision attributed to individual word structures.

Data
The dataset used was GRID [17], an audio-visual sentence corpus for research purposes, that consists of color videos and audio in English recorded from 33 speakers (18 men and 15 women) with a resolution of 360 x 288.Each speaker utters 1,000 short sentences with a six-word sequence.All videos are 75 frames (about 3 seconds) in length.The spoken sentence structure is command + color + preposition + letter + number + adverb.Each sentence was chosen randomly from the combination of words listed in Table 1.The GRID dataset contains 51 unique words, including 25 letters (excluding the letter 'W' since it is classified as multi-syllable), 10 numbers (from zero to nine), and 4 words for each command, color, preposition, and adverb.A combination of color, letter, and number were the keywords.All speakers utter all these keywords.An example of a spoken sentence is "Bin Blue By Z Eight Now".The dataset includes metadata files that list the frame time for each word.An example of a metadata file can be seen in Table 2.The word "sil" at the beginning and end is silent time.The training and validation datasets are divided according to speaker numbers.The total video used is 32747 of 33000, and 253 videos were corrupted.The evaluation data consisted of 2 males and 2 females from speaker numbers 1, 2, 20, and 22.The rest of the video was used for training.

Proposed Model Figure 1. Overall lip-reading process
The Dlib and OpenCV libraries were used to perform mouth extraction.A shape predictor identifies features in an input image, such as the mouth, nose, and eyes.Facial landmark detection first locates the face and then detects the mouth within that region.The frames were cropped around the mouth area and downsized to 100 x 50.An illustration of mouth area localization is shown in Figure 2.

Figure 2. Mouth area localization
This model includes techniques for data augmentation; random horizontal flipping is applied to each frame with a probability of 0.5.Additionally, frames are duplicated or deleted with a probability of 0.05 per frame.Next is the data normalization process.This step ensured that each input had a similar data distribution to make convergence faster during training.The data is normalized to intervals of 0 and 1, dividing the pixel value by 255.
Figure 3 shows the proposed architecture of the lip-reading model.It uses a 3D-CNN with ResNets to extract spatial and temporal features to get fixed-length feature vectors.After the convolution and pooling layers, this architecture has a Bi-GRU layer, which is followed by a fully connected layer that uses softmax activation and CTC loss functions to classify.At the end of the sequence, it generates classes of probabilities to predict future input characters based on previous ones.Several convolutional network models have been developed to extract image features, but their applicability in video analysis is limited owing to the absence of motion modeling.The current study utilized a feature extraction model known as 3D-CNN with ResNets.3D-CNN showed superior performance results in video analysis.It has an additional dimension that can retain temporal information from input to produce output volume.3D-CNN efficiently summarizes object, scene, and action-related data within videos, thus offering versatility without necessitating model fine-tuning for each task [23].

Residual Neural Networks (ResNets)
ResNets is highly well-liked for image recognition and classification because it can overcome degradation by skipping layers (skipping connections).This skip connection can solve the problem of vanishing gradients by allowing gradients to carry out the alternative shortcut to skip unnecessary paths.ResNets are easier to train due to their simple topology and short interconnections among layers.Additionally, they exhibit proficiency in detecting features within lower-dimensional data representations, enabling them to acquire knowledge from small datasets [24].ResNets consists of several variants, the distinction between them is the number of layers it forms, such as 18, 34, 50, 101, or 152 layers.The architecture of variant ResNets [25] is shown in Figure 4.For instance, Res-Net18 is built from 18 layers of a neural network.The first layer is a 7x7 kernel, followed by max pooling (stride 2), and four identical convolution layers.Each layer consists of two residual blocks and two layers with a skip connection.The size of the kernel in the convolution layer is 3x3 except for the first layer (7x7), and the number of parameters in each layer is 64, 128, 256, and 512.The maximum pooling layer uses stride 2. The Recurrent Neural Network (RNN) is a neural network architecture that is frequently employed in several domains, including speech recognition, language modeling, and translation.Its main functionality lies in predicting the next word or character within a sequence of words.To address some limitations faced by RNNs, like vanishing gradients during training, an improved version called GRU was introduced by Cho et al. [26].The goal behind developing GRU was to enhance the information flow throughout the network.GRUs are often considered like LSTM networks due to their comparable design and similarly promising results.However, one notable distinction lies in how they handle gating mechanisms.While LSTM employs forget gates and input gates separately for controlling memory cells' access at each time step independently, GRU combines these into an update gate.This enables efficient determination of relevant information for propagation towards output predictions.
In the backend process, we used a two-layer Bi-GRU.The Bi-GRU had the best prediction accuracy and the quickest learning convergence time compared to the unidirectional models, GRU and LSTM [18].Bi-GRU provides information to two independent neural network topologies connected to the same output layer in both forward and reverse flows.Both networks receive complete input information, unlike the standard GRU deployment.The output of the 3D-CNN ResNets is delivered successively to the Bi-GRU layers, which produce characters as output.

CTC Loss Function
The CTC loss function obviates the necessity of pre-alignment between the sequence of input and output.It enables the independent prediction of labels for each time step.The vocabulary comprises tokens, including a 'blank' character representing '-', aiding in encoding repetitive characters.For instance, in the CTC configuration, 'Hel-lo' is the correct representation of 'Hello,' where 'l' is duplicated.The CTC loss function accepts a model output matrix consisting of scores assigned to each token at every time step alongside the actual truth sequence [27].
During training, the objective is either to optimize all possible routes leading up to the fundamental truth label or minimize negative log probability sums.Throughout the process of evaluation, a selection is made at each stage using either beam search or greedy methods to identify characters.The final recognition output sequence is then generated by removing redundant and null characters.The CTC loss function can be implemented on various levels, such as phonemes, visemes, or individual characters.

Performance Measurement
Model performance measurement uses a standard evaluation metric in automatic speech recognition at the sentence level, character error rate (CER) and word error rate (WER).CER measures how close the predicted character order is to the target character order, while WER is for words.The lower the CER or WER, the better the prediction accuracy.All models were evaluated to compare computational performance and efficiency.The CER and WER equations are determined in Equations 1 and 2, respectively.From Equations 1 and 2, N represents the entire character count in the fundamental truth, while S signifies substitution for incorrect classification.D denotes deletions of non-decoding characters, and I indicates the insertion of decoded characters not chosen.

Results and Discussion
The proposed models were evaluated using The Google Colab Pro+ version on GPU T4 with an allocation of 15 GB of GPU RAM and 51 GB of system RAM.A TensorFlow-CTC decoder was used to calculate the error rate scores for all experimental models, which were all developed using Keras with a TensorFlow backend.The ResNets have five variations: 18, 34, 50, 101, and 152 layers.We experimented with three variations, namely 18, 34, and 50.We employed the Adam optimizer [29] to train all our models.The learning rate used was set at 10 -4 for a total of 250 epochs, and the batch size was 16.

CER and WER
To assess the computational effectiveness of the models, we examined the error rate in relation to the trained parameters and the duration of training.The evaluation was conducted with unseen speakers.The results are summarized in Table 4.The 3D CNN ResNet-34 and Bi-GRU combination gave the best result in terms of CER of 14.09% and WER of 28.51%.However, 3D CNN ResNet-18 and Bi-GRU have shorter training times of 205.45 hours (about 8.5 days).The 3D-CNN ResNet-50 and Bi-GRU models have the highest CER and WER values and the longest training time.The number of trainable parameters in the 3D-CNN ResNet-34 and Bi-GRU models is approximately 65.1 million, which contributes to their superior performance in sentence prediction compared to the other two models.We could not train models for layers 101 and 152 due to memory constraints in our test environment.In Figure 5, we can observe the comparisons of CER for each model, while in Figure 6, we can observe the WER comparisons.During the initial 100 epochs, it is evident that the model utilizing a 3D CNN ResNet-50 exhibits a larger disparity in CER values compared to the other two models.However, as training progresses, it only demonstrates a slight discrepancy.On the other hand, there is only a slight difference when comparing WER between 3D CNN ResNet-18 and ResNet-34 models.The deeper architecture of 3D CNN ResNet-50 displays more significant variations than its counterparts.These observations imply that having a deeper network does not necessarily guarantee improved lipreading performance; hence, further investigation is needed to understand this.The more layers in ResNets, the longer the training time required.Increasing the number of trained parameters may be beneficial in ResNets models.However, this approach may not be suitable for other neural networks.This highlights the significance of selecting relevant features rather than relying solely on parameter quantity for enhancing model quality and performance.Training a neural network with numerous parameters poses a considerable computational challenge.The complexity increases when implementing the ResNets model due to the extensive memory requirements for storing and maintaining parameters and weight values, resulting in time-consuming training processes.Consequently, given the limitations of memory capacity in our testing environment, it is not feasible to train layers like ResNet-101 and ResNet-152.These findings also demonstrate that deeper networks such as ResNets are computationally expensive without necessarily leading to enhanced lip-reading performance.
Despite the challenges and limitations associated with training deep neural networks like ResNets for lip-reading, there are still promising opportunities for improving accuracy and performance in this field.One potential avenue for improvement is the exploration of ensemble learning methods in lip-reading models.By combining multiple models, more accurate predictions can be made.Another approach to addressing the computational challenges of training deep neural networks is through model compression.Model compression techniques, such as sparsity via regularization, weight quantization, and network pruning, have shown promise in reducing the memory usage and computation requirements of deep networks.For instance, weight quantization replaces trained network weights with lower precision or utilizes bit-wise operations.These techniques can be applied to lip-reading models based on deep convolutional neural networks such as MobileNet, VGG16, and AlexNet [30].However, the challenges of training deep neural networks, particularly models like ResNets with their memory requirements, highlight the need for alternative approaches to improve computational efficiency and performance.

Confusion Matrix
To visualize the results, the proposed approach utilized phoneme-to-viseme mapping [2]. Figure 7 presents the confusion matrix for viseme prediction.The results revealed that while the proposed model successfully differentiated most visemes, but there were several misclassifications observed.Viseme "ch", which mapped the phoneme groups {/jh/, /ch/, /zh/, /sh/} had a high frequency of incorrect classification.Examples of data included in viseme "ch" are the letters g, h, and j.These findings are consistent with previous research [4] that has also highlighted the challenges associated with mapping multiple phonemes to a single viseme.
We selected 10 sample sentences to illustrate the results of our predictions.Table 5 displays example sentences from the GRID dataset, along with their corresponding predicted sentences.Any inaccuracies in the predictive sentences are marked using bold and underlined formatting.Some complete sentences may align perfectly with the predictions, while others might contain one or more incorrect words.This is to be expected since lip-reading is dependent primarily on the visible articulators, which include the lips, the tongue, and the teeth to some extent.

Prediction Sentence Structure
Table 6 displays the level of accuracy attained in successfully predicting different word structures.The predictive accuracy rate for letter prediction was observed to be 35.79%,indicating a considerably lower performance compared to other components of the test.On the other hand, it was observed that command words exhibited the highest rate of prediction, with an accuracy of 88.65%.This outcome was attained through a confluence of various causes.The observed discrepancy can potentially be accounted for by the temporal duration of letter sounds being shorter than 0.3 seconds.Furthermore, distinguishing visually indistinguishable visemes, such as the phonemes "p" and "b" or "f" and "v," might be a significant challenge due to their closely related visual characteristics.For instance, specific letters such as b, d, c, and e necessitate similar oral articulatory gestures for accurate pronunciation.This phenomenon impacts word prediction and poses challenges for the visual system's acquisition of novel information.This has led to the revelation of an additional technological limitation associated with visual speech recognition systems.The accuracy rates of human interpreters were used as a baseline model in an audio speech recognition study [31].To evaluate the effectiveness of our proposed model, we conducted a comparative analysis between its performance in lip-reading tasks and that of humans in lip-reading tasks.The findings of our investigation, as provided in Table 7, demonstrate that the proposed model outperformed humans across all sentence structures.This remarkable achievement can be attributed to significant advancements in deep learning methodologies.The application of these techniques has yielded notable advancements in the field of lip-reading, enabling successful utilization in many real-life scenarios.As a result, this noteworthy accomplishment was rendered feasible.

Conclusion
This paper proposes a deep learning-based lip-reading system using 3D CNN ResNets and Bi-GRU.Different ResNet architectures were compared to see which one was best at predicting full sentences from a series of images of the lip area.The most accurate model was achieved by combining 3D CNN ResNet-34 with Bi-GRU, which obtained a CER of 14.09% and a WER of 28.51% on unseen speakers in experiments conducted on the GRID dataset.We demonstrated that our model outperformed humans across all sentence structures.This remarkable achievement showcases its effectiveness in real-world scenarios and highlights the transformative impact of recent developments in deep learning.
Increasing the number of layers in ResNets results in longer training durations.Although introducing more trained parameters may have its advantages, this approach may not be compatible with all neural network architectures.Instead of solely relying on parameter count to enhance model quality and performance, it is crucial to focus on selecting relevant features.Training a neural network that has many parameters poses computational challenges due to the extensive memory resources required to store and update all the weights and values.Consequently, these training processes become time-consuming.These findings also suggest that while deeper networks such as ResNets are computationally demanding, they do not necessarily translate into improved lip-reading capabilities.Despite the obstacles and constraints, there are favorable prospects for enhancing accuracy and performance.A potential approach to improving this is by investigating ensemble learning methods in lip-reading models.Ensemble learning entails the fusion of multiple models to generate more precise predictions, thereby enabling the utilization of diverse viewpoints and amplifying overall performance.

Data Availability Statement
The data presented in this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.3625687[32].

Funding
The authors received no financial support for the research, authorship, and/or publication of this article.

Acknowledgements
We would like to thank Bina Nusantara University, especially the Binus Graduate Program, Master of Computer Science, for giving us the chance to conduct the research.

Institutional Review Board Statement
Not applicable.

Informed Consent Statement
Not applicable.