Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the StartLip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition. Keywords-concatenated frame images; convolutional neural network; keyframe reduction; keyframe sequence; lip reading


INTRODUCTION
Deep learning applications, especially Convolutional Neural Network (CNN) applications, have recently achieved impressive success in diverse object detection and recognition tasks [1], However, CNNs face some challenges, in particular in video recognition. A video may be incomplete and sound may be lacking during certain parts. If the audio at the crucial moment is missing, it may result in the video's contents being misunderstood [2]. These videos will be more useful if they were edited and the missing words or messages could be found. Most of the proposed solutions rely on the lip reading technique to help transcription by reading and observing the moving lips, including tongue and face to get the right words.
Moreover, the process of transcribing or translating the speech obtained by lip reading is a skill that requires learning and practice until becoming proficient at recognizing the lip movement or lip pattern related to the pronunciation of each syllable.
In general, there are two popular multimodals or methods of supervised learning for lip reading: Visual Speech Recognition (VSR) and Audio-Visual Speech Recognition (AVSR). VSR uses a method to teach the machine with visualonly information from a video without using speech or audio for training [3]. On the other hand, AVSR trains the machine by applied images combined with audio data from a video to achieve greater accuracy [4]. Authors in [5] found that the use of Visual-Only (VO) data has better classification accuracy than that of Audio Visual (AV) and Audio Only (AO) data. There are currently two groups of research studies on lip reading recognition, the first group uses images in VO [2,3,[5][6][7][8][9][10][11][12][13][14] while the other group uses both audio and video together [4,[15][16][17][18][19][20][21][22][23][24][25]. Both these groups have a model for extracting the features with a technique used in combination. Nevertheless, the latter group differs in that they have to combine audio feature data with visual features when it comes to machine learning. AV speech recognition is used commercially on various software or systems, but the recognition quality is reduced by environmental noise. This situation is not the same as using data from a quiet image, making lip reading a pivotal role in Automatic Speech Recognition (ASR) in harsh audio environments [6].
Neural networks are commonly used to help extract features and recognize lip reading patterns in machine learning, such as in CNNs [6,11,24], Long Short-Term Memory (LSTM) [3,10,21] In order to sort the images of the video frame, the lip movement is determined by the number of frames, for example, 40 frames per word [7], or the number of frames in seconds such as 0.2 seconds [8,16] or 1 second, or maybe determined using every frame in the video wherein one video will have only one word and a short duration. The methods mentioned above also use multiple frames in machine learning processing, which consumes more resources and processing time. There are also real world problems, e.g. where a speaker speaks at a slowly speaking speed, resulting in a longer video file length while the words are spread out. Therefore, limitations on the number of frames or the duration may differ from the words or messages conveyed. However, if there is a representation of the images or keyframes, it will reduce the number of images and cost used in machine learning.
Most lip reading studies are applied to English datasets of digits, alphabets, words, phrases, and sentences. The AVDigits [26] is a popular dataset for testing and improving the performance of a model. There is a total of 540 videos with a resolution of 1920×1080 pixels (px). It consists of six speakers who speak numbers from 0 to 9 in English and each repeats the numbers 9 times. Greek [12], Myanmar [14], Spanish [27], and Czech [28] have been also studied, but there are no Thai language datasets created for lip reading. It is difficult for the Thai language to be in some words with similar lip patterns but with different meanings. Since the vowels are pronounced similarly, the patterns of lip movements are similar. There is a pattern of intonation in Thai language resulting from a combination of tones, resulting in intonation in five tonal sounds. Therefore, this research aims to create a dataset in the Thai language and reduce the image dimensions and the number of frames by finding keyframes to replace syllables or words for Thai lip reading recognition using CNNs.
II. RESEARCH METHODOLOGY The research methodology for the improving performance recognition of lip reading using C3-SKI consists of 1) dataset preparation, 2) face detection and lip localization, 3) C3-SKI creation, 4) model development, and 5) model effectiveness evaluation.

A. Dataset Preparation
The Thai digit dataset is called THDigits. It was created as a Thai video file containing numbers from 0 to 9, which are repeated three times with three different speaking speeds (slow, regular, and fast) by using 100 mixed-gender speakers. A total of 3,000 video files with length between 1 and 4 seconds were constructed. These videos have four different resolutions: 1920×1080, 1280×720, 960×540, and 720×404 px and were recorded with smartphones, regardless of model and brand. The Thai numbering is shown in Table I.

B. Face Detection and Lip Localization
Before processing face detection and lip locating, each video prepared in the Thai digit dataset was separated into individual frames, ordered from the first to the last frame. After that, the face was detected in each frame using the Viola-Jones [29,30] technique based on the Haar-like feature. The hypothesis T represents any distinguishing characteristic, h is the distinguishing characteristic, and β represents the percentage of error classification. The characteristic C(x) is given as (1) [29]: The difference of pixel sums or features on an image is compared using the Haar-like feature with the black and white filter that assigned I and P values to represent N by N images as in (2) [30]: After face detection comes the lip positioning based on 68 facial landmarks [31]. Lip localization determines the lip area or Region Of Interest (ROI) to leave only the desired feature area by cropping in each frame to the lip area only, illustrated in Figure 1. in 5 rows with 5 frames per row, in a total of 25 frames per image [6]. As mentioned above, if the speaker speaks slowly or extensively, the number of frames will be increased, but only one word or syllable will be conveyed. In this study, only the keyframes with pronounced lip movement were selected, relying on measuring the outer lip dimensional from the mouth's height (h) and width (w) for each frame x as shown in Figure 2. Lip dimensions were calculated by (3) on each frame. With the help of the increasing function, decreasing function, relative maximum, and relative minimum of a graph related to lip dimensions, all keyframes were extracted and selected. The relative maximum and relative minimum are lying on the slope at the x interval. If the c is some x in a graph, where c is the relative maximum or relative minimum, then the slope at c would be zero, represented in (4).
( ) f´(c) is the function that has a relative maximum or minimum at c, f(x)≤ f(c) for the relative maximum, and f(x)≥ f(c) for the relative minimum. Each video is possible to produce several relative maxima and relative minima as keyframes. In this experiment, there are three keyframes assigned as a single syllable. The first keyframe will represent the frame where the mouth starts to pronounce a syllable. This is the SLI. The second keyframe (MLI) is the frame where the mouth is at the maximum opening limit or movement of the syllable. The last keyframe (ELI) is at the end of the pronunciation of that syllable. For example, the word 'one' in English is pronounced as 'nueng' (H nụǹg) in Thai language. Fifty one frames were split from a video in the dataset and the area of the lip was cropped as shown in Figure 3. The frame sequence numbers 9, 18, and 32 are representing the three keyframes based on relative maximum and relative minimum, and are shown in Figure 4. The three keyframes representing a single syllable. After the three keyframes were selected, each keyframe image was scaled down to 32×32 pixels and merged into a single RGB color image with a dimension of 96×32 pixels, as shown in Figure 5, while other studies use image dimensions of 80×60 or 224×224 pixels ( [7] and [6] respectively). Thus, each input color image frame as the dataset for building the model has a pixel density ratio of 3,072 pixels, which is less than the one used in other studies [6,7].

D. Model Development
The model was designed based on the CNNs with a total of 13 layers. It consists of 1 normalization layer, 6 convolutional (Conv) layers with or Rectified Linear Unit (ReLU) activation function for each convolutional layer, 3 max-pooling layers, and 3 Fully-Connected (FC) layers, which include a flatten layer and two dense layers. There are a total of 846,890 parameters for training. The model architecture is shown in Figure 6. All images of the Thai digit dataset were used in the training and validation of the designed model. The training and validation images were the 80% and 20% of the dataset respectively. A total of 200 epochs of training was set with a minibatch gradient descent size of 32. The model was built using an Intel Core i7-7700HQ PC at 2.80GHz, 16GB of memory, 512GB of Samsung Solid State Drive, and a 3GB NVIDIA GeForce GTX 1060. This system was running through Python version 3.8.3 on Windows 10 x64 architecture.

E. Model's Effectiveness Evaluation
The developed model has validated by the accuracy rate and loss value. The accuracy was calculated by (5), and the loss value or loss function, also known as cross-entropy which is the favorite function for classification is defined in (6) [32][33][34][35][36][37][38]: where C refers to the total number of samples recognized correctly, N refers to the total number of all samples.
where N refers to the total number of all samples, M refers to the total number of classes, y i,j refers to the true value when the sample i belongs to the class j, p i,j refers to the probability predicted by the sample model of i to belong to the class j.
The model will stop training automatically using the 'EarlyStopping' feature supported by Keras library with the callback function. The patience value was 5, and monitoring to the validation loss value. The model stopped training when there was no improvement in the validation loss for 5 consecutive epochs. The minimum validation loss was reached at 37 epochs of training.    Moreover, this experiment compares the accuracy of the technique for input image reduction with other studies using the AVDigits dataset. It was found that the lip reading models built by MDBN [39], MDAE [40], RTMRBM [26], am-LSTM  Table II and Figure 10.

IV. CONCLUSION AND DISCUSSION
This paper proposes the application of the C3-SKI to a CNN for lip reading. The C3-SKI consisting of SLI, MLI, and ELI was tested in lip reading recognition on THDigits and AVDigits datasets. Its primary input is images applied to machine learning according to deep learning techniques. Reducing the number of images can be used by finding the image keyframes based on the relative maximum and relative minimum values. In this paper, 32×32 pixels of 3 sequence keyframes were assigned to represent any syllable of a digit between zero and nine, and these keyframes were concatenated to produce a new 96×32 px image as an input to the neural network. These new input images make the resulting image dimensions and density of pixels less than most state-of-the-art methods. Thus, there were 3,000 keyframe input images with 10 classes that were divided to 80% for training and 20% for validation. The developed CNN model included 1 normalization layer, 6 convolutional layers, 3 max-pooling layers, and 3 fully-connected layers. Training took a total of 47 epochs and was finalized when the validation loss value reached its minimum and the model did not improve any further. As a result, the model had accuracy, validation accuracy, loss, and validation loss values of 95.06%, 86.03%, 4.61%, and 9.04% respectively.
The model's accuracy was 85.62% when it was applied to the AVDigits dataset. Therefore, the reduced number of keyframes with relative maximum and relative minimum can be applied in conjunction with CNN for lip reading. The results of this study conform to the conclusions of [6,7] in which the CNN method with reduced concatenated frame images was applied to lip reading with a high level of effectiveness. These concatenated frame images have smaller dimension than the traditional images which are usually used to image classification in CNNs (224×224 px).
In future work, the researchers plan to create a dataset for Thai sentences and test the C3-SKI technique. New keyframes could be scaled down and compared to the number of keyframes used to represent each syllable for continuous speech or sentences, such as 5, 7, or 9 keyframe images sequentially. This technique will develop a function or equation to find the lip shift's common point from one syllable to the following syllable. Besides, word prediction methods from the corpus would be used for prediction combined with image processing to increase lip reading accuracy on deep learning in real-time processing.