Vowel Recognition for Rehabilitation Assessment of Speech Disorder Patients via Multi-source Frequency Spectrum Images

Communication impairments have a broad spectrum of medical causes, such as speech disorders, hearing loss, brain injury, stroke, and physical impairments. As a result, communication disorders can affect social development and interpersonal relationships. Speech impairments can benefit from early speech treatments; however, the majority of rehab facilities across the world still carry out this process manually. A wide range of studies has been conducted on speech processing for various human languages. Machine learning and deep learning have been applied to the medical and healthcare industry to enhance rehabilitation by utilizing the new technology. This study analyzed the classification accuracy of the designed network and other pre-trained models (VGG-Net, AlexNet, and Inception) and performed a complete comparative analysis to assess the classification accuracy of several pre-trained models. The sound is converted to the image as a new way to see them in the neural network via a newly proposed concept named image-profiled data. These image-profiled datasets that used a spectrogram and a Mel-frequency cepstral coefficient (MFCC) produced this study's best results and accuracy. This project aims to develop a new neural network that can successfully distinguish between the vowels from the voices of normal people, patients with speech disorders and the mix from the prior two groups using the six and twelve classes of Malay vowels. The designed network model, which used 6 batch sizes, 20 epochs, and ADAM as the optimizer, this study presented and achieved the maximum accuracy values of both classes for image-profiled audio data in analyses conducted.


Introduction
Communication is essential for every human in every possible way.If the disabled person can regain ways to communicate with their family, this may help to re-establish emotional bonds and support roles, and it may help to prevent frustration.A man who used to be self-sufficient may become enraged when he needs help and has difficulty requesting it for simple tasks.Some of the frustration might be relieved by introducing efficient communication strategies.Sound production is necessary, and the form of Published Online First: July, 2024 https://doi.org/10.21123/bsj.2024.9202P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal words will be produced to communicate.Sounds are produced by the vibration of the vocal cords in a human's mouth.There are numerous of study and research on the sound's production and vibration.Ladefoged and Johnstone's research 1 in 2015 is among the works that proposed the acoustic phonetics and formant frequencies of humans.The cavities above the larynx produce a sound, and these cavities resonate with specific frequencies, and some do not resonate.The frequencies that resonate are called formant frequencies.
Formants or formant frequencies are the peaks or maximal points in the sound spectrum.Spectrum is the acoustic component that can identify a complex sound wave 2 .Thus, this study will use two types of visual representation of speech: Spectrogram and Mel-Frequency Cepstral Coefficient (MFCC).The aids of visual representation, or the image-profiled, are to analyse and interpret the dataset images to understand sounds with objective numbers.Vowels are letters that come in for speech sounds in which the air leaves the mouth unimpeded by the tongue, lips, or throat 3 .Speech sounds can be classified into two categories: those where the air is blocked by the lips, tongue, or throat before leaving the mouth or the air is not secured.Vowels signify unblocked sounds, while consonants represent blocked sounds.Crystal et al. state three formant frequencies, and every vowel, which is categorized into front, central, and back of the human's mouth, has its frequencies 4 .
Aphasia is a disorder resulting from damage to portions of the brain responsible for language for most people.These areas are on the left side of the brain 5 .Aphasia usually occurs suddenly, often following a stroke or a head injury, but it may also develop slowly due to a brain tumour or progressive neurological disease.It's, therefore, essential to note that aphasia is not a disease in itself but rather a constellation of symptoms concerning difficulties with expressing or comprehending language that could be due to multiple different underlying causes.Even though speech-language pathologists cover a wide range of treatments, it is required to have two separate areas for someone with a speech problem to diagnose 6 .
One is called apraxia of speech, and the other is called dysarthria.According to one definition, dysarthria is a group of speech disorders caused by problems in the muscles that control speech as one or more of the motor systems essential to speech production are impaired 7 .According to this definition, "dysarthria" only refers to speech disorders with a neurogenic origin.A motoric deficit, or a fundamental disturbance of movement, of the muscles in the speech production process results in dysarthria 8 .Apraxia of speech is a motor programming disorder characterized primarily by articulatory disturbances with associated compensatory prosodic disturbances 9 .Unlike dysarthria, apraxia of speech is not related to significant slowness, weakness, incoordination, or paralysis of the muscles of the speech production mechanism.
A spectrogram is a three-dimensional visual representation of speech.It shows the time horizontally while the frequency is vertical.Initially, it is two-dimensional, but the added dimensions come with the intensity of the information in the spectrogram 10 .The third dimension, represented by colors, depicts the signal strength (or loudness) for a given frequency at a given point.The color scheme may vary in each spectrogram, but the essential idea behind each remains constant 11 .MFCC is widely used in speech recognition as an aid visualization.It is popularly extracted from speech signals for use in recognition tasks.MFCC can represent the filler (vocal tract) in the source-filter speech model 12 .The frequency response of the vocal tract is relatively smooth, whereas the source of voiced speech can be modeled as an impulse train.MFCC uses a log, which transforms convolution between the excitation source and vocal tract (filter) into addition in the spectral domain 13 .The 'cepstral' in the MFCC is the adjective of the spectrum.
The conversion from the wav file audio to the spectrogram image profile is pre-processed by using the LibROSA library in Python.During this step, the signal was downsampled from 44100 Hz to 16000 Hz sampling rate.The length of the window used is 1024 samples, and after t, it was converted into a timefrequency by using short-time Fourier transform (STFT), and this method is calculated using the fast Fourier transform (FFT) 14 .The proposed paper will Published Online First: July, 2024 https://doi.org/10.21123/bsj.2024.9202P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal focus on vowel recognition for normal persons and stroke patients with speech defects.Two groups of normal people and disorder patients were used to acquire the sound recording data.A few studies have been done so far to classify the pronunciation of Thai, Japanese, and Arabic vowels using an intelligent method.Therefore, this research aims to design and develop a system using a deep learning structure for Malay vowel speech recognition.The experiment in this study will be carried out over a network and will use five models of networks.This system was created to: 1. Address the issue of pronunciation problems in speech disorders, especially stroke patients.

Related Works
Speech is the production of individual sounds that are put together to form words, while language is a combination or system used to convey ideas and understand others.Speech and language disorders may occur separately or together.It's important to note that these symptoms may be mistakenly identified as poor listening for attention, selective hearing or bad behavior.Vowel recognition is applied to all worldwide vowels.An approach recognition for Arabic phonemes also uses a similar deep learning approach.For this study, the crucial part of the experiments is the pronunciation of Arabic phonemes.Incorrect pronunciation of Arabic short vowels can completely alter a sentence's meaning.For this reason, both students and teachers of Classical Arabic (CA) must practice more while correcting their students' pronunciation of Arabic short vowels.This study has developed a model that can classify Arabic vowels using Deep Neural Networks (DNN).Similar to this study, they created and designed a new audio Arabic dataset, developed neural network architecture and achieved the best classification accuracy.Their proposed model has reached a testing accuracy of 95.77% 16 .
Javanese language vowels are unique to pronounce as they have vowels, semi-vowels and consonants.It can complicate new learners and beginners in pronouncing the language.This study aims to improve the model's accuracy by using weight initialization methods and weight functions.They have used three weight initialization and activation functions inside the CNN model.This study used MFCC and MFSC features to conduct a multinomial logistic regression model.They have proved that the optimal CNN model could be achieved by combining Xavier weight initialization and ReLU activation 17 .The classification accuracy gained from this study is 99.60%.

Materials and Methods
This study's proposed method includes improving training and validation accuracy.A large collection of images, including images of normal people and speech disorder patients, is required to construct a new intelligent classification system for speech disorders of stroke patients.The recording, converting, and cropping processes produce data for normal people and stroke patients.A convolutional neural network will then be trained using the entire dataset.

Data Collection
The steps for recording, gathering, and preparing the dataset are described in this section.To ensure the data's accuracy and quality, every action taken to prepare the dataset was highly crucial.For this study goal, there aren't any publicly accessible datasets for Malay vowels.Therefore, the dataset preparation was created for this study's purpose.The dataset was gathered from two groups: the normal person group and the stroke patients group with a speech impairment.All tests were run on this dataset.For the normal and healthy people, the speech dataset was collected from Malay speakers who speak the standard Malay dialect (10 males and 10 females between 18 and 27).The speech data was then gathered from 9 stroke patients from Perkeso rehab Ayer Keroh in Melaka, Malaysia, who also spoke the standard Malay dialect (6 male and 3 female).
The dataset contains 44,100 Hz of standard speech data recorded from a REMAX RP1 8GB Digital Audio voice recorder.Everyone records similarly with a 15 cm space between their mouths and the voice recorder.For the normal person group, every vowel must be pronounced in three distinct segments, short, middle, and long.The lengths of the recordings for the short-period, middle-period, and long-period signals are 1, 2, and 3 seconds, respectively.This research has divided the recording time into three periods for the normal person group to come up with the patients' voice pronunciation capabilities.Stroke patients may have difficulties pronouncing some vowel classes, so the recording for the normal person group is conducted in three periods to come up with the condition where patients with speech disorders pronounce the vowel for one to three seconds.
This study aims to evaluate how expanding the classification group of the dataset affects performance accuracy.There are six classes of vowels and twelve classes of vowels in this study's classification.Every vowel class in the first six classes combines male and female individuals; however, the next 12 classes separate male from Published Online First: July, 2024 https://doi.org/10.21123/bsj.2024.9202P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal female members.This study wants to observe whether adding more classification groups would improve performance or whether combining males and females would alter training performance..Fig. 1 shows the spectrogram images for two groups, normal persons and speech disorder patients.Each vowel class in each group sample is represented in Fig. 1.Vowel classes are listed from left to right in the following order: /a/, /e/, /E/, /i/, /o/, /u/.After the conversion from the audio to images is complete, cropping will begin with each vowel's image, which has a size of 240 by 55 pixels and a bit depth of 32 bits.
(a During the conversions from audio to MFCC, the default acoustic features used are 20 Mel bands.The features of the speech sample are extracted after some basic processing.Pre-emphasis, frame blocking, and windowing are the three feature extraction stages used in MFCC 19 .The pre-emphasis speech signal x(n) must go via a high-pass filter.In the equation, the output signal is represented by the symbol y(n), and the value of "a" is typically between 0.9 and 1.0.

Convolutional Nueral Networks
The Convolutional Neural Network (CNN) is a widely used model that comprises one or more convolutional layers, which can be pooled or fully linked, and is based on a variant of multilayer perceptrons.Further, these convolutional layers create feature maps that record a region of the image, which is ultimately broken into rectangles and sent out for nonlinear processing.The advantages of CNN are they offer very high accuracy in image recognition problems and are capable of automatically detecting important features without any human supervision compared to the ANN and RNN 20 .ANN has no specific rule for determining the structure of artificial neural networks, and appropriate network structure is achieved through experience and trial and error.Besides, the RNN model can't process very long sequences if using real as an activation function, and the training process of RNN is challenging.In conclusion, CNN is considered more potent than other models as it has high accuracy in image recognition problems and weight sharing 21 .
The structure of the designed network is described in this section.Cross-validation was used to refine these hyperparameters, ensuring they were stable and not overfit to specific data subsets.Some analysis was conducted to examine how different hyperparameters affected the model's performance and enable further optimization if needed.The final hyperparameters were then selected based on this analysis.
The convolutional layers applied convolutional filters to the input before a nonlinear activation function, with Convolution Layer 1 being the first.Each layer had a kernel of size three, batch sizes of 32 and 64, and two distinct dropout values.The CNN model was created through a trial-and-error learning process.It was kept up to date by gradually raising testing accuracy, which increased testing accuracy to 90.0% 22  The architecture of CNN was selected with simplicity and complexity in mind.Larger accuracy may be possible with more complicated models, but these models frequently have larger processing costs and are more susceptible to overfitting.Conversely, more capacity could be required for simpler models to fully capture complex patterns in the data.The complexity of the dataset and the intended level of abstraction served as the main factors in deciding the number of convolutional layers, pooling layers, etc.The CNN architecture was chosen after taking into account the dataset's size and the processing resources that were available.The selected design attempts to create a good tradeoff between accuracy and processing speed by finding a balance between simplicity and complexity.
When it comes to learning rates, the training procedure usually includes the application of adaptive learning rate algorithms, like Adam or RMSprop.For this study, a learning rate of 0.0001 was employed.These methods dynamically adjust the learning rate during training to optimize convergence.As for loss functions, it depends on the specific task at hand.Common choices include mean squared error (MSE) for regression problems and categorical cross-entropy for classification tasks.Early stopping criteria are used to prevent over-fitting and involve monitoring a validation metric such as accuracy or loss.This study didn't use the early stopping class to prevent overfitting.
Instead, the study relied on a small number of epochs to ensure that the model didn't train for too long and potentially over-fit the data.This approach helps strike a balance between training the model adequately and avoiding over-fitting.

Results and Discussion
In this section, the performance of the proposed CNN will be compared to the other current network models with a confusion matrix to measure the performance accuracy in all analysis studies.The results were split into two analytical studies: (a) the comparison of pretrained models by using spectrogram image-profiled; and (b) the comparison of pre-trained models by using MFCC image-profiled.A ratio of 80:10:10 was used to divide the vowel dataset into training, evaluation, and testing for the three analysis experiments in (a), and (b).This proposed study aims to validate the efficacy of pre-trained models (Designed model, VGG-16, VGG-19, AlexNet, and Inception) that employ images of 55 × 240 of spectrogram and 35 × 200 of MFCC dimensions and 32-bit depth.The Python approach was developed with the KERAS (Tensorflow) neural network computation framework.

The Comparison of Pre-Trained Models by Using Spectrogram Image-Profiled
The first study analysis includes 20 normal persons, ten male and ten female.A total of 10800 spectrogram images were collected to conduct this experiment.The second analysis consists of unequal speakers of speech disorder patients with a total of 1620 and 1080 total images for six and twelve classes, respectively.The last and third analyses combine normal person and speech disorder patient datasets called mixed analysis.A total of 12420 for six classes and 11880 and twelve classes for this analysis have been collected.The epoch and batch size were set to 20 and 6, respectively.The dataset for the normal person group using spectrogram and MFCC image profile is balanced with the same amount of images that were collected.Similar to the post-stroke patient's group dataset, the two image profile contains a balanced and the same amount of data.However, the comparison of images between the groups is unbalanced due to the constraints that exist for the post-stroke patient to pronounce the vowel class during the recording session.So, due to this condition, the amount that can be collected in the post-stroke patient group is less compared to the normal group.However, an initiative has been made to progress the project where the dataset for the group of normal people is varied so that the system can learn more about a vowel class from the dataset of normal people and post-stroke patients.
Five different network models (Designed model, VGG16, VGG19, Inception, and AlexNet model) are used to apply for a comprehensive comparison of one to the other.For each of these networks, this study explained the experiment's findings.For a batch size of 6 and an epoch of 20, the designed network model in Fig. 3 and 4 achieves a classification accuracy of 92.96% and 93.70% for 6 class and 12 class of vowels, respectively.The 12 classes of designed model have higher validation accuracy than the 6 classes.Table 4    From the performance result above, the classification performance for six vowel classes is less than 12 classes for the designed and VGG networks.Based on the observation, this study may infer that increasing the training groups of the dataset will improve the classification performance and accuracy.Besides that, the six classes of AlexNet and Inception network perform better than the 12 classes, as the larger the training dataset size, the higher the accuracy.In the analysis that consists dataset of speech disorder patients, the validation accuracy for six classes of vowels of the designed network was lower than the analysis for the normal person shown in Table 2. Due to their difficulties pronouncing vowels throughout the recording procedure, speech disorder patients' validation accuracy is lower than normal people.Some of them struggle with certain vowel groups.Vowels with similar sounds, like /e/ and /i/, might make it challenging to train data sets.However, the mixed analysis, which combines the dataset from patients with speech disorders and normal people, reveals a rise for 12 classes of vowels.
Table 6 displays the CNN model's precision, recall, and F1 scores for categorising each vowel in the six classes of vowels.Throughout experiments with all of the analyses and six classes of vowels utilizing spectrogram image profile, the dataset's lowest F1 score was obtained by the normal person analysis class /e/, with a score of 66%.Conversely, class /a/ has the highest F1 score, 98% and 100% for normal person analysis and speech disorder patients analysis, respectively.For the 12 classes of vowels, the confusion matrix can be evaluated using the F1 score findings.The lowest score gained by /e/ class of normal person analysis with 21% while the highest score on the dataset is 100% for /o/ class.Table 6 shows the results of precision, recall and F1 scores of the designed CNN model for six classes of vowels.The designed model's confusion matrix is displayed in Fig. 5 and 6 for the error analysis.The confusion matrix to the /e/ vowel class shows the most missprediction.Vowel pairs with the sounds /e/, /h/, and /u/ are the most imprecise in the confusion matrix.These sounds are comparable, which medical theory explains because they have similar characteristics.In turn, the system's model recognition could be more precise.Dark red tones symbolize the accurate prediction, whereas light red tones symbolize the inaccurate prediction 26 .4. The dataset's lowest F1 score was attained by the normal person analysis class /a/ of normal person analysis, with a score of 98% while the lowest score is achieved by class /u/ for speech disorder patient analysis with 68%.Overall, class /a/ has the highest F1 score in all analysis.Fig. 9 and Fig. 10 shows the confusion matrix for six group of normal person analysis by using MFCC imageprofile.

Figure 1 .
Figure 1.Spectrogram images for vowels (a)-(f) for vowel /a/, /e/, /E/, /i/, /o/ and /u/ for normal person and vowels (g)-(l) for vowel /a/, /e/, /E/, /i/, /o/ and /u/ for disordered patient Mel Frequency Cepstral Coefficient (MFCC) For use in tasks requiring recognition, MFCC are frequently derived features from voice signals.MFCCs are regarded as a representation of the filter (vocal tract) in the source-filter model of speech.During the conversions from audio to MFCC, the default acoustic features used are 20 Mel bands.The features of the speech sample are extracted after some basic processing.Pre-emphasis, frame blocking, and windowing are the three feature extraction stages used in MFCC19 .The pre-emphasis speech signal x(n) must go via a high-pass filter.In the equation, the output signal is represented by the symbol y(n), and the value of "a" is typically between 0.9 and 1.0.Fig.2displays the cropped MFCC images of each class of normal person and speech disorder group.Each image has a size of 200 by 35 pixels and a bit depth of 32 bits.To ensure that essential information was retained during the conversion process, careful attention was given to the image quality and resolution of both spectrogram and

Fig. 2
displays the cropped MFCC images of each class of normal person and speech disorder group.Each image has a size of 200 by 35 pixels and a bit depth of 32 bits.To ensure that essential information was retained during the conversion process, careful attention was given to the image quality and resolution of both spectrogram and https://doi.org/10.21123/bsj.2024.9202P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal MFCC images.The resolution was set at a level that allowed for clear visualization of important features while avoiding excessive noise or distortion.() = () −  * ( − 1)

Figure 2 .
Figure 2. MFCC images for vowels (a)-(f) for vowel /a/, /e/, /E/, /i/, /o/ and /u/ for normal person and vowels (g)-(l) for vowel /a/, /e/, /E/, /i/, /o/ and /u/ for speech disorder patient The reasoning behind selecting specific image sizes for spectrogram (240 x 55) and MFCC (200 x 35) is based on the trade-off between resolution and computational efficiency.The selected sizes balance memory and processing power consumption while capturing sufficient frequency and time information, and are effective in various audio analysis applications.
To address the vowel recognition problem, this study developed a designed network model.A CNN with an input image size of 240 x 55 and 200 x 35 was used to build the model, along with an ADAM classifier as the optimizer and SoftMax as the activation function.CNNs use SoftMax as the activation function and the ADAM classifier as the optimizer.The final layer transforms the output into a probability distribution across classes, while the ADAM optimizer minimizes the loss function.These decisions improve CNN's performance and accuracy in image classification tasks.The CNN architecture includes activation functions, pooling layers, fully connected layers, dropout layers, and convolutional layers.Techniques for padding feature maps maintain size while boosting efficiency.CNN architectures use convolutional layers to extract features, pooling layers to reduce spectral variability, and dropout layers to mitigate overfitting.A fully connected layer predicts image class using convolution output.CNN hyperparameters are established through a methodical process of testing and analysis.A literature review and past knowledge are used to identify a range of values for each hyperparameter, followed by a grid search or random search strategy to investigate various combinations.The CNN model's performance was evaluated using metrics like accuracy or loss function, and the hyperparameters with the best results were chosen.Published Online First: July, 2024 https://doi.org/10.21123/bsj.2024.9202P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal . This study presented a CNN model with six convolutional layers, three max-pooling layers, one flattened layer, and two fully connected layers as the optimum CNN model for Malay vowels.The model's specifications are as follows:  The first convolutional layer consists of 32 filters (3 x 3), a relu activation function and considers the image is 238 by 53 pixels in size  The second convolutional layer consists of 32 filters (3 x 3), a relu activation function, maxpooling (2 x 2) for the images of 118 x 25, and dropout (0.25). The third convolutional layer consists of 32 filters (3 x 3), a relu activation function an image resolution of 116 x 23. The fourth convolutional layer consists of 32 filters (3 x 3), a relu activation function, maxpooling (2 x 2) size of 57 x 10, and dropout (0.25). The fifth convolutional layer consists of 64 filters (3 x 3), a relu activation function and recognizes that the image is 55 by 8 pixels. The sixth convolutional layer consists of 64 filters (3 x 3), a relu activation function, max-pooling (2 x 2) layer size is 26 by 3, and dropout (0.25). Finally, to create a dense layer, 1024 units of a dense layer are combined with 6 units of a dense SoftMax layer.

Published
Online First: July, 2024 https://doi.org/10.21123/bsj.2024.9202P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal Negatives" 23 .When the convolutional neural network (CNN) tests a particular data set, the confusion matrix is utilised to identify which classes the CNN properly predicts and which do not.Confusion matrices in the equations below describe crucial predictive characteristics, including recall, specificity, accuracy, and precision.accuracy, this study used precision, recall, and F1 scores as performance metrics to evaluate the model.Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive.Recall calculates the proportion of correctly predicted positive instances out of all actual positive instances.The F1 score is the mean of precision and recall, providing a balanced measure between the two metrics.These additional metrics help us gain a comprehensive understanding of the model's performance beyond just accuracy.

Figs. 3
Figs. 3 and 4 shows the line graphs of the model's accuracy and loss of the designed CNN models by using spectrogram image-profiled for 6 classes and 12 class of normal person.The visualization displays a line graph from 0 to 20 epochs that compares the accuracy and loss of the training and validation models.The designed model with 12 classes of vowels outperformed the designed model with 6 classes of vowels and achieved the best accuracy of 93.70% as shown in Fig. 4. Overall, the classification accuracy results for the designed network gained above 85% in normal, patients, and mix analysis.Compared to the designed network of other pre-trained models, it reached the highest accuracy in all performances conducted.

Figure 3 .Figure 4 .
Figure 3. Model Loss and Model Accuracy of Spectrogram Image-profiled for 6 Class of Normal Person

Figure 5 .Figure 6 .
Figure 5. Confusion Matrix of Spectrogram Image-profiled for 6 Class of Normal Person

Figure 8 .
Figure 8. Model Loss and Model Accuracy of MFCC Image-profiled for 12 Class of Normal Person Most of the neural network models depend on hyperparameter, which makes VGG16 a unique model since instead of having a large number of hyperparameter, they focus on having a uniform structure.The 16 in VGG refers to the 16 layers that have weights.This network is extensive and has about 138 million (approx) parameters.VGG network usually takes a lot of time to train.It is because the network architecture is large, and due to this reason, smaller network architectures are more desirable to learn.Refers to the result gained in this study, VGG16 achieved an average result compared to the designed network model for all the experiments conducted.VGG19 is more complex than VGG16 and performs worse in accuracy and loss.VGG19 network trains on more than a million images from the Imagenet database.It is an image database of 14 million images organized according to the Wordnet hierarchy.This network consists of 19 layers deep and can classify images into thousand object categories such as animals or objects.As a

Figure 9 .Figure 10 .
Figure 9. Confusion Matrix of of MFCC Imageprofiled for 6 Group of Normal Person

Table 1 . Distribution of Group Vowel Classes from Collected Datasets Quantity of Classes Class of Vowels
/a/ (Male) /e/ (Male) /E/ (Male) /i/ (Male) /o/ (Male) /u/ (Male) /a/ (Female) /e/ (Female) /E/ (Female) /i/ (Female) /o/ (Female) /u/ (Female) Spectrogram Instead of a real-time image, this study employed a spectrogram image profile since a spectrogram is a graphic depiction of a signal's strength over time at various frequencies that make up a waveform.The vertical axis of a spectrogram shows data depending on frequency.The lowest frequency is shown at the bottom, and the highest is displayed at the top

Table 5 . The Comparison of Result Accuracy of the Designed Network Models
shows the result of the training and validation accuracy for both classes using five different networks.https://doi.org/10.21123/bsj.2024.9202P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Batch size six and epoch 20 had the highest accuracy in 2022, only 81%, in training and testing performed by Hashim et al. 24 and other network models for the same environment.Out of the six model networks examined, this study could execute the maximum validation accuracy for epoch 20 and batch size 6 with the help of the newly proposed model in this proposed research, with 94.54%.This study has successfully improved upon Hashim's proposed network with a more accurate network for this study investigation.