A Deep Learning Approach for Identifying and Discriminating Spoken Arabic Among Other Languages

Spoken Language Identification (SLID) is an important step in speech-to-speech translation systems and multi-lingual automatic speech recognition. In recent research, deep learning mechanisms have been the prevailing approaches for spoken language identification. This paper aims to study, detect, and analyze spoken languages similar to Arabic in pronouncing certain words and then proposes a deep learning-based architecture, specifically the Bidirectional Long Short Term Memory (BLSTM), for spoken Arabic language identification and discrimination between these similar languages, namely, German, Spanish, French, and Russian, all of which are taken from Mozilla speech corpus languages. Additionally, our work involves a linguistic study of these considered languages. A total of ten thousand speakers are chosen for all five languages, and the BLSTM architecture is designed and implemented using acoustic signal features and applied to five experiments in this paper. The results show a precision of 98.97%, 98.73%, 98.47%, and 99.75% for identifying the spoken Arabic language separately along with German, Spanish, French, and Russian, respectively. Additionally, we achieved an average accuracy of 95.15% for discriminating between all these considered five languages in terms of the pronunciation of words. Our findings confirm that a BLSTM architecture is able to distinguish between observable similar pronunciations of words in considered languages.


I. INTRODUCTION
Spoken Language Identification (SLID) is how the language of spoken speech is recognized automatically [1]. SLID plays an essential role in many Natural Language Processing (NLP) applications [2]. Such as speech recognition, voice assistants and chatbots, dialog systems, sentiment analysis, information retrieval, social media analytics, question answering, email classification and filtering, and machine translation. SLID is also frequently used as a preprocessing technique [3], [4]. Although different approaches to spoken language identification have been proposed, the most common issue is discrimi-The associate editor coordinating the review of this manuscript and approving it for publication was Okyay Kaynak . nating between possibly similar languages, mainly applied to short texts; this remains a challenging task in NLP.
Arabic is the official language of the following countries: Saudi Arabia, Algeria, Bahrain, Egypt, Kuwait, Iraq, United Arab Emirates, Qatar, Oman, Chad, Comoros, Djibouti, Eritrea, Jordan, Lebanon, Syria, Morocco, Palestine, Somalia, Sudan, Tanzania, Mauritania, Tunisia, and Yemen. In addition to that, Arabic is a recognized minority language in five sovereign states: Niger, Turkey, Mali, Senegal, and Iran. There are more than four hundred million Arabic speakers [5]. In addition, it is recognized as the fourth most used language on the internet [6].
This paper focuses on giving the Arabic language a large dosage of digital speech and language processing regarding spoken language identification, detecting some industrial countries' languages that are similar to the Arabic language regarding pronouncing certain words, building a deep learning approach for discriminating between these similar languages in pronunciation and the spoken language identification from five languages, specifically the Arabic language based on speech signals and by using a higher number of audio speech files.
The rest of this paper is arranged as follows. Section II briefly reviews the related literature. The chosen speech corpora are presented in Section III. In Section IV, we present the data preparation and experimental setup. A detailed discussion of the proposed methodology and our experiments are presented in Sections V and VI, respectively. Section VII presents the experimental results and discussions. Section VIII displays the work contributions. Finally, Section IX contains our conclusions and future work.

II. LITERATURE REVIEW
Currently, many works present various methods to identify the language from the spoken speech [7], [8], [9]. These methods provide accurate results. However, these studies are usually limited to spoken language classification and identification of specific languages [10], [11], [12], [13]. There is no standard technique that can serve as the gold standard for discriminating between different languages. Additionally, the study of the possible similarities and dissimilarities between Arabic and other languages is urgently needed to improve spoken language identification [14].
Distinguishing between languages is a hurdle for SLID systems, especially when the languages contain similar spoken words. This area has been explored recently in the context of South-Slavic languages [15]. Mandarin variants in China, Singapore, and Taiwan [16], Indonesian and Malay languages [17], Brazilian and European Portuguese [18], and Dari and Persian [19]. Inspired by such studies, many shared task competitions emphasize such tasks.
Arabic language identification for NLP is an important research effort, but few studies have been done in this area for both audio speech [20], [21], [22], [23], [24], and textual forms [25], [26]. Mel Frequency Cepstral Coefficient (MFCC) is commonly used to solve this problem [10], [11], [27], [28], and [29]. Heracleous et al. [1] presented experiments for SLID by using Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN) methods. Additionally, they investigated merging the two approaches. The proposed methods were evaluated on the National Institute of Standards and Technology (NIST 2015) i-vector Machine Learning Challenge task to recognize fifty languages, with three hundred training vectors for each target language. Using CNN, the Equal Error Rate (EER) was equal to 3.48%, and the EER when using DNN was equal to 3.55%. When the CNN and DNN systems were fused, the EER was equivalent to 3.30%. The results show the effectiveness of using CNN and i-vectors in SLID. Additionally, these methods demonstrated significantly superior performance compared to the Support Vector Machines (SVM) method.
Draghici et al. [7] evaluated a previously proposed method for SLID, depending on the Convolutional Recurrent Neural Networks (CRNN) and CNN, by altering the training technique to guarantee a fair distribution of classes and effective memory utilization. They used a revised set of six languages: English, French, German, Greek, Italian, and Spanish. The results of the methods achieved good accuracy values of 71% and 83% for CNN and CRNN, respectively; this confirms that both CRNN and CNN can learn language-specific patterns in Mel spectrogram forms of speech recordings. However, the study was restricted to Germanic languages, Romance languages, and Greek.
Singh et al. [9] used CNN to SLID. Their study involved many languages of different corpora: spoken language identification corpus, language identification corpus, Kaggle corpus, and Mozilla speech corpus. The spoken language identification corpus consists of three languages: English, Spanish, and German. Sisodia et al. [29] assessed some of the ensemble learning methods for SLID using MFCC and Delta Mel Frequency Cepstral Coefficient (DFCC) features. They recorded speech audio files using five languages: French, Dutch, English, German, and Portuguese. Ensemble learner methods were designed using Adaboost, Bagging, Extra Trees, Gradient Boosting, and Random Forests, and the performance of these methods was compared using accuracy, precision, recall, and F-measure metrics. The extra trees classifier evaluation of performance metrics gives better results, up to 85.71% for accuracy, 84.00% for precision, 87.50% for recall, and 83.58% for F-measure. However, the audio corpus used for this experiment has several inherent restrictions; therefore, results should not be generalized. Additionally, the study did not consider the problem of discriminating between similar languages.
Prasad et al. [30] created two unique BLSTM using Large Vocabulary Continuous Speech Recognition (LVCSR) based on lexical features and a fixed length of four hun-dred per speech bottleneck features generated by the i-vector framework. They evaluated their proposed approach using the VarDial 2017 corpora for Arabic and German dialect identification with dialects of Egyptian, North African, Levantine, Gulf, and MSA for Arabic and Bern, Zurich, Lucerne, and Basel for German. Additionally, they created a LSTM architecture to discriminate between similar languages, such as Bosnian, Croatian, and Serbian. The BLSTM architecture showed an accuracy of 24.60% on lexical features and 57.70% on bottleneck features of the i-vector framework. In contrast, the performance was too poor for discriminating between similar languages using LSTM, resulting in an accuracy of 19.50%.
Malmasi et al. [31] presented the results of the third edition of discriminating between similar languages; a shared task was given at COLING'2016 as part of the VarDial'2016 workshop. There were two subtasks in the challenge: the first one focused on identifying similar languages and language varieties in Newswire texts written in thirteen languages divided into the following groups: (Czech, Slovak), (Indonesian, Malay), (Bosnian, Serbian, Croatian), (Brazilian Portuguese, European Portuguese), (Argentinian Spanish, Peninsular Spanish), and (American English, British English). The second subtask dealt with Arabic Dialect Identification (ADI) in speech transcripts. Thirty-seven teams registered to participate in the task, twenty-four submitted test results, and twenty also wrote system description papers. The most successful feature was High-order character ngrams. In addition, this study included the SVM and CNN. In contrast, deep learning approaches did not perform well. Ionescu and Butnaru [32] presented a machine learning method for the ADI and the German Dialect Identification (GDI); Closed Shared Tasks of the DSL 2017 Challenge. The proposed method combines several kernels using multiple kernel learning. While most kernels are based on character p grams (also known as n-grams), a low-dimensional representation of audio recording is extracted from speech transcripts. A kernel-based on i-vectors is provided only for Arabic data. Furthermore, they independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR) in the learning stage. The method was simple and shallow. Additionally, they were ranked first in the ADI Shared Task with an accuracy of 76.27% and fifth place in the GDI Shared Task with an accuracy of 66.36%. But, the study depended on just two languages.
Studies results show that SLID performance is enhanced. However, the research is still limited to studies that include Arabic, and these studies didn't compare Arabic to other languages. Table 1 presents a summary of the considered literature review. Initial quarrying experiments were executed in order to uncovered suitability of different deep learning architectures, hence; BLSTM was found to outperform all others [33], [34]. As a consequence in this paper, we propose using the BLSTM architecture for Arabic language identification.

III. SELECTED SPEECH CORPUS
We use the Mozilla speech corpus [35] to train and evaluate the proposed architecture. The Mozilla speech corpus is available to download for research and study purposes and contains speech files of varying quality in a multi-lingual setup. In addition, the corpus includes samples from adult male and female speakers but none from children. For all languages, the audio speech files recorded in the Mozilla corpus come from different dialects, accents, and continents.
Researchers can use this corpus to train, validate, and test speech-enabled systems such as speech recognition systems and other digital speech processing tasks. Additionally, it may benefit different fields, such as gender classification [36], [37], speaker identification, speech recognition, or environment recognition and classification. Each row in the file (.TSV) represents a single speech file and contains the ID, sentence, gender, accent, age, and path. The Mozilla speech corpus project employs crowdsourcing to collect data [38]. The original Mozilla speech datasets have the training, validated, invalidated, and testing sets. In this current effort, each language of the considered ones has been split into three subsets: training, testing, and validation in a balanced manner regarding gender and other variables. Our splitting strategy has considered any speech file as a unified unit, and any file cannot be split over into two different data subsets.
The latest release contains 96 languages, and over 500,000 individuals have participated. The audio speech files in this corpus have different lengths for each language, ranging from one second to ten seconds. Additionally, this corpus is considered the most extensive multi-lingual speech corpus that includes Arabic in the public domain for speech classification and identification.

IV. DATA PREPARATION AND EXPERIMENTAL SETUP A. DATA PREPARATION
In this section, we chose the subsets of the Mozilla speech corpus for five languages of the twelve most spoken languages in the world [39]; Arabic, German, Spanish, French, and Russian. Each language contains two thousand audio speech files balanced by gender (1,000 adult males and 1,000 adult females). Figure 1 presents the number of speech files in training, validation, and testing folders and how the corpora were divided and organized.

B. EXPERIMENTAL SETUP
In this subsection, we study, detect, and analyze languages if compared to Arabic in pronouncing certain words [40], [41]. It is important to let the readers to know that the reason of existence of similar spoken language units among more than one spoken language. These commonness of language units, such as words, maybe caused by sharing either religious, names, mathematical, and/or scientific terms among inter acting cultures and societies. VOLUME 11, 2023

1) INVESTIGATING PHONETIC SIMILARITIES AMONG ARABIC AND FOUR LANGUAGES IN SLID PROBLEM
Here we outline the similarities in pronunciation of words between five languages, including Arabic. For example, the pronunciation of the Arabic word '' '' is similar to ''Alkohole'' in German, ''Alcohol'' in Spanish, ''Alcool'' in French, and '' '' in Russian. Columns 1, 2, 3, 4, and 6 of Table 2 show the similarity between these five languages in pronouncing certain words.

2) INVESTIGATING PHONETIC SIMILARITIES AMONG ARABIC AND GERMAN LANGUAGES IN SLID PROBLEM
Here we outline the similarities in pronunciation of words between Arabic and German languages. For example, the pronunciation of the Arabic word '' '' is similar to ''Tarif'' in German [42]. Columns 4 and 6 of Table 2 show the similarity between these two languages in pronouncing certain words.

3) INVESTIGATING PHONETIC SIMILARITIES AMONG ARABIC AND SPANISH LANGUAGES IN SLID PROBLEM
Here we outline the similarities in pronunciation of words between Arabic and Spanish languages. For example, the pronunciation of the Arabic word '' '' is similar to ''Blusa'' in Spanish [43]. Columns 3 and 6 of Table 2 show the similarity between these two languages in pronouncing certain words.

4) INVESTIGATING PHONETIC SIMILARITIES AMONG ARABIC AND FRENCH LANGUAGES IN SLID PROBLEM
Here we outline the similarities in pronunciation of words between Arabic and French languages. For example, the pronunciation of the Arabic word '' '' is similar to ''Zaouïa'' in French [44]. Columns 2 and 6 of Table 2 show the similarity between these two languages in pronouncing certain words. The pronunciation similarities of certain words between arabic and four languages [46], [47], [48], [49], [50], [51], [52]. VOLUME 11, 2023

5) INVESTIGATING PHONETIC SIMILARITIES AMONG ARABIC AND RUSSIAN LANGUAGES IN SLID PROBLEM
Here we outline the similarities in pronunciation of words between Arabic and Russian languages. For example, the pronunciation of the Arabic word '' '' is similar to '' '' in Russian [45]. Columns 1 and 6 of Table 2 show the similarity between these two languages in pronouncing certain words.

V. PROPOSED METHODOLOGY
The design and implementation of the proposed methodology is discussed here. It has six steps: input, preprocessing, extracting features, framing, classification method, and output, as shown in Figure 2.
A. INPUT Figure 2 shows a block diagram of speech processing for discriminating between similar languages in pronunciations of words and spoken language identification. The input data: (T1) means training, (V1) means validation, and (T2) means testing. The input audio speech files contain all the studied similar words in Subsection B of Section IV. The following part is sentences of audio speech files found in our used data subsets, which included shared spoken words between our five languages.

B. PREPROCESSING
The sample rate of the selected Mozilla speech corpus is 48 kHz for each input audio speech file. In the preprocessing part, we segment the audio speech file based on the boundaries of the speech signal and ignore the silent parts, depending on the sample rate and the time of the window, using detect speech function, because the silence portions of some files are very long and may dominate actual signal information, then we replicate the label of the file to the number of segments. Figure 3 shows the details of the preprocessing step.

C. EXTRACTING FEATURES
Specifying and extracting features is an essential step concerning spoken language identification. A good feature representation simplifies the problem and enables higher accuracy with less computational complexity, whereas a poor representation exacerbates the issue and necessitates a more complicated classifier method. In this paper, we use two features with our proposed method, namely MFCC and GTCC after initially tuning up our system with plenty of features, and we found these two considered features are the best to adopt. These features were extracted from each preprocessed audio segment using the Audio Feature Extractor function in MATLAB [48].

1) THE MEL FREQUENCY CEPSTRAL COEFFICIENT
The Mel Frequency Cepstral Coefficient (MFCC) is a wellknown signal feature used extensively in spoken language identification [27], [29]. MFCC is based on the human peripheral auditory system. Humans do not use a linear scale to perceive the frequency content of sounds for speech signals [49]. Figure 4 shows the block diagram of MFCC [50].

2) THE GAMMATONE CEPSTRAL COEFFICIENT
The Gammatone Cepstral Coefficient (GTCC) is another used feature extraction technique. The technique depends on a Gammatone filter bank, which more accurately mimics the human auditory system as a series of overlapping band-pass filters than triangle filters employed in MFCC [51]. In addition, GTCC uses an analytically designed frequency scale and the ERB rate scale. Figure 5 shows the block diagram of GTCC [49].

D. FRAMING
The features in this paper include 26 dimensions: 13 for GTCC and 13 for MFCC. Because the segments had different feature vectors depending on speech file length, we need the framing to buffer the feature vectors into fixed sequences of size 50 frames with 47 overlaps and 26 features for each frame. To explain with the help of Figure 6, the box in the upper left contains the extracted features of all the three used subsets in either training, validation, or testing. Each line in that referred box includes features of the number of frames in the input file depending on its original recorded duration. For example, the first file contains 139 frames with 26 features extracted from it, as explained above and in the past subsection. For example, one of the files from the designated input group (i.e., one line from the upper left box of Figure 6) is displayed in the upper right box of Figure 6, which is framed into 30 sequences.
In addition, within a given speech file, we group the frames in sequence blocks; each sequence block contains features of the 50 frames with a step size of three frames (i.e., the overlap of 47 frames) as explained in the middle bottom of our figure. Furthermore, each block has one targeted tag; thus, each speech file consists of sequences, 50 frames each, as depicted  in the upper right box. Depending on this configuration, BLSTM gets a group of data blocks containing 26 features times 50 frames (i.e., 1,300 total values), but BLSTM is fed with the sequence of data groups of 26 values at a time (i.e., frame by frame). In other words, the BLSTM has 26 input lines and one target for each block containing 1,300 values.

E. CLASSIFICATION METHOD
A multi-class BLSTM architecture classifier [52] was designed and implemented in this paper to identify our considered spoken languages and discriminating spoken Arabic from the existence of other languages. Figure 7 shows the structure of our proposed classifier. First, we specify the   BLSTM architecture input size as equal to sequences of the size dimensions of features, which equal 26. Then, we set the BLSTM1 layer with an output size of 256 hidden units and output a sequence, followed by a dropout1 layer with a 0.4 dropout rate. Then, we specify the BLSTM2 layer with an output size of 128 hidden units and the output a sequence, followed by a dropout2 layer with a 0.4 dropout rate. Then, we specify the BLSTM3 layer with an output size of 256 hidden units and the output of the last element of the sequence, followed by a dropout3 layer with a 0.4 dropout rate and a fully connected layer of five classes before the softmax layer, and finally, a classification layer to detect the class of language. Table 3 shows the analysis details of the BLSTM architecture used in Figure 7.

F. OUTPUT
The last step is to identify the spoken language along with system output evaluations. We evaluated our proposed method on the testing files. Each audio speech file contains different numbers of sequences. The metrics are calculated for each sequence to compute the metrics for the whole audio speech file [53]. The evaluation metrics are as follows [9], [54]. For all metrics: Tp means True positives, Tn means True negatives, Fp means False positives, and Fn means False negatives.

1) ACCURACY
It measures the proportion of accurately predicted samples to all samples, as given in Eq. (1).

2) PRECISION
It measures the proportion of accurately predicted samples to all positively predicted samples, as given in Eq. (2).

3) RECALL
It measures the proportion of accurately predicted samples to all positively actual samples, as given in Eq. (3).

4) F-SCORE
It is a weighted average of recall and precision, as given in Eq. (4).
We also present the confusion matrix in the results section.

VI. EXPERIMENTS
We report five experiments in this paper: 1) To discriminate among five similar languages: Arabic, German, Spanish, French, and Russian. 2) To discriminate Arabic from German speech. 3) To discriminate Arabic from Spanish speech. 4) To discriminate Arabic from French speech. 5) To discriminate Arabic from Russian speech. All these experiments are based on that study done in Subsection B of Section IV. The work was carried out as follows for all experiments. First, we further partitioned each speech file into segments by considering the cleaned speech signal after removing silence and non-speech segments.
Additionally, we extracted the MFCC and GTCC features of each segment. Then frame the segments into fixed sequences. Finally, we carried out five runs for each experiment using the BLSTM architecture, with the same training parameters for all experiments.
We split the data into 72% for training, 8% for validation, and 20% for testing, as shown in Figure 1. Table 4 lists the BLSTM architecture training parameters used. An ablation study was performed at a different initial phase of the research to tune the system to the best parameters. Instead, we concentrated in this study on our main aim, which is comparisons of the spoken our considered languages.   After experimenting with different maximum epochs: 15, 20, and 50, we found that 15 is the best because the accuracy curve does not increase after it and to avoid overfitting in loss, which means the network makes fifteen passes through the training data. The minimum batch size is equal to 128 so that the network looks at one hundred and twenty-eighth training signals at a time. To prevent the gradients from exploding, we set the gradient threshold to 1. We set the verbose parameter to true to be able to depict the table that corresponds to the data shown in the plot and the shuffle parameter as every-epoch to shuffle the training sequence at the beginning of each epoch. Furthermore, we specified the learn rate schedule parameter piecewise to decrease the learning rate by a specified factor (0.0009) every time a certain number of epochs have passed. Finally, we used the adaptive moment estimation (Adam) optimizer. Figure 8 shows the training progress for run number two in the discrimination among five different languages experiment. Figure 9 shows the training progress for run number two in identifying the spoken Arabic language with respect to the Russian experiment.

VII. RESULTS AND DISCUSSION
As outlined in Subsection B of Section IV, checking pronunciations was carried out on the word level. The results were calculated using the average of five runs of each experiment. VOLUME 11, 2023

A. EXPERIMENT 1: ALL LANGUAGES IDENTIFICATION
In this experiment, we are trying to uncover the probable degrees of similarities and dissimilarities between these five languages, namely Arabic, French, German, Russian, and Spanish. For example, it is obvious for talkers and listeners to find commonly spoken words exist in all five languages.
As the results of automatic classification, the result is 99.60% for the average training accuracy, 96.48% for the average validation accuracy, and 95.15% for the average testing accuracy. The accuracies ranged from 96.00% to 96.88% in terms of validation accuracy. The testing accuracy reached significant levels of 94.90% for the worst result and 95.40% for the best result. Table 5 shows the results of this experiment, and Table 6 shows the confusion matrix of the best test run in Table 5.
To analyze system errors of the results depicted in Table 6, we can learn some beneficial outcomes regarding the similarities and dissimilarities of our considered five languages. By referring to the confusion matrix in the mentioned table and regarding Arabic: two test tokens of Arabic are identified as French, nine test tokens are identified as German, two test tokens are identified as Russian, and one test token is identified as Spanish, this implies that Arabic is close to German. Similarly and regarding French test tokens, eight test tokens of French are identified as Arabic, seventeen test tokens are identified as German, one test token is identified as Russian, and three test tokens are identified as Spanish, from these errors we can rely on the direction that also French is similar to German; for German language, three test tokens are identified as Arabic, one test token is identified as French, and two test tokens are identified as Spanish, this language got the highest accuracy and none of other languages are confounding it but German is confusing most of the other languages; for Russian language, three test tokens are identified as French, eight test tokens are identified as German, and two test tokens are identified as Spanish, the closest language for Russian one is German one; and for Spanish language, three test tokens are identified as Arabic, ten test tokens are identified as French, thirteen test tokens are identified as German, and four test tokens are identified as Russian, for Spanish language, French and German languages are confusing this language with a considerable count, as in [55]. Additionally, Table 7 shows the analysis results of some speech sentences of test files from the confusion matrix given in Table 6. From Table 7 and As shown in sentences of rows: three, seven, nine, ten, and fourteen, the system was able to discriminate among similar languages in pronouncing certain words. Such as the following words: '' '' / /, '' '' /lajmu:n/, ''Alkohole'' / /, ''haschich'' / /, and ''pantalón'' / / that was detected in Table 2. Table 7 is taken from the actual test speech and linked directly to the errors caused by the system and displayed in the above confusion matrix; this can be considered as a justification for the degree of similarity between Arabic and four languages in pronouncing is caused by certain common words found in training and testing datasets. In other words, the system gives errors by incorrectly identifying the language and picking up the wrong language due to phonetically common words between the correct language and the erroneously considered language.
For example, in the first row of Table 7, the system told us the spoken language is German instead of articulated as Arabic, and we can see the word '' '' / / in the spoken sentence, which exists in both Arabic and German. As can be noticed in the sample shown in the table, German is incorrectly only once, as Arabic, but all other mistakes are found in Arabic, French, and Spanish. In all confusion matrices, we can conclude that the German language is strongly confus-    Table 5.
ing other languages, but vice versa is not applicable. Table 7 concludes that the similarity in pronunciation between words affects spoken language identification.

B. EXPERIMENT 2: BINARY IDENTIFICATION AMONG ARABIC AND GERMAN
In this experiment, we identify the spoken Arabic language with respect to German; the result is 99.86% for the average training accuracy, 98.56% for the average validation accuracy, and 97.30% for the average testing accuracy. The accuracies ranged from 97.81% to 99.06% in terms of validation accuracy. The testing accuracy reached significant levels of 97.13% for the worst result and 97.50% for the best result. Table 8 shows the results of this experiment, and Table 9 shows the confusion matrix of the best test run in Table 8.

C. EXPERIMENT 3: BINARY IDENTIFICATION AMONG ARABIC AND SPANISH
In this experiment, we identify the spoken Arabic language with respect to Spanish; the result is 99.69% for the average training accuracy, 97.63% for the average validation accuracy, and 97.78% for the average testing accuracy. The accuracies ranged from 97.50% to 97.81% in terms of validation accuracy. The testing accuracy reached significant levels of 97.50% for the worst result and 98.00% for the best result. VOLUME 11, 2023 Table 10 shows the results of this experiment, and Table 11 shows the confusion matrix of the best test run in Table 10.

D. EXPERIMENT 4: BINARY IDENTIFICATION AMONG ARABIC AND FRENCH
In this experiment, we identify the spoken Arabic language with respect to French; the result is 99.44% for the average training accuracy, 98.69% for the average validation accu- racy, and 97.20% for the average testing accuracy. The accuracies ranged from 98.44% to 98.75% in terms of validation accuracy. The testing accuracy reached significant levels of 96.88% for the worst result and 97.38% for the best result. Table 12 shows the results of this experiment, and Table 13 shows the confusion matrix of the best test run in Table 12.

E. EXPERIMENT 5: BINARY IDENTIFICATION AMONG ARABIC AND RUSSIAN
In this experiment, we identify the spoken Arabic language with respect to Russian; the result is 100% for the average   Table 10.  training accuracy, 98.75% for the average validation accuracy, and 98.78% for the average testing accuracy. The accuracies ranged from 98.44% to 99.06% in terms of validation accuracy. The testing accuracy reached significant levels of 98.63% for the worst result and 99.00% for the best result. Table 14 shows the results of this experiment, and Table 15 shows the confusion matrix of the best test run in Table 14. Table 16 presents other methods used to identify the spoken Arabic language and discriminate languages similar to Arabic compared to our proposed methodology. This table clarifies the languages used in each study and the proposed classifiers and features to generalize the available studies in this area. As shown in Table 16, our proposed methodology obtained 95.15% average accuracy for discriminating between five  similar languages in pronunciations of words; 97.30% average accuracy for differentiating between Arabic and German languages and 98.97% precision for identifying the spoken Arabic language from German; 97.78% average accuracy for discriminating between Arabic and Spanish languages and 98.73% precision for identifying the spoken Arabic language from Spanish; 97.20% average accuracy for differentiating between Arabic and French languages and 98.47% precision for identifying the spoken Arabic language from French; and 98.78% average accuracy for discriminating between Arabic and Russian and 99.75% precision for identifying the spoken Arabic language from Russian. These are excellent classification accuracies with more similar languages by using GTCC and MFCC features. Additionally, we can conclude from all results that the French language has the highest similarity to the Arabic language.
As important performance feedback about the performance of the automatic system and language identification, similarity of languages does not depend only on one or two words in spoken sentences, but it may affected by the multilingual spoken language identification problem, which for sure includes levels of uttered words, syllables, phonemes, rhythm, and all other languages' hidden uniqueness and commonness. In fact, the deep learning system is going to consider all aspects of similarities and dissimilarities incorporated in spoken utterances.
Our methodology outperformed the method presented in [32]. The improvement is more than 20% in terms of accuracy. Likewise, our methodology outperformed the method illustrated in [30]. The improvement is more than 35% in terms of accuracy.

VIII. WORK CONTRIBUTIONS
The most significant contributions of this paper are summarized as follows:

1) Detecting four languages similar to Arabic in pronounc-
ing certain words. 2) Building a novel deep learning approach for discriminating between similar languages in pronunciation. 3) Establishing a system efficiently in MATLAB program capable of identifying the spoken languages automatically in record time. 4) Identifying the spoken Arabic language that is missing in previous studies. 5) Creating and organizing a corpus of subsets of Mozilla speech corpus for five languages. 6) Improving targeted applications based on multi-lingual speech, such as automatic machine translation and spoken document retrieval. The SLID is also commonly used in emergency call routing, where it is used to direct a call to a fluent native operator.

IX. CONCLUSION
In this paper, we studied five Mozilla speech corpus languages: Arabic, German, Spanish, French, and Russian. We first phonetically established that they have similar spoken words, especially when compared with the Arabic language, as in the case of pronouncing ''Alcohol''. Then, we proposed the BLSTM architecture for spoken Arabic language identification and for discriminating between these similar languages in speech. Finally, we presented a training, validation, and testing strategy with a balanced corpus and short speech recordings as a second contribution for other speech recognition tasks.
We carried out extensive experiments, and our methodology produced state-of-the-art results. Our experimental results show that for experiments with two and five languages, the BLSTM architecture with MFCC and GTCC features extracted from the speech is a suitable methodology for SLID. Additionally, there is no large difference between all experiments' best and worst results, which means that the proposed system is accurate in classification.
Our investigations have also shown our method's robustness to discriminate among similar languages in all the experiments. For example, the method exhibits a precision of 99.75% for identifying the spoken Arabic language from Russian and an average accuracy of 98.78% for discriminating between Arabic and Russian. Also, our method obtained a 95.15% average accuracy for discriminating between five similar languages in pronunciations of words. However, it is very important to insist that our outcomes cannot be precisely compared to past efforts due to differences in mixtures of considered language, datasets, used parameters, and/or methodologies.
The data preparation and training of the selected corpus of the five languages of Mozilla speech corpus for five experiments are important and challenging limitations in this research, which also signifies the work contribution of this paper.
As future work, this study and proposed methodology could contribute to developing an automatic translation system or improving a multi-lingual speech recognition system because language identification is the first step to creating such systems. In addition to this, other deep learning architectures such as LSTM, CNN, attention-based BLSTM, and transformers-based will be investigated and compared. Also, more languages from Afro-Asiatic families, Semitic, and others are one of the future research targets.

DECLARATION OF INTEREST
The authors have no relevant conflicts of interest to disclose.