Interpretation of Bahasa Isyarat Malaysia (BIM) Using SSD-MobileNet-V2 FPNLite and COCO mAP

: This research proposes a study on two-way communication between deaf/mute and normal people using an Android application. Despite advancements in technology, there is still a lack of mobile applications that facilitate two-way communication between deaf/mute and normal people, especially by using Bahasa Isyarat Malaysia (BIM). This project consists of three parts: First, we use BIM letters, which enables the recognition of BIM letters and BIM combined letters to form a word. In this part, a MobileNet pre-trained model is implemented to train the model with a total of 87,000 images for 29 classes, with a 10% test size and a 90% training size. The second part is BIM word hand gestures, which consists of ﬁve classes that are trained with the SSD-MobileNet-V2 FPNLite 320 × 320 pre-trained model with a speed of 22 s/frame rate and COCO mAP of 22.2, with a total of 500 images for all ﬁve classes and ﬁrst-time training set to 2000 steps, while the second- and third-time training are set to 2500 steps. The third part is Android application development using Android Studio, which contains the features of the BIM letters and BIM word hand gestures, with the trained models converted into TensorFlow Lite. This feature also includes the conversion of speech to text, whereby this feature allows converting speech to text through the Android application. Thus, BIM letters obtain 99.75% accuracy after training the models, while BIM word hand gestures obtain 61.60% accuracy. The suggested system is validated as a result of these simulations and tests.


Related Work
Bahasa Isyarat Malaysia (BIM), also known as Malaysian Sign Language (MSL), was initially developed in 1998, shortly after the Malaysian Federation of the Deaf was founded. This paper aims to create a mobile application that will bridge the communication gap between hearing people and the deaf-mute community by assisting the community in learning BIM.
In [14], a survey was conducted for possible consumers as its methodology. The target populations were University of Tenaga Nasional (UNITEN) students and Bahasa Isyarat Malaysia Facebook Group (BIMMFD) members. Multiple-choice, open-ended, and dichotomous items were included in the surveys. This research demonstrates that the software is thought to be helpful for society and suggests creating a more user-friendly and accessible way to study and communicate using this app utilising BIM.
The current state of the art with modern and more efficient gesture recognition methods has been discussed in several papers. In [26], the author introduced two deep-neuralnetwork-based models: one for audio-visual speech recognition (AVSR) using the Lip Reading in the Wild Dataset (LRW) and one for gesture recognition using the Ankara University Turkish Sign Language Dataset (AUTSL). This paper uses both visual and acoustic features and fusion approaches, achieving 98.56% accuracy and demonstrating the possibility of recognizing speech and gestures using mobile devices. The authors of [27] worked on training models on datasets from different sign languages (Word-Level American Sign Language (WLASL), AUTSL, and Russian (RSL)) to improve sign recognition quality and demonstrate the possibility of real-time sign language recognition without using GPUs, with the potential to benefit speech-or hearing-impaired individuals, using VideoSWIN Information 2023, 14, 319 4 of 20 transformer and MViT. However, this paper focuses on the development of BIM letter and word recognition using SSD-MobileNet-V2 FPNLite and COCO mAP.

SSD-MobileNet-V2 FPNLite
SSD-MobileNet-V2 can recognise multiple items in a single image or frame. This model detects each image's position, producing the object's name and bounding boxes. Ninety different objects can be classified using the pre-trained SSD-Mobile model.
Due to the elimination of bounding box proposals, Single-Shot Multibox detector (SSD) models run faster than R-CNN models. The processing speed of detection and the model size were the deciding factors in the choice of the SSD-MobileNet-V2 model. As demonstrated in Table 1, the model requires input photos of 320 × 320 and detects objects and their locations in those images in 19 milliseconds, whereas other models require more time. For example, SSD-MobileNet-V1-COCO, the second-fastest model, takes 0.3 milliseconds to categorise objects in a picture compared to SSD-MobileNet-V2-COCO, the third-fastest model, and so on. Compared to the second-fastest model SSD-MobileNet-V1-COCO, SSD-MobileNet-V2 320 × 320 is the most recent MobileNet model for Single-Shot Multibox detection. It is optimised for speed at a very low cost, with a mean average precision (mAP) of only 0.8 [28].

TensorFlow Lite Object Detection
An open-source deep learning framework called TensorFlow Lite was created for devices with limited resources, such as mobile devices and Raspberry Pi modules. TensorFlow Lite enables the use of TensorFlow models on mobile, embedded, and Internet of Things (IoT) devices. It allows for low latency and compact binary size on-device machine learning inference. As a result, latency is increased and power consumption is decreased [28].
For edge-based machine learning, TensorFlow Lite was explicitly created. It enables us to use various resource-constrained edge devices, such as smartphones, micro-controllers, and other circuits, to perform multiple lightweight algorithms [29].
An open-source machine learning tool called TensorFlow Object Detection API is utilised in many different applications and has recently grown in popularity. When installing the TensorFlow Object Detection API, an implicit assumption is that it can be provided with noise-free or benign datasets. This open-source software is now being used in many object detection applications. However, in the real world, the datasets could contain inaccurate information due to noise, naturally occurring adversarial objects, adversarial tactics, and other flaws. Therefore, for the API to handle datasets from the real world, it needs to undergo thorough testing to increase its robustness and capabilities [30].
Another paper also defines TensorFlow Object Detection as a class of semantic things (such as people, buildings, or cars) that can be detected in digital photos and videos using object detection, a computer technology linked to computer vision, and image processing. The study areas for target detection include pedestrian and face detection.
Many computer vision applications require object detection, such as image retrieval and video surveillance. Applying this method to an edge device could let you perform a task, such as an autopilot [29].

MobileNets Architecture and Working Principle
Efficiency in deep learning is the key to designing or creating a helpful tool that is feasible to use with as little computation as possible. There are other ways or methods to solve efficiency issues in deep learning programming, and MobileNet is one of the approaches for said problem. MobileNets reduce the computation by factorising the convolutions. The architecture of MobileNets is primarily from depth-wise separable filters. MobileNets factorise a standard convolution into a depth-wise convolution and a 1 × 1 convolution (pointwise convolution) [31]. A standard convolution filters and combines inputs into a new set of outputs in one step. In contrast, depth-wise separable convolution splits the information into the filtering layer and the combining layer, decreasing the computation power and model size drastically.

Android Speech-to-Text API
Google Voice Recognition, or GVR, is a tool with an open API that converts the speech from the user to text to be read. GVR usually requires an internet connection from the user to the GVR server. GVR uses neural network algorithms to convert raw audio speech to text and works for several languages [32]. This tool has two-thread communication. The first thread is to receive the user's audio speech and send it to Google Cloud to be converted into text and stored as strings. After that, the other communication thread reads the strings, sends them to the server, and resides in the user workstation.
Google Cloud Speech-to-Text or Cloud Speech API is another tool for the speech-totext feature. It has far more features than standard Google Speech API. For example, it has 30+ voices available in multiple languages and variants. However, this is not just a tool; it is a product made by Google, and the user needs to subscribe and send some fees to use this tool. Table 2 lists the advantages and disadvantages of these tools. Table 2. Advantages and disadvantages of Google Cloud API and Android Speech-to-Text API.

Advantages Disadvantages
Google Cloud API It supports 80 different languages. Not free.
Can recognise audio uploaded in the request.
Requires higher-performance hardware. Returns text results in real time. Accurate in noisy environments. Works with apps across any device and platform.
Android Speech-to-Text API Free to use.
Need to pass local language to convert speech to-text. Easy to use.
Not all devices support offline speech input. It does not require high-performance hardware.
It cannot pan an audio file to be recognised. Easy to develop.
It only works with Android phones.

Materials and Methods
This project includes three main categories: BIM letters, BIM words, hand gestures, and Android application development. These three main categories are divided into the database acquisition phase, the system's design phase, and the system's testing phase. The BIM sign language implemented utilises the static hand gesture, which only involves capturing a single image at the classifier's input.

BIM Letters
The first category, BIM letters, had three phases: the database acquisition phase, system's design phase, and system's testing phase. Phase 1: In the database acquisition phase, datasets were obtained from deaf/mute teacher datasets, Kaggle, and self-generated datasets. BIM datasets in Kaggle are limited; thus, ASL letters were implemented with a replacement of self-generated letters G and T. Phase 2: for the system's design phase, TensorFlow/Keras was implemented into the system as deep learning neural network to train the dataset. Lastly, Phase 3: the system's testing phase was tested to ensure the functionality was well executed by generating a confusion matrix table.
Collected data were processed for classification using the CNN model, in this case, MobileNet, and the datasets were trained by implementing 10% of the datasets for testing and 90% of the datasets for training. Once the result was obtained, the model was converted to TensorFlow Lite to be imported into Android Studio for application making. A flow process is shown in Figure 1.

BIM Letters
The first category, BIM letters, had three phases: the database acquisition phase, system's design phase, and system's testing phase. Phase 1: In the database acquisition phase, datasets were obtained from deaf/mute teacher datasets, Kaggle, and self-generated datasets. BIM datasets in Kaggle are limited; thus, ASL letters were implemented with a replacement of self-generated letters G and T. Phase 2: for the system's design phase, Ten-sorFlow/Keras was implemented into the system as deep learning neural network to train the dataset. Lastly, Phase 3: the system's testing phase was tested to ensure the functionality was well executed by generating a confusion matrix table.
Collected data were processed for classification using the CNN model, in this case, MobileNet, and the datasets were trained by implementing 10% of the datasets for testing and 90% of the datasets for training. Once the result was obtained, the model was converted to TensorFlow Lite to be imported into Android Studio for application making. A flow process is shown in Figure 1. There are 29 letters in the datasets, including delete, nothing, and space, which is beneficial for real-time applications. The data collected from a total number of 3000 images in each class, comprising 87,000 images, were then resized to 200 px × 200 px before being provided as input because smaller images can allow training to be faster.
The system was tested to ensure its operation was executed effectively using a confusion matrix, as seen in Figure 2.
The confusion matrix consists of True Negative, True Positive, False Positive, and False Negative, and zero means false, while one means true. Therefore, there are two classes: (class 0) and (class 1). Thus, anything that the confusion matrix stated as zero or (class 0) is where the prediction is incorrect, such as True Negative, False Positive, and False Negative, whereas (class 1) is the number of samples that the model correctly classified as true, that is, True Positive.  There are 29 letters in the datasets, including delete, nothing, and space, which is beneficial for real-time applications. The data collected from a total number of 3000 images in each class, comprising 87,000 images, were then resized to 200 px × 200 px before being provided as input because smaller images can allow training to be faster.

Start
The system was tested to ensure its operation was executed effectively using a confusion matrix, as seen in Figure 2.
The confusion matrix consists of True Negative, True Positive, False Positive, and False Negative, and zero means false, while one means true. Therefore, there are two classes: (class 0) and (class 1). Thus, anything that the confusion matrix stated as zero or (class 0) is where the prediction is incorrect, such as True Negative, False Positive, and False Negative, whereas (class 1) is the number of samples that the model correctly classified as true, that is, True Positive.

BIM Word Hand Gestures
The dataset includes five classes, three of which are from family (kelu tain the words brother (abang), father (bapa), and mother (emak); one (perasaan), which is love (sayang); and one from pronouns (ganti nama), w Data were gathered and processed to be classified using the CNN model model from TensorFlow 2 Model Zoo was used to ensure that it achieved racy. This process includes changing the ratios, which are 25% for testin training. The model was converted to TensorFlow Lite to construct apps an droid Studio. Database acquisition, system design, and system testing were that make up this category. A flow process of BIM words and hand gestur Figure 1.
The datasets were self-generated, in which 100 images for each class w 500 images in total were collected, with a size of 512 px × 290 px. The im were based on different positions and light intensities, including the dis camera and the brightness. The pictures were also mirrored to acquire a var To differentiate the images, labelImg was downloaded and used. This softw an XML file for each image labelled so it can be detected using TensorFlow tion API. Figure 3 shows an example of selecting and labelling hand gestu (abang) using labelImg software.

BIM Word Hand Gestures
The dataset includes five classes, three of which are from family (keluarga) and contain the words brother (abang), father (bapa), and mother (emak); one from feelings (perasaan), which is love (sayang); and one from pronouns (ganti nama), which is I (saya). Data were gathered and processed to be classified using the CNN model. A pre-trained model from TensorFlow 2 Model Zoo was used to ensure that it achieved the best accuracy. This process includes changing the ratios, which are 25% for testing and 75% for training. The model was converted to TensorFlow Lite to construct apps and put into Android Studio. Database acquisition, system design, and system testing were the three steps that make up this category. A flow process of BIM words and hand gestures is shown in Figure 1.
The datasets were self-generated, in which 100 images for each class were used, and 500 images in total were collected, with a size of 512 px × 290 px. The images captured were based on different positions and light intensities, including the distance from the camera and the brightness. The pictures were also mirrored to acquire a variety of images. To differentiate the images, labelImg was downloaded and used. This software generates an XML file for each image labelled so it can be detected using TensorFlow Object Detection API. Figure 3 shows an example of selecting and labelling hand gestures for brother (abang) using labelImg software.

BIM Word Hand Gestures
The dataset includes five classes, three of which are from family (keluarga) and contain the words brother (abang), father (bapa), and mother (emak); one from feelings (perasaan), which is love (sayang); and one from pronouns (ganti nama), which is I (saya). Data were gathered and processed to be classified using the CNN model. A pre-trained model from TensorFlow 2 Model Zoo was used to ensure that it achieved the best accuracy. This process includes changing the ratios, which are 25% for testing and 75% for training. The model was converted to TensorFlow Lite to construct apps and put into Android Studio. Database acquisition, system design, and system testing were the three steps that make up this category. A flow process of BIM words and hand gestures is shown in Figure 1.
The datasets were self-generated, in which 100 images for each class were used, and 500 images in total were collected, with a size of 512 px × 290 px. The images captured were based on different positions and light intensities, including the distance from the camera and the brightness. The pictures were also mirrored to acquire a variety of images. To differentiate the images, labelImg was downloaded and used. This software generates an XML file for each image labelled so it can be detected using TensorFlow Object Detection API. Figure 3 shows an example of selecting and labelling hand gestures for brother (abang) using labelImg software. While the pre-processed datasets were classified using TensorFlow 2 Detection Model Zoo, the SSD-MobileNet-V2 FPNLite 320 × 320 model with a speed of 22 s/frame rate and COCO mAP of 22.2 was used to determine the model's accuracy before being transformed into TensorFlow Lite and exported to Android Studio. Then, a collection of 500 (512 px × 290 px) images was used, of which 25%, or 125 images, was utilised for testing and 75%, or 375 images, was used for training. 500 (512 px × 290 px) images was used, of which 25%, or 125 images, was utilised for testing and 75%, or 375 images, was used for training.
Once the training process was completed, the hand gesture was detected in real time using TensorFlow Object Detection API and TensorFlow 2 Detection Model Zoo, the SSD-MobileNet-V2 FPNLite 320 × 320 model with a speed of 22 s/frame rate and COCO mAP of 22.2 was used as the pre-trained model. This is because the pre-trained model was trained with a large dataset, and this saves much more time rather than creating a model. COCO is an extensive dataset for object identification, segmentation, and captioning. Therefore, since a larger COCO mAP is advised, other models may also be employed to recognise the objects correctly. TensorFlow Records (TFRecords) can be used, and these TFRecords are a binary file format for storing data. Using this helps speed up training for custom object detection, in this case, hand gestures. The model was trained three times, in which the number of steps was changed to 2000 steps and 2500 steps to evaluate the model's accuracy.

Android Application
This application has features of converting speech to text, converting BIM letter hand gestures that can form words, and converting BIM word hand gestures into text. The datasets for the BIM letter and word hand gestures were obtained by trained models that were converted into TensorFlow Lite. Android Studio was used to make the Android application. Users need to sign up and log in to the application to gain access to the feature. To ensure the system functions properly, the system was tested towards the objectives of this project. This, in turn, ensures the developer can improve the developed application. The flow process of the Android application of this BIM recognition is presented in Figure 4. Once the training process was completed, the hand gesture was detected in real time using TensorFlow Object Detection API and TensorFlow 2 Detection Model Zoo, the SSD-MobileNet-V2 FPNLite 320 × 320 model with a speed of 22 s/frame rate and COCO mAP of 22.2 was used as the pre-trained model. This is because the pre-trained model was trained with a large dataset, and this saves much more time rather than creating a model. COCO is an extensive dataset for object identification, segmentation, and captioning. Therefore, since a larger COCO mAP is advised, other models may also be employed to recognise the objects correctly. TensorFlow Records (TFRecords) can be used, and these TFRecords are a binary file format for storing data. Using this helps speed up training for custom object detection, in this case, hand gestures. The model was trained three times, in which the number of steps was changed to 2000 steps and 2500 steps to evaluate the model's accuracy.

Android Application
This application has features of converting speech to text, converting BIM letter hand gestures that can form words, and converting BIM word hand gestures into text. The datasets for the BIM letter and word hand gestures were obtained by trained models that were converted into TensorFlow Lite. Android Studio was used to make the Android application. Users need to sign up and log in to the application to gain access to the feature. To ensure the system functions properly, the system was tested towards the objectives of this project. This, in turn, ensures the developer can improve the developed application. The flow process of the Android application of this BIM recognition is presented in Figure  4. For this phase of developing the Android application, two files from the BIM letter and BIM word hand gestures were included. To acquire the mentioned files, a trained For this phase of developing the Android application, two files from the BIM letter and BIM word hand gestures were included. To acquire the mentioned files, a trained model of BIM letter and BIM word hand gestures was converted into TensorFlow Lite files and used for application making.
The BIM letters, BIM word hand gestures, and Android speech to text were developed using Android Studio. To enable real-time hand gesture detection in the application, the trained model of BIM letters and BIM word hand gestures are translated to TensorFlow Lite and imported into Android Studio. By importing the SpeechRecognizer class, which gives access to the speech recognition service, the application of speech-to-text capability can also be accomplished in Android Studio. The speech recogniser can be accessed through this service. This API's implementation involves sending audio to distant servers for speech recognition, such as converting microphone input to text.
In this project, the trained models are created with TensorFlow and converted into TensorFlow Lite format. Then, the converted format is then used to develop an Android app that analyses a live video stream and identifies things using a machine learning model; in this case, it analyses the BIM letters and BIM word hand gestures. This machine learning model detects objects, which are BIM hand gestures, and it evaluates visual data in a prescribed manner to categorise components in the image as belonging to one of a set of recognised classes it was taught to identify. Milliseconds are frequently used to assess how long a model takes to recognise a known item (also known as object prediction or inference). In reality, the amount of data being processed, the size of the machine learning model, and the hardware hosting the model all affect how quickly inferences are made.
For the user's Android application, there are a few stages and features that need to be fulfilled by the user, such as: 1.
The user needs to turn on the internet connection.

2.
The user needs to download and install the app on their smartphone.

3.
The user needs to register to the app if they are a first-time user (input name, email address, and password).

4.
The user needs to log in as a user with their successfully registered account (input name and password).

5.
The user must allow the app to use the camera and record audio.

Results
The implementation of the Android application that allows two-way communication between deaf/mute and normal people, which integrates Bahasa Isyarat Malaysia (BIM), consists of four main buttons that enable users to choose whether they want to use speechto-text conversion, BIM letters to text conversion, BIM letters to create words conversion, and BIM word hand gestures to text conversion.
Three main categories help the application to be fully functional: BIM letters, BIM word hand gestures, and the development of the Android application itself. For BIM letters, the trained model achieved the highest accuracy of 99.78% by utilising the MobileNet pre-trained model with a 10% test size and a 90% training size. The result was evaluated by using a normalised confusion matrix. As for BIM word hand gestures, by implementing the TensorFlow 2 Detection Model Zoo, which uses SSD-MobileNet-V2 FPNLite 320 × 320, the average precision was 61.60% after being trained three times with 2000 steps and 2500 steps. Lastly, for the development of the Android application, '2 CUBE' is the name of the application, which means '2 Cara Untuk BErkomunikasi dalam Bahasa Isyarat Malaysia'. Furthermore, a feature of this application includes speech-to-text conversion, and the trained models of letters and BIM word hand gestures were converted to TensorFlow Lite, which can be implemented for real-time hand gesture detection.

BIM Letters
Using MobileNet pre-trained models, 29 BIM letters were trained and evaluated. Figure 5 displays a normalised confusion matrix for the trained model with the 10% test size and 90% training size. The diagonal elements represent the total correct values predicted for the classes based on the normalised confused matrix. The result demonstrates that the model accurately predicted all classes with a value of about 99 per cent.

BIM Word Hand Gestures
The training results of BIM words using hand gestures conducted using Tensor-Board, as explained in Section 3.2, presents the loss, learning rate, and steps per second. The first-time training was set to 2000 steps, while the second-and third-time training were set to 2500 steps, and Figure 6 shows the training results of classification loss via TensorBoard, while Table 3 shows the training results of loss, learning rate, and steps per second.
MobileNet with 99.78% accuracy of the test set Normalised Confusion Matrix

BIM Word Hand Gestures
The training results of BIM words using hand gestures conducted using TensorBoard, as explained in Section 3.2, presents the loss, learning rate, and steps per second. The first-time training was set to 2000 steps, while the second-and third-time training were set to 2500 steps, and Figure 6 shows the training results of classification loss via TensorBoard, while Table 3 shows the training results of loss, learning rate, and steps per second.
For the evaluation result, this model obtained 0.616, which is 61.60% average precision (AP), with intersection over union (IoU) between 0.50 and 0.95 in all datasets with a maximum detection of 100. The precision is not that high since the datasets collected are in a small volume because the laptop capacity used for this project was low, and it required a lot of time running on CPU instead of GPU. Other than that, the classes for the hand gestures of father (bapa), mother (emak), and I (saya) are almost the same; hence, these were detected as the same classes. For the average recall (AR), the model obtained a value of 0.670, or 67%, with IoU between 0.50 and 0.95 in all datasets with a maximum detection of one. The evaluation results can be seen in Figure 7.  For the evaluation result, this model obtained 0.616, which is 61.60% average preci sion (AP), with intersection over union (IoU) between 0.50 and 0.95 in all datasets with a maximum detection of 100. The precision is not that high since the datasets collected are in a small volume because the laptop capacity used for this project was low, and it re quired a lot of time running on CPU instead of GPU. Other than that, the classes for the hand gestures of father (bapa), mother (emak), and I (saya) are almost the same; hence, these were detected as the same classes. For the average recall (AR), the model obtained a value of 0.670, or 67%, with IoU between 0.50 and 0.95 in all datasets with a maximum detection of one. The evaluation results can be seen in Figure 7.    Figure 6. Results of loss, learning rate, and steps per second for the trained model of classification loss for three times training. For the evaluation result, this model obtained 0.616, which is 61.60% average precision (AP), with intersection over union (IoU) between 0.50 and 0.95 in all datasets with a maximum detection of 100. The precision is not that high since the datasets collected are in a small volume because the laptop capacity used for this project was low, and it required a lot of time running on CPU instead of GPU. Other than that, the classes for the hand gestures of father (bapa), mother (emak), and I (saya) are almost the same; hence, these were detected as the same classes. For the average recall (AR), the model obtained a value of 0.670, or 67%, with IoU between 0.50 and 0.95 in all datasets with a maximum detection of one. The evaluation results can be seen in Figure 7. The model accuracy was determined by using images to estimate the percentage of the accuracy of each class, with brother (abang) at 86%, father (bapa) at 88%, mother (emak) The model accuracy was determined by using images to estimate the percentage of the accuracy of each class, with brother (abang) at 86%, father (bapa) at 88%, mother (emak) at 92%, I (saya) at 97%, and love (sayang) at 98%, while the results of using a live camera from a webcam to detect the hand gestures in real time show that the accuracy of saya is 83%, sayang is 94%, and emak is 93%. Figure 8a shows the launcher icon for the application, a graphic representing the mobile application. This icon appears on the user's home screen whenever the user downloads this application. The main page for this application is shown in Figure 8b, where users need to register before they can use the application. If the user already has a registered account, they can log in with their successfully registered account. Figure 8a shows the launcher icon for the application, a graphic representing t bile application. This icon appears on the user's home screen whenever the user loads this application. The main page for this application is shown in Figure 8b, users need to register before they can use the application. If the user already has tered account, they can log in with their successfully registered account. Figure 9a shows the user's registration page for the application. Users need t their name, email address, and password before clicking on the register button, and 9b shows the user's login page. The user must enter their successfully registered and password to log in to the application by clicking on the login button.   Figure 9a shows the user's registration page for the application. Users need to input their name, email address, and password before clicking on the register button, and Figure 9b shows the user's login page. The user must enter their successfully registered email and password to log in to the application by clicking on the login button. Figure 9c shows the home page of the application after the user successfully logs into the application, where there are four clickable buttons with different functions for them to choose from, which are BIM letter hand gestures, BIM letter hand gestures to create a word conversion, BIM word hand gesture to text conversion, and, lastly, speech-to-text conversion. Figure 10 shows the page after clicking on the BIM letters recognition button, whereas Figure 10a shows that users need to click on the start camera (mulakan kamera) recognition before using this feature. Figure 10b shows that the user needs to allow the app to take pictures and record videos if they are a first-time user before proceeding. Figure 11 shows the BIM letters page when the camera recognition has been allowed, whereas Figure 11a shows the camera detected the letter 'D' when the BIM hand gesture is shown. In contrast, Figure 11b shows the camera detects the letter 'I' when the BIM hand gesture is directed to the camera. As for Figure 11c, when the camera does not recognise the hand gesture shown, the app displays Tidak dapat dikesan, which means it cannot be detected.    Figure 9c shows the home page of the application after the user successf the application, where there are four clickable buttons with different functi to choose from, which are BIM letter hand gestures, BIM letter hand gestur word conversion, BIM word hand gesture to text conversion, and, lastly, s conversion. Figure 10 shows the page after clicking on the BIM letters recognition butt Figure 10a shows that users need to click on the start camera (mulakan kamera before using this feature. Figure 10b shows that the user needs to allow th pictures and record videos if they are a first-time user before proceeding.   Figure 11 shows the BIM letters page when the camera recognition has been allowed, whereas Figure 11a shows the camera detected the letter 'D' when the BIM hand gesture is shown. In contrast, Figure 11b shows the camera detects the letter 'I' when the BIM hand gesture is directed to the camera. As for Figure 11c, when the camera does not recognise the hand gesture shown, the app displays Tidak dapat dikesan, which means it cannot be detected.  Figure 12a shows the sidebar menu on which the user can click, and they can see their name and the registered email address they use. In addition, the sidebar menu includes four buttons with different features that they can click on, and they can also sign out from the application when they do not want to use it anymore. Figure 12b shows the BIM combined letter page where the user needs to click on the start camera recognition button, and this page also has an additional and clear button for the user to use when they want to combine the hand gesture they show or erase the letter they want.  Figure 12a shows the sidebar menu on which the user can click, and they can see their name and the registered email address they use. In addition, the sidebar menu includes four buttons with different features that they can click on, and they can also sign out from the application when they do not want to use it anymore. Figure 12b shows the BIM combined letter page where the user needs to click on the start camera recognition button, and this page also has an additional and clear button for the user to use when they want to combine the hand gesture they show or erase the letter they want. Figure 13 shows the BIM combined letter page, whereas Figure 13a shows the hand gesture of the letter 'B', and this is added to the app by clicking on the add (Tambah) button. Figure 13b shows that the hand gesture of the letter 'C' is shown and is added to the app, resulting in the word 'bilc', which is an incorrect word; therefore, the user must click on the delete (Padam) button to delete the letter 'C'. Lastly, Figure 13c shows the hand gesture of the letter 'A' after deleting the letter 'C'; hence, the resulting word is 'bila'. Figure 14 shows the BIM word hand gesture page, where the user can use this feature to detect the BIM word hand gesture. Users need to click on the start camera recognition button to start detecting the hand gesture they show. Figure 14a shows the BIM hand gesture being translated to brother (abang) in a text, while Figure 14b shows mother (emak) being translated by the app when the user displays the mother (emak) hand gesture. Lastly, Figure 14c shows the letter I (saya) being translated from the hand gesture shown by the user.  Figure 13 shows the BIM combined letter page, whereas Figure 13a shows the hand gesture of the letter 'B', and this is added to the app by clicking on the add (Tambah) button. Figure 13b shows that the hand gesture of the letter 'C' is shown and is added to the app, resulting in the word 'bilc', which is an incorrect word; therefore, the user must click on the delete (Padam) button to delete the letter 'C'. Lastly, Figure 13c shows the hand gesture of the letter 'A' after deleting the letter 'C'; hence, the resulting word is 'bila'. Figure 14 shows the BIM word hand gesture page, where the user can use this feature to detect the BIM word hand gesture. Users need to click on the start camera recognition button to start detecting the hand gesture they show. Figure 14a shows the BIM hand gesture being translated to brother (abang) in a text, while Figure 14b shows mother (emak) being translated by the app when the user displays the mother (emak) hand gesture. Lastly, Figure 14c shows the letter I (saya) being translated from the hand gesture shown by the user.    Figure 15 shows the speech-to-text page, whereas Figure 15a shows the main page once the user clicks the speech-to-text button. After that, the microphone can be clicked, and for a first-time user of the app, the user needs to grant access to recording the audio, as shown in Figure 15b. Finally, Figure 15c shows the Semua kebenaran dibenarkan message, meaning the user has granted all access.  Figure 15 shows the speech-to-text page, whereas Figure 15a shows the main page once the user clicks the speech-to-text button. After that, the microphone can be clicked, and for a first-time user of the app, the user needs to grant access to recording the audio, as shown in Figure 15b. Finally, Figure 15c shows the Semua kebenaran dibenarkan message, meaning the user has granted all access.  Figure 16 shows the speech-to-text page, and in Figure 16a, it can be seen that the user needs to click on the microphone icon, the Google speech recogniser pops up, the user is able to talk and capture speech by using the microphone, and it detects the speech and converts it to text, as shown in Figure 16b. Users need to click the change (Tukar) button for the next speech-to-text process.  Figure 16 shows the speech-to-text page, and in Figure 16a, it can be seen that the user needs to click on the microphone icon, the Google speech recogniser pops up, the user is able to talk and capture speech by using the microphone, and it detects the speech and converts it to text, as shown in Figure 16b. Users need to click the change (Tukar) button for the next speech-to-text process. Figure 15. Speech-to-text page of (a) the main page; (b) permission for the ap (c) access granted. Figure 16 shows the speech-to-text page, and in Figure 16a, it c user needs to click on the microphone icon, the Google speech recog user is able to talk and capture speech by using the microphone, and i and converts it to text, as shown in Figure 16b. Users need to click button for the next speech-to-text process.

Development of Android Application
(a) (b) Figure 16. Speech-to-text page of (a) converting recorded speech to text; (b) converted to text. Figure 16. Speech-to-text page of (a) converting recorded speech to text; (b) the speech has been converted to text.

Analysis of Android Application
By selecting their preferred speech to text, BIM letter recognition, BIM letters to construct a word, and, finally, BIM word hand gesture buttons on the BIM Android application, deaf/mute and normal people can communicate with one another.
This test can be conducted by repeating a hand gesture of each BIM letter and captured by phone camera ten times, and the accuracy results are tabulated in Table 4. The letters 'B', 'D', 'I', 'M', and 'V' have the highest accuracy from ten trials at 100%, while the lowest is the letter 'E', with 50% accuracy. The other stated letters have an average accuracy above 50%. Table 4. Analysis of BIM letters from A to Z by using the app.

Letter A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Accuracy (%)  60  100  70  100  50  90  70  90  100  80  70  90  100  70  80  80  70  60  90  70  60  100  90  70  90  60 A speech-to-text analysis was conducted, and the accuracy results are presented in Table 5. The test aims to determine whether or not the application accurately recognises the speech. For example, the words 'abang' and 'sayang' have an accuracy of 100%, 'bapa' has an accuracy of 90%, and 'emak' and 'saya' have an accuracy of 80%.

Conclusions
In summary, Bahasa Isyarat Malaysia (BIM), an Android application, was successfully developed. This project's goals were all completed. This success can be seen in the findings for the BIM letters, which, after training the models, achieved 99.75% accuracy. The app was built successfully for testing and analysis to determine the effectiveness of the whole system, where the test analysis reveals that, after ten trials, the average accuracy of the letters hand gesture was greater than 50%. The same may be said for speech to text, where an acceptable accuracy of more than 80% was attained. Briefly, this application can help deaf/mute and normal people communicate at ease with each other. This project can also eliminate the hassle of a human translator, making it significantly more cost-effective while developing a shorter and more fascinating interaction.
Additionally, there are a number of potential areas for future research that can be taken into account: (i) to increase the accuracy of speech recognition, audio-visual speech recognition with lip-reading will be introduced and (ii) to increase the performance of hand gesture recognition, attention models that enable the system to concentrate on the most instructive portion of a sign video sequence can be used.

Data Availability Statement:
The alphabet dataset used in this study is partially and openly available in Kaggle, except for G and T (https://www.kaggle.com/datasets/grassknoted/asl-alphabet accessed on 7 March 2023).