A Dynamic Gesture Recognition Interface for Smart Home Control based on Croatian Sign Language

: Deaf and hard-of-hearing people are facing many challenges in everyday life. Their communication is based on the use of a sign language, and the ability of the cultural/social environment to fully understand such a language deﬁnes whether or not it will be accessible for them. Technology is a key factor that has the potential to provide solutions to achieve a higher accessibility and therefore improve the quality of life of deaf and hard-of-hearing people. In this paper, we introduce a smart home automatization system speciﬁcally designed to provide real-time sign language recognition. The contribution of this paper implies several elements. Novel hierarchical architecture is presented, including resource-and-time-aware modules—a wake-up module and high-performance sign recognition module based on the Conv3D network. To achieve high-performance classiﬁcation, multi-modal fusion of RGB and depth modality was used with the temporal alignment. Then, a small Croatian sign language database containing 25 different language signs for the use in smart home environment was created in collaboration with the deaf community. The system was deployed on a Nvidia Jetson TX2 embedded system with StereoLabs ZED M stereo camera for online testing. Obtained results demonstrate that the proposed practical solution is a viable approach for real-time smart home control.


Introduction
Most people at some point in life, especially in older age, probably experience either temporary or permanent disability, or are facing increasing difficulties in functioning [1,2]. Considering the type of impairment we focus on within this paper-in 2019, around 466 million people in the world had a disability based on deafness, of which 34 million were children. By 2050, indicators predict that 900 million people will face the consequences of the inability of equal communication daily [3]. Regardless of the concerning quantitative indicators, the unquestionable truth is that such persons must necessarily use specific communication procedures to integrate. The communication of deaf and speech-impaired people is based on the use of sign language, and the knowledge of it allows the integration but only to a certain extent. Disability, as an inability to integrate, is a condition and a direct result of the inaccessible and complex environment surrounding those with a health impairment [1]. The environment either disables people or supports their inclusion and participation in the general population [1].
Technology plays a crucial part in accessibility initiatives and solutions [1,2], where particular emphasis is placed on research and development focused on new methods in human-computer interaction (HCI) oriented toward natural user interfaces (NUI). Having the demand for familiar, In this paper, we propose a system for real-time sign language dynamic gesture recognition with application in the context of a smart home environment. The contributions of our work are founded on several different elements. First, achieving real-time dynamic gesture recognition with online recognition being deployed and realized on NVIDIA Jetson TX2 embedded system combined with StereoLabs ZED M stereo camera. Second, the use of a hierarchical architecture approach to achieve more effective use of resources (memory and processing demand) using the wake-up module as an activation network and the fusion of two 3DCNN networks as a high-performance classifier with the multimodal inputs-RGB with depth inputs. Third, the formation of a specific custom sign language gesture control commands to interact with the smart home environment. Lastly, training and evaluation of the proposed system is based on Croatian Sign Language gestures/commands, forming dataset with a specific application in the smart home environment.
The rest of the paper is organized as follows. Section 2 explains the relevant research. In Section 3, the proposed sign language interface, and the corresponding component modules are described. The performance evaluation is made in Section 4, and Section 5 concludes the paper.

Related Work
Hand gesture recognition is one of the most prominent fields of human-computer interaction. There are many related studies for hand gesture recognition using wearable [9] and non-wearable sensors [8]. Considering the different sensors, the research involves the use of specialized hardware such as Microsoft Kinect [14][15][16], stereo camera [17], sensor gloves [18][19][20], and non-specialized hardware like mobile phone cameras [21,22], web cams [23], etc. The use of specialized hardware for hand gesture acquisition primarily bridges certain steps in the process that would otherwise have to be taken into account, such as hand segmentation, hand detection, and hand orientation, finger isolation, etc. Traditional approaches in the problem of gesture classification were based on hidden Markov models (HMMs) [24], support vector machines (SVMs) [25], conditional random fields (CRFs) [26], and multi-layer perceptron (MLP) [27]. In recent years, research interests have been shifted from a sensor-based approach to a vision-based approach, thanks to rapid advancement in the field of deep learning-based computer vision. The most crucial challenge in deep learning-based gesture recognition is how to deal with the temporal dimension. More recent work implies the use of models based on approaches utilizing deep convolutional neural network (CNN), long-short term memory (LSTM), and derivative architectures. Earlier, 2DCNNs were shown to provide high accuracy results on images, so those were applied to videos combined with different approaches. Video frames were used as multiple inputs for a 2DCNN in [28,29]. A combination of a 2DCNN and an LSTM was proposed in [30] where features were extracted with a 2DCNN network and then applied to an LSTM network to cover the temporal component. In [30], spatial features were firstly extracted from frames with long-term recurrent convolutional network (LRCN) and then temporal features were extracted with a recurrent neural network (RNN). A two-stream convolutional network (TSCN) was used in [29] to extract spatial and temporal features. In [31], a convolutional LSTM-VideoLSTM was used to learn spatio-temporal features from previously extracted spatial features. In [32] the proposed model is a combination of a three-dimensional convolutional neural network (3DCNN) and long short-term memory (LSTM) and used to extract the spatio-temporal features from the dataset containing RGB and depth images. In [33], spatiotemporal features were extracted in parallel utilizing a 3D convolutional neural network (3DCNN). In [34], 3DCNNs were used for spatio-temporal feature extraction with 3D convolution and pooling. The extraction and the quality of spatial features in the recognition process is highly influenced by the factors such as background complexity, hand position, hand-to-scene size, hand/fingers overlapping, etc, [8,35]. In such circumstances, spatial features can easily be overwhelmed by those factors and become undiscriminating in the process. Therefore, temporal information provided by the sequence of scenes/frames becomes the key factor [36]. This information, especially for real-time gesture recognition process, is of high importance considering the stream of video frames and learning the spatio-temporal features simultaneously is more likely to provide quality results rather than when either separate or in sequence [4,36].
Considering the modality, in [37] the RGBD data of a hand with the upper-body was combined and used for sign language recognition. To detect the hand gestures, in [38], YCbCr and SkinMask segmented images were the CNN's two-channel inputs. In [39], a method for fingertip detection and real-time hand gesture recognition based on RGBD modality and the use of 3DCNN network was proposed. In [40], the best performance was reported using RGBD data and a histogram of gradient (HOG) with an SVM as a classifier. Further, in [41], a dynamic time wrapping (DTW) method was used on the HOG and histogram of the optical flow (HOF) to recognize gestures.
Advancement in the field of automatic sign recognition is profoundly dependent on the availability of the relevant sign language databases which are specified for the language area [42] or have limited vocabulary for the area of application [43]. Given the practical applications of automatic sign language recognition, most studies are focused on the methods oriented toward the translation of sign language gestures into textual information. One of the interesting solutions for upgrading the social interaction of sign language users is proposed in [38], where authors introduced a sign language translation model using 20 common sign words. In [42], a deep learning translation system is proposed for 105 sentences that can be used in emergency situations. Considering that hand gestures are the most natural and thus the most commonly used modality, among others in HCI communication, it is vital to consider sign language as the primary medium for NUI. An example of a practical system for the automatic recognition of the American sign language finger-spelling alphabet to assist people living with speech or hearing impairments is presented in [44]. The system was based on the use of Leap Motion and Intel RealSense hardware with the SVM classification. In [45], a wearable wrist-worn camera (WwwCam) was proposed for real-time hand gesture recognition to enable services such as controlling mopping robots, mobile manipulators, or appliances in a smart home scenario.

Proposed Method-Sign Language Command Interface
In this section, we describe a proposed system for human-computer interaction based on Croatian Sign Language. The design of the proposed method was envisaged as a dynamic gesture-based control module for interaction with the smart environment. Our goal was to implement a touchless control interface customized for sign language users. By selecting a limited vocabulary tailored in the form of smart home automatization commands, the proposed solution was used to manage household appliances such as lights, thermostats, door locks, and domestic robots. The infrastructure of the proposed sign language control module was designed to meet certain requirements for real-time online applications such as reasonable classification performance and hardware efficiency with swift reaction time. The proposed infrastructure consists of three main parts: wake-up module (see Section 3.1), sign recognition module (see Section 3.2), and sign command interface (see Section 3.3). Figure 1 illustrates the pipeline of the proposed sign language command interface. In the defined workflow, the proposed system continuously receives RGB-D data, placing it into two separate input queues. The wake-up module is subscribed to the data queue containing a sequence of depth images. In each operating cycle, a wake-up module performs gesture detection based on N consecutive frames. If the start of a gesture was successfully detected, the command interface is placed in an attention state, and the sign language recognition module becomes active. It performs hierarchical gesture classification on two modality sequences of maximum size M starting from the beginning of the input queues. The result of sign classification was passed as a one-hot encoded vector to command parser for mapping of recognized gesture sign to a predefined vocabulary word, further used for building the automatization command in JSON format.

Wake-Up Gesture Detector
The primary purpose of the wake-up module was to lower power consumption and to deliver better hardware performance and memory resource efficiency in the implementation of the proposed sign language command interface. Given the case where the control interface continuously receives an incoming video stream of RGB-D data, where for the most time no gesture is present, the main task of the proposed wake-up module was to act as an initiator for sign language recognition module. This component performs a binary classification (gesture, no gesture) on the first N frames of the Depth data queue; thus, it reduces the number of undesired activations of the proposed command module. To adequately address the real-life scenario where different signs of Croatian sign language have different duration; in this work, the proposed detector was employed for managing the queue length M of the sign recognition module. In case when sign recognition was active, the detector module keeps operating as a sliding window with corresponding stride of 1. When the number of following no gesture prediction reaches the threshold, the wake-up module performs a temporal deformation by resampling input sequences of size M to fixed size m on witch sign language classification operates. In this work, resampling was done by selecting m input frames linearly distributed between 1 and M. Since the wake-up module runs continuously, it was implemented as a lightweight 3DCNN architecture as shown in Figure 2, left.
The network consists of four convolutional blocks, followed by two fully-connected layers. Each convolution layer was interspersed with a batch normalization layer, a rectified linear unit (ReLu), and max-pooling layers. The three convolutional layers had 32, 32, and 64 filters, respectively. The kernel size of each Conv3D layer was 3 × 3 × 3, together with a stride of 1 in each direction. A drop-out was utilized during the training after the last convolutional layer, in addition to L2-regularization on all weight layers. To decrease the probability of false-positive predictions, the model was trained on BinaryCrossentropy loss using Adam optimization with a mini-batch size of 16. The proposed gesture wake-up architecture presents a fast and robust method suitable for real-time applications.

Sign Recognition Module
This module represents a core functionality of the proposed smart home system. It has the task of recognizing users' gestures and decoding them to sign language. Given the practical application of our system, it was necessary to consider the deep learning architecture suitable for available memory capacity and power budget. Also, it needs to provide a good trade-off between classification performance, fast reaction time, and good robustness to environmental conditions such as varying lighting conditions and complex background. Motivated by the recent success of adopting multi-modal data for robust dynamic gesture recognition, in this work, we propose using a multi-input network classifier. The proposed solution has consisted of two identical parallel sub-networks, where each operates on different data modality. Both stream of modalities, respectively RGB and Depth data, were spatially and temporally aligned, which facilitates our subnetworks to have the same understanding of an input sign gesture. By opting for two different modalities, our architecture ensures that each subnetwork learns relevant spatiotemporal features of a given input modality. In our approach, Conv3D was employed to extract spatiotemporal information from input video data. The network architecture of the proposed sign language classifiers is illustrated in Figure 3.  Figure 3. Sign language classificator-network architecture.
In our approach, each network was trained separately with the appropriate data type, each maintaining its auxiliary loss on the last fully connected layer. For modality fusion, both subnetworks where concatenated with a new dense layer. Training was performed by optimizing weights of dense layers while freezing the weights of the ConvNet block. Each unimodal architecture consists of blocks numbered 1 to 4, followed by two fully connected layers and the output Softmax layer. In each block, a Conv3D was succeeded by Batch normalization and MaxPool3D layer except for block 4, which contains two Conv3D layers. The number of features maps per Conv3D are 32, 64, 128, 256, 256, respectively, for each layer. All Conv3D layers were defined by a filter size of 3 × 3 × 3, while padding and stride dimensions are set to 1 × 1 × 1. Further, in the first block, only spatial down-sampling was performed, while for the rest of the blocks, pool size and stride of the MaxPool3D layer was set to 2 × 2 × 2. A drop-out of 0.2 was employed during the training before each fully connected layer. The training was performed using Adam optimizer with a mini-batch size of 16. A detailed structure of the unimodal RGB model is shown in Figure 2, right.
To properly train 3DCNN, much more training data were required than with 2D counterparts since a number of learnable parameters were much higher. To prevent overfitting during training in this work, we performed an online spatiotemporal data augmentation. Spatial augmentation included translation, rotation, and scaling, where transformation parameters were fixed for a particular data batch. Besides affine transformations, we also applied Gaussian blur on RGB data. Since we trained our network from scratch, we adopted lower transformation values for these steps, which helps the model to converge faster. For temporal augmentation, we randomly selected successive frames in a range of defined input size. In the case of fusion training, parameters of data augmentation were fixed to maintain spatial and temporal alignment between two input modalities. All of these operations were applied randomly inside mini-batches that were fed into the model.

Sign Command Interface
After the wake-up module concludes that there are no more gestures presented in the input data stream i.e., the number of consecutive "no gesture" predictions reaches the defined threshold, it signals the sign language recognition module to perform sign classification on the current RGB-D queues. The result of sign classification is given through the Softmax layer as class-membership probabilities. These probabilities are then passed to the Sign command interface in form one-hot encoded vector.
The main task of the Sign command interface is to create structured input from the sequence of recognized language signs. This sequence can be later used by smart home automatization to realize user commands. In this work, we established a sign language vocabulary suitable for a specific range of commands within the home automatization context. We also defined a grammar file with all possible command patterns. For a given smart home automatization scenario, we distinguished between two types of devices: an actuator where the user can set the state, and the sensor, which allows for the acquisition of information to get the state of the sensor/environment parameter.
Further, we designed commands patterns/to control or read the state of the individual device or a group of devices. This command pattern was formatted as follows: action part, device name/device group part, and an optional part of a command, which was the location descriptor (e.g., room, kitchen). An example of user command is shown in Figure 4. Another functional contribution of the command interface was to improve recognition of users sign language command. Relying on the established command pattern/format, which is seen as the sequence structure, the proposed sign interface additionally performed classification refinement. If the prediction at the current sequence step does not fit in expected language sign subclass, it selects the next prediction with the highest probability. This refinement can be recursively repeated for next two steps. For example, if the system recognizes a sign related to the household device while the system expects a command action, it can reject prediction and look for the next sign command with the biggest probability score, which could also be the one expected.

Croatian Sign Language Dataset
Sign language is a visual language of communication used by deaf and hard-of-hearing people. According to [46], almost all EU nations have some form of recognition of its sign language. The task of automatic sign language recognition primarily includes the same constraints relevant to dynamic gesture classification. In the context of the task, it is common to use a collection of labeled video clips in which users are performing predefined hand gestures. Although there is a considerable number of available dynamic hand gesture datasets, such as 20BN-jester [11], EgoHands [47], Nvidia Dynamic Hand Gesture [48], a relatively small number of the available sign language corpora exist [49]. Since sign language is unique for a particular region and, since there are no available Croatian datasets, in this work, we created a new small dataset based on Croatian sign language-SHSL. Considering that the sign languages come with a broad range of factors, we restricted ourselves to the topic of smart home automatization command. First, through collaboration with the deaf community, the vocabulary for home automatization was defined. The proposed sign language corpus contains 25 language signs needed to construct a smart home command simultaneously. The previously mentioned command pattern groups SHSL signs in three categories: actions, household items, and house location, where each sign category contains 13, 8, and 4 signs, respectively. For the production of the SHSL dataset, 40 volunteers were selected to perform each of the 25 sign gestures twice, which resulted in a total collection of 2000 sign videos. The average duration of the gesture length is 3.5 s. In our work, data acquisition was performed with a ZED M camera, which facilitates the collection for RGB-D data. The video was recorded in 1920 × 1080 resolution with 30 fps. The camera stand was placed 1.3 m from the signer, and each signer was instructed to wear a dark shirt.

Experiments
In this section, an experimental evaluation of the proposed system was made by analyzing each component of the proposed system separately. Given that the network architecture of the lightweight wake-up module and high-performance sign recognition module was based on Conv3D, which generally has a considerable amount of learnable parameter, as illustrated in Figure 2, a substantial amount of data were required to minimize the potential of overfitting. In this work, apart from applying data augmentation methods, we also pre-trained our models with the Nvidia hand gesture dataset to obtain proper model initialization. The Nvidia hand gesture dataset contains 1532 clips distributed between 25 dynamic hand gesture classes. In total, 20 subjects were recorded with multiple sensors, installed on several locations inside the car simulator. To initialize models, we only used the provided RGB and depth data modality randomly split with 5:1 ratio, into the training and validation set. Each component of the system was trained on two Nvidia's RTX 2080 Ti's using the TensorFlow mirrored distribution strategy. In the testing phase, modules were integrated and exported to a power-efficient embedded AI computing device, Nvidia Jetson T2X, to calculate forward prediction. Jetson TX2 is a CUDA compatible device, which was required for our ZED M camera to compute the depth modality in real-time.

Performance Evaluation of the Wake-Up Module
In this work, the wake-up module had the role of a real-time hand gesture spotter, which in addition to distinguishing between "gesture" and "non-gesture", it also manages the length of input sequence M used by sign recognition module. Considering that the efficiency of the Sign language command interfaces highly depends on the performance of the Wake-up module, we needed to analyze the accuracy regarding the size of the input sequence N on which the gesture detector operates. Using longer input sequence N directly affects the time it takes to recognize a gesture and thus to execute the users' command. Performance evaluation of the proposed wake-up module was given as a binary classification accuracy concerning three input sizes 4, 8, and 16. For the sake of real-time execution, a lightweight architecture was trained using only depth modality data, and the corresponding results are shown in Table 1. Table 1. Wake-up binary classification accuracy. From the results, it is visible that the best accuracy results were obtained using 16 frames. Concerning practical real-time requirements explained in Section 3, the proposed Sign language command interface integrates the 8-frames Wake-up module. This decision was made based on parameter reduction with a negligible decrease in performance, and a smaller frame size also provides better wake-up resolution. Figure 5 shows the accuracy of the classification of the proposed model on the train and validation sets using 8-frames, from which it is visible that the model has a good fit on the problem. Training was performed with the class ratio of 5:3 for "no-gesture" and "gesture" as our experiments showed that this proportion was sufficient to obtain model relevance of 93.8% and 93.1% in terms of precision and recall.

Performance Evaluation of the Sign Recognition Module
In this work, a sign recognition module is realized as a hierarchical architecture of two identical parallel subnetworks, each operating on different input modality. For the training process, we firstly initialize each of two subnetworks with Nvidia hand gesture dataset using respective RGB and Depth data. For model initialization, a total of five training-validation runs are performed. The pre-trained subnetworks are selected based on max reported validation accuracy of 62.3% and 65.6%, respectively, for RGB and depth modality. After obtaining the validation accuracy, the model was updated with the rest of the 306 validation samples to finish the initialization process. The following fine-tuning process to the SHSL dataset started with a learning rate of 0.005, and it was controlled by scheduler that adjusted in order to reduce the learning rate by factor 5 if the cost function did not improve by 10%. For this experiment, we compared the performance of each single-modality subnetwork, and modality-fusion model concerning the number of input frames M, to determine the influence of a particular input modality on the recognition result. The sign recognition module was evaluated using a leave-one-subject-out (LOSO) cross-validation scheme on the SHSL dataset. As explained in Section 3.4, the SHLS dataset contains video data of 40 users where every user was recorded performing each of 25 sign language gestures twice. In the proposed validation scheme, analyzed networks are trained with video data collected from 39 subjects, and tested on a single user. The average classification results of 40 users for LOSO are shown in Table 2. From the results, it is visible that network accuracy depends on the length of input sequence M, were using longer input achieves a higher performance. Also, we can inspect that employing Depth modality network performs better in comparison with RGB data, while a significant performance improvement was reported when modality fusion was introduced. Table 2. Sign language recognition-leave-one-subject-out (LOSO). To improve the recognition for personalized use, we implemented an additional fine-tuning process in which we retrain dense layers with the first version of users' trials containing each of 25 gestures. Similarly, as in the LOSO scheme, performance evaluation was made by training each model with data of 39 users (1950 videos) together with the first 25 corresponding videos of the current signer. Results in Table 3 show an average accuracy of 40 users from which it is evident that the user adaptation technique can additionally increase performance. Performance per user is given in Figure 6. where it is visible that signer #5 achieved the lowest accuracy of 58.8% while signer #2 achieved the best accuracy of 80.9%. Also, Figure 6 enables us to inspect accuracy improvement per user where it is visible that most notable improvement of 7.8% is achieved for signer #21, while there are few cases where this procedure decreases the results as with singer #12 (−3.3%); in the average subject, adaptation delivers an improvement of 2.1%.

Performance Evaluation of the Sign Command Interface
The purpose of the proposed sign command interface is to interpret a sequence of recognized language signs as command format consisting of three parts: action part (A), device part (D), and location part (L). Given the application domain, the proposed sign language corpus contains 25 language signs grouped according to the command format as 13, 8, and 4 signs, respectively, as action, device, and location part. The performance of each command group is given in the following Tables 4-6. The confusion matrix in Table 4 presents the misclassification between action gestures (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13) in terms of precision and recall per gesture, with the average classification accuracy of 73.3%. Likewise, the confusion matrix in Table 5 refers to device gestures (14)(15)(16)(17)(18)(19)(20)(21) with an average classification accuracy of 70%, and Table 6 refers to location vocabulary (22)(23)(24)(25) that accomplishes 71.9% accuracy. From the confusion matrices, we show that knowing the type of gesture in advance can minimize the errors, that is by the rejection of all gestures that do not belong to the observed set precision can reach 90.2%, 81.6%, and 73.1% respectively for A, D and I.
To analyze the possible usability of the proposed system in real-life applications, we tested our solution regarding sentence error rate (SER), and we analyzed the execution time of the proposed wake-up and sign recognition module. In this work, SER was calculated as the percentage of language sign sequences that do not have the exact match with those of reference. Based on directions by deaf people, for this experiment, we defined 15 different combinations of sign language sequences (commands), so that every sign is performed at least once.
Reported results for SER follow the conclusion reached from Figure 6, where the worst performance (practically unusable) was obtained for signer #5, and the best score was reported for user #2, who achieved 40% for LOSO and 33% for user adaptation. Additionally, a refinement procedure was introduced based on following the established command format. This method achieves performance improvement, thus reporting SER of 20% for LOSO and subject adaptation.     To meet the design requirements concerning real-time application, we based our practical solution on the Nvidia TX2 platform employing a TensorRT framework to achieve low latency runtime and high-throughput for deep learning applications. TensorRT-based solutions deliver up to 40× better performance than CPU counterparts [50], which enables our system to perform efficient subject adaptation procedures in real-time. To analyze the effectiveness of the proposed solution, we measured the execution time of each component in the prediction phase. Using a data batch size of 10, the recorded execution time of 180 ms, 290 ms, and 510 ms, respectively, and the wake up module, unimodal subnetwork, and modality fusion demonstrated that the proposed solution could maintain real-time processing criterion using a power-efficient embedded AI computing device.

Conclusions
Sign language is a primary form of communication for deaf and hard-of-hearing people. This visual language is established with its own vocabulary and syntax, which poses a serious challenge for the deaf community integrating with social and work environments. To assist the interaction of deaf and hard-of-hearing people, we introduced an efficient smart home automatization system. The proposed system integrates a touchless interface tailored for sign language users. In collaboration with deaf people, a small Croatian sign language database was created, containing 25 different language signs within the vocabulary specifically intended for a smart home automatization. For developing a real-time application, we presented a novel hierarchical architecture that consists of a lightweight wake-up module and a high-performance sign recognition module. The proposed models were based on employing Conv3D to extract spatiotemporal features. To obtain high-performance classification, in this work, we performed a multi-modal fusion with the temporal alignment between two input modalities-RGB and depth modalities. Moreover, effective spatiotemporal data augmentation was applied to obtain better accuracy and to prevent overfitting of the model. The evaluation results demonstrate that the proposed sign classifier can reach reasonable accuracy recognizing individual language sings in a LOSO scheme. We also demonstrated the improvement of the results when subject adaption was performed. The performance of the whole system was given in terms of the sentence error rate. From the results presented in Section 3, we can conclude that the proposed Sign command interface can be efficiently used within the smart home environment. The online phase of the proposed real-time system was, as a practical realization, implemented and tested on the Nvidia Jetson TX2 embedded system with Stereolabs ZED M stereo camera.
In future work, research efforts will be made towards the continuous recognition of sign language in a smart home environment. We plan to include the expansion of the vocabulary of our Croatian sign language database, followed by developing a more complex grammar and language model. Improvements in deep learning framework will also be made, which will include a higher degree of user modalities such as facial expression, head, and body postures. Funding: The work of doctoral student Luka Kraljević has been fully supported by the "Young researchers' career development project-training of doctoral students" of the Croatian Science Foundation.