Hand gestures for emergency situations: A video dataset based on words from Indian sign language

Automatic sign language recognition provides better services to the deaf as it avoids the existing communication gap between them and the rest of the society. Hand gestures, the primary mode of sign language communication, plays a key role in improving sign language recognition. This article presents a video dataset of the hand gestures of Indian sign language (ISL) words used in emergency situations. The videos of eight ISL words have been collected from 26 individuals (including 12 males and 14 females) in the age group of 22 to 26 years with two samples from each individual in an indoor environment with normal lighting conditions. Such a video dataset is highly needed for automatic recognition of emergency situations from the sign language for the benefit of the deaf. The dataset is useful for the researchers working on vision based sign language recognition (SLR) as well as hand gesture recognition (HGR). Moreover, support vector machine based classification and deep learning based classification of the emergency gestures has been carried out and the base classification performance shows that the database can be used as a benchmarking dataset for developing novel and improved techniques for recognizing the hand gestures of emergency words in Indian sign language.


a b s t r a c t
Automatic sign language recognition provides better services to the deaf as it avoids the existing communication gap between them and the rest of the society. Hand gestures, the primary mode of sign language communication, plays a key role in improving sign language recognition. This article presents a video dataset of the hand gestures of Indian sign language (ISL) words used in emergency situations. The videos of eight ISL words have been collected from 26 individuals (including 12 males and 14 females) in the age group of 22 to 26 years with two samples from each individual in an indoor environment with normal lighting conditions. Such a video dataset is highly needed for automatic recognition of emergency situations from the sign language for the benefit of the deaf. The dataset is useful for the researchers working on vision based sign language recognition (SLR) as well as hand gesture recognition (HGR). Moreover, support vector machine based classification and deep learning based classification of the emergency gestures has been carried out and the base classification performance shows that the database can be used as a benchmarking dataset for developing novel and improved techniques for recognizing the hand gestures of emergency words in Indian

Value of the data
The lack of publicly available dataset is a big challenge that hinders the developments in automatic SLR. The dataset proposed in this article is the first publicly available dataset of the hand gestures of the emergency ISL words. The data will be useful for the researchers to develop novel techniques for the improvements in automatic recognition of ISL gestures [1 , 2] . Improvement in this field is a great benefit to the society, as it provides a communication platform for the Deaf to convey their urgent messages to the authority. This dataset can act as a basic benchmarking database of a set of hand gestures of emergency ISL words. It can be referred for further expanding the dataset by replicating the samples, or adding new samples of the gestures in different views and background conditions to further develop and improve the SLR and HGR techniques [3] .

Data description
The dataset contains the RGB videos of hand gestures of eight ISL words, namely, 'accident', 'call', 'doctor', 'help', 'hot', 'lose', 'pain' and 'thief' which are commonly used to convey messages or seek support during emergency situations. All the words included in this dataset except the word 'doctor' are dynamic hand gestures. The videos were captured from 26 adult individuals including 12 males and 14 females in the age group of 22 years to 26 years. The subjects participated in the data collection process are not the representatives of a particular region, rather represent the whole India.
For dynamic gestures, hand gestures recognition depends most importantly on motion features, rather than skin color features, based on silhouette or shapes or edges and their variations over time due to its movements. Even though skin color variations play little role, the data collection has been done by taking extreme care to include participants with maximum skin color variations to study the dependency of gesture recognition performance on human skin color.
It may so happen that the skin color will certain time highly resemble to the background color (including person's clothing) and will highly affect the classification rate. Hence, all the videos in this dataset have been collected against a black background under normal lighting conditions in an indoor environment. Such type of black background can be easily constructed at a very low cost with a board painted in black color and placed in front of the camera. As these are emergency situation related words for use with the deaf to communicate with the world, high recognition rates with less false positive and less false negative are highly needed. The plain black color background in the videos helps to increase the performance of hand gesture recognition with less computational overhead.
The dataset is built with an objective for developing a benchmark for emergency hand gesture recognition and the corresponding classification results as a reference for further improvements of the ISL recognition. The dataset is included in two folders namely 'Raw_Data' and 'Cropped_Data'. The folder 'Raw_Data' contains the original ISL videos of size 1280x720 pixels. The folder 'Cropped_Data' contains the video sequences obtained after cropping out the excessive backgrounds and rescaling the frames to a uniform size of 50 0x60 0 pixels. Fig. 1 shows a set of keyframe sequences for sample videos from all the eight hand gestures in the 'Cropped_Data' set.
The dataset contains a total of 824 sample videos in .avi format. The raw videos are labelled using the format ISLword_XXX_YY , where: • ISLword corresponds to the words 'accident', 'call', 'doctor', 'help', 'hot', 'lose', 'pain' and 'thief'. • XXX is an identifier of the participant and is in the range of 001 to 026.
• YY corresponds to 01 or 02 that identifies the sample number for each subject.
For example, the file named accident_003_02 is the video sequence of the second sample of the ISL gesture of the word 'accident' presented by the 3 rd participant.
The cropped videos are labelled using the format ISLword_crop_XXX_YY , where: • ISLword corresponds to the words 'accident', 'call', 'doctor', 'help', 'hot', 'lose', 'pain' and 'thief'. • XXX is an identifier of the participant and is in the range of 001 to 026.
• YY corresponds to 01 or 02 that identifies the sample number for each subject.
For example, the file named accident_crop_003_02 is the video sequence of the second sample of the ISL gesture of the word 'accident' presented by the 3 rd participant obtained after cropping and downsampling to 50 0x60 0 pixels. Table 1 and Table 2 shows the file and folder organizations of the sets containing raw data and cropped data respectively.

Experimental design
The hand gestures included in this dataset are according to the style and movements specified in the ISL dictionary published by the Ramakrishna Mission Vivekananda University, Coimbatore, Tamilnadu, India [4 , 5] . The videos of the ISL words and their descriptions have been shown to the participants for the effective presentation of the gestures. A Sony cyber shot DSC-W810 digital camera with 1280x720 pixels frame size is used for the data collection. This data collection process has got ethical clearance from Institutional Human Ethics Committee (IHEC) of Central University of Kerala, India. All the individuals have gone through the detailed informed consent form and signed their consent for voluntary participation.
The videos in this dataset were collected by asking the participants to stand comfortably behind a black colored board. The participants were asked to present the eight hand gestures, in front of the board, one by one and the procedure is repeated two times to capture two sample  videos of each gesture. Example for a single frame of the video of the word 'accident' in original form (raw) and after cropping are shown in Fig. 2 (a) and Fig. 2 (b) respectively. All the videos were taken by fixing the camera at the same distance from the black board. Human hands are highly flexible in nature and the style and speed of hand movements by different individuals has shown great variations while presenting the gestures. No restriction has been imposed on the speed or time duration while capturing the video samples. Hence the duration of videos varies from one second to three seconds depending upon the speed of gesture presentations by different individuals.

Data analysis
The ISL gestures in the cropped dataset have been analysed by classifying them with the conventional feature driven approach using multiclass support vector machine (SVM) [6] as well as the recently evolved data driven approach using deep learning model. In both cases, 50% of the dataset is used for training and the remaining 50% is used for testing.  In feature based approach, a set of key frames have been extracted from the video sequences through a fast and efficient method based on image entropy and density clustering as proposed in [7] . The keyframe extraction eliminates the redundant information and makes all the videos with equal numbers of frames. The appearance features, namely, the three dimensional wavelet transform descriptors [8] are extracted from the keyframes. These descriptors are used for training and testing the SVM classifier. SVM is a supervised machine learning approach used for binary and multiclass pattern recognition. The memory efficient operation through the data points called support vectors and the availability of versatile kernel functions make it a widely adopted choice for image/video based feature classification too. It is extensively used in classification problems with comparatively less training samples and shown greater performances. Multiclass SVM is utilized in this work and obtained an average classification accuracy of 90%.
In deep learning approach, the pre-trained convolutional neural network (CNN) model, namely GoogleNet [9] , is combined with a long short term memory (LSTM) network [10] for gesture classification. The videos are converted into sequences of feature vectors through GoogleNet network, by getting the output of the activations of its last pooling layer. The classification model of LSTM network is built with a sequence input layer followed by a bidirectional LSTM layer with 20 0 0 hidden units and a dropout layer afterwards. The output of the dropout layer is further transformed by the fully connected layer into the size suitable for classification by a softmax layer and a final classification layer. The network is trained for 20 epochs with the sequences of feature vectors in which 10% of the training dataset is used for validation with an adaptive moment estimation (adam) optimizer, a batch size of 16 and an initial learning rate of 0.0 0 01. The performance of the classification model is evaluated with the test videos and achieved an average classification accuracy of 96.25 %.
The classification performances of both the methods have also been evaluated using the metrics for precision, recall and F-score values corresponding to each gesture class as shown in Table 3 . The given average classification accuracies, precision, recall and the F-score values can be considered as the base performance measures and those who are further going to work on the dataset may improve it.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.