Attributes’ Importance for Zero-Shot Pose-Classification Based on Wearable Sensors

This paper presents a simple yet effective method for improving the performance of zero-shot learning (ZSL). ZSL classifies instances of unseen classes, from which no training data is available, by utilizing the attributes of the classes. Conventional ZSL methods have equally dealt with all the available attributes, but this sometimes causes misclassification. This is because an attribute that is effective for classifying instances of one class is not always effective for another class. In this case, a metric of classifying the latter class can be undesirably influenced by the irrelevant attribute. This paper solves this problem by taking the importance of each attribute for each class into account when calculating the metric. In addition to the proposal of this new method, this paper also contributes by providing a dataset for pose classification based on wearable sensors, named HDPoseDS. It contains 22 classes of poses performed by 10 subjects with 31 IMU sensors across full body. To the best of our knowledge, it is the richest wearable-sensor dataset especially in terms of sensor density, and thus it is suitable for studying zero-shot pose/action recognition. The presented method was evaluated on HDPoseDS and outperformed relative improvement of 5.9% in comparison to the best baseline method.


Introduction
Human-action recognition (HAR) has wide range of applications such as life log, healthcare, video surveillance, and worker assistance. The recent advances in deep neural networks (DNN) have drastically enhanced the performance of HAR both in terms of recognition accuracy and coverage of the recognized actions [1,2]. DNN-based methods, however, sometimes face difficulty in practical deployment; a system user sometimes wants to change or add target actions to be recognized, but it is not so trivial for DNN-based methods to do so since they require large amount of training data of the new target actions.
Zero-shot learning (ZSL) has a great potential to overcome this difficulty of dependence on training data when recognizing a new target class [3][4][5]. Whereas in normal supervised-learning setting, the set of classes contained in test data is exactly the same as that in training data, it is not the case in ZSL; test data includes "unseen" classes, of which instances are not contained in training data. In other words, if Y train is a set of class labels in training data and Y test is that in test data, then Y train = Y test in normal supervised-learning framework, while Y train = Y test in ZSL framework.
(more specifically, Y train ∩ Y test = φ in some cases, and Y train ⊂ Y test in other cases). Unseen classes are classified using attribute together with a description of the class based on the attributes, which is usually given on the basis of external knowledge. Most typically it is manually given by humans [6,7]. The attribute represents a semantic property of the class. A classifier to judge the presence of the attribute (or the probability of the presence) is learnt using training data. For example, the attribute of "striped" can be learnt using the data of "striped shirts", while the attribute of "four-legged" can be learnt using the data of "lion". Then an unseen class "zebra" can be recognized, without any training data of zebra itself, by using these attribute classifiers as well as the description that zebras are striped and four-legged.
The idea of ZSL has been applied also to human action recognition [8][9][10][11][12]. Indeed they successfully demonstrated a capability of recognizing unseen actions, but the attributes used in these studies are relatively task-specific and not so fundamental as to be able to recognize wider variety of human actions. The potential of recognizing truly wide variety of actions becomes substantially bigger if more fundamental and general set of attributes are utilized. To this end, we believe the status of each human-body joint is appropriate attribute since any kinds of human action can be represented using the set of each body joint's status.
There are sophisticated vision-based methods such as [13][14][15] for estimating the status of body joint, but the problem of occlusion is essentially inevitable for those approach. Moreover, they are not suitable for the applications in which the target person moves around beyond the range of camera view. Thus we utilize wearable sensors, which are free from occlusion problem, to estimate the statuses of all the major human body joints. We aim at developing a method that flexibly recognizes wide range of human actions with ZSL. This study especially focuses on the classification of static actions, or poses, as the first step toward that goal (Some of the poses are sometimes referred to as "action" in prior works, but we use the term "pose" in this study).
The biggest challenge in zero-shot pose recognition is the intra-class variation of the poses. The difficulty of intra-class variation in general action recognition was discussed in [8]. For example, when "folding arm", one may clench his/her fists while another may not. The authors introduced a method to deal with the intra-class variation by regarding attributes as latent variables. However, it was for normal supervised learning and their implementation in ZSL scenario was naive nearest-neighbor-based method that does not address this problem. The intra-class variation becomes an even severe problem in ZSL especially when fine-grained attributes like each body joint's status are utilized. This is because the value of all the attributes should be specified in ZSL even though some of the attribute actually may take arbitrary values. It is difficult to uniquely define the status of hands for "folding arm", but the attribute "hands" cannot be omitted because it is necessary for recognizing other poses such as "pointing". Conventional ZSL methods have dealt with all the attributes equally even though some of them are actually not important for some of the classes. This sometimes causes misclassification because a metric (e.g., likelihood, distance) to represent that a given sample belongs to a class can be undesirably influenced by irrelevant attributes. This paper solves the problem by taking the importance of each attribute for each class into account when calculating the metric.
The effectiveness of the method is demonstrated on a human pose dataset collected by us that is named Hitachi-DFKI pose dataset, or HDPoseDS in short. HDPoseDS contains 22 classes of poses performed by 10 subjects with 31 inertial measurement unit (IMU) sensors across full body. To the best of our knowledge, this is the richest dataset especially in terms of sensor density for human pose classification based on wearable sensors. Due to its sensor density, it gives us a chance of extracting fundamental set of attributes for human poses, namely the status of body joints. Therefore, it is the first dataset suitable for studying wearable-based zero-shot learning in which wide variety of full-body poses are involved. We make this dataset publicly available to encourage the community for further research in this direction. It is available at http://projects.dfki.uni-kl.de/zsl/data/.
The main contribution of this study is two folds. (1) We present a simple yet effective method to enhance the performance of ZSL by taking the importance of each attribute for each class into account. We experimentally show the effectiveness of our method in comparison to baseline methods.
(2) We provide HDPoseDS, a rich dataset for human pose classification suitable especially for studying wearable-based zero-shot learning. In addition to these major contributions, we also present a practical design for estimating the status of each body joint; while conventional ZSL methods formulate attribute-detection problem as 2-class classification (whether the attribute is present or not), we estimate it under the scheme of either multiclass classification or regression depending upon the characteristics of each body joint.

Related Work
We review three types of prior works in this section, namely ZSL, wearable-based action and pose recognition, and wearable-based zero-shot action and pose recognition.

Zero-Shot Learning
The idea of ZSL was firstly presented in [3] followed by [6] and [16]. The major input sources have been images and videos, but there have been some studies based on wearable sensors as reviewed later in this section. The most fundamental framework established in the early days is as follows. Firstly a function f : X → A is learnt using labeled training data, where X denotes an input (feature) space, and A denotes an attribute space. The definition of unseen classes is given manually, and it represents a vector in the attribute space A. Then a function g : A → Y is learnt using the vectors in A and their labels. Here Y denotes a label space. When test data are given, their labels are estimated by applying the learnt functions f and g subsequently. In early days, SVM was frequently used for learning f , and it's replaced by DNN-based method these days. One of the most common methods for learning g has been nearest neighbors [8,[16][17][18]. This study also uses a nearest-neighbor-based method.
Extensive efforts have been made to improve ZSL methods from various viewpoints. Socher et al. [19] invented a method that does not need manual definition of unseen classes by utilizing natural language corpora (word2vec). Jayaraman and Grauman [20] took the unreliability of attribute estimation into account by a random-forest based method. Semantic representations were effectively enriched by using synonyms in [21], and by using textual descriptions as well as relevant still images in [22]. Qin et al. [23] extended the semantic attributes to latent attributes in order to obtain more discriminative representation as well as more balanced attributes. Tong et al. [24] were the first to introduce generative adversarial network (GAN) [25] in ZSL. A problem of domain shift that is common in ZSL was effectively dealt with in [10] and [12]. Liu et al. [26] studied cross-modal ZSL between tactile data and visual data. Our idea of incorporating each attribute's importance for each class was inspired by [20] as their idea of incorporating attributes' unreliability is similar in terms of dealing with the characteristics of attributes.

Wearable-Based Action and Pose Recognition
As reviewed in [27][28][29], the mainstream of action recognition methods before DNN-based methods become popular has consisted of two-stage approach; firstly they apply a sliding window to time-series data and extract time domain features such as mean and standard deviation as well as frequency domain features such as FFT coefficients, and secondly apply various machine-learning method such as hidden Markov model (HMM) [30], support vector machine (SVM) [31], conditional random field (CRF) [32], and an ensemble method [33].
In recent years, DNN-based approaches have become increasingly popular as they showed overwhelming results [2]. In [34][35][36], they introduced a way to employ convolutional neural networks (CNN) to automatically extract efficient features from time-series data. Ordóñez et al. [37] proposed a method to more explicitly deal with the temporal dependencies of the human actions by utilizing long short-term memory (LSTM). Hammerla et al. [38] also introduced a LSTM-based method and gave the performance comparison among DNN, CNN, and LSTM as well as the influence of the network parameters in each method. Following the findings from these researches, we also utilize CNN for estimating the status of each body joint. The details of the implementation will be given in the following section.

Wearable-Based Zero-Shot Action and Pose Recognition
One of the earliest attempts to apply the idea of ZSL to human activity recognition based on wearable sensors is [17] and their subsequent work [9]. They firstly used nearest-neighbor-based approach to recognize activities using attributes, and later enhanced the method to incorporate temporal dependency by using CRF. Wang et al. [39] proposed a nonlinear compatibility based method, where they first project the features extracted from sensor readings to a hidden space by a nonlinear function, and then calculate the compatibility score based on the features in the hidden space and prototypes in semantic space. Al-Naser et al. [40] introduced a ZSL model for recognizing complex activities by using simpler actions and surrounding objects as attributes.
These prior works successfully showed a great potential of realizing zero-shot action recognition based on wearable sensors. However, on one hand the attributes used in those studies are neither fine-grained nor fundamental enough so as to represent truly wide variety of human actions or poses. On the other hand, the methods used in those studies do not take the attributes' importance into account, which matters more especially when using fine-grained attributes to represent diverse poses.

Sensor
Our goal is to use all the major human-body joints as attributes to represent full-body poses. Thus, a very dense sensor set across full-body is required. Perception Neuron from Noitom Ltd. is ideal for this purpose (https://neuronmocap.com/). It has 31 IMU sensors across full body; 1 on head, 2 on shoulders, 2 on upper arms, 2 on lower arms, 2 on hands, 14 on fingers, 1 on spine, 1 on hip, 2 on upper legs, 2 on lower legs, 2 on feet ( Figure 1). Each IMU is composed of a 3-axis accelerometer, 3-axis gyroscope and 3-axis magnetometer. We use 10 dimensional data from each IMU including 3 acceleration data, 3 gyro data, and 4 quaternion data. This rich set of sensors are especially helpful in applications where detailed full-body pose-recognition is desired. For example, in workers' training, novice workers can learn to avoid undesirable poses that can cause safety or quality issues with the help of a pose recognition system.

Target Poses
In order to test the generalization capability of zero-shot models, we constructed a human pose dataset named HDPoseDS using Perception Neuron. We newly built the dataset because existing wearable-sensor datasets are not collected with so densely-attached sensors as to be used for extracting fundamental set of attributes for human poses, namely the status of each body joint. We defined 22 poses such that various body parts are involved and thus the generalization capability of the developed method in zero-shot scenario can be appropriately tested (see Figure 2 for appearance and Table 1 for names).  Table 1. We had 10 subjects, and each subject performed all the 22 poses for about 30 s. All of the 10 subjects were males, but from 4 different countries. The body heights of the subjects ranged from 160 cm to 185 cm. The ages were from 23 years old to 37 years old. To the best of our knowledge, this is the richest dataset especially in terms of sensor density (31 IMU across full body). Therefore, it is the first dataset suitable for studying wearable-based zero-shot learning in which wide variety of full-body poses are involved. We make this dataset publicly available to encourage the community for further research in this direction. It is available at http://projects.dfki.uni-kl.de/zsl/data/.
During the data collection, only brief explanation about each pose was given, and thus we observed some intra-class variation in the dataset as summarized in Table 1.

Proposed Method
Following the standard scheme of ZSL, our approach also consists of two stages; attribute estimation based on sensor readings, and class label estimation using estimated attributes. We explain the two stages one by one in detail. In this section, we first explain the sensor to be used in our study, then describe the two stages in detail.

Attribute Estimation
We use 14 major human-body joints as attributes to represent various poses as summarized in the first column of Table 2. Unlike conventional ZSL methods, where 2-class classification is always used (whether an attribute is present or not), we use either multiclass classification or regression depending upon the characteristics of each body joint as shown in Table 2. For the joints that have only one degree of freedom like knees (or at least whose major movement is restricted to one dimension), it is more suitable and beneficial to use regression to estimate the status. This allows us to represent intermediate status of the joint by just specifying an intermediate value, which enables to describe detailed status of the joint, rather than just "straight" and "curl", to represent more complicated poses in the future. For the joints that have more than 2 degrees of freedom like head, we use multiclass classification. It is indeed possible to replace this by 2-class classification on each status, but it's more natural to formulate this as a multiclass classification problem since each status are mutually exclusive (e.g., if head is "up", then it cannot be "down" at the same time).
We use CNN to deal with multivariate time-series data. Previous studies [34,37,38] first dealt with different modalities individually by applying convolution only on temporal direction (a kernel size of CNN is k × 1), and integrated the output from all the modality in fully connected layers that appear right before the classification layer. However, as shown in Figure 3, we integrate the readings from different sensor modalities in the earlier stage using CNN (a kernel size of CNN is k 1 × k 2 ) since we experimentally found that it gives better performance. We construct one network per one joint, resulted in 14 networks to estimate the status of all the joints. The sliding window size in this study is 60, which corresponds to 1 s. The number of modality (referred to as M in Figure 3) is s × 10, where s is the number of used IMU for each joint. We use 4 convolution layers with 25, 20, 15, and 10 channels. One fully connected layer with 100 nodes is inserted before the final classification or regression layer. The activation function is leaky ReLU throughout the network but the regression layer has sigmoid activation to squash the values to [0, 1]. We use cross entropy loss for multiclass classification, and mean absolute loss for regression. They are optimized using MomentumSGD with momentum value of 0.9. Batch normalization and drop out is used for regularization. The kernel size in convolution layers are 3 × 10 for hands and 3 × 3 for all the rest. We use the wider kernel for hands because the number of IMU used for classifying hands' status is significantly larger than that for other joints.

Naive Formulation
We use nearest-neighbor-based method for zero-shot pose classification. The input to pose classification is the output from attribute estimation explained in Section 4.1. The output dimension from the networks for each joint is the number of classes if the joint's status is estimated by multiclass classification, and 1 if it is by regression, resulting in a 33 dimensional vector in total.
Let a (n) = {a (n) 1 , · · · , a (n) D } be an attribute vector of n'th sample (D = 33 in this study). Please note that ∀a (n) d ∈ [0, 1] since we squash the values by softmax and sigmoid function for multiclass classification and regression, respectively. For ZSL, we need extra information for estimating pose labels using vectors in attribute space. Following some of the previous works [6,7,9,17], we simply use a manually defined table for this as shown in Table 3 and convert them to corresponding 33-dimensional vectors (one-of-K representation is used for multiclass classification part).
For normal nearest-neighbor-based method, a given test data x is first converted to an attribute vector a using the learnt attribute estimation networks, and then the distance between a and the ith training data v (i) is calculated as follows.
where D is the dimension of an attribute space, and x d denotes the d'th element of vector x. For seen classes, v (i) ∈ R D is a training data mapped to the attribute space using the learnt attribute estimation networks, while for unseen classes it is a vector created based on the pose definition table (Table 3). p is usually 1 or 2. Then the given data x is classified to the following class.
where C(i) gives the class to which ith training data belong.

Incorporating Attributes' Importance
As summarized in Table 1, sometimes there is intra-class variation in the poses. In normal supervised learning setting, this intra-class variation can be naturally learnt as training data cover various instances of the given class. However, it is not possible to do so in ZSL since there is no training data other than a definition table of the unseen classes. In this situation, it is not appropriate to equally deal with all the attribute because not all the attributes are equally important for classifying a particular class. For example, for "squatting" class, the status of hip joints and knees are important, and the values are expected to be always 1 (bent). On the other hand, the status of elbows are not important since it is still "squatting" regardless of the status of elbows; the values may be 1 (bent in order to hold on to something or to put hands on knees) or 0 (straight down to the floor). In other words, the attribute values of elbows do not matter to tell whether given test data belong to squatting class or not. Please note that we cannot simply omit the attribute "elbow" because it is indeed necessary for other poses such as "folding arm". Therefore, we need to design a new distance metric so that we can incorporate each attributes' importance for each class.
We formulate this as follows.
w rc (d) = 1 if d'th attribute is given by regression where w ai (j, i) denotes the importance of joint j for the class C(i). It is given manually in this study as shown in Table 4. It may be indeed an extra work to manually define the attribute importance, but actually it does not require too much extra time because anyway the attribute table (Table 3) should be manually defined as it was the case in many previous works. In addition, it is neither difficult because it is natural to assume that the person, or the system user, who defines the attribute table (Table 3) has enough knowledge not only on the definition of the target poses but also on which attribute (body-joint status) is important for each pose. It may be also possible to infer the attributes' importance either from training data or external resources (e.g., word embedding) rather than manually defining them, but it lies beyond the scope of this study at this moment. We use binary values for attributes' importance for simplicity, but it is easy to extend it to continuous numbers. w ai (d, i) is the importance of attribute d for the class C(i) and the value of it is copied from w ai (j, i), where d'th attribute comes from joint j . Please note that it depends not only on d but also on C(i). w rc (d) is 0.5 if the d'th attribute is calculated using multiclass classification because the total distance with regards to the joint j from which the d'th attribute comes from (∑ d∈jointj |a d − v (i) d |) ranges from 0 to 2, whereas the distance with regard to the joint whose status is estimated using regression ranges from 0 to 1. W ai can be interpreted as the total number of "valid" joints to be used for classification of class C(i). Therefore, the first term on the right-hand side in Equation (3) can be interpreted as the average distance between a and v (i) over "valid" attribute that comes from "valid" joints. The second term is introduced to penalize the class that uses too few attributes. If test data have the same distance to two classes that have different numbers of important attributes, this term encourages to classify the data to the class which has larger number of important attributes, which indicates more detailed definition of the pose. We use λ = 0.1 and p = 1 in this study. Note that the increase of the computational cost compared to the naive formulation (Equation (1)) is trivial because we just multiply constant numbers when calculating the distances. Table 4. Attributes' importance.

Evaluation Scheme
We use HDPoseDS for evaluation. The evaluation procedure is as follows.
(1). All the input data are converted to attribute vectors using the neural networks explained in Section 4.1. The sliding window size is 60, which corresponds to 1 s, and it's shifted by 30 (0.5 s). This ends up with roughly 590 (= (30/0.5 − 1) × 10) attribute vectors per pose since HDPoseDS contains data from 10 subjects and each subject performed roughly 30 s for each pose. (2). For each class c, we construct a set of training data by combining the data from all the other classes than c and the pose definition of c based on attributes (Table 3). We use class c's data as test data. (3). The labels of the test data are estimated using the method explained in Section 4.2. (4). We repeat this for all the 22 classes. (5). We calculate the F-measure for each class based on the precision and recall rate.
Please note that we do not assume that the possible output classes are only unseen (test) classes; we assume that the seen (training) classes are also potential output classes during testing. Since we do not use instances of seen (training) classes in testing, this evaluation scheme is not exactly the same as the generalized zero-shot learning (G-ZSL) [41,42], in which the instances of seen classes are also used in testing. It is, however, more similar to G-ZSL than normal ZSL in a sense that the target classes in testing include not only unseen (test) classes but also seen (training) classes.
In addition to this, we also investigate how the proposed method works in few-shot learning scenario, where only a small number of training data are available. The evaluation procedure is the same as the ZSL case except the step (2); instead of including the attribute definition of c in the training data, we include k samples from class c's data in k-shot learning scenario, and all the other data of class c are used for testing. To choose the k samples, firstly we randomly permutate class c's data. Then we use the l'th (l = 1, 2, ..., N c /k ) k samples for training and the remaining (N c − k) samples for testing, where N c is the number of class c's data. Then we proceed to step 3 The estimation result for class c is averaged, and then we proceed to step (4).
The performance of the proposed method is compared with three baseline methods. The first one is one of the most frequently used method in ZSL studies, which is called "direct attribute prediction (DAP)" introduced in [7]. Please note that we did not compare with indirect attribute prediction (IAP) that is also introduced in [7]. This is because, as the authors of [7] stated, IAP is not appropriate for the case where training classes are also potential output class during testing. The other two baseline methods are nearest-neighbor-based, which is also common in ZSL studies. The proposed method is also based on a nearest-neighbor method. The first nearest-neighbor-based baseline is a naive nearest-neighbor-based method, in which the distance between samples are calculated using Equation (1) with weights w rc (d). The second nearest-neighbor-based baseline is the one that uses random attributes' importance. We randomly generate either 0 or 1 for each w ai (j, i) in Equation (4). For this baseline method, we test 1000 times using different random weights and report the average F-measure of the 1000 tests. For both of the proposed and the baseline methods, we use a prototype representation (mean vector) of each class introduced in [43], rather than all the training data themselves, in order to deal with the severe imbalance of number of training samples. We tested all the method using a single desktop PC with Intel R Core TM i7-8700K CPU and NVIDIA GeForce GTX 1080 GPU.

Results and Discussion
The result of ZSL is summarized in Table 5. The details are given in Appendix A. As shown in the table, our proposed method outperformed all the baseline methods in average F-measure. In addition, the proposed method could run at about 20 Hz, which is near real-time. The comparison with the naive nearest-neighbor-based method (without attributes' importance) shows the effectiveness of the attributes' importance. The performance of the baseline that uses random attributes' importance shows that the attributes' importance should be carefully designed. In other words, our method enables users to incorporate appropriate domain knowledge on the target classes so that the performance of the model is enhanced. Compared to DAP [7], the proposed method showed more stable performance on different poses.
The improvement compared to the best baseline (nearest neighbor without attributes' importance) was 4.55 points, which corresponds to 5.91% relative improvement. In addition, the proposed method achieved higher scores in majority of the poses compared to this baseline. Especially a big improvement was observed in "Stretching calf(L)" and "Stretching calf(R)" poses. This was because there were unignorable number of subjects who faced down when performing these poses though they were supposed to face forward according to the definition of the pose given in Table 3. Our method could successfully deal with this intra-class variation simply by ignoring the status of head and focusing more on the other important attributes.
On the other hand, there are some poses whose F-measure dropped by incorporating the attributes' importance. Among those, the biggest drop was observed in pose "Folding arm". This was caused by the low estimation accuracy of the important attributes for folding-arm pose; the statuses of shoulders in folding-arm pose were sometimes estimated as "front" while they had to be "down", probably because arms were slightly pulled forward to make the space for hands at underarms. Incorporating attributes' importance means focusing more on the important attributes for each pose and ignoring the other attributes. Therefore, in case the attribute estimation accuracy is not good for those important attributes, the pose classification is done by relying too much on the unreliable attributes. This problem may be addressed by integrating attributes' unreliability that was introduced in [20]. Table 5. The evaluation result (F-measure) of ZSL. Abbreviations are as follows. DAP: direct attribute prediction, NN: nearest neighbor, AI: attributes' importance. The bold numbers represent the best score or the one close to the best (the difference is less than 0.01) for each pose. The result of few-shot learning is shown in Figure 4. Here we only compared the proposed method with the best performed baseline, which is a nearest-neighbor-based method without attribute importance. It shows that incorporating attributes' importance consistently improves the performance also in few-shot learning scenario. The improvement is especially bigger when number of shots (training data) is less, and the impact of attributes' importance becomes smaller as number of available training data increases. This is because the intra-class variation is reflected more in the training data as the number of available training increases and the classifier can naturally learn which attribute is actually important. Another interesting observation is that the F-measure in ZSL scenario was better than that in one-shot learning scenario regardless of with or without attributes' importance. This implies that under a situation where only extremely limited number of training data is available, human knowledge (pose definition table) can give a better compromise than just relying on the small number of training data.

Conclusions
This paper has presented a simple yet effective method for improving the performance of ZSL. In contrast to the conventional ZSL methods, the proposed method takes the importance of each attribute for each class into account, which becomes more critical when using a set of fine-grained attributes in order to represent wide variety of human poses and actions. The experimental results on our dataset HDPoseDS have shown that the proposed method is effective not only for ZSL scenario, but also for few-shot learning scenario. The results as well as the provided dataset are expected to promote further researches toward practical development of human-action-recognition technology under the situation of limited training data.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Detailed Evaluation Results
We show the confusion matrices of the 4 methods mentioned in Section 5. Table A1. The confusion matrix of the DAP [7]. The numbers in the first row and the first column correspond to the pose IDs shown in Table 1. T denotes total number in rows and columns. P and R denote precision and recall, respectively. The number in the bottom right is the accuracy (sum of diagonal elements divided by the total numbers). Please note that this is different from the F-measure.  Table A2. The confusion matrix of the nearest-neighbor-based baseline without attributes' importance.  Table A3.
The confusion matrix of the nearest-neighbor-based baseline with random attributes' importance. The numbers are the average of 1000 times. The precision and recall were calculated based on these averaged numbers.  Table A4. The confusion matrix of the proposed method.