Augmenting Activity Recognition with Commonsense Knowledge and Satellite Imagery

Activity recognition gained relevance because of its applications in a variety of fields. Despite relevant improvements, classifiers are still inaccurate in several real-world circumstances or require excessively time-consuming training routines. In this paper we show how satellite imagery and common sense knowledge can be used for improving users’ activity recognition performed on a mobile device. More specifically, we made use a personal device providing a list of candidate user activities instead of only the most probable one. Then, from the GPS location of the user, we (i) extract a list of neighboring commercial activities using a reverse geo-coding service and (ii) classify the satellite imagery of the area with state-of-the-art techniques. The proposed approach uses the ConceptNet network for ranking the list of candidate activities using both additional information. Results show an improvement in activity recognition accuracy.


Introduction
Pervasive systems, constantly analyzing different facets of our world, frequently cooperate to provide services with a coherent representation of the environment. However, despite the many facets of our life are strictly tied from the practical viewpoint (e.g., if a user is running he is likely to be in suitable location such as a park or a gym), it is difficult to exploit their correlation using traditional learning techniques (e.g., bagging, boosting) [1]. On the other side, treating each facet as an independent variable might lead to unrealistic results. For instance, locations and activities are strictly correlated.
In this paper we tackle the problem of enabling situation-recognition capabilities by fusing different sensor contributions. Specifically, we propose to extract well-known correlations among different facets of everyday life from a commonsense knowledge base. The approach is general and can be applied to a number of cases involving commonsense for the sake of: (i) ranking classification labels produced by different classifiers on a commonsense basis (e.g., the action classifier detects that the user is running with an high confidence and the place classifier outputs two possible labels: "park" and "swimming pool". In this case, using commonsense, it is possible to infer that the user is more likely to be in a park that in a swimming pool); (ii) predicting missing labels (e.g., if a user is running but the location data is missing, it is possible to propose "park" as a likely location). More in details, the paper contains three main contributions: (i) it describes a greedy search algorithm to measure the semantic proximity of two concepts within the ConceptNet network [2]; (ii) it proposes a novel technique to extract contextual localization data from satellite imagery; and (iii) it shows how to improve activity recognition accuracy by making use of two different localization sensors and common sense knowledge.
Accordingly, the rest of the paper is organized as follows: Section II formally defines the problem of commonsense sensor fusion and describes the proposed algorithm. Section III describes the experimental testbed we implemented to validate our proposal. Section IV details experimental results under different configurations. Section V discusses related work. Finally, Section VI concludes the paper.

Sensor Fusion with Commonsense Knowledge
The proposed approach is based on the assumption that commonsense knowledge can be used to measure the semantic proximity among concepts. The more two concepts are proximate, the more it is likely they have been recognized within the same context [3]. In this section we formally introduce the approach.

A. Problem Definition
Let us consider a set of n classifiers C 1 ::C n , each one delegated to recognize a specific facet of the environment. Each classifier C x is able to deal with uncertainties by producing (at every time step t) m labels l 1 (C x ; t); :::; l m (C x ; t) for each data sample. Given that, the 2/5 overall perception of the environment can be represented as a tuple ((l 1 (C 1 ; t); :::; l m (C 1 ; t)); :::; (l 1 (C n ; t); :::; l m (C n ; t))).
In this paper, we tackle the problem of ranking all the possible tuples provided by n classifiers on a commonsense basis.
The general problem of commonsense tuple ranking can be expressed, without loss of generality, in this way: given 2 tuples both composed by commonsense concepts, (l 1 (C 1 ; t); l 1 (C 2 ; t)) and (l 2 (C 1 ; t); l 2 (C 2 ; t)), is it possible to establish which tuple contains the most proximate concepts on a commonsense basis?
Measuring commonsense proximity requires two key conditions to be met. In particular: (i) a knowledge base containing both a vocabulary covering a wide scope of topics and semantic relations hard to be discovered in an automatic way; and (ii) an algorithm for computing semantic proximity.
The first condition is best addressed by ConceptNet. It is a semantic network designed for commonsense contextual reasoning. It was automatically built from a collection of 700,000 sentences, a corpus being a result of collaboration of some 14,000 people. It provides commonsense contextual associations not offered by any other knowledge base. ConceptNet is organized as a massive directed and labeled graph. It is made of about 300,000 nodes and 1.6 million edges, corresponding to words or phrases, and relations between them, respectively. Most nodes represent common actions or chores given as phrases (e.g., "drive a car" or "buy food"). Its structure is uneven, with a group of highly connected nodes, and "person" being the most connected, having in-degree of about 30,000 and out-degree of over 50,000. There are over 86,000 leaf nodes and approximately 25,000 root nodes. The average degree of the network is approximately 4.7.
To meet the second requirement, we started from a preliminary round of experiments with ConceptNet that led us to the following principles: 1) Proximity increases with the number of unique paths. However, this is not a reliable indicator given that even completely unrelated concepts might be connected through long paths or highly connected nodes.
2) Proximity decreases with the length of the shortest path; nodes connected directly or through some niche edges are in a short distance, hence they are proximate; 3) Connections going through highly connected nodes increase ambiguity, therefore proximity should be inversely proportional to the degrees of visited nodes; 4) ConceptNet has been created from natural-language assertions.
Thus, errors are frequent and algorithms have to be noisetolerant; Majewski et al. recently proposed an interesting algorithm for commonsense text categorization inspired by similar observations [9]. Despite having been conceived for a different problem, it can be applied to localization as well. The algorithm is based on the assumption that proximity among concepts is proportional to the amount of some substance s that reaches the destination node v as a result of injection to node u. The procedure has been built around two key biological paradigms such as diffusion and evaporation and works as follow: 1) a given amount of substance s is injected to a node u; 2) at every node, a fraction of the substance evaporates and leaves the node; 3) at every node, the substance diffuses into smaller flows proportional to the out degree of the node; 4) Nodes never overflow. If multiple paths visit the same node, the previous amount of substance s can be incremented; 5) Target nodes are ranked according to the amount of substance s received. Figure 1 exemplifies the algorithm in action. A certain amount (i.e., 256 units) of substance s is injected into a node (i.e., Run). Then, the substance diffuses over the graph and halves by evaporation at each node it visits. The amounts of s that reach nodes Park and Road are 60 and 16 respectively. Park is considered more proximate than Road to Run.
It is worth noticing that this approach can easily handle the fact that different classifiers might produce the same set of labels (i.e., classifiers observing the same facets of reality). In fact, if a label compares multiple times it is sufficient to multiply the amount of substance injected into the corresponding nodes. Furthermore, this approach permits to assign different weights to different classifiers in a straightforward way.
Finally, it is interesting to note how this algorithm matches with the principles we deduced from our preliminary studies on ConceptNet. In fact: (i) the evaporation process assures that short paths imply high proximity; while (ii) the diffusion process takes into account the total amount of connections among two concepts while diminishing the relevance of highly-connected paths.
In the following, we apply the described technique to fuse information contributions coming from sensors analyzing different facets of the same situation.

Improving Activity Recognition
To assess the relevance of our ideas, we used a specific instance of the general problem. We prototyped a system able to improve activity recognition accuracy by making use of two different localization sensors. Activities are classified from accelerometer data while locations from GPS traces. All three modules have been configured to eventually produce multiple labels to deal with uncertainties. In these cases, common sense reasoning is applied.

A. Activity Recognition
To classify user's activities we implemented a sensor based on [4]. It collects data from 3-axis accelerometers, sampling at 10Hz, positioned in 3 body locations (i.e., wrist, hip, ankle) and classifies activities (i.e., dance, use stairs, drive, walk, run, stand still, drink) using instance-based algorithms. Furthermore, considering that human activities have a minimum duration, it aggregates classification results over a sliding window and performs majority voting on that window. Each window is associated with the most frequent label. For the sake of the experimentation, we modified it to deal with uncertainties. Instead of producing a single label for each sensor sampling, we implemented a mechanism to produce multiple labels associated with a degree of confidence. Specifically, for each sample to be classified, k nearest neighbours (associated to q classes, k = 64, q <= k) are identified. The sample is then associated to all the classes (at most 3) associated to at least k=2q training samples. Table  1 reports a realistic confusion matrix for this sensor.

3/5
produce numerous false positives. To reduce them, while keeping an acceptable level of false negatives, we implemented 3 filters acting on the GPS signal. Specifically: the first acts on the assumption that each class is more likely to be visited during defined portions of the week. The second, acts on the assumption that each class of locations is fairly characterized by the duration of the visit. This duration is usually related with a GPS signal interruption. Finally, the third one, filters out each label not compatible with measured speed.

Experimental Evaluation
To assess the feasibility of our idea, we used the system described in Section III to collect a dataset comprising a full day of a single user.
The activity recognition module, has been trained to classify 8 activities (i.e., climb, use stairs, drive, walk, read, run, use computer, stand still, drink). For each class, 300 training samples have been selected. The location module implementing reverse geocoding, instead, sampled GPS coordinates each 30 seconds. GPS coordinates has been labeled with 5 different categories (i.e., street, university, bar, park and library).
We first discuss the performance of recognition modules, considered independently. Figure 3(a)(b)(c) summarizes the results. The reverse geocoding localization sensor is the less precise among the three. It correctly recognizes only 20% of locations because of multiple commercial activities are usually located within its search radius. On the other hand, the satellite-based sensor correctly classifies around 80% of the samples. Finally, the activity recognition module is around 65% of correctly classified samples.
When both location and activity labels are combined using ConceptNet, 4 cases can occur: (i) both are available, (ii) only activity is available, (iii) only location is available, (iv) no data available. The first case allows applying common-sense sensor fusion. In both the second and the third case, instead, commonsense can be used to identify a possible place or activity to complete the (activity, place) tuple. Figure 3(d)(e) show the results obtained by combing activity labels with both the location labels. A significant improvement has been achieved. It is worth noticing that the Undefined (i.e., multiple labels available) category is lowered to zero meaning that ConceptNet is always capable of providing a ranking of action-place couples. Furthermore, the No Classification data category is lowered to zero, in fact one of the advantage of the use of ConceptNet is to provide missing data. Please note that in our experiment we never experienced the concurrent lack of both sensorial data that should have called for different strategies similar to activity and location prediction, such as bayesian networks [7]. It is worth noticing that the fusion process with the satellite-based sensor produced better results because of its initial performance was better than the reverse geocoding one.

B. Location Recognition via Satellite Imagery
The location sensor based on satellite images is based on GPS data and classifies user location by making use of satellite imagery. Specifically, given the GPS coordinates, it uses Google Maps API to retrieve the corresponding image tile. Then, it classifies the tile against a set of 5 categories (i.e., green, harbour, parking, rail, residential). (Figure 2) To implement this sensor we used an approach based on the bag-of-features image classification technique [5]. In computer vision, the bag-of-words model (BoW model) can be applied to image classification, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.
A dataset comprising 200 tiles, evenly distributed among the 5 categories, has been collected from Google Maps and manually annotated. SURF features, chosen because of their robustness to scaling and rotation, have been extracted [5].
SURF features have been organized in bag-of-words representing each of the categories and used to train separate one-class SVM classifiers. During the testing stage, instead, the tile covering the user location is downloaded and tested against each classifier. The tile is assigned to the category associated with the classifier with the highest likelihood. Table 2 shows the confusion matrix. All the classes are recognized with an accuracy comprised between 78% and 95%.

C. Location Recognition via Reverse Geocoding
The location sensor based on reverse geocoding samples GPS coordinates and classifies user's location by querying the reverse geocoding Google Maps API [6]. Specifically, this API takes as input the GPS coordinates and a search radius and returns a list of points of interest associated to a label coming from a predefined set (i.e., road, square, park, shop, cinema, mall, restaurant, gym). Unfortunately several practical drawbacks affect this process. Google Maps database, for example, is not perfect. Although we do not have accurate statistics, we noticed that a portion of locations is still missing. Furthermore, locations' coordinates are not always precise. Finally, Google Maps does not provide in-formation about locations' geometry. Due to this, especially for large-sized instances (e.g., parks, squares) locations can be misclassified. For example, a user running close to the border of a park is likely to be associated to the shops she is facing instead of to the park itself.
To mitigate these problems and avoid false negatives, the system has been setup to use a search radius of 250m. Clearly, the number of reverse geo-coded locations is proportional to the search radius. Because of this, especially in densely populated areas, the system might

Related Work
Many works focus on data fusion at different levels, either for acquiring and making accessible diverse contextual aspects or for reasoning about them. The traditional approach makes use of probabilistic models. [8] Proact combines data coming from RFIDs and an accelerometer mounted on the RFID glove in order to identify activities. RFID tags are used to restrict the number of possible actions by considering the manipulated object. In a system for multi-modal sensor fusion specifically designed for smartphone is proposed [9]. The system exploits data coming from the microphone and inertial sensors on the mobile for inferring high level activities with lightweight bayesian learning algorithms.
Few works make use of commonsense for situation recognition. An interesting approach is presented in [10]. It uses RFID to trace a set of everyday objects and infers user activities by making use of Google searches. Alternatively, applies commonsense to localization [11]. It uses Cyc to improve automatic place identification on the basis of user historical data. However, both these approaches limit the use of commonsense to improve a single contextual aspect. Alternatively, in this paper we use commonsense to integrate multiple aspects.
To best of our knowledge there are few works using common sense to integrate different context sources. Pentland et. al. [12] presented a user-centric situation recognition system able to overhear users' conversations and use ConceptNet as reasoning system. Bicocchi et al. [13], instead, presented a workflow to classify a situation using a stream of images collected from ego-vision devices. Images are independently classified using k-nn search are combined together using commonsense. However, they both do not make use of commonsense knowledge to fuse different contributions.

Conclusion
In this paper we presented preliminary results we obtained with a novel approach that combines an activity classifier and location classifier using satellite imagery with the ConceptNet knowledge base. Different classifiers are fused together on a commonsense basis for both: (i) improve classification accuracy and (ii) dealing with missing labels. The approach has been discussed through a realistic case study focused on the recognition of both locations visited and activities performed by a user. Results have been encouraging and apparently indicates that our approach can be applied to different scenarios.

Figure 2:
Four snapshot taken from the location sensor using satellite imagery. Residential and harbour areas are correctly classified. The red strings superimposed on map tiles are actual classification labels produced by the sensor Figure 3: The reverse geocoding sensor correctly recognizes only 20% of locations (a). On the other hand, the satellite-based sensor correctly classifies around 80% (b). Finally, the activity recognition module is around 65% of correctly classified samples (c). Figures (d) and (e) show the results obtained by combing activities with locations coming from both the localization sensors