POLIMI-ITW-S: A large-scale dataset for human activity recognition in the wild

Human activity recognition is attracting increasing research attention. Many activity recognition datasets have been created to support the development and evaluation of new algorithms. Given the lack of datasets collected in real environments (In The Wild) to support human activity recognition in public spaces, we introduce a large-scale video dataset for activity recognition In The Wild: POLIMI-ITW-S. The fully labeled dataset consists of 22,161 RGB video clips (about 46 h) including 37 activity classes performed by 50 K+ subjects in real shopping malls. We evaluated the state-of-the-art models on this dataset and get relatively low accuracy. We release the dataset including the annotations composed by person tracking bounding boxes, 2-D skeleton, and activity labels for research use at: https://airlab.deib.polimi.it/polimi-itw-s-a-shopping-mall-dataset-in-the-wild.


Specifications
Computer Vision and Pattern Recognition Specific subject area Human activity recognition Type of data 2-D RGB video Annotation composed by person tracking bounding boxes, 2-D skeleton, and activity labels in JSON format How the data were acquired This dataset was taken from RGB cameras of two smartphones with resolution 1920 × 1080 pixels, 30 fps. The models of the two smartphones are VIVO S7 5G and Honor 30 S . Data format Raw Description of data collection We collected the dataset in shopping malls in the Hubei province of China. The shopping malls have multiple floors providing different services, e.g., main hall with grocery stores on the ground floor; clothes shops on the second and third floor; restaurants and drink bars on the fourth floor; cinema with a waiting room on the fifth floor; supermarket on the underground floor. The diverse settings guarantee the desired variety of subjects and situations. The subjects are clients and staff in the shopping malls having different genders and ages, including men and women, babies, children, teenagers, adults and elderly people. So we have different subjects for almost every recorded video clip. The cameras were held by hands at about 90 cm from the floor. The recorders imitated the mobile robot, keeping moving or staying still by looking around to capture persons that are performing actions. We did not mount the cameras on a robot in order to avoid uncommon situations that the presence of a robot may trigger. We did not use 3-D stereo/depth cameras as recording tools since this type of camera suffers for the moving issue, which is not suitable for data collection for mobile robots. Moreover, they may not perform well when subjects are far away from the camera. Data

Value of the Data
• The data are useful for those working in the area of human activity recognition from skeletal data or from RGB videos; • It will be possible to develop new algorithms to classify a more detailed set of activities based on semantic meanings, thus improving the performance of the applications; • It could be used to support the development and evaluation of the models of human activity recognition from mobile robots operating in public environments; • Basing on these data, it will be possible to classify actions performed In The Wild, thus opening a wide sort of applications for Social Robotics and other disciplines; • Except for the topic of human activity recognition, the dataset could be useful for other human subjects related tasks. Since the clips include many crowded scenes, this dataset is interesting for investigating opening problems like person tracking, pose tracking, person reidentification, body/head orientation. • The data may also contribute to the autonomous robotic research field to develop new path planning, and obstacle avoidance methods. • Other researchers will become interested in problems and algorithms arising when operating in the wild.

Data Description
Human activity recognition (HAR) involves skeleton representations of human bodies instead of raw RGB videos. Due to its strong adaptability and highly abstract characteristics, many significant models were developed based on skeletal data [1][2][3][4][5][6][7] . Compared to the RGB video representation, the greatest benefits of the skeletal data are that they are free of dynamic environment noise and robust against complicated backgrounds (lighting conditions, color of clothing, object obstruction, etc.). It is important for service robots to recognize the actions of people in the real world to further enhance their capabilities to offer services.
We analyzed some relevant skeleton-based HAR models in the last three years to check how public datasets were used to train and evaluate models in the community. As shown in Table 1 , the most commonly used datasets are (in descending order): NTU RGB + D 60 [8] , NTU RGB + D 120 [9] , Kinetics [10] , Northwestern-UCLA Multiview Action 3D [11] , SYSU 3D Human-Object Interaction [12] datasets. Among those, only Kinetics was not collected from a constrained environment but from online streaming resources by using the crowd-sourcing method instead, while all the other datasets were collected in the respective laboratories.
The state-of-the-art models got about 90% accuracy on the datasets collected in laboratory environments as shown in Table 2 , 3 , 5 and 6 . Nevertheless, they got only less than 40% accuracy on the Kinetics dataset which was collected from online streaming resources by crowd-sourcing methods as shown in Table 4 . It hints that the state-of-the-art models could perform well on the datasets collected in constrained environments, but they may meet challenges when recognizing actions from unconstrained, natural environments.
Because the main datasets for evaluating new HAR models are collected in the specific laboratories and the accuracy is about 90%, we think that there is little optimizing space for the models trained on such a type of datasets. Meanwhile, we argue that reliable HAR models to support the production of mobile service robots should not only be evaluated on the datasets Table 1 Datasets used for recent human recognition models.

Table 4
The state-of-the-art methods on Kinetics-Skeleton dataset in accuracy (%).   Table 6 The state-of-the-art method on SYSU dataset in accuracy (%).
collected in controlled environments but also on datasets collected in the final, public environments, situation defined in the community as "In The Wild" (ITW). Due to the well-known issues (like having unbalanced taxonomies, unnatural scenes, label noise and invalid websites links) of the crowd-sourcing methods, the dataset like Kinetics collected from online streaming resources by crowd-sourcing methods may not satisfy the needs of developing robust models which are able to perform well in the real world.
To fill this gap, we propose the POLIMI-ITW-S dataset to develop reliable skeleton-based human activity recognition models that could be deployed on mobile service robots to recognize actions that happen in the real world.
We propose that a reliable ITW dataset of clips useful for robotic applications should have the following characteristics: 1. viewpoint similar to the one of the robot, including subjects viewed both in full figure when the robot is far from the subject and only in part, when the robot is close, taken from a  [16] 2015 203 - [8] 2016 60 40 56,880 [17] 2018 80 - [9] 2019 120 106 114,480 [18] 2019 31 18 16,115 [19] 2020 55 100 112,620 [20] 2020 530 camera having characteristics typical of the ones that are mounted on commercial, mobile, service robots; 2. video clips recorded from free moving viewpoints like the ones of mobile robots; 3. clips representative of the common actions in the selected environment, evenly distributed among the classes; 4. a large number of different subjects performing the same action; 5. different genders of subjects with a large range of ages, from babies to elderly people; 6. real-life background, possibly including people and objects that could typically be in the context; 7. presence of crowded scenes with large quantities of persons, including subjects occluded by person(s) or object(s). 8. unscripted, natural actions: there are no "actors", people are recorded without knowing in advance that they are, so they are supposed to perform naturally; 9. real-life sequencing of actions, so that it is possible to consider typical sequences of actions from realistic clips; 10. possible inclusion of human-object and multi-person interactive actions (such as "calling", "talking", "drinking", "eating", "holding baby in arms", etc.); 11. large-scale dataset.
As shown in Table 7 , the available datasets are usually collected in controlled contexts, such as laboratory, home, or conveniently extracted from streaming sources produced for other purposes.
The main advantages of these datasets are that they are easy to obtain and with relatively low labor cost comparing to the datasets collected in public spaces. In addition, datasets creators could take advantages of the limited environments to deploy RGB-D cameras to get 3-D human skeletal data. Since the actors know what they should do in advance and there are rarely considered crowded scenes, they did not need to dedicate a lot of resources for annotation.
However, most datasets do not consider the viewpoint of mobile robots in public spaces. Furthermore, there is no large-scale visual dataset that deals with real, daily behavior of people. In most cases, actions are performed upon request, often by actors, usually separated from each other. Results obtained starting from such constrained conditions may not completely hold in a real-world scenario, as we have verified on our dataset for state of art models. There is a lack of adequate dataset to train models that could be used by robots to recognize common human activities in public spaces. The absence of datasets for human activity recognition in the wild is a serious impediment to computer vision and robot intelligence research. Different from the state of art datasets, our dataset satisfies all the requirements mentioned above for a reliable ITW dataset.
We have collected 22,161 video clips with more than 15.4 million frames. The average duration of each video clip is about 7 s. The total length of the dataset is about 45.97 h. The dataset was collected in Hubei province of China. According to the population statistics published by the local government [24] , the gender distribution of the city is 51.48% (male) and 48.52% (female). The age distribution is 18.7% (0-14), 62.28% (15-59), 19.03% (over 60). We believe the distribution of gender and age in the dataset matches the distribution of the population of the city. Individuals were anonymized by blurring faces using RetinaFace [25] . We used OpenPifPaf [26] to extract person tracking bounding boxes and 2-D skeleton data.
Before starting the annotation work, we analyzed a subset of the collected video clips and used the proposed detailed labeling mode to define 37 activity classes. Actually, except for the defined activity classes, there are also other activities that occurred in videos such as "falling down", "fighting", "kicking", "throwing trash", etc. We didn't add these activities to the dataset since they have a relatively small number of clips, which would have dramatically affected the learning performance.
The defined classes were distributed on three levels: The labels of the general level are used for describing single actions. We have defined "standing", "walking", "sitting", "crouching", "cleaning", "jumping", "laying", "riding", "running" and "scooter" for this level. The modifier level labels are "walkingTogether", "sittingTogether", "standingTogether", etc , which refer to multiple persons or a group of people walking, sitting, or standing together, etc. The aggregate level detailed labels aim at describing multiple actions in a single label, such as "standingWhileCalling", "stand-ingWhileLookingAtShop", "walkingWhileWatchingPhone", "sittingWhileHoldingBabyInArms", etc. The complete list of the defined labels is shown in Table 8 .
We have also defined a rule for the labels containing the keyword "together". It is only used for the groups of persons performing social activities. For example, if two or more persons are standing closely, but are not involved in any social activities, their activities will not be considered as done "together".
The dataset was fully labeled by HAVPTAT [27] . We provide RGB videos, persons' tracking bounding boxes, 2-D skeleton data (17 body keyjoints) and labeled activities' classes in JSON format. To build a high-quality dataset that offers correct annotations, we adopt a series of approaches including: training annotators with tutorial slides and video demos, pre-testing the annotators rigorously before formal annotation, and cross-validating across annotators.
For the reader's convenience, Table 9 shows the COCO body 17 keypoint arrangement [28,29] adopted in our dataset. We notice that the entire dataset misses joints 4 (left ear) and 5 (right ear) of the COCO's 17 keypoints.
The skeletons of the data collected from the real world are often incomplete. A frame containing few joints could lead the ambiguity for the learning model. For instance, a frame with only 2 and 3 joints may reduce the possibility for a system to identify the corresponding activity  class. To reduce such a type of learning error, we also temporal-linearly interpolated the missing joints and hold the pose as valid only with more than a given number (nine) joints. The threshold was fixed by nine since only partial bodies could be captured by a camera in some cases. For example, as shown in Fig. 1 , when the camera is close to a person, only the upper part of the body is present, but the keypoints of the lower part of the body (14 left knee, 15 right knee, 16 left ankle, and 17 right ankle) miss. The left picture of Fig. 1 is the original pose extracted by OpenPifPaf [26] . Right elbow (9), left (10) and right (11) wrist were not detected. After having been processed by the interpolation operation provided by the Python's Pandas library [30] , the three missing keypoints were reconstructed on the right picture. This approach was applied to the entire dataset. We call it the "original version" dataset in the experimental phase. From Table 10 and Fig. 2 a, we could see that the imbalance issue is present in the original dataset. The most frequent activities are "walking", "standing", and "walkingTogether", which occur in about 55% of the dataset.

Experimental Design, Materials and Methods
The structure of the annotated JSON format file and the data information of the fields are shown in Fig. 4 . An annotated file is composed by all the information of frames ordered by temporal sequence. Every "frame" contains the main entry "prediction" which includes the detailed data: "keypoints" are composed by 17 tuples of ( X , Y , confident_score) with X , Y coordinates and a confident score of each joint (in total [ 17 × 3 ] dimensional data shape); "bbox" is composed by upper left X , Y coordinates, width, height of the bounding box ([4] dimensional data shape), "score" is the confident score of the bounding box ([1] dimensional data shape), "category_id" is the constant 1 inferring a person subject following the convention of COCO annotation ( [1] dimensional data shape), "id_" is the ID of a tracked person ([1] dimensional data shape), "action" is the ground truth label. A piece of an annotation file is shown in Fig. 5 . The example is composed by two frames of a clip and each frame includes two persons with "sittingWhile-WatchingPhone" and "standing" actions.   We used PyHAPT to pre-process the data [31] . [1][2] After the annotation work done by HAVP-TAT [27] , we obtained the annotated files like the example shown in Fig. 5 . We thus represent each joint with a couple of pairs ( X , Y ) corresponding to its extremes so that a skeleton frame is recorded as an array of 17 couples with data shape (17,2). Based on the field of "id_" which is the person tracking ID, we could facilitate composing the keypoints of the same person in different T temporal frames to get ( T , 17, 2) data shape. For the multi-person cases, we take all the detected persons in each clip into account. We consider each person performing the same action in a single video clip as a valid action sequence. If the same person performs multiple actions in a single video clip, we consider them as different action sequences performed by the same person. Since every action sequence includes only a person's data, so we extend the previous data shape to ( T , 17, 2, 1) for convenience of implementation. For the whole dataset, the script reshapes and gets the array of ( N , 2, T , 17, 1) dimensions by concatenating the single action sequences of persons with N action samples. We summarize the meaning of each element in the tuple: the script generates N samples of action sequences in total; two dimensions ( X , Y ) skeletal data; an action sequence lasts T frames; 17 keypoints of a human body; 1 person data in each tuple. All action skeleton sequences are padded to T = 300 frames by replaying the actions as also done by other skeletal datasets. The training set and test set split ratio is 70% and 30%. The number of padded frames and the training-test set split ratio can be both customized by users.
From Table 1 , we evaluated four most relevant state-of-the-art human activity recognition models in the last three years (Efficient GCN [14] , CTR-GCN [7] , MS-G3D [1] , 2S-AGCN [3] ) on the new POLIMI-ITW-S dataset. We believe the results are representative for the current mainstream human activity recognition algorithms.
To have a fair comparison, we decided to use joint data for training and test. As a result, from Table 11 , we observe that the accuracy is only less than 50%. We infer that the state-of-the-art activity recognition models could not perform well on real-life data.
When we were in the data collection phase, we already noticed that most of the actions occurred are "walking" and "standing" without involving any other additional actions in the shopping malls. This leads to an imbalance issue: some highly frequent actions appear more often than others. For instance, "walking" and "standing" have 80 K+ and 50 K+ sequences, while "walk-ingWhileEating" and "sittingWhileDrinking" have only 1256 and 325 sequences.
To evaluate whether the imbalance classes could cause the bad performance, we tried to take only 30 0 0 sequences of some classes with huge numbers to build a "cropped" version dataset as shown in Fig. 2 b. We evaluated the models also on the "cropped" version dataset. Unfortunately, the accuracy was even lower than the one with the original version dataset. We could say that the imbalanced classes might not directly affect the performance of the models.