EatSense: Human Centric, Action Recognition and Localization Dataset for Understanding Eating Behaviors and Quality of Motion Assessment.

,


Introduction
There are many extensive publicly available datasets for action recognition, temporal action localization and monitoring the daily activities of people [1].These datasets contain various action classes for recognition, and temporal segments for localization and provide performance benchmarks for several commonly used algorithms [2].Although the current datasets are extensive in terms of total recording hours, and many different and difficult scenarios, these still lack the capability to model a specific behavior or detect a change in motion to model decay in the motor movement of the subjects.This leads us to why it is important to model behavior and identify minor changes.Modeling a person's behavior, such as their eating habits, can provide us with a more comprehensive understanding of their routine.Moreover, the ability to detect minor changes in motion can be incredibly valuable in situations where long-term monitoring is required, particularly for older individuals or for assessing changes in athletic performance [3], [4].
Most publicly available datasets have shortcomings such as: firstly, they contain only trimmed individual clips or sparse annotations in an untrimmed video instead of dense action labels, and secondly, they do not have the sub-actions (atomic actions) level of annotations; instead, they only offer the high-level action label, e.g., eating or drinking.
In this paper, we present a new densely annotated dataset named EatSense that is recorded while a person eats at a dining table in a real-world uncontrolled environment.EatSense is an unobtrusive, human-centric, upper-body-focused dataset that provides the capability to model eating behavior along with the ability to study the change in motion/motor deterioration.The change in motion is simulated by attaching weights to the wrists of the subjects while they eat 1 .Adding wrist weights to human movement can simulate increased muscle stiffness, leading to changes in movement patterns and kinematics.The concept of using weights to imitate upper-body decay has been confirmed in [5].Likewise, in [6], a similar idea is employed to exhibit various gait abnormalities.In the past, different methods to temporarily induce palm stiffness and limit fine motor control of fingers have been explored [7], [8].EatSense contains 27 subjects from 13 nationalities, hence introducing diversity in eating styles, tools used (fork, chopsticks, etc), and food selection.The contributions of this paper are: A new untrimmed dataset named EatSense for action recognition, temporal action localization and quality of motion assessment is presented.
-We provide frame-wise, dense labels (≈ 114.1 actions per video sequence) with three levels of abstractions (see section 4.2).
-We provide detailed comparisons with other publicly available datasets where, unlike many datasets, EatSense contributes to both the computer vision and health-care communities (see table 4).
-We provide experimental test benchmarks using recent approaches for action recognition and temporal action localization (see section 5).
-The full dataset including the RGBD and skeleton data is publicly available.
From an explainability point of view, hand-crafted features from the upper-body poses of the subject were extracted using domain knowledge and interactions between humans and objects, to demonstrate that the dataset can be used for tasks where explainability is the key, such as healthcare applications where information about individual joints is vital to understand or diagnose/track/predict a problem (see section 6).
to demonstrate effective modeling of eating sub-action recognition using EatSense with reasonable accuracy from interpretable features (see section 6.2).
to demonstrate the application capability of EatSense for quality of motion assessment (see section 6.3).
Figure 1: EatSense dataset is an eating sub-action analysis dataset, comprising multiple modalities, dense annotations and multiple abstractions of labels.

Literature Review
A review of publicly available datasets for the domain of action recognition, action localization and activities of daily living, is presented.This section also includes a brief overview of some commonly used action recognition and localization approaches.

Public Datasets
There are several publicly available action classification datasets.Some of the large-scale datasets can be divided into four categories according to their targeted application, i.e., datasets for action recognition (trimmed video datasets), datasets for temporal action localization (untrimmed video datasets), activities of daily living and quality of motion assessment datasets.

Trimmed Video Datasets
Datasets that contain one action per video sequence are classified as action recognition datasets as they do not present the possibility of temporally localizing the ongoing activity.NTU-RGB-D 120 [9] is one of the most extensive action-recognition benchmark datasets.It contains 114,480 video sequences of 106 subjects, and 120 action classes, which include numerous daily routine actions, group actions and medical conditions.It was collected via multiple sensors such as RGB, depth and infrared.Kinetics-700 [10] is another large-scale dataset that contains 700 action classes, and 650,317 video clips collected from youtube with at least 450 clips for each action class.The action labels include a variety of actions including many actions involved in daily lifestyle such as 'digging', 'pouring milk' and 'drinking' etc. Goyal et.al. presented Something-Something v2 [11], which is an ego-centric dataset, that is focused on human hand gestures such as putting something on something or turning something upside down.It contains 220,847 videos recorded at 12 frames per second (fps), with 174 classes.This dataset was recorded in a setting where a person strictly performs a pre-defined set of actions with daily use objects.
HMDB51 [12] is another dataset brought together by a collection of youtube videos and digitized movies with at least 101 videos per class with a total of 7,000 videos manually annotated.HMDB51 contains 51 action labels which can generally be divided into 5 types, i.e., general facial actions, facial actions with object manipulation, general body movements, body movements with object interaction and body movements for human interaction.Jhuang et.al. presented J-HMDB [13] which is a subset of the HMDB51 dataset with 21 classes where the authors provide annotations for human joints.These joint positions are further utilized to estimate ground-truth segmentation and optical flow.
As can be observed from this section, for many large-scale datasets (with hundreds of classes) the targeted applications are computer vision-based action recognition with a single clip-wide label.Thus, these datasets have a very limited capability to be used for healthcare or behavior understanding/modeling applications.On the other hand, EatSense contains full-length eating sessions of individuals and also introduces a new healthcare monitoring/motor function decay assessment dataset.

Untrimmed Video Datasets
There are various publicly available datasets that can be used to investigate the action localization problem, each with different label settings.These include: 1) video clips labeled with a single action, 2) videos marked with sparse labels, where there are long periods of inactivity between two actions, 3) videos with dense labels covering the entire video, meaning there are no unmarked sections, or the videos contain overlapping labels at any given time.
ActivityNet-1.3 [14] contains 203 actions that people do on a regular basis for example 'shovelling snow' or 'cleaning shoes'.These actions are broadly classified into vehicles, housework, animals, interior maintenance and exterior maintenance.ActivityNet is a large-scale dataset with an average of 137 untrimmed videos per action class for 849 hours of videos.It contains classes with sparse ground truth labels with 1.54 actions per clip of 1.9 minutes in length.FineGym [15] contains 530 sub-actions for example 'vault' and 'floor exercise' in untrimmed videos.This is a human-centric dataset, with a single subject in the field of view, collected from videos available on youtube where subjects (regardless of who are they) are performing various gymnastics.
PKU-MMD [16] is an extensive video dataset designed for action recognition and multi-modality action analysis.It is divided into two phases with 51 and 49 action labels respectively.The action labels can be categorized into two groups, consisting of 41 daily actions and 10 interaction actions.The dataset comprises a total of 1076 long videos (approximately 4 minutes) and 2000 short videos (approximately 2 minutes), all recorded from multiple perspectives.
There are several datasets available that contain dense labels.Two such datasets are AVA [17] and Sphere-H130 [18].AVA contains 80 actions in 430 clips, each 15 minutes in length cropped from various films.Hence, this dataset includes instances with multiple subjects interacting with the environment or with each other.The Sphere-H130 action dataset contains 130 sequences of 13 actions of about 70 minutes in total performed by 5 subjects in a home setting.However, in this dataset, the subjects strictly perform a specific set of actions, hence it lacks real-world diversity.
UCF-101-24 [19] is another extensive action recognition dataset with realistic RGB videos downloaded from youtube.This contains 101 action categories with 13320 videos (27 hours) in total.These 101 categories can be broadly classified into five types, 1) Human-Object Interaction, 2) Body-Motion Only, 3) Human-Human Interaction, 4) Playing Musical Instruments, and 5) Sports.Epic-Kitchens [20], unlike Something-Something-v2, is a non-scripted dataset recorded where the subjects had the instruction to do the tasks in a kitchen however they like.Epic-Kitchens contains 4,053 classes on over 100 hours of high-definition kitchen recording sessions.
In conclusion, the datasets for action localization tasks have extensive sets of untrimmed videos and action labels, but many of them still lack dense temporal labels and others are short of consistent sets of sub-actions involved in one large action, hence they are rarely used for long-term behavior modeling of an individual.EatSense aims to fill these gaps, as it contains dense labels for the videos and 16 sub-actions involved in the eating action.

Eating Related Datasets
In a recent study, Tang et al. [21] introduced a dataset for intake gestures during meals, which is part of the Clemson Cafeteria Database (CCD) [22].The video data was captured in a university cafeteria, with 276 participants consuming a total of 374 different foods and beverages.The dataset is composed of three different gesture classes, including bite, drink, and others.In another research work, Shengjie et al. [23] recorded an ego-centric dataset using a head-mounted camera in a free-living environment, where they formulated a binary classification problem to distinguish between eating and non-eating activities.Finally, for those interested in gesture detection, Neves et al. [24] presented a comprehensive review of approaches used in eating gesture detection.
OREBA (Objectively Recognizing Eating Behavior and Associated Intake) [25] is a dataset designed to provide researchers interested in detecting intake gestures (single gesture denoted as: Intake, Intake-Eat, Right, Spoon) with a large amount of data collected from multiple sensors during communal meals.This dataset includes recording via a 360-degree camera positioned at the front to record videos, as well as a sensor box that contains a gyroscope, an IMU and an accelerometer attached to both hands.Other research studies have also presented small-scale datasets such as Accelerometer and audio-based Calorie Estimation (ACE) [26], Clemson [27] and Food Intake Cycle (FIC) [28] that primarily focus on characteristics of intake gestures, such as chews, and swallowing behaviors.
Men et al. introduced a dataset [29] that aimed to differentiate between high-level actions related to the type of food being consumed, such as 'eat a steak' or 'drink from a plastic bottle'.The primary goal of this dataset was to estimate the frequency of self-feeding and to gain insights into eating/drinking behavior.They utilized Microsoft Kinect to capture skeleton motions.Mobiserv-AIIA [30] was designed for the evaluation of specialized meal intake to prevent undernourishment or malnutrition.The dataset contains captured videos recorded in a controlled laboratory environment with multiple cameras set up at various angles.Mobiserv-AIIA does not contain atomic actions, rather it focuses on high-level actions such as 'eat', 'drink' and 'slice' for different meals (breakfast, lunch and fast food) and using various tools (spoon, fork or glass of water, etc.).
To summarize, previous studies have presented various datasets and conducted research that deals with recognizing actions/gestures such as eating, drinking and swallowing, etc. but is limited in the sense that they do not explicitly highlight the most common sub-actions involved in the eating process.Onofri et.al. [31] explain that activity recognition-based behavior analysis algorithms require two categories of knowledge, namely contextual knowledge and prior knowledge.Also, most past (vision-based) datasets lack prior knowledge as they do not contain sub-action information.Hence, they fail to provide the capability to explore the complete behavior (based on many sub-actions involved while eating) of individual subjects.EatSense addresses this gap, as it contains the 16 most common sub-actions for the whole eating process.

Activities of Daily Living Datasets
The MSR-Action3D dataset [32] includes the 3D location of 20 joints in each frame.The dataset contains 20 actions such as 'tennis serve' and 'golf swing' etc.The MSR-DailyActivity dataset [33] was designed to model daily actions performed by a person while sitting on a couch.It contains 320 samples of 16 daily activities such as 'play guitar' and 'eat'.Some trimmed and untrimmed video-based datasets such as Sphere-H130 [18], ActivityNet-1.3 [14] and Something-Something v2 [11] described in previous sections contain actions performed in a daily routine.

Quality of Motion Assessment Datasets
Many datasets not only focus on the ongoing activity but also quantify the quality of motion of the subject.Sphere-Walking [34] was designed for motion quality assessment via gait analysis.In this dataset, 6 subjects were recorded while they climb up a flight of stairs and each of them was scored by health professionals.Init Gait DB [35] is a benchmark dataset for gait impairment research, recorded in a controlled laboratory environment.Eight different gait styles were simulated where the movement of limbs and posture of the human body was altered.It was recorded from multiple view angles using RGB cameras.
The walking gait dataset [6] is also a gait analysis-based dataset that simulates 9 walking gait patterns.These were simulated by adding a thick sole to one shoe or tying weights at the ankle.This was recorded via a Microsoft Kinect where the subject walked on a treadmill with two flat mirrors behind them.Sphere and other datasets rely on whole or lower-body gait analysis.Moreover, research such as [36], [37] and [38] presents a comprehensive overview of publicly available gait analysis databases.[36], [39] and [40] discuss challenges and solutions to gait analysis techniques in depth.
To the best of our knowledge, none of the current datasets concentrates specifically on the evaluation of human motion quality in relation to action-based eating behaviors, with a particular emphasis on the movement of the upper body joints.

Action Classification and Localization
Vision-based frameworks for action detection are classified into action recognition and temporal action localization.

Action Classification
In general, vision-based action classification/recognition2 frameworks can be divided into two categories based on modalities, i.e., video-based and skeleton-based action recognition.For skeleton-based action recognition, many researchers have explored Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) based approaches [41], [42], [43] with hand-crafted spatial features.These, however, ignore the spatial connectivity of the human body.
To incorporate human joint connectivity, some researchers proposed using Graph Convolutional Networks (GCN) or heatmaps for action recognition.In [44] Duan et al. proposed PoseConv3D (a.k.a.PoseC3D), which used a 3D heatmap volume as an input for a 3D-CNN network which made it less prone to joint estimation noise and thus more robust for action recognition.Yan et al. proposed the Spatio-Temporal Graph Convolutional Network (ST-GCN) [45], which established both spatial and temporal graph connections.Adaptive Graph Convolutional Network (AGCN) based approaches exploit the hierarchical structure of GCNs where different layers contain multi-level semantic information and thus incorporate long-range dependencies of the joints for action recognition [46], [47] and [48].Recently, Chen et al. [49] proposed a feature aggregation topology (channel-wise topology refinement graph convolution -CTR-GC) that effectively aggregates joint features in various channels and dynamically learns different topologies.
Here, we only discuss a few RGB (2D/3D CNN) based action classifiers.Temporal Segment Networks (TSN) [50] first divides the video into snippets, uniformly and sparsely sample the frames and then average pool the samples to merge per-frame predictions.Temporal Pyramid Network (TPN) [51] introduced spatial and temporal modulation blocks to align semantics and adjust the tempo among multiple levels of features extracted from the backbone.Temporal Adaptive Network (TANet) [52] presented a temporal adaptive module that generates temporal kernels to capture global context information which, when used alongside a 2D CNN, produces a powerful action recognition framework.
There have been several studies on action recognition methods used for healthcare research.In [53] Gul et.al. proposed a YOLO-based action classifier along with a dataset to identify eight abnormalities such as 'backward fall' and 'chest pain', etc., collected via a camera set up in a live environment.In [54] Woznowski et.al. present complete activities of daily living hierarchical ontology with two video annotation strategies based on granularity, i.e., atomic labels and high-level annotations for human action recognition in healthcare.
On action recognition for eating behaviors, Sharma et al. [55] presented a convolutional neural network (CNN)-based method for detecting hand-to-mouth gestures during extended periods, ranging from 0.5 to 15 minutes, to identify eating periods.The researchers utilized prior knowledge of other gestures to improve the detection of eating episodes.The Clemson all-day dataset was used, which contains data collected using IMU sensors placed around the subject's wrists.Okamoto et al. [56] presented a system for recognizing eating and drinking actions, which also categorizes the food items consumed.The system detects the mouth region to extract relevant information about nutritional intake.
However, the techniques related to healthcare primarily distinguish between different abnormalities instead of detecting minor changes in a before-and-after scenario.Furthermore, most of the previous techniques have a limitation in that they employ deep features to differentiate between two abnormalities instead of using explainable features.The use of explainable features could be more advantageous for healthcare professionals to understand the underlying causes of abnormalities.Additionally, explainable features could still be somewhat dependable in cases where the algorithm struggles to differentiate between two abnormalities.

Temporal Action Localization
The Background Suppression Network (BSN) [57] predicted the score of an action at any time instance and the score of the start and end of that particular action.It generated flexible proposals by keeping the temporal positions that are high for the score of the start and end of an action.However, these proposals were evaluated separately, which completely ignored the global context of the video.Boundary-Matching Network (BMN) [58] on the other hand, aggregated the features of all proposals and evaluated them simultaneously hence keeping the global context of the video intact.Most of the past algorithms including BSN and BMN used an external classifier to predict action categories from video proposals and relied heavily on anchor windows.Recently, ActionFormer [59] was presented, which combined a transformer with a temporal feature pyramid network to get multi-scale features and recognized action categories without explicitly generating action proposals (hence no external classifier) or pre-defined anchor windows.

The EatSense Dataset
The motivation for selecting eating for performance monitoring is that we wanted to select an activity that is performed on a regular basis so we can explore behavioral and upper-body movement changes that healthy person goes through in their daily lives.Eating is one of the most common and frequent actions in one's daily routine as compared to any other action that can change over time.Moreover, individuals tend to persist with their eating habits despite encountering minor physical impediments that may affect their mobility.Lastly, we believe eating can be exploited to acquire and evaluate changes in motor movement.As people grow older, their motor movement gradually gets restricted over time which also affects their ability to eat properly.
The main focus of this paper is to introduce a new dataset that addresses several research gaps related to human eating behavior and healthcare applications.The dataset offers a detailed labeling system with up to 16 action subclasses, including short-time actions, and involves the localization of sub-actions in videos with an average of 114.1 sub-actions in 11.5-minute segments.Additionally, the dataset emphasizes humancentric behavior understanding, particularly related to hand gestures and posture during eating.Finally, the dataset allows recognition of decay in motor movement, which is simulated through wrist weights to create small changes in upper-body movements.The dataset can be found at the link provided in the footnote 3 .This study focuses only on the research questions mentioned earlier, but there might be other possible uses of the dataset that are not thoroughly discussed in this paper and could be investigated further for understanding human eating behavior and healthcare applications.

Data Collection
An RGB-D Intel RealSense D415 camera was deployed in a dining room environment.This is an inexpensive depth camera that provides good 3D depth estimation in an indoor environment [60].The depth maps were used to translate 2D poses from 2D to a 3D frame of reference.The camera was mounted at an oblique angular view facing the dining table, with the restriction that there is only one person in each frame.The recording was done in multiple locations with varying camera-to-subject distances and backgrounds.The frames were discarded if someone other than the person eating walked by or entered the field of view of the camera.The subjects were allowed to eat as they please without any interference or input from the recording team.Fig. 1 shows the setting of the camera system in the dining room environment.It also shows one sample from the dataset for both with and without wrist weights.Special wrist/ankle weights with velcro stitches were used.The weights were always attached to the wrists of subjects, which are denoted as joints numbered 5 and 8 in Fig. 1.The placement is also shown in the bottom thumbnail at the left of the figure.

Data Labelling
The dataset is labelled with four levels of abstraction.

Data Labelling: Poses
For the first level of abstraction, the skeleton of the upper-body pose was estimated.The 3D joint locations of 8 joints are represented as -nose (1), chest (2), right shoulder (3), right elbow (4), right wrist (5), left shoulder (6), left elbow (7) and left wrist (8).Sometimes both left and right limbs are denoted as one entity, e.g., the right and left wrist collectively is represented as w.HigherHRNet [61] is used to estimate the location of the 2D joints, which were then projected into 3D space by using the depth map measurement.The choice of HigherHRNet was made empirically from a pool of commonly used pose estimators for ground truthing the data.
As pose estimation is a crucial step in skeleton-based action recognition, an experiment to find the most suitable algorithm for the proposed dataset was conducted.First, a total of 100 images were sampled uniformly from the set of videos.Second, 2D poses in the images were then carefully labeled by hand.Third, the images were then used as input to deep learning-based pose estimation algorithms, i.e., OpenPose [62], darkpose [63], deeppose [64], higherHRNet [61] and vipnas [65].To evaluate the results, two metrics were utilized: mean average precision (mAP) in 2D space and mean squared error (MSE) in 3D space.For mAP, the intersection over union (IoU) also known as object key-point similarity (OKS) in the case of key points, was calculated by measuring the distance between the predicted and ground truth key points using the Gaussian kernel.The dataset contains scenarios where arms cross each other or the subject puts their hand on their lap (under the table), thus the body joints are sometimes occluded.The experiments showed that when a joint was not visible in the camera, OpenPose was unable to detect the pose of the individual.
Both the 2D joint locations predicted by each of the classifiers and manually labeled 2D joint locations were then projected into the 3D space using the depth map.To quantify error on common grounds, MSE (3D) was calculated with only the visible joints.Table 1 shows the mAP with the threshold of IoU [0.5,1].It also depicts the MSE for each of the pose estimators.In the experiments that follow, we chose to use HigherHRNet because its MSE was considerably lower (9.7×10−3 m) than the alternatives for approximately the same mAP.In the third level of abstraction, each of these categories is further divided into sub-actions, that include several atomic actions (that last for less than or equal to a second).Our dataset approximately follows the Zipf law as can be seen in Fig. 3 left.Experiments on actions with few examples can be unreliable, thus we include and present experiments only on the actions that have at least 40 instances.The levels of abstractions for actions are shown in figure 2. Ground truthing was done manually with the help of a video image annotator (VIA) [66].For sub-actions where two possible labels were correct, the preference was given to actions being performed by hand.For example, for simultaneous actions 'chewing' and 'food in hand at table', 'food in hand at table' was given preference.Other examples include, 'chewing' plus 'move hand away from mouth' was marked as 'move hand away from mouth'.Moreover, when a person was talking, using a mobile phone or any other activity not included in the list, it was marked as 'others', irrespective of any eating action going on simultaneously.For example, 'talking','chewing' or 'food in hand at table' are all marked as 'others'.Also, whenever everything is at rest, frames were marked as 'no action'.
To imitate limited motor movements, the researchers captured at least two sets (up to 4 sets) of videos for each participant.In the first set, a weight was tied to each of the participant's wrists to restrict their movements 4 , while in the second set, the participant was not weighed down and could move normally.The videos were labeled as 'Y' or 'N' depending on whether the weight was added or not, respectively.
Numerous volunteers actively participated in the labeling process, presenting a significant challenge in maintaining consistency across the labels.To address this concern, we devised a two-step quality control system aimed at achieving reasonably accurate labeling.Initially, we instructed the volunteers to label eating actions using a comprehensive guide that outlined naming conventions for actions and provided detailed explanations on the appropriate usage of each specific label.Subsequently, one of the authors diligently reviewed the labeled videos to ensure the consistent quality of the labels was upheld.

Data Statistics
The dataset contains 135 video sequences (53 without weights and 82 with weights) of 27 subjects with different cultural backgrounds to ensure diversity in ages, ethnicity, body size, gender, and eating behaviors.These were recorded with a resolution of 640x480 at 15 frames per second (fps).Actions performed in each of the individual videos are shown at the top fig.4. Table 2 summarizes the average time taken by an instance of each action and the total number of instances of the actions.Moreover, for quality of motion assessment, the ratio of occurrence of non-weight 'N' is 36.6% and weight 'Y' is 63.4%.The percentage distribution of different levels of label abstractions is shown in Fig. 3.
Eating behaviors and food choices vary across different cultures and regions.Chopsticks are commonly used for eating in East Asian countries, while hand-to-mouth eating with no tool is prevalent in South Asian regions, where people prefer to eat rice or flatbreads directly with their hands.The subjects were chosen specifically to maximize diversity and generalizability.The EatSense dataset includes subjects from thirteen different nationalities, and Table 3 provides information on their ethnicity by region, age groups, and the tools they used for eating.As the choice of tools is dissimilar between different subjects, the actions performed are also bound to be subjective.The dataset also conforms to the above-mentioned convention, as can be visibly seen at the bottom fig. 4.

Dataset Properties
EatSense has several attractive properties that distinguish it from other existing datasets.
Each of these videos is densely labeled, which means there are no unlabelled temporal patches unlike most of the existing large-scale datasets.Also, the two-stage quality control of labels ensures clean and consistent labels across all of the videos.The current state of the dataset can be easily extended by recognizing tools and foods and checking for human-object interactions, to detect what types of food a person eats for a complete nutritional analysis.Unlike other existing datasets, where a background/environment plays a significant   As discussed before, EatSense contains ground truth for multiple levels of abstraction.Each of the actions is chosen by a tree-like labeling strategy, based on figure 2. Frame-level labels of actions and poses are provided.When a person eats while sitting at a dining table, their lower body is occluded and does not play any significant part in detecting eating activities.Hence, EatSense is an upper-body posture-focused dataset.EatSense is also rich in data that could be used for human-health analysis.For example, it contains a layer of labels that simulate decay/decline in the motor movement of a person over time.Continuous monitoring of eating actions and looking for decay in motor movement could potentially be helpful for identifying a serious health situation.
Table 4 shows a brief comparison of how EatSense compares to other related datasets in both the computer vision and AI healthcare communities.The table includes commonly used action recognition/localization datasets, such as Thumos14 [67], FineGym, NTU-RGB-D, and AVA alongside commonly used healthcarebased datasets, for example, MSRDailyActivity3D, Init Gait DB and OREBA.The table presents various characteristics of the EatSense dataset that offer multiple possibilities for research.These include the ability to develop models for action recognition, action localization with dense labels, eating behavior analysis, decay in motor movement assessment, upper-body-focused models, etc.

Experimental Evaluation on Baseline Approaches
We evaluate EatSense using two action understanding methodologies, i.e., action recognition and temporal action localization methods.The PyTorch implementation was utilized for each of the algorithms listed below.Each algorithm was trained for 150 epochs, beginning with a learning rate of 1 × 10 −2 , and for every 30 th epoch, the learning rate was multiplied by a factor of 1  10 .Unless specified otherwise, the rest of the training protocols for the techniques used were consistent with those in the original papers.

Trimmed Video Analysis
The analysis of trimmed videos can be divided into two stages: single-modality experiments, which examine individual aspects such as skeleton, flow, and RGB, and multi-modality experiments, which combine RGB and flow.In trimmed video analysis, it is assumed that the videos have already been segmented into separate clips for each action.These techniques explore intra-action relationships independent of the past or future action occurrence, hence only recognizing the ongoing action.Trimmed video analysis avoids the problem of deciding when one action finishes and the next starts.

Dataset Splits
To generate a mix of actions for a classification analysis, first, the data was divided into clips of individual activities.Second, stratified sampling (without replacement) on the action clips was done.Third, using these sampled clips, five stratified splits were generated.Out of these five splits, three splits were used for training and one each for validation and testing for this recognition task.The splits were permuted for five-fold cross-validation.For evaluation on graph-connected networks that take poses as the input, a set of poses (m × 8 × 3 vector) for each frame, was also shuffled with the constraint that the same set of actions is selected as in the original five splits, where m is the number of frames.

Classifiers
Various deep learning-based networks with different input modalities were evaluated for the classification of the trimmed videos.For skeleton-based classifiers, graph-based approaches with deep features such as graphconvolutional networks (e.g., CTR-GCN [49], 2s-AGCN [48], ST-GCN [45]) and 3D heatmap volume (e.g., PoseConv3D [44]) were utilized for evaluation of both 2D and 3D joint information.We also demonstrate recognition from RGB, optical-flow (motion features) and combined RGB+Flow modalities by using TANet [52], TPN [51] and TSN [50].Furthermore, a comparison was conducted between fine-tuning pre-trained algorithms and training the same algorithms from scratch.

Results
To measure the performance of the models on EatSense, we compute the Top-1 and macro (mean class accuracy) over all 16 classes.Table 5 shows the top-1 (clip) and macro (class) accuracies of the networks with pre-trained models and table 6 demonstrates the performance when trained from scratch.
Table 5: The table displays the Top-1 and Macro accuracies achieved by deep networks with three modalities as input on the trimmed videos.The 'Pre-train dataset' column indicates the dataset on which the particular algorithm was pre-trained.NTU-60 and Kin-400 refer to NTU-RGB-D-60 and kinetics-400 respectively.The row 'Average' shows the average accuracy achieved by all the tested algorithms on their respective modalities.

Discussion
On trimmed action clips as shown in table 5 with pose as the modality, CTR-GCN performs well as compared to others because it utilizes a channel-wise topology, i.e., it dynamically learns and effectively aggregates features.ST-GCN 2D does not perform well as 2D poses have low-quality motion features compared to 3D.Claiming the first position, ST-GCN 3D performs significantly better regarding both top-1 and macro accuracy.Overall, almost all deep learning-based graph convolutional networks (GCN) tend to achieve an accuracy of over 70% and class-wise accuracy of over 50%.Moreover, when training from scratch (as shown in table 6), ST-GCN 3D still performs better than others.
For RGB as a modality, both trained from scratch and pre-trained on Kinetics-400, TANet and TPN achieve nearly the same top-1 and macro accuracy with a slight difference in the performance.However, TANet (when trained from scratch) achieves the best classwise accuracy (macro) of 72.6% and top-1 83.6% because it specializes in capturing long-term temporal dependencies, unlike TSN and TPN.TANet, however, does not perform as well with the optical flow as input.This may be related to how optical flow encodes motion information in the video.Optical flow is a dense representation of motion, whereas the encoding of the motion information in RGB frames is more sparse and may be more difficult to extract.
On the other hand, TSN is designed to explicitly model temporal information by dividing the video into segments and aggregating features from each segment.This makes it well-suited for processing dense motion information like optical flow.Hence, TSN provides good performance for optical flow input as compared to Table 6: The table displays the Top-1 and Macro accuracies achieved by deep networks with three modalities as input on the trimmed videos.'None' represents the baseline algorithms that were trained from scratch.The row 'Average' shows the average accuracy achieved by all the tested algorithms on their respective modalities.

Algorithm
Pre TANet or TPN.Additionally, TSN uses a multi-scale temporal sampling strategy that allows it to capture temporal information at multiple scales.This may be particularly beneficial for processing optical flow, which encodes motion information at multiple spatial and temporal scales.Lastly, as shown in table 5, for mixed modality (RGB+Flow) the top-1 accuracy achieved by all three algorithms is nearly the same.Overall, TSN performs better in terms of top-1 accuracy and macro accuracy because it has the capacity to effectively capture the temporal evolution of actions by dividing the clip into short segments and sampling frames from each of the segments.
Table 6 shows the performance of the above-mentioned algorithms when trained from scratch.In the table, TANet demonstrates superiority in RGB modality, while TSN excels in optical flow.Overall, the experimental findings indicate that the baseline algorithms exhibit similar accuracy levels, even when trained from scratch.However, as can be observed by the averages, mentioned in the tables 5 and 6, utilizing pretrained models enhance the accuracies achieved by the baseline algorithms for all the modalities.

Untrimmed Video Analysis
One of the main contributions of EatSense is its dense labeling, i.e., there are no unlabelled patches in the 11.5-minute (on average) long videos.This gives the opportunity to develop skeleton-based (pose-based) action localization frameworks.As there are no off-the-shelf implementations of such a technique at present, we provide evaluation benchmarks on RGB images.We present the evaluation of EatSense on untrimmed videos using temporal action localization algorithms that exploit images to study inter-action temporal relationships.

Dataset Splits
For untrimmed videos, the videos were randomly divided into three splits, training on 96, validation on 24 and testing on 15 videos.

Localizers
In action localization frameworks, videos (usually in non-overlapping snippets) are input into a (pre-trained) visual encoder that represents the video as a high-level feature set, that is further processed for action recognition and localization.For temporal localization (deciding when the video stream changes from one activity to another) tasks, we extracted visually encoded features using the TSN [50] and TSP [68] pipelines.For EatSense, we used overlapping snippets to get the dense high-level feature set to detect atomic actions.These high-level features were then used with BMN [58] and ActionFormer [59], respectively, to generate segment proposals for evaluation using the EatSense dataset.In the end, the proposals generated by BMN were then classified using TSN (trained and evaluated as in section 5.1.2) whereas ActionFormer implicitly categorizes the generated proposals into action classes.

Results
For the performance evaluation on EatSense, mean average precision is used as a metric for action localization tasks.The results are shown in table 7, where @0.1, @0.3 and @0.5 denote the temporal IoU (tIoU) threshold levels.
Table 7: The table shows the mean average precision for the action localization task using deep networks on untrimmed videos.

Discussion
Untrimmed videos in EatSense are particularly challenging for current action localization networks as can be seen from the table 7. The performance of localization algorithms decays with respect to higher tIoUs.This is because EatSense contains actions of lengths varying in the range of [0.62,9] seconds, i.e., the smallest atomic action lasts for 9.3 frames on average.On the other hand, we only provide benchmarks for action localization on RGB data as (to the best of our knowledge) there aren't any specialized off-the-shelf skeletonbased temporal action localization frameworks available.
decay in the motor movement of the upper body of a person, etc.In this section, we also explore some major application scenarios of EatSense to demonstrate the flexibility of the dataset in understanding the role of individual joints for action recognition and deterioration assessment.For this purpose, we use hand-crafted features derived from the 8 upper body joints which are discussed in the following sub-sections.

Descriptive Features
After 2D human pose estimation, using HigherHRNet, depth maps recorded by the RGBD camera were used to project 2D poses to 3D space.The estimated 2D poses and projected 3D points are absolute positions and they tend to change whenever the environment or position of the camera changes.To avoid this issue, we set the origin of the camera coordinate system to the 3D location of the subject's chest joint.We calculated the relative positions of each joint by subtracting the 3D position of the chest from the 3D joint absolute position, as given in eq. 1, where p abs j refers to the absolute 3D position of the joint j, where j = {1, ..., 8} and p abs c refers to the absolute position of the chest.The relative joint positions are then used as a feature for classification tasks.
p rel j = p abs j − p abs c (1)

Additional Spatial Features
Motion of the joints while performing various eating sub-actions is highly correlated.To exploit the relations between both arms, the Euclidean distances between both wrists and both elbows were estimated as given in eq. 2, where i represents either the elbow or wrist joint and r or l denotes right or the left one, respectively.For a more meaningful representation of a 3D point in space, the joints' relative positions (note that this gives the distance of the joint from the chest) are converted into a polar coordinate system as given in eq. 3.
To explore the inherent long-range dependencies of the joints, i.e., the stretch and contraction of the arms, the product of the polar coordinates of the joints were calculated as given in eq. 4 where p polar j represents the relative polar joint position excluding the chest joint (skeleton origin).
To get the orientation of the arms at any time instance, joint triplet angles for elbow (e) and shoulder (s) sockets were calculated (w and c represent wrists and body centre in the equations).
p θ e = cos − Lastly, to exploit the interactions of the human posture with stationary objects in the scene, we find the distance of the joints from the plane of the table.The plane of the table was estimated using a least squares method from 3D table locations.These pixel locations were marked manually on the table in the first frame in each of the videos and then propagated through the video with the assumption that the table is fixed and does not change position during one eating session.

Additional Temporal Features
Temporal features are imperative for robust action recognition, but these features become pivotal if motion quantification is needed.In the temporal domain, velocities and acceleration of the joints were estimated, as given in eq.6 and 7, respectively.Subscripts t+k represent joint (j) positions at frame t+k, where k controls the number of frames in the temporal estimation window.Due to the real-world recording environment and no control over the subject, performance or the order of actions, these measurements were found to be particularly noisy.To accommodate this, features with varying window sizes, i.e., using k from 0 to 5 for both acceleration and velocities, were included in the feature space.
To put emphasis on the causality of the current posture, it is important to look for the immediate past values.For this purpose, we also use three lag relative positions (i.e., p rel t−1,j , p rel t−2,j , p rel t−3,j ).Furthermore, a weighted sum over the three lags was also included as a feature.The weights of the past three values were set to put more stress on recent past values.The weights and the moving window equation are given in eq. 8.The movw stands for moving window and the subscript shows the size of that window.

Classification with Hand Crafted Features
This section investigates eating sub-action recognition with explainable (hand-crafted) features in a frame-by-frame setting.The features discussed in the previous paragraphs represent both the spatial and temporal domains of the skeleton.There are many features, and not all are useful.A feature selection process was applied (described in Section 6.2.3).The selected features were represented as vectors, which were the input to the various classifiers described below.The features were then used to classify the labeled 16 different actions in EatSense.

Dataset Splits
For classification using hand-crafted features, we used the same strategy of stratified sampling of the data as mentioned in subsection 5.1.1.Clips in the whole dataset were stratified sampled (without replacement) into 5 subsets of the data splits while maintaining nearly the same percentage of occurrence of each individual action in each subset.However, these clips were then expanded into individual frames for frame-by-frame analysis.

Classifiers
For classification using videos and hand-crafted features, various machine/deep learning methods were explored including Light Gradient Boosting Method (LightGBM) [69], Adaboost, Multi-layer Perceptron6 (MLP), K-Nearest Neighbour (KNN) and Quadratic Discriminant Analysis (QDA).Additionally, as our data has an imbalanced set of classes, LightGBM with the focal-loss objective function was also tested.Default values of the hyper-parameters of the techniques mentioned above were used unless stated otherwise.

Experiments
To deal with the numerous features of the data, we employed a forward sequential feature selection search to identify the most influential features using stratified splits (all five splits) with five-fold cross-validation.The top 30 features from the feature selector are presented in Fig. 5. From the results, we discovered that 30 features are the most helpful for classification, and decided to use them for further experiments.
For frame-by-frame analysis, two experiments were carried out.(i) Training and testing on stratified splits (5-fold cross-validation).(ii) Training on stratified splits and testing on unseen full videos.

Results
Table 8 shows the results of both experiments mentioned above.It shows the mean top-1 and standard deviation of the accuracy achieved by each of the algorithms with 5-fold cross-validation in the first three columns.The second experiment (shown in the last two columns) shows the accuracy of the trained model results on the unseen videos.It is clear that LightGBM outperforms the others in both experiments.As the action ground truth is labelled manually, the start and end frame times of the actions are prone to be noisy.
To account for human variation and overcome the temporal classifier offset in the labels, a windowed search of sizes ranging from ±1 to ±45 frames (3 seconds) was applied to look for the correct label in that particular range of frames, between the ground truth and the predictions.Table 8 shows the accuracy achieved with the window size ±1 (which corresponds to ± 0.2 seconds).69.5 48.8 0.9 38.9 44.9

Quality of Motion Assessment
As an initial proof-of-concept for detecting minor changes in motion, we investigate a binary classification problem.EatSense provides labels 'Y' and 'N' for the classification of people eating with and without weights {with weights ('Y'), without weights ('N')} on the wrist, respectively.This could potentially simulate decay in the motor movement of the elderly over time.

Dataset
On stratified splits (all five splits), a forward sequential feature selector with five-fold cross-validation was used to identify the most contributing features for this classification.The feature selection plot for only the top 15 features is shown in Fig. 5.The top 10 features were found to be of the greatest importance.

Classifier and Experiments
As lightGBM outperforms other algorithms in frame-by-frame action classification applications with handcrafted features, thus lightGBM (LGBM) was used to distinguish between with and without weight cases.The experiments on the assessment of the quality of motion using 10 features were divided into two subexperiments.(i) combined subjects: training and testing on all7 data in stratified splits (with 5-fold crossvalidation).(ii) Cross-Subjects: feature analysis to check how well the best 10 features generalize with respect to individual subjects in terms of separability of 'Y'/'N' classes using the t-SNE plot shown in Fig. 6.

Results
Table 9 shows the results of overall and subject-wise mean accuracy along with standard deviation (experiment (i)).The column 'Mean Acc.' shows the average of the accuracies achieved during 5-fold cross-validation

Discussion
For a frame-by-frame analysis with explainable features, it is observed in Table 8, LightGBM without the focal loss objective function achieves high frame-wise accuracy of 67.5% for unseen video, (44.6% if ±1 temporal window analysis is applied to account for human variation in ground truth labelling) and overall class-wise (macro) accuracy of 47.6%.However, the use of the focal loss objective function with LightGBM increases the classwise (macro) accuracy by 1.2%.For quality assessment of motion with combined subjects table 9 shows that LightGBM works very well and achieves overall 97.4% mean accuracy with five-fold cross-validation.Moreover, to validate the generalization of features across multiple subjects, the mapping of the 10 features selected by the forward sequential feature selector to a 2D plane (Fig. 6) using t-SNE clearly shows that both classes are separable from each other, where green and red (0 and 1) represent 'N' (no weights) and 'Y' (with weights) respectively.This shows a change in motion is certainly detectable with low-level features.

Conclusions
This paper presents the new benchmark dataset EatSense that includes atomic actions, dense multiple abstraction levels of frame-level labels and with/without weight cases to simulate deterioration in motor movement.EatSense can be used as a generic training benchmark dataset for action recognition tasks specifically designed for the eating process.Furthermore, EatSense also has the capability to be used as a generic test benchmark suite for temporal action localization and action recognition.
We provide a systematic analysis of the performance of the dataset with many deep learning frameworks on multiple modalities for trimmed videos.We also discuss the performance of temporal action localization on untrimmed videos in EatSense.However, the performance of current temporal action localization algorithms is not very good and EatSense proves to be more challenging than the rest of the datasets publicly available.This highlights the need for developing new approaches for untrimmed video understanding.Furthermore, we demonstrate the application capability of EatSense even where a low-level understanding of the individual joints or hand-crafted features is required.
Future research includes extending the action recognition classes to include actions such as 'wipe mouth' and 'mix'.The deterioration classification currently is for two classes {with weights, without weights}, i.e., decayed and normal.We plan to extend the deterioration classification to a more fine-grained scale.
Ethics approval was obtained for data collection and distribution.

Figure 2 :
Figure 2: The level of abstractions used in the dataset for labelling each of the 16 actions.

Figure 3 :
Figure 3: Distribution (in percentages) of various labels according to their occurrence in the dataset.Left) distribution of individual 16 sub-actions.Middle) distribution of actions based on abstraction-level 1 for the labels.Right) occurrence percentage of videos with weights 'Y' and 'N'.

Figure 4 :
Figure 4: Top) Actions performed in each of the individual videos.The vertical axis shows the name of each of the individual videos, which has the format '{date} {unix − time}', collectively marked by the keyword 'Project' in the dataset.Bottom) shows the actions performed by individual subjects.The variations in the color mean the frequency of occurrence of each action.It has subject IDs (Pid) on the vertical axis and actions on the horizontal axis.Vectorized image, best viewed zoomed in.11 d a,i = || p rel r,i − p rel l,i || 2 rel x,j ) 2 + (p rel y,j ) 2 + (p rel z,j ) 2(3)p prod = j∈{j=1,...,7 and j =c} p polar j

Figure 5 :
Figure 5: The left and the right figures are the forward sequential feature selection plots for both action recognition classification and motion quality assessment.The vertical axis shows the corresponding accuracy as features are added to the data.The horizontal axis lists the features as they are added to the classification process.The shaded region shows the error bounds of accuracy determined by cross-validation on different sets.

Figure 6 :
Figure 6: T-SNE plots for individual subjects with 10 features mapped to 2D space after feature selection with all 27 subjects.Green is for no weights ('N'), and red is for weights ('Y') attached to the wrists of the subjects.The first row from left to right, depicts subjects S 1 to S 5 and the rest of the rows are arranged in a similar fashion.

Table 1 :
mAP (@IoU=0.50)andMSE (3D) of the skeleton estimation as compared to hand-labelled groundtruth skeletons.For the second level of abstraction, the eating actions were broadly divided into five categories based on joint location and motion.The categories are hands-based, motion-based, head position-based, body posture based and others.

Table 2 :
Average time in seconds taken by an instance of the action and total number of instances of the action for each of the actions in the EatSense dataset

Table 3 :
The table shows the diversity of subjects divided into five age groups.This shows the tools, foods, and ethnical origins of all 27 subjects involved in the dataset.Pid refers to person IDs.

Table 4 :
Brief comparison of the proposed EatSense dataset against publicly available datasets used in action recognition, localization and healthcare research.C# stands for the number of classes (action classes, anomaly classes in case of Init Gait DB), BUC stands for human behavior understanding capability, HCC stands for healthcare capability, UV stands for the untrimmed videos, S# stands for the number of subjects which is marked multiple(M) for datasets lacking specific numbers, Lbs indicates the type of labels single (S),

Table 8 :
Raw frame-wise activity classification accuracy of testing on stratified splits and on unseen full videos.The stratified splits results show the mean and standard deviation of the results over the 5 test splits.Top-1, Macro and Std.show the frame-wise mean classification accuracy, mean classwise balanced accuracy and the mean standard deviation (post-5-fold CV).UV and FL stand for unseen videos and focal loss respectively.