1 Introduction

Human activity recognition has been studied, reviewed, and discussed in some of the previous works [13, 15, 24, 47, 85, 105, 132, 160, 175, 190, 216]. Human activity recognition has been under constant limelight in the area of video analysis technology due to the rising needs from many key areas like entertainment environments, human-computer interaction (HCI), surveillance environments, and healthcare systems. Its application in the surveillance environment involves the detection of abnormal activities or dangerous behaviors which can alert the related authorities. Similarly, in an entertainment environment, activity recognition can be used to improve human-computer interaction (HCI), for e.g., by embedding AR (augmented reality) features in real-time applications [27]. Furthermore, activity recognition can help in the automatic recognition of patient’s actions to facilitate the rehabilitation processes in a healthcare system. Applications in the area of human interaction [61], pedestrian traffic [93], home abnormal activity [63], human gestures [33], ballet activity [29], tennis activity [108, 203], sports activity [133, 163] simple actions [1, 25, 60, 91, 113, 125, 145, 166,167,168, 187] and healthcare applications [2, 20, 64, 71,72,73,74, 77, 78, 81, 114, 118, 120, 124, 129, 131] are a few examples of HAR.

Indoor-HAR, which can be considered as one of the emerging areas of HAR, deals with the recognition of activities limited to an indoor environment like homes, gyms, sports clubs, hospitals, schools, corridors, parking lots, etc. The area of Indoor-HAR is different in terms of the challenges and applications it has. For example, we can relate activities like walking, moping, and wandering to a school’s corridor. Whereas jumping, weight lifting, running, etc. can be a part of gym activities. Activities inside a house, hospital, sports complex, gym, or any closed environment can be limited compared to outdoor activities. These activities can be basic ones like walking, sitting, sleeping, studying, cooking, moping, drinking water, etc. or complex ones like “taking medication” which can involve a series of simple activities like “opening the pill box”, and “drinking water”. Similarly, another complex activity can be “praying” which can further involve activities like “sitting” or “standing”, “holding a book(studying)”, “ringing a bell” or “lighting a candle” or “lamp”.

Certain activities may be simple but the pose of the person performing the activity and the orientation may make them complex for the machine to recognize. An activity like “drinking water” while sitting, standing or walking may generate different recognition rates. Also, a few activities may have certain similarities while performing due to which the results can be incorrect. For e.g., jumping and dancing/doing aerobics may have similar movements which may be ambiguous for the system.

In order to recognize activities limited to some indoor environments, we cannot use conventional deep learning methods as there are different challenges involved, based on the environment. As in the case of indoor HAR, the activity set differs from one environment to the other, one model cannot fit in every scenario.

In other words, the use of one type of HAR model meant for general may not be suitable for a specific indoor environment. So, indoor HAR is a different track that requires the problem of activity recognition to be considered differently as per its application area involved. Through this survey, we have tried to introduce Indoor-HAR as a new and emerging area of research. Through indoor HAR, concepts like smart homes, ADL(Activities of Daily Living), automated patient care, and elderly care can be strengthened and practically have a lot of research potential to be looked into. In this survey, we have discussed recent developments in the field of indoor HAR, different approaches, datasets, the use of indoor HAR for human behavior analysis, and the real-time application of indoor HAR.

The main contributions of this work are as follows:

  • It outlines the comparison of previous surveys in indoor HAR, which includes its merits and shortcomings.

  • Deliberates the latest and important developments recently reported covering both the general aspects of HAR and the specific vision-based indoor-HAR systems.

  • Elaborates the scope and wide application area of indoor HAR, discussing various existing and suggested research works which can help in real-life situations

  • Proposes a taxonomy for indoor-HAR on the basis of input methodology, datasets, and challenges which one by one explores the role of each of these parameters and how are they connected to HAR and indoor-HAR.

  • Highlights the HAR and specifically indoor HAR datasets which includes the technical details like dimensionality, fps, number of activities etc.

  • The methodology included for indoor HAR is highlighted which includes the use of handcrafted features and automatic feature learning in HAR

  • Summarises the detailed findings of various state-of-art, challenges and hurdles involved at different levels of processing and at different stages of research for any indoor-HAR system.

The steps involved in HAR are shown in Fig. 1. Depending on the nature of the data, it can be vision-based or sensor-based.

Fig. 1
figure 1

Steps for HAR i) data collection ii) data pre-processing iii) training iv) recognition

Previously, there have been some surveys [6, 24, 34, 84, 141] in the field of HAR which have been summarised below in Table 1.

Table 1 Recent surveys in the field of HAR

The ultimate aim of this work is to bridge the gap between existing parameters and the real-time applications of indoor HAR. With this survey, we wish to enhance the usability of existing works for more practical and real-time scenarios.

The area of indoor-HAR is vast and can have a variety of scope for future work. Some of the application areas are discussed here.

The sections are explained as follows: section 2 gives the literature review, and section 3 discusses the input methodology for the various research techniques, which are RFID/sensor-based and vision based. Section 4 discusses datasets; section 5 discusses the approaches and human body representation. Section 6 discusses the challenges and section 7 gives conclusions and future scope.

2 Related work

The area of HAR has been popular for many years now and a lot has been achieved in this area. This section will discuss various applications of Indoor-HAR and their related work and the recent research that has been done in that area involving both vision-based and sensor-based input.

2.1 Elderly care system

There is a need for extra care for the elderly and people with special needs as their number is increasing and there is a shortage of care staff. During recent times of COVID-19, we came across many instances where family members were not there to look after the elderly and they died in isolation. Some countries also faced a dire shortage of nurses and caretakers. This is a consistent problem where the elderly lives alone. The monitoring is difficult. Wearable sensors are not very efficient in the case of old people as they tend to forget to wear them. Sometimes, the uneasiness caused while wearing them can cause hurdles in totally depending on such sensors for their activity monitoring. Also, the statistics cannot be as impactful as watching a person doing some activity. In such cases, a vision-based HAR can provide a solution for the challenges and problems related to wearable sensors. Vision-based methods deploy a camera for recognising the activities of a person. Different research works based in the same field are [69, 106]. Recent work includes [153] which proposes unsupervised HAR with skeletal Graph Laplacian Invariance.

The daily activities can be detected and a pattern can be observed which can help in understanding their needs better. Apart from regular monitoring, systems for special alerts can be designed which can help in situations like falls, slipping, painful walking or limping and some abnormal activity with respect to the environment. The risk of falling is one of the most prevalent problems faced by elderly individuals. A study published by the World Health Organization [198] estimates that between 28% and 35% of people over 65 years old suffer at least one fall each year, and this figure increases to 42% for people over 70 years old. According to the World Health Organization, falls represent greater than 50% of elderly hospitalizations and approximately 40% of the non-natural mortalities for this segment of the population. A model which is trained for the detection of “fall” [72, 176, 206, 211] might be used for alerting the neighbors or other members of the house about the same. In the survey [212], vision-based fall detection systems were discussed. Vision-based approaches of fall detection were divided into four categories, namely individual single RGB cameras, infrared cameras, depth cameras, and 3D-based methods using camera arrays. Other works based on fall detections are [67, 87, 149, 207]. Figure 2 gives an idea of how a vision-based system can be used for monitoring the elderly [35].

Fig. 2
figure 2

Elderly monitoring vision-based system [35]

2.2 Patient care system

Vision sensors(cameras) can be deployed to monitor a patient who may need constant observation or is going under treatment or needs post-recovery care. Apart from vital signs which are monitored for such patients, a vision-based system may be helpful in case or abnormalities or medical emergencies. Capturing the visual cues, facial gestures or detection of motion or activity of such patients can help in alerting the concerned team which can be significant in the recovery of a patient.

2.3 Physical fitness

Intelligent systems have been built over the years which have been designed to benefit us in our fitness domain. Multiple wearable fitness trackers have boomed the market over the years. These trackers have sensors that record our data and instruct us about better workouts, balanced eating, and sensible living. Research on pose estimation directs naturally to human activity recognition [48, 157]. Smartphones have also given a new shape to the fitness model. The ubiquity of smartphones together with their ever-growing computing, networking, and sensing powers have been changing the landscape of people’s daily life [76, 191].

Wearable sensors and smartphones have opened a pool of options in the fitness domain for research but these methods may suffer from problems like interference, the privacy of data, unease of wearing the sensors, and skin irritation [197]. Vision-based systems can be effective as they offer an unobtrusive solution for the monitoring and diagnosis of the person. Google’s on-device, real-time body pose tracking with MediaPipe BlazePose is a great research example that is making use of human body pose and can be used to build fitness and yoga trackers [82]. Figure 3 shows the screenshots of the real-time results captured from the same.

Fig. 3
figure 3

Google’s real-time pose estimation in fitness domain [82]

2.4 Abnormal human activity recognition

Abnormal activity means any irregularity in the set of activities belonging to an environment. When we talk of abnormal activities in an indoor environment, we mean any activity which can be a matter of concern or may require immediate attention [17]. We come across cases of domestic violence, child sexual and physical abuse, and even physical abuse of the elderly at home. Researchers have also worked on providing a solution to this problem [185]. As per the reports [49], worldwide, approximately 20% of women and 5–10% of men are sexually abused in their childhood. In India, 2 out of every 3 children are physically abused and every second child is a victim of emotional abuse. In general, these malpractices are never done in the open, and thus saving such children is an essential step toward building a healthy and happy future. Indoor activity recognition may be vision-based or wi-fi signal based. Systems are built for accurately detecting abnormal activities on commercial-off-the-shelf (COTS) IEEE 802.11 devices [55, 215]. This system makes use of IoT and wi-fi signals for any abnormal indoor activity recognition.

2.5 Human abnormal behavior analysis

Our society suffers from a lack of awareness about mental health. It is taboo to talk about mental health issues. Another useful application for indoor HAR can be for analyzing human behavior by studying the pattern of indoor activities. A sick person may have a different activity pattern than a healthy person. He may be less active and may not be performing his activities like on usual days. This can also help in detecting any mental illness like depression where a person may repeat an activity or may skip his daily activities. Research like this may be helpful for situations like the onset of depression.

Recent works involve the use of smartphone data collection of the sensory sequence which is collected from people performing their daily tasks. Later, the cycle detection algorithm helps in segmenting the data sequence for activity [41].

Indoor-HAR taxonomy

In the present era, human activity recognition [12, 34, 56, 65, 188, 189] in videos has become a prominent research area in the field of computer vision. Videos are used in daily living applications such as patient monitoring, object tracking, threat detection, security, and surveillance. In general, all HAR algorithms are categorized on various aspects from how the data acquisition is performed to what activities are considered. Figure 4 shows different facets of indoor HAR.

Fig. 4
figure 4

Taxonomy of Indoor-HAR

The taxonomy is based on parameters like the approach, data type, data sets, 2D or 3D representation of the input, etc. This survey also discusses the challenges and problems at different levels of HAR recognition for indoor activities. The survey covers the parameters shown in the figure and throws light on them one by one w.r.t the existing work in that field and the characteristics of the same.

3 Input methods: vision-based and sensor-based

Specifically, indoor HAR methods are classified into two main groups, which are sensor-based/RFID-based HAR and vision-based HAR, based on the generated data type. [97, 202] talks about another category called multimodal where sensor data, as well as visual data, are used to detect human activities.

Sensor technology has grown in multiple perspectives, including computational power, size, accuracy, and manufacturing costs. This has widened the band of portable devices which are making use of this technology to record data for activity recognition. Examples are wristbands, smartphones, etc. This growth and development of sensor technology have also fostered techniques for pervasive human tracking, silhouette tracking, detection of uncertain events [90, 98, 208], human motion observation, and emotion recognition in real environments [3, 99, 192].

Human tracking and activity recognition problems require feature extraction and pattern recognition techniques based on specific input data from innovative sensors (i.e., motion sensors and video cameras) [4, 5, 100, 152, 204]. Motion sensors-based activity recognition is based on classifying sensory data using one or more sensor devices. [39], proposed a complete review about the state-of-the-art activity classification methods using data from one or more accelerometers. In [102], the classification approaches are based on RFs features which classify five daily routine activities from Bluetooth accelerometer placed at breast of the human body, using a 319-dimensional feature vector. In [22], fast FFT and decision tree classifier algorithms are proposed to detect physical activity using biaxial accelerometers attached to different parts of the human body. However, these motion sensors-based approaches are not feasible methods for recognition due to the discomfort of the users to wear electronic sensors in their daily life. Also, combining multiple sensors for improvement in recognition performance causes a high computation load.

Recently, there has been a tremendous increase in the number of applications using closed-circuit television (CCTV) for monitoring and security purposes due to the evolution in CCTV technology which has resulted in better video quality, more straightforward setup, lower cost, and secure communication. Although each type of sensor aims at specific services and applications, sensors generally collect raw data from their target ubiquitously, and general knowledge is acquired by analyzing the collected data. Human activity recognition or HAR, allows machines to analyze and comprehend several human activities from input data sources, such as sensors, and multimedia content. HAR is applied in surveillance systems which can be for home [21, 147], for mass crowd-monitoring [38, 95] and for detecting humans under distress using UAVs [134], behavior analysis, gesture recognition, patient monitoring systems, ambient assisted living (AAL), and a variety of healthcare systems that involve direct interaction or indirect interaction between human and smart devices. HAR for elderly care has been a key area of work [37, 101, 109]. Various opportunities can be provided by these sensor technologies which can improve the robustness of the data through which human activities can be detected and also provide services based on sensed information from real-time environments, such as cyber-physical-social systems [205]; there is also a type of magnetic sensors when embedded in a smartphone can track the positioning [19]. The above points can be summarised in Fig. 5. Table 2 lists recent works in sensor-based input and vision-based input with the technique used.

Fig. 5
figure 5

Disadvantages of wearable sensors/RFIDs

Table 2 Recent work with sensor-based and vision-based inputs

3.1 Vision-based input

With the ever-developing field of HAR, there is also a need of a more practical way of fetching the data. This is where vision-based HAR proves efficient as compared to sensor-based HAR. The sensor-based HAR involves the placement of sensors as per the activity which is difficult to plan whereas vision-based HAR is a compatible approach for real-time videos which are the most common source of input and data considering the increase of CCTVs and cameras. The vision-based HAR is an important research area of computer vision. Vision-based HAR helps us identify the activity being performed in the video/image. The task is cumbersome due to the challenges involved like cluttered background, shape variation of the objects/people involved, illumination differences, placement of camera(s), viewpoint, stationery or moving camera etc. These challenges are further dependent on the category of activity being considered. There are four major categories of activities: gestures, action, interaction, and group activity represented in Fig. 6. Indoor HAR can include activities from these categories.

Fig. 6
figure 6

Categories of Activities

[93] discusses components of an automated, “smart video” system to track pedestrians and detect situations where people may be in peril, as well as suspicious motion or activities at or near critical transportation assets. Activity recognition through Gait is the process of identifying an activity by the manner in which they walk. The identification of human activities in a video, such as a person walking, running, jumping, jogging, etc. is an important activity in video surveillance. Gupta et al. [86], contribute to the use of a Model-based approach for activity recognition with the help of the movement of the legs only.

Researchers have also worked on an efficient depth video-based HAR system that monitors the activities of elder people 24 hours a day and provides them an intelligent living space which comforts their life at home.

HAR is also influenced by culture of people. This was established in the work [139] which proposed that daily life activities, such as eating and sleeping, are deeply influenced by a person’s culture, hence generating differences in the way a same activity is performed by individuals belonging to different cultures and that by taking cultural information into account we can improve the performance of systems for the automated recognition of human activities. They proposed four different solutions to the problem and used a Naive Bayes model to associate cultural information with semantic information extracted from still images. They used a dataset of images of individuals lying on the floor, sleeping on a futon and sleeping on a bed. Vision based HAR has further processing steps which will be discussed in later sections.

4 Datasets: RGB, RGB-D

There has been a lot of research based on RGB images [11, 50, 94] in the past decades. Computer vision and RGB images are closely associated with each other. RGB images give the information about the appearance of objects in the frame. The information about only the colours is a little less though as with this limited information it is a cumbersome task to partition the foreground and background having similar colours and textures. Additionally, the object appearance described by RGB images is not sturdy against common variations, such as changing illumination conditions, which hinders the usage of RGB based vision algorithms in real. While a lot of work has been going on to design sophisticated algorithms, research is parallelly going for finding a new type of representation that can better represent the scene information. Due to its complementary nature of the depth information and the visual (RGB) information, RGB-D image/video is an emerging data representation that is able to help solve fundamental problems. Meanwhile, it has been proved that combining RGB and depth information in high-level tasks (i.e., image/video classification) can dramatically improve classification accuracy [179, 180]. Table 3 shows examples of some of the RGB datasets for activity recognition whereas Table 4 lists datasets available for indoor activity recognition.

Table 3 RGB DATASETS
Table 4 Indoor Activities Datasets

These datasets divided into two categories based on the modalities in which they are recorded such as RGB and RGB-D. Before 2010, a large number of RGB video dataset was available [40]. After the advancement of low-cost depth sensors, e.g., Microsoft Kinect, there has been a drastic increase in the 3D dataset, and multi-modal videos dataset [13, 119, 132]. Due to low cost and lightweight sensors datasets are recorded with multiple modalities such as depth frames, accelerometer, IR sensors frames, acoustical data, and skeleton data information. The RGB-D datasets have multiple modalities which reduces the loss of information in 3D videos as compared to traditional RGB datasets at the cost of increased complexities. Table 3 shows a few of the existing datasets for activity recognition including examples of multi-modal datasets as well.

The challenges and concerns which might exist for performing HAR on such complex datasets are- occlusion, cluttered background, localization of activity, noise removal, etc. The use of depth sensor like Microsoft Kinect for creating RGB-D datasets may also pose practical problems like sensor placement and orientation. At the beginning, Kinect was as an Xbox accessory, enabling players to interact with the Xbox 360 through body language or voice instead of the usage of an intermediary device, such as a controller. Later on, due to its capability of providing accurate depth information with relatively low cost, the usage of Kinect goes beyond gaming, and is extended to the computer vision field. This device equipped with intelligent algorithms is contributing to various applications, such as 3D-simultaneous localization and mapping (SLAM) [89, 117] people tracking [151], object recognition [30] and human activity analysis [42, 128] etc.

5 Approaches of HAR

Activity recognition systems classify and recognise various activities for which they have been trained. Since, the data and variety of data is always better for such systems, the features from this data are again crucial and significant part of the HAR process. This leads to two types of approaches for the feature extraction- handcrafted features (machine learning), automated feature learning (deep learning) method. Traditionally, the researchers used the manual extraction of features from data (handcrafted features) [31, 116] for training the machine learning models. However, with the development of deep neural networks [112], models tend to automatically learn features. With great success of CNNs on image classification tasks, CNNs were applied to recognize human activities in videos. Well known approaches for action recognition in RGB videos based on CNNs are presented in [107, 123, 171, 194, 196]. With the introduction of cost inexpensive RGB-D sensors, researchers shift their attention, towards other vision cues like depth and skeletal data along with RGB data. One of the advantages of depth data and skeletal data, as compared to traditional RGB data is that they are less sensitive to changes in lighting conditions. Furthermore, the availability of well-known and diverse RGB-D datasets like MSR Action 3D [58], UTDMHAD [43], Berkeley MHAD [150], CAD-60 [178], SBU Kinect interaction [150], and many more, encouraged extensive research for human activity recognition using multiple vision cues. However, the two approaches are efficient in their own respect. Table 5 lists a few of the recent works using the two above-mentioned approaches.

Table 5 Handcrafted and automated feature in HAR

The results of HAR models for activity recognition also vary with how the human body is represented. Based on the category of representation of a Human body, its representation can be classified into 2D and 3D representation methods. 2D methods take into account spatial information, contour, shape, and skeletal information. 3D representation on the other hand talks of skeletal joints in terms of depth together with other characteristics such as spatial and temporal components related to the frame.

The skeleton is a high-level presentation that can be used to describe human activity in a very precise way and is adapted to the challenge of activity analysis, pose estimation, and action recognition. However, skeleton data include the coordinates of the human body’s key joints over time. This is an important factor for the motion presentation of each action. Body pose is an important indicator of human actions. Human body pose is a type of high-level semantic feature that has shown effect in recognizing human actions with discriminative geometric relations with body joints. Recent researches focus on either 2D or 3D modeling of these body joints. Most of the latest research in the field of automated HAR in fact based on skeleton data and uses depth devices such as Kinect to obtain three-dimensional (3D) skeleton information directly from the camera. Although these researches achieve high accuracy but are strictly device dependent and cannot be used for videos other than from specific cameras.

The 2D representation approach is effective for applications where precise pose recovery is not needed or possible [154]. The low-resolution image or single viewpoint can also be a reason for 2D representation methods. Example surveillance camera positioning. 3D approaches are suitable for a scenario involving a high level of discrimination between various unconstrained and complex human movements (Table 6).

Table 6 2D and 3D representation approach

6 Challenges in indoor HAR

Human activity recognition is an active and pervasive field for researchers not only from the areas of Computer Science, Electrical Engineering, and Informatics but also from many interdisciplinary backgrounds. This is because this field involves many challenges. From capturing data to the recognition process, all steps involve a thoughtful procedure that can be helpful in devising an appropriate way of recognizing the targeted activities. The environment for regular activities can be constrained and noisy.

The hardware involved has to be carefully thought out. One has to take into consideration real-world hardware implementation, efficient and resource-constraint computation, novel physics-based sensing, and socio-economical and human factor-related challenges.

The activities for which the model is to be built, also help in deciding the approach ahead. The indoor activities performed in controlled and static background conditions may not require a great deal of complex pre-processing steps whereas localization and denoising become an integral part of the process involving activities performed with complex backgrounds and multiple objects in a single frame. Choosing features is again a challenge for building a good model. Traditionally, machine learning used handcrafted features but with the popular and successful implementation of convolution neural network models, the deep learning approach has gained momentum and is a preferred choice for researchers. Thus, broadly talking of the levels of challenges in the pipeline for HAR we can categorize them as Application-level challenges, Data acquisition-level challenges, Approach level challenges, Algorithm level challenges, and Implementation level challenges. These are represented in Fig. 7 below.

Fig. 7
figure 7

Various levels of challenges and factors involved in Indoor HAR

6.1 Application-level challenges

The application area of indoor HAR is a big factor to study the challenges involved. The general application areas of HAR are surveillance, elderly care [176, 210], assisted living [185], smart homes, abnormal activity recognition [115, 131, 215], sports [164], etc. The challenges vary according to the area of application. Application of the model will help in deciding the steps ahead for building a system that considers the challenges which exist in that research problem.

6.2 Data acquisition level challenges

As previously discussed, the data for indoor HAR which can be a video or an image from the video can be either captured using vision-based methods or sensor-based methods. Some studies have combined the image/video under the sensors category for the sake of simplicity. This survey has considered vision-based input but also discussed current research in both the sensor-based and vision-based input fields.

6.2.1 Real-world data gathering

The data captured from whichever medium may have noise due to the environment from which the data has been extracted. For example, speech data from a kitchen activity will have background noise. Similarly, the images or video captured from cameras placed indoors in the living room can have noise in the form of dust particles, low visibility [122, 155] etc. Due to this noise the accuracy of the model may suffer.

6.2.2 Number and type of devices single camera or multiple cameras

The other problem is the choice of device to capture the data. Data can be captured from a static camera installed in the region of the activity or it can be collected using types of sensors (wearable, ambient) as discussed in section 2. In both cases the number and location of the sensor/camera matter. The data may lose its worth if it loses the critical features which are important for recognizing a particular activity. The capturing devices should be thoughtfully placed. The application for which it is being used will also matter together with the choice of sensor. Like, if the monitoring of the elderly is to be done using a wearable sensor, then one sensor is sufficient on the wrist. On the other hand, if the camera-based sensor is to be used then we may need two cameras or even more to detect and capture the whole activity being done.

6.2.3 Power consumption/storage/computation/Privacy

With the increase in application areas of activity monitoring, CCTVs have gained popularity in various segments involving AAL [54] elderly care [165]. But the challenge is the increasing demand of power, storage and computation which are essential pillars to the success of a model for activity recognition. These factors cannot be overlooked and thus reduction in terms of these parameters play as a gamechanger for the model.

Apart from storage and power consumption, privacy of data and person are also very important. The elderly person may feel uncomfortable being monitored by someone for his daily activities in case of a camera. While using wearable sensors, other issues related to networking may cause a concern.

6.3 Approach level challenges

The recognition and analysis by a model require knowing in depth of the features which can prove efficient. Knowledge of feature dimension and interpretability is also a key aspect to create a powerful yet computationally light model.

6.3.1 Handcrafted vs. deep learning

The impressive classification performance of the deep learning methods is making it popular among the researchers. However, it is known that they typically require a large training sample to achieve that accuracy. Meanwhile, handcrafted (HC) features have been implemented for decades and still serve as a powerful tool when combined with machine learning classifiers. Researchers are still exploring the best technique to suit their problem [16, 127]. Table 5 listed some of the latest research works using either or both of these approaches.

6.3.2 Dimension reduction

Another important challenge is to reduce the size of the feature so that the computation is easy to process. Also, a non-contributing feature should not be part of the process. Researchers have been working on this aspect for a better solution to the problem. Table 7 lists some of the recent works in the dimensionality reduction area for deep learning and machine learning techniques.

Table 7 Few of the recent works for dimensionality reduction

6.3.3 Combination of automatically-calculated and handcrafted features

Researchers have been trying to combine handcrafted features (machine learning) and automatically-calculated (deep learning) features to design such systems which can be accurate and robust. Table 5 lists a few of the recent works which have combined both types of features for HAR models. This is a challenging task to choose the appropriate techniques and compare them with other models.

6.4 Algorithm-level challenges

This challenge talks about the concerns like pre-processing steps to be included, the data volume to be taken, and learning with limited labels. These concerns and issues are correlated and overlap with the issues discussed above. For a better HAR performance, data pre-processing is a crucial step in machine learning and deep learning algorithm. Data pre-processing includes data cleaning, normalization, transformation, feature extraction, and selection [111].

An important issue in HAR is whether to use training data from a general population (subject-independent), or personalized training data from the target user (subject-dependent). Past research has shown better results with personalized data but a collection of end-user data for training is not a practical option. Thus, the subject-independent approach is more common. [164] introduce a novel approach that uses nearest-neighbor similarity to identify examples from a subject-independent training set that are most similar to sample data obtained from the target user and uses these examples to generate a personalized model for the user.

6.5 Implementation-level challenges

Implementation of the model finally gives us an idea of how the system behaves in real-time. This arises because of the challenges like low lighting conditions, occlusions, etc. Overall, the requirement is for a robust and reliable system. These challenges also include network issues which may cause real-time glitches in HAR. The accuracy and time to execute, both are important aspects of a real-time system. If the system takes too long to process then the lag decreases the real-time effect thus making the system inefficient. If the system accuracy is less but the speed of processing is good then the system is unreliable. So, both concerns should be addressed equally.

7 Conclusion and future work

In this survey, we have reviewed recent trends in the field of indoor human activity recognition. We have proposed a taxonomy for indoor HAR. The review is conducted on the state-of-the-art HAR input methodology, approaches, datasets, and human body representation methods. The datasets which have been used in various recent research have alsso been listed. The survey lists various challenges associated with indoor HAR at various levels. Indoor-HAR can provide smart solutions for elderly care, patient care systems, physical fitness systems, abnormal human behavior detection, and analysis, etc. We can monitor our indoor activities to observe the change in activity patterns before the onset of any disease or major illness. These can help in situations where these changes go unnoticed because we tend to ignore certain changes in the body. Such systems can generate data and help us observe physical changes. The cases of child abuse during the pandemic had increased. Such systems can help in such situations too for detecting any abnormal behavior, fall detection, domestic violence, child abuse, etc. Table 8 summarises the survey in terms of the individual components of the proposed taxonomy. Indoor-HAR is a vast area with many key and challenging applications.

Table 8 Summary of Proposed Taxonomy in terms of Components