Age estimation algorithm based on deep learning and its application in fall detection

: With the continuous development and progress of society, age estimation based on deep learning has gradually become a key link in human-computer interaction. Widely combined with other fields of application, this paper performs a gradient division of human fall behavior according to the age estimation of the human body, a complete priority detection of the key population, and a phased single aggregation backbone network VoVNetv4 was proposed for feature extraction. At the same time, the regional single aggregation module ROSA module was constructed to encapsulate the feature module regionally. The adaptive stage module was used for feature smoothing. Consistent predictions for each task were made using the CORAL framework as a classifier and tasks were divided in binary. At the same time, a gradient two-node fall detection framework combined with age estimation was designed. The detection was divided into a primary node and a secondary node. In the first-level node, the age estimation algorithm based on VoVNetv4 was used to classify the population of different age groups. A face tracking algorithm was constructed by combining the key point matrices of humans, and the body processed by OpenPose with the central coordinates of the human face. In the secondary node, human age gradient information was used to detect human falls based on the AT-MLP model. The experimental results show that compared with Resnet-34, the MAE value of the proposed method decreased by 0.41. Compared with curriculum learning and the CORAL-CNN method, MAE value decreased by 0.17 relative to the RMSE value. Compared with other methods, the method in this paper was significantly lower, with a biggest drop of 0.51


Research background and significance
In computer science, artificial intelligence has been dominant and has gained increased attention, as it is a necessary means for the development of social intelligence. Artificial intelligence simulates human thought information through machine learning and performance. There have been major breakthroughs in artificial intelligence technology, and it is widely used in production learning. In every area of everyday life, it is of driving significance to the development of society [1]. Computer vision is a prominent area of machine learning. According to the information such as pictures and videos, the computer has the ability to recognize itself. Computers and people have similar methods for information acquisition. Computers work through cameras, massive data, and related algorithms [2]. With the continuous development of society, computer vision has been gradually applied to various industries. In picture classification, there has been excellent performance in detection and segmentation, though there is still plenty of room for expansion.
Because of the evolution of society and the ubiquity of videos, there was an explosion of data. Image-based visual detection and classification has been gradually replaced by videos, using video surveillance to obtain information. Focusing on the analysis based on a large number of historical information sets has gradually become an inevitable trend of social development [3]. Population aging has become a major concern of society. An increasing amount of elderly people live alone. Relative to other abnormal detection behaviors, monitoring falls in the elderly is getting more and more attention and is significant to promote the healthy development of human beings.
Along with the social structure of population aging, you can see exactly how the population is aging. At the same time, population aging is still increasing. Compared with the previous decade, there was a significant increase in the number of individuals over 60 years old [4]. Based on a survey of the causes of accidental injuries suffered by the elderly, injuries from falls account for a large proportion of deaths. Since the damage caused by a fall is irreversible, the fall detection of the elderly has become a major concern of the society [5]. Nowadays there are more and more elderly people living alone. It is particularly necessary to test the physical health of the elderly in the aspect of old-age care, including monitoring those who live in nursing homes. Monitoring human fall information according to video and visual information has become an inevitable trend of social development.
In the process of human fall detection, the fall is usually taken as the detection center and the judgment of the age of the person who fell is ignored. Falls in different age groups tend to have different consequences; therefore, the measures taken after a fall are also different. Focused monitoring of falls based on age is especially necessary. In this paper, human fall behavior is thoroughly studied. Additionally, a specific analysis of the problems faced are shown, such as a data explosion which occurs when too many videos are utilized as the input; this subsequently has a great influence on the real-time performance of fall detection. Moreover, fall detection for the elderly needs to be graded in the form of gradients. Therefore, on the basis of solving these problems, this paper constructs a fall prediction mechanism for people of different ages.

Research status at home and abroad
Human fall detection is a classification problem based on human posture information and age.
Information about the human face and biological characteristics of the human body needs to be gathered to make predictions about how old people are through the use of an age estimator. In recent years, researchers at home and abroad have performed numerous experiments on age estimation.
Age is the most important characteristic information about humans. In human-computer interaction, age estimation has always been difficult to detect through the use video surveillance and other important application values. Depending on how much facial information is maintained, detection information often contains some errors. Facial expressions and skull shape also play a role in the results. Many scholars and researchers have made a lot of improvements on the age-based estimation algorithm.
Li et al. [6] proposed a refined network Local Response Normalization (LRN) simultaneously utilizing packaging label distribution and a regression essence, and a relaxation regression refining field was used for age discrimination. Yu et al. [7] solved the problem of insufficient data volume and differences in data distribution. Based on a fine-tuned network, they composed a classification network of two types of Convolutional Neural Networks (CNN) for age estimation. Li et al. [8] provided some novel ideas and prospects for the application of CNN. This work provided an overview of various convolutions and outlined some rules of thumb for function and hyperparameter selection. Guehairia et al. [9] used the Gcforest algorithm to perform age estimation experiments based on images. The algorithm had the advantage of a cascading structure that allowed for interactions between trees. Badr et al. [10] proposed a cascaded model system. The division of age labels was understood through a classification model, and the knowledge learned from the classification model was used as the auxiliary input for a regression model to achieve age estimation. Zhang and Bao [11] used a regression forest to estimate the age of face images alongside head posture. Pramanik and Dahlan [12] combined a convolutional neural network framework and Resnet50 architecture to propose a fast method for age estimation from face images. Chang et al. [13] proposed the OHRank's ordinal hyperplane sorting algorithm based on the relative sequence information of age labels in the database; subsequently, the age of the human body is determined. Many researchers conducted experiments on the Asian face age dataset (AFAD). For example, Wang et al. [14] combined curriculum learning with age estimation, which was used to improve the training efficiency of the neural network and enhance the ability of the network to discriminate age. Santos et al. [15] proposed a fall detection system based on commodity mmWave sensors along with body-feature estimation algorithm to overcome the low-resolution deficiency. They compared several body-feature estimation algorithms and selected the best one. Then, they illustrated the potential of body-feature estimation algorithms via a case study of a thresholdbased fall detection system. Niu et al. [16] made a breakthrough in the traditional classification and regression research in age estimation, in which sequential regression was adopted to carry out feature learning and regression modeling at the same time. Based on deep learning, the end-to-end learning of convolutional neural networks was used to concretely analyze the regression problem. The age estimation effect has been effectively improved. Age estimation is a key part of human-computer interaction and can be widely combined with other fields of application. It has a profound impact on the development of society.

VoVNetv4 overall network analysis
Because the bottom feature has unique details from the top feature, they can provide more detailed richness; however, its semantics are poor. With strong noise, high level features are usually less keen to capture fine nodes. Therefore, they usually have a lower resolution than the underlying features, though they have much more robust semantic information. The goal of feature aggregation is to efficiently integrate low-level features with high-level features. It is an important means to improve the model performance. VoVNetv4 adopts the mode of module splicing to deliver early features, thereby the original characteristic pattern is preserved by aggregating multiple receptive fields to capture all kinds of visual information in a trans-latitude manner and to extract features to achieve a diversified representation. The VoVNetv4 network is composed of four ROSA and adaptive stage modules. All features are connected once in each ROSA module. First, it performs a single aggregation at the module output. Then, the ROSA output information is respectively processed by the adaptive stage module. Finally, a sexual polymerization is performed. The VoVNetv4 network can effectively avoid feature redundancy and improve the detection accuracy of the network. The overall architecture of its network is shown in Figure 1.

Rosa module
In object detection, most networks use Resnet [17] and Densnet [18] as Backbone networks. Densnet is based on the Resnet network, which densely joins the convolution layers to ensure the flow of information between volumes and layers because dense connections not only bring feature enhancement, but also brings the disadvantage of linear growth of the output channel. The VoVNet [19] network has made a lightweight improvement from the dense connections to all feature aggregations in the last layer to address this shortcoming. Although the VoVNet network effectively solves the problems of complexity and memory access cost of original network, the recognition accuracy has not been significantly improved. The VoVNetv4 proposed in this paper draws on the improvement idea of VoVNet. The ROSA module is proposed for a single aggregation with phased characteristics. The aggregation calculation of VoVNet is shown in Figure 2. The ROSA module is designed with a phased single aggregation module architecture, and is used for aggregation operations. In the ROSA module, the region is divided into different stages of operation, where each region represents a stage of operation. The ROSA module consists of four areas. Each area contains 5 identical ConvBlocks and has the same input and output channels. ConvBlocks consists of a 3 root 3 convolution layer, a BN layer and a Relu layer. The convolution kernel has a step size of 1. In the process of the aggregation operation, region 1 and region 2 are looped only once, region 3 is looped four times, and region 4 is looped three times. Every time you enter the next Stage of the ROSA stage, the feature graph goes through a convolution kernel of 1, downsampling with step size 1 and a 3 root 3 convolution layer. The maximum pooling layer has a step size of 2 because highlevel semantic information is more important than low-level semantic information in target detection tasks. The ROSA module adds high-level features through different regional operations and is used to improve the ratio of high-level features to low-level features.

CORAL framework classifier
The age estimation scheme proposed in this paper includes image preprocessing, three stages of feature extraction and age classification. The CORAL framework is an ordered classifier for the age classification stage. In terms of sorting task, it is important to reduce the differences between the same classes, as there is more variation between categories of different species. Whether good classification results can be achieved is a critical task. The CORAL framework converts ordinal targets into binary classification subtasks. This framework can effectively improve the classification effect. The derivation process of the CORAL framework is shown in Eq (1): where is the data set, is the i-th picture, N is the number of samples, and represents the corresponding rank value. The rank is the age rank of a human body within a certain region, where is the set that rank belongs to. Let's call this set . Then, the relation between and is shown in Eq (2): (2) The elements contained in set are arranged in order, as shown in Eq (3): The main purpose of the ordered regression task is to find a sorting rule, that is, the corresponding relationship between the age picture and the age value in rank minimizes its loss function. Let C be a cost matrix of K × K. , represents the loss value of rank ( ) when a sample (x, y) is predicted.
When , = 0, the picture that represents the age corresponds exactly to the age in the rank. The network model presents a perfect prediction state wheny ≠ , , > 0. In ordinal regression, the V-shaped cost matrix is more conducive to feature learning and classification. In the actual computation, it is hard to obtain the cost matrix to a V state. The CORAL framework can produce consistent predictions for each binary task and avoids the disadvantages of a V matrix. One can extend with binary labels, as shown in Eq (4): where ∈ 0,1 indicates whether y exceeds rank ( ). For example: ∈ 1 > , if the internal condition is true, it indicates that the function 1 is 1; otherwise, it is 0. Given a response mechanism based on binary tasks, the predictive rank label of input is obtained by h( ) = . The Rank index q is obtained by = 1 + ∑ ( ) where ( ) ∈ 0,1 is the prediction of the KTH binary classifier in the output layer.

Loss function
The weight parameter of the neural network is defined as W. Currently, the network structure does not include the bias unit of the last layer. The penultimate layer output of the network is expressed as g ( , ) and shares a weight with all nodes in the final output layer. Then, one can add K-1 independent bias units to g ( , ). Put the corresponding binary in the last layer. The input to the classifier is defined as ( , ) + . The activation function is shown in Eq (5): Here, the empirical prediction probability of task k is defined in Eq (6): where sigma is the activation function. For model training, one can use the loss minimization function L( , ), L( , ), as shown in Eq (7): where ( ) is the loss weight associated with a k classifier when you assume that the value is greater than zero. The model is trained by extending binary tags. One can assign the −1 binary classifier to the output layer to train individual convolutional neural networks.

Experimental environment and parameters
This section evaluates the proposed age estimation methods according to different experimental procedures to prove the validity of experimental data. Meanwhile, it is compared with other methods. This experiment was conducted on a server with an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz 2.20 GHz (2 processors) and a NVIDIA TITAN X(Pascal). The operating system is Window10 64bit. The video memory is 12 GB. The software utilized included Python (version 3.8) and Pytorch (version 1.9).

Experimental data set
The Asian face age dataset (AFAD) [20] is the largest age estimation dataset so far, containing 164,432 images of human faces. Each image contains a corresponding human age and gender label. The AFAD data set was collected on Renren. Its image information contained 100,752 images of men and 63,680 pictures of women. The data set focuses on Asian age estimates and contains different backgrounds and lighting conditions.

Evaluation index
In this paper, to test the effectiveness of the proposed backbone network, the performance of VoVNetv4 was evaluated on two evaluation indexes: Mean Absolute Error ( ) and Root Mean Square Error ( ). is mainly used to calculate the average difference between a person's predicted age and their actual age, that is, the average solution of residual error. The smaller the MAE value, the higher the accuracy of age estimation, as shown in Eq (8): is used to measure the difference between the predicted age and the actual age. The calculation process is similar to that of the standard difference, as shown in Eq (9) ( 1 (9) where is the age predicted by the network model, h( ) is the real age value of the predicted sample, and N is the total number of samples.

i. Experimental results
In order to make a fair comparison, using the same random seeds in the training, the experiment was divided into four parts: the first part is to take the standard Resnet-34 classification network as the performance baseline model; the second part is to integrate the standard Resnet-34 classification network and the CORAL framework and to verify the effectiveness of the CORAL framework; the third part is a fusion using VoVNet and the CORAL framework to verify the VoVNet performance; and the fourth part is to improve VoVNet. Therefore, VoVNetv3 is fused with the CORAL framework. To verify the validity of the proposed method, the random seed is set to 1, the learning rate is set to 0.0005, the epoch is 200, batchsize is set to 128. 1) To verify the effectiveness of the CORAL framework, Resnet-34 was selected as the feature extraction network for verification analysis. Primarily, the Resnet-34 network is used as a classification network for age estimation. The convergence of the network is analyzed by examining any changing trend of loss value. The network training loss figure is shown in Figure 4. The network gradually converges at approximately round 37. 2) Based on the first part of the experiment, the fusion of the CORAL framework classifiers is used to verify the effectiveness of the CORAL framework. In this experiment, the cost function is used for network optimization analysis. The network convergence is shown in Figure 5. The model gradually converges at approximately round 25. Experiments show that the CORAL framework classifier can effectively improve the generalization ability of the network. 3) In this experiment, VoVNet is used as a feature extraction network and the CORAL framework is used as a classifier. The network convergence is shown in Figure 6. The experimental results show that the convergence of the network is significantly improved in the 25th round. Compared with Resnet-34 as a feature extraction network, its convergence speed and effect have significantly improved. Figure 6. VoVNet +CORAL framework cost function curve. 4) From the third part of the experiment, the effectiveness of using VoVNet as a feature extraction network was verified. In the fourth part of the experiment, the VoVNet network was improved. The backbone network VoVNetv4 was proposed. In order to verify the performance of the proposed network. In this experiment, VoVNetv4 is used as the feature extraction network. The classification network uses the CORAL framework. Its training cost function curve is shown in Figure 7. The experimental results show that the VoVNetv4 network has a higher convergence ability and speed. ii. Result analysis Other research methods are replicated in this chapter, that is, under the same AFAD data set. Training rounds of 200, MAE and RMSE were used to analyze the performance. The experimental results show that compared with curriculum learning and the CORAL-CNN method, the MAE value decreased by 0.17. Compared with the OR-CNN method, relative to the value of RMSE, the MAE value decreased by 0.21. The methods used in this paper have different degrees of decline compared with the comparison methods. The comparative experimental results are shown in Table 1. iii. Ablation experiment The results of the ablation experiment are shown in the Table 2. The VoVNetv4 in the table is a phased single aggregation backbone network VoVNetv4 that uses the ROSA module proposed in this paper for regional encapsulation. The CORAL framework is used as the classifier for the CORAL substitute experiment. The evaluation index with a downward arrow indicates that the lower the value of the evaluation index, the better the performance of the algorithm to achieve the effect. The number below the evaluation index is the value of the decline of the optimized laughter method used as compared with the original algorithm. As shown in the Table, MAE and RMAE both decline when using either the proposed VoVNetv4 as the backbone network or the CORAL framework as classifiers alone, and the combination of the two produces an improved effect. Therefore, we finally choose VoVNetv4 as the feature extraction network and the CORAL framework as the classifier.

Application of age detection algorithm in fall detection
A gradient-style two-node fall detection framework combined with age estimation is designed. In order to achieve the detection priority of human falls in different age groups, the framework adopts a dual-node method to divide the detection into a primary node and a secondary node. In the first-level node, the age estimation algorithm based on VoVNetv4 is used to classify people of different age groups. At the same time, a face tracking algorithm is constructed based on the combination of the key point matrix of the human body processed by OpenPose and the center coordinates of the face. In the second-level node, human body fall detection based on the AT-MLP model is performed based on the age gradient information of the human body in order to improve the real-time performance in the fall detection process [21]. Human key point matrix

Human tracking
Primary node

Secondary node
Priority gradient Secondary gradient Based on AT-MLP fall detection algorithm Results Figure 8. First level node frame diagram.

Primary node
The framework adopts a two-node task approach for task analysis. The primary node is divided into two stages: information processing and priority allocation. The information processing stage seeks to understand relevant video information. the first, the video is processed by the VoVNetv4 age estimation algorithm to obtain the age information of the human body and the central position of the face. The age information is classified into stages by the age group. If the age information is 60 years or older, it is classified as elderly. Those under 60 are divided into young adults and teenagers. It is used for task deployment in the next phase. At the same time, for the video, it was passed into the OpenPose network to carry out the key point coordinates of human body, that is, the output key point matrix information of human body. First, it combines the central position of face obtained in the age estimation algorithm with the key point matrix of human body output by OpenPose. Next, it constructs the face center point tracking algorithm to locate the age of the body and the key points of the body; thus, the tracking process is realized. Then, it sets the distance between the central point coordinates of face and the central point coordinates of human body recognized by OpenPose as . The calculation of is shown in Eq (10): Among them, ( , ) is the key point coordinates of human face center recognized by OpenPose. ( , ) are the key point coordinates of human face center recognized by the age estimation algorithm. According to D, the corresponding relationship between the target and the key points is determined. The priority allocation stage mainly prioritizes the tracked human body information: information older than or equal to 60 years of age is classified as an advanced resource; the rest is passed to the next node as low-level resources.

Secondary node
The secondary node includes two parts: resource allocation and human fall detection. It is used to further process the information of the first node. The gradient two-node fall detection framework combined with age estimation allocates human information of different age groups in two directions of priority gradient and secondary gradient. Let's say one has n videos to process. Then, n videos are passed into the level 1 node for processing. After tracking the center of the face, the corresponding relationship between the age of human body and the key point matrix of human skeleton is obtained. Equation (11) (11) where m is the information passed into the gradient sequence and is a high-level resource. The age of the human body is greater than or equal to 60. Key point matrix information of human skeleton, it is a low-level resource, that is, the human age is less than human skeleton key point matrix information. k and j are the video sequences in the incoming gradient to set the threshold information in the incoming gradient stack respectively. Suppose that the threshold of key bone information of the elderly is set to 5. For others, the threshold for key bone information is set to 1. In the resource allocation phase, five falls of the elderly will be performed. One can perform another fall of a young person or juvenile and enter the loop detection process in this way.

Visualization of detection effect
In order to reflect the detection results and practical application effects of the two-node fall detection framework combined with age estimation, this section demonstrates on the LFDD data set. The LFDD dataset contains multiple fall scenarios. There are different light and color differences and associated occlusion conditions. It feeds the human fall video into the two-node fall detection framework through the information processing and priority allocation of the first node. Then, it obtains the age label of the human body. Then, according to the age information gradient into the second level node, they mark the human body for falls. Now, visualization results of part of the model detection are presented, as shown in Figure 9. These results include the elderly, young people, and young people with special conditions.
In Figure 9(a), a group of images showing fall detection of human subjects aged 65 is presented. On the left is the visual information that has not fallen, including information about the person's age and whether they fell and marked the priority distribution channels. The figure on the left and the figure on the right are of high priority. The picture on the right shows the status information of the elderly when they fall. It includes the special case of occlusion. In Figure 9(b), a group of images showing visual information about falls among young people is presented. All of them are assigned with low priority. In Figure 9(c), a group of images shows no information about the human face. The mark at this time does not show the age label because of the realistic facial occlusion factor. It treats such cases as high priority assignments. The image on the right shows a human squat that looks like a fall. At this time, the model can correctly identify whether the human body is in a falling state.
Since there are no multiplayer scenarios in the LFDD data set, to demonstrate the applicability of the two-node fall detection model combined with age estimation,other multi-person scenarios are experimentally verified. The visualization results are shown in Figure 10.

Conclusions
In this paper, the proposed human age estimation algorithm is introduced in detail. First, the backbone network VoVNetv4 is proposed for overall analysis. It consists of the ROSA module and the adaptive stage module. The second is a concrete analysis of the two modules. The ROSA module is encapsulated by the region of the feature module. In order to achieve a phased single aggregation of features, the mode of aggregation calculation and the architecture of ROSA module are explained. The adaptive stage consists of Convolutional layer, BN layer, and Rectified Linear Unit activation function. It is used to smooth the feature information of the ROSA module. The three network layers covered are described in detail in this section. In this chapter, the classifier and loss function used in age estimation are introduced and explained in detail. The principle and calculation process of CORAL framework classifier are shown. The use of Loss function is explained and analyzed. Finally, the age estimation algorithm based on VoVNetv4 is analyzed experimentally. The data set of the experiment is introduced. The experiment was divided into three parts. The effectiveness of the proposed method is verified by a series of comparative experiments. This paper proposes VoVNetv4, a new backbone network for object detection. VoVNetv4 uses regional splicing to transfer early features, completes the capture of various visual information across latitudes, and realizes a single feature aggregation in stages; at the same time, it builds an adaptive stage module composed of Conv layer, BN layer, and Relu activation function. Feature smoothing; the classifier uses the CORAL framework to divide tasks in binary form and make consistent predictions for each task. In this paper, the experimental verification is carried out on the AFAD data set of the Asian face age data set. The effectiveness of the proposed method is verified through four sets of ablation experiments, and compared with other detection methods. The experimental results show that the average absolute error is 3.30, and the root mean square error is 4.64, and the detection effect is very obvious. Realize the hierarchical detection of the elderly and young people, effectively improve the real-time monitoring ability, avoid the phenomenon of data explosion caused by too many input videos, and realize the gradient detection of human falls.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.