EVALUATION OF CROWD COUNTING MODELS IN TERM OF PREDICTION PERFORMANCE AND COMPUTATIONAL REQUIREMENT

: With the increasing of human population and the development of technology, crowd counting models are needed to estimate people in certain areas. This research paper compares the prediction performance and computational requirement of four state of the art crowd counting models: M-SFAnet, DM-Count, Context-Aware Crowd Counting (ECAN)


INTRODUCTION
Crowd counting is a branch of science from crowd analysis that can be used to carry out monitoring and surveillance in video form, providing estimates in designing an area, monitoring traffic, and others [1].By using crowd count technology, the total mass and mass density values can be estimated [2].Research on crowd counting is highly important for security purposes.A recent tragedy caused by the crowd is the Seoul tragedy, with at least 153 people killed.Nowadays, many crowd counting models are being developed, but many of them are focused on prediction performance.Several crowd counting models that are close, if not the best, in the state of the art in terms of prediction performance are M-SFANet, DM-Count, Context-Aware Crowd Counting (ECAN), and Supervised Spatial Divide-and-Conquer Network (SS-DCNet).Frequently when evaluating a crowd counting model the focus is on finding the least prediction error.On the other hand, the computational requirement is ignored.The computational requirement is equally important to discuss to understand how heavy the model and the possibility of the application of the model.The high demand of light-weight model due to the development of the Internet of Things (IoT) also a reason why the efficiency of a model is important.Therefore, in this research from the four models mentioned above, the computational requirements and prediction performance of each model are being analyzed and evaluated.Therefore, the result and comparison of its computational requirement and prediction performance can then be shown.

RELATED WORKS
Early approach of crowd counting is by detecting individuals in the input picture and then counting the detected individuals.The detection is done by bounding boxes that slide to find the desired object which in this case is persons (i.e., object detection approach).This approach was later found inefficient and heavy in the computing sense, since it requires the model accurately detects the individual in a crowded situation which makes individuals often overlap and make it hard for the model correctly detects every individual on the scene.In attempts to reduce the computational weight of the detection, the "object" that will be detected by the model is reduced (in features) to just the head.This attempt is still not enough to reduce the computation.One of the attempt of crowd counting by detection proposed by Marsden et al. [3] that build a model based on Resnet18 network of He et al. [4], which is trained on ImageNet dataset.Feature map average pooling step was used in the model to reduce the parameters, hence allowing multi-task crowd analysis to be applied and reduces the memory needed in the training process.The model was also tasked to detect violent behavior which justify the chosen approach of detection that can both detect violent behavior and count the crowd.Other attempt is by Xing et al. [5], who proposed crowd counting based on detection flows.By tackling crowd counting with detection flows, they can reduce false alarms, can work better with data noises, and give better specific descriptions of crowds.
Then the trend shifted to regression-based methods.This approach to some degree successfully deals with dense crowd's situations and high background clutter.This approach is inspired by the human ability to estimate the density of a crowd at first glance without counting individually the crowd.The regression approach of crowd counting is achieved by determining crowd density from low-level imagery features.First global feature (such as texture and edges) and local feature (such as Scale-invariant Feature Transform (SIFT), Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG), and Gray level Co-occurrence Matrix (GLCM)) are extracted from the image [6].After the feature extraction, regression models are trained to predict the number of people in the crowd.It found out that local feature is the best feature to use in Regression method of crowd counting compared to holistic features and histogram features.Also, the Gaussian process regression is the best regression for crowd counting compared to linear regression, k-nearest neighbors and neural networks [7].
At later stage, arose a new approach along with the breakthrough of Deep Learning models that excels at tackling Computer Vision problems [8].The proposed method suggested to utilize a Deep Learning model like CNN to crowd counting.The approach is combined with the concept of extracting features from the regression approach.But instead of the regression model, the Deep Learning model is used to predict the crowd counts.This kind of approach by far surpasses past approaches.
In addition to the Deep Learning approach, the current trend is to extract a density map from the image for features that will be fed to the model.This reduces the computational weight and relieves the data from the less useful feature.The model now does not need to recognize complex features, it only needs to recognize simple features from the image which now is in the form of the density map.

M-SFANET
The first model is M-SFANet.This model consists of four main components: VGG16-bn as the encoder, which constantly reduces the feature map size and captures high-level semantics information, the multi-scale-aware modules that are divided into Context-aware module (CAN), which is connected with the 10th layer of VGG-16bn, Atrous spatial pyramid pooling (ASPP), which is connected to the 13th layer of VGG-16bn, and lastly the dual path multi-scale fusing decoder consists of density and attention map path [9].The process of M-SFANet starts when the input image is fed to the encoder to become a feature map, then the feature map is fed into the multi-scale-aware modules (CAN and ASPP), and finally the decoder will combine the multi-scale feature into a density and attention maps [10].

DM-Count
The second model is DM-Count.This model uses VGG-19 as its backbone.DM-Count performs Distribution Matching to do crowd counting.Optimal Transport (OT) is used to measure the similarity between the predicted density map and the ground truth.Total Variation (TV) is also used to stabilize the OT.In DM-Count, the Gaussian method was not used because the Gaussian method will impact the generalization performance of crowd counting [11].

Context Aware Crowd Counting (ECAN)
The third model is Context-Aware Crowd Counting (ECAN).The ECAN approach is a deep net architecture that adaptively encodes multi-level contextual information into features.The ECAN approach is meant to overcome the large-scale consistencies that appear in images.ECAN uses the first ten layers of pre-trained VGG-16 as its backbone then by performing Spatial Pyramid Pooling, the scale-aware features are computed.Spatial Pyramid Pooling is used to extract multiscale context from VGG.Then the geometry of the images being exploited, this addressed to cover the multi geometry of the images across it vary.The strategy used to determine the ground-truth density maps is to denote each position of the human head in the scene and by convolving an image.
To minimize the loss, Stochastic Gradient Descend (SGD) and Adam algorithm are used [12].

Supervised Spatial Divide-and-Conquer Network (SS-DCNet)
The fourth model is SS-DCNet.This model used pre-trained VGG-16 as the encoder [13] and U-net [14] as the decoder to obtain the feature map.Then the first stage of Spatial Devide-And-Conquer (S-DC) is executed to fuse the feature map.SS-DCNet can also execute multi-stage S-DC by doing further decoding.There are multiple loss functions that SS-DCNet used, which are Counter Loss, Merging Loss, Division Loss, Upsampling Loss, and Division Consistency Loss, and then the ground truth can be obtained [15].

Evaluation of Prediction Performance
The experiment evaluates the prediction performance based on the value of Mean Square Error (MSE).
Where yi is the i observed value, (yi) ̂ is the i predicted value, and n is the number of observations and MAE (Mean Absolute Error) Where yi is the predicted value, xi is the observed value, and n is the number of observations.The experiment is tested using the test data from each dataset as shown in Table 1 and is run in each pre-trained model to obtain the MSE and MAE result.

Evaluation of Computational Requirement.
The experiment evaluates the computational requirement based on the average CPU usage, average RAM usage, average runtime of each image in one specific dataset, and the average runtime of each dataset.CPU usage and RAM usage is being observed in the task manager while the average runtime of each image and average runtime of each dataset is collected based on the time of compilation in each dataset.To obtain the average value, pre-trained models are run on test data five times in each dataset.

Overall Performance Evaluation.
The last step is to determine from the experiment results which model is the highest performing (both highly accurate and fast) to do crowd counting using the hardware stated in Table 2.The way to find the best high-performing model is by rank each model by performance (evaluation score: MAE and MSE) and Computational Requirement (Mainly the runtime), then add both rank and find which is the lowest.If the tie result found, the most efficient model is the one that have more balance result between the accuracy rank and computational requirement rank.

TABLE 1 .
Used Datasets Quantity.Each model is going to be compiled into four datasets as mentioned in Table

Table 3 .
Evaluation of Prediction Performance on Test Dataset ShanghaiTech Part A.

Table 4 .
Evaluation of Prediction Performance on Test Dataset ShanghaiTech Part B.