Journal

Implementations of person detection in tracking and counting systems tend towards processing of orthogonally captured images on edge computing devices. The ellipse-like shape of heads in orthogonally captured images inspired us to predict head centroids to determine positions of persons in images. We predict the centroids using a fully convolutional network (FCN). We combine the FCN with simple image processing operations to ensure fast inference of the detector. We experiment with the size of the FCN output to further decrease the inference time. We compare the proposed centroid-based detector with bounding box-based detectors on head detection task in terms of the inference time and the detection performance. We propose a performance measure which allows quantitative comparison of the two detection approaches. For the training and evaluation of the detectors, we form original datasets of 8000 annotated images, which are characterized by high variability in terms of lighting conditions, background, image quality, and elevation profile of scenes. We propose an approach which allows simultaneous annotation of the images for both bounding box-based and centroid-based detection. The centroid-based detector shows the best detection performance while keeping edge computing standards.


Introduction
Automatic tracking and counting of persons using computer vision systems is an important task in surveillance of public and private places [1], and specifically in public transport [2]. Various imaging technologies including radar sensors [3], laser scanners [4], 3D laser scanners [5], infra-red sensors [6], and cameras operating in the visible spectrum of light [2] can be used for this purpose. Selection of the imaging technology is mainly driven by economic interests of manufacturers (low price of a final solution). Cameras operating in the visible spectrum of light are preferred in practice. However, their utilization often faces legal obstacles which must be reflected in final solutions. For example, orthogonal scanning of a scene (camera placed above a scene, looking directly down on the scene) is preferred at public places as it prevents unwanted identification of individuals from their faces ( Fig. 1) [2].
One of the key operations performed within person tracking and counting systems is detection of persons in images [2]. The person detection is a variant of a computer vision task known as object detection. It comprises localization and class recognition of objects (individuals) within the images. The localization is the task of determining the P. Dolezel et al. Fig. 1. Examples of orthogonally acquired images for person detection. An image with a small number of persons where the persons are sharply differentiated from the background (a) promises a higher probability of correct person detection than a crowded image (b). Incomplete heads and variable distances of the heads from a camera lens (b) also complicate the head detection. In extreme cases, a head can be very close to the lens and cover a large part of the image area (c). If the colour of the head blends significantly with other objects in the scene and the lack of light makes the overall image unclear (c), the head detection becomes challenging. a desired output by implementing a fully connected layer at the end of a network. A network which does not contain any fully connected layer is referred to as the fully convolutional network (FCN). As FCNs consist purely of convolutional and pooling layers, they are faster than ConvNets with fully connected layers. This property of FCNs is advantageous for dense pixelwise prediction tasks such as instance [10] and semantic segmentation [12], where outputs of the networks are generally three-dimensional cubes.
The bounding figure-based object detection methods can either predict bounding figures directly (one-stage methods) [7,8,[13][14][15] or they can generate regions of interests at first stage. This is followed by sending the region proposals to the second stage for object classification and bounding figure regression (two-stage methods) [16][17][18][19][20]. Twostage methods typically reach higher accuracy rates but are slower than one-stage methods. The well-established one-stage methods such as YOLO [7] or single shot multibox detector (SSD) [13] use fix-sized anchor boxes as region candidates. The main drawbacks of the anchorbased methods are the need of ad-hoc heuristics (determining the number and dimension of anchor boxes), and the large set of anchor boxes to be evaluated, which slows down the training of models [14]. These drawbacks are overcome by key-point based methods such as Ex-tremeNet [8], CornerNet [14], and CenterNet [15]. CornerNet predicts two key-points for each object of interest: the top-left and the bottomright corners of a rectangular bounding box. CenterNet improves the CornerNet idea by adding a centre of gravity of the object to the prediction of the bounding box coordinates. ExtremeNet predicts coordinates of four extreme points and of one centre point for each object. These five coordinates determine an irregular octagon which determines the position of a detected object in the image.
Dense pixelwise prediction FCNs transform an input image into a map (segmentation map [10], saliency map [21], optical flow map [22], etc.). They consist of an encoder (contracting path) and a decoder module (expansive path), respectively. Convolutional and pooling layers of the encoder module gradually reduce the resolution of feature maps (pooling layers) while learning semantic information (convolutional layers). The down-sampling ensured by the pooling layers increases local receptive fields of neurons in deeper convolutional layers, thus allowing the learning of more complex features. Once dilated (atrous) convolution layers are used instead of pooling layers, the receptive fields are increased without decreasing the resolution. Note that the processing of high-resolution feature maps results in high time and space-complexities of such modified networks [23]. The decoder module, which consists of inverse layers (up-convolution and up-sampling layers), ensures recovering of output spatial dimensions. The decoder can be designed either as a mirror of the encoder [24] or it can be asymmetric to the encoder [12].
The state-of-the-art dense pixelwise prediction FCNs are directed acyclic graphs with skip connections to transfer pooling indices (Seg-Net [25]) or feature maps from the encoder to the decoder. The combination of feature maps from the encoder with feature maps produced by corresponding up-convolution or up-sampling layers can be ensured, for example, by concatenative skip connections (U-Net [26]), by attention gates (attention U-Net [27]), by adding an extra fullresolution stream (full-resolution residual networks (FRRNs) [28]), and by improving skip connections with higher number or complicated convolutional units (squeeze U-Net [29], gated feedback refinement network (G-FRNet) [30], global convolutional network [31] (GCN)).
The bounding figure-based object detectors, likewise dense pixelwise prediction FCNs, are trained end-to-end on a set of annotated images. Annotation of a sufficiently representative dataset is always time consuming. In the case of the bounding box-based object detection, an annotator (a domain expert) draws a rectangular boundary around each object of interest in each image in the dataset, and assigns them class labels [7]. The annotation of images for dense pixelwise prediction tasks is even more challenging. For example, in the case of the instance segmentation, the annotator creates a segmentation map for each image in the dataset. He/she must assign to each element of the map a value corresponding to the class of the object which is associated with the element.
Performance measures such as intersection over union and generalized intersection over union are commonly used for evaluation of bounding box-based object detectors [32]. These measures take into account areas defined by ground truth bounding boxes and bounding boxes predicted by an evaluated detector.
The practical implementation of tracking and counting systems tends towards data processing on edge computing devices. Their limited computing power requires the use of time efficient and yet accurate object detectors. Herein, we propose an approach ensuring fast and precise detection of persons in orthogonally captured images. We assume that the position of an object centroid is sufficient for person counting and tracking. Instead of bounding figure prediction, we predict object centroids using localization maps. We expect that generation of smaller localization maps will result in smaller time complexity of the centroidbased detector, while keeping detection performance at the level of its full resolution variant. To confirm our hypothesis, we propose two dense pixelwise prediction FCNs: the first generates localization maps of spatial resolution equal to spatial resolution of input images, and the second produces maps of quarter resolution. We compare the proposed centroid-based person detector for both map resolutions with YOLO and CenterNet detectors in terms of inference time and detection performance. We choose YOLO as it is a generally accepted baseline for real-time bounding box-based object detection, and CenterNet as it is promising keypoint-based detector with object centroid coordinates as one of its outputs. As our centroid-based object detector predicts only points (coordinates of centroids), the commonly used performance measures cannot be used for its comparison with bounding box-based person detectors. We propose a total localization error ∑ which allows quantitative assessment of detection performance of both the centroid and the bounding box-based detectors. For the training and evaluation of the detectors, we form new datasets which consist of images with orthogonally captured persons in various scenes and with variable head-lens distances. While annotation of the datasets for bounding box-based detection requires only delamination of bounding boxes, pixel-wise annotation is needed in the case of the proposed detector. To simplify the annotation, we propose an approach which allows simultaneous annotation of the images for both bounding box-based and centroid-based detection.
The key contributions of this article are as follows: • A centroid based person detection technique in visual data is proposed. • The technique aims at orthogonal scanning of the scene with variable head-lens distances.

Bounding box-based object detection
Let an image contain objects of recognized classes, where ∈ Z 0+ , and ∈ Z + . A rectangle boundary with edges parallel to the edges of the image, tightly enveloping the th recognized object, delimits position and dimensions of the object within . The ground truth bounding box of the th object is a 5-tuple = (̇,̇, , ℎ , ) , wherėanḋare and coordinates of the left-top rectangle corner respectively, and ℎ are width and height of the rectangle respectively, is the class of the object, and ∈ {1, … , }.

Centroid-based object detection
In the case of the centroid-based object detection, the position of the th object in is given by and coordinates of its centroid̄and , respectively. Let the ground truth centroid of the th object be the 3-tuple = (̄,̄, ), where ∈ {1, … , }.

Proposed centroid-based object detector
Dense pixelwise prediction FCNs are theoretically capable of transforming an image of width and height ℎ into a three-dimensional centroid map of width , height ℎ and depth (depth given by the number of recognized classes). Elements of the map are given as 1, if a centroid of the -th class is at ( , ) coordinates, where The centroid maps are theoretically the ideal source for a centr oid-based object detector. In such a detector, a dense pixelwise prediction FCN acts as a map generator. The generator must be complemented by a localization module which searches for centroid predictionŝin map predictionŝ. We implement the search, as search for positions of maxima in̂. A set of the centroid predictionŝis for the image given aŝ where ⊙ denotes an element-wise product, and = × × . Due to the limited approximation capability of real FCN models, values of̂elements can be assigned incorrectly which can result in false positive and false negative detections, and in incorrect localizations. We expect that the localization performance of the proposed approach can be improved once the mass of objects is considered. Rather than training a pixelwise prediction FCN to predict centroid maps , we train it to predict three-dimensional localization maps  of width  , height ℎ  and depth . Elements of  can take any real value from the interval where ∶  → (0, 1], ( , , ) = 1 indicates presence of the centroid of an object of the -th class at the location ( , ), and the values decrease towards 0 with increasing distances of the elements from their centroids. To take the advantage of localization maps , we must transform predictions of localization maps into predictions of centroid mapŝ . Thus, an object detector based on the localization maps must consist of a localization map generator (a pixelwise prediction FCN), a centroid counterpoint module, and the localization module (4), respectively ( Fig. 2), where the centroid counterpoint module ensures the transformation of intô.

Centroid counterpoint module
The module emphasizes centroids and suppresses false detections by a series of operations. At first, we process each layer of using a maximum filter with kernel of size ℎ × . We get a map̂with elementŝ where is a set of spatial coordinates in a rectangular sub-window of size ℎ × , centred at point ( , ). We comparêand to highlight local maxima. The result of this operation is a binary map̂1 with elementŝ  The map̂1 contains the centroids as well as local maxima caused by noise of background. To suppress irrelevant regions in the map̂1, we form a mask̂with elementŝ To suppress artefacts in the map̂1 caused by the local maximum filter (6), we erode each layer of the mask̂using a rectangular structuring element of size ℎ × with the origin in its centre. We pad thêby ones to keep the dimensions of the eroded mask̂⊖ the same as dimensions of̂1. Elements of̂⊖ are given aŝ Application of the eroded mask̂⊖ on̂1 results in a map̂with elementŝ where ⊕ denotes exclusive disjunction. We identify centroids among the maxima highlighted in the map by considering their values in the localization map prediction . Each element of the predicted localization map associated with a centroid must be greater or equal to a threshold value , where ∈ (0, 1]. This operation results in the centroid map prediction̂with elementŝ The setting of is problem dependent and must reflect the quality of localization map predictions.
We summarize the pipeline of the centroid counterpoint module in Fig. 3.

Detection of persons in orthogonally captured images
Orthogonal scanning of a scene results in images where individuals are presented by their heads and shoulders. Their sizes are subject to the focal length of a camera lens and to distances of individuals from a pupil of the lens. The distances depend, for example, on heights of the individuals, elevation profile of the scene (Fig. 4), and the camera altitude.
The detection of persons in the orthogonally captured images can be treated as detection of heads and shoulders [33] or as detection of heads [2]. Detection of heads has proven to be specifically accurate on datasets with high variability in the sizes [2]. In both cases, any bounding figure-based object detector can be used for person detection.
In tracking and counting applications, individuals are usually considered to be members of the same class ( = 1). Great emphasis is put on correct localization of individuals in images with minimum overlooked individuals and with minimum false detections. Dimensions of persons are not essential unless they predetermine accuracy of the localization. Considering these facts, we propose using the centroidbased object detector for the detection of persons. To ensure high robustness of the person detector, we view the person detection as a problem of determining positions of heads.

Centroid-based person detection
Our goal is to develop a person detector allowing real time detection on devices with a single-board computer architecture. To reduce time complexity of the detector, we consider 288 × 288 pixel (px) images to be inputs of the detector. Given that heads are defined by shape and brightness gradient rather than by colour, we use greyscale images for processing. The only component of the proposed detector which allows optimization of its inference time is the localization map generator. Considering this fact, we propose two FCN topologies to P. Dolezel et al. be implemented in the person detector as a generator. We base them on the U-Net architecture [26]. The first topology (full resolution U-Net) generates localization maps of spatial resolution equal to spatial resolution of input images (288 × 288). The second one (reduced U-Net) aims at generation of quarter size localization maps (72 × 72).
The U-Net is a symmetric dense pixelwise prediction FCN (decoder is the mirror of encoder). U-Net modules (UMs) ensure feature extraction at four levels of the network. UM consists of five consecutive operations: convolution (Conv), rectified linear unit (ReLU), dropout (DO), Conv, and ReLU, respectively. Using a short notation, UM can be written as where is probability of dropout, is stride of convolutional filters, is the number of the filters, and ℎ and are their height and width, respectively. Each UM in the encoder module is followed by maxpooling (MP) with pools of height and width ℎ and , respectively, and stride (shortly MP(ℎ × , )). Feature maps produced by UMs in the encoder module are concatenated with feature maps produced in the decoder module. The transfer of the maps from the encoder to the decoder is ensured by skip connections.
We summarize topologies of the full resolution and the reduced U-Nets in Table 1 and Table 2, respectively. The columns outline operations performed within the encoder and decoder modules. The operations are arranged in rows with respect to skip connections. The data flow is symbolized using arrows, where their orientations indicate data flow directions. Simple arrows denote the main flow of data, and double ones symbolize skip connections (skip connections are numbered for clarity). We denote a concatenation of two feature maps as [⋅, ⋅]. For all UMs in both topologies, we use following setting: ℎ = 3, = 3, = 1 and = 0.2. As the only changing parameter is the number of filters , we use notation UM( ) for the description of the topologies. In the full resolution U-Net, up-sampling (US) precedes each UM in the decoder module. In the reduced U-Net, we remove the last two USs. We implement US as the Kronecker product of each input feature map with ℎ × matrix of all ones (US(ℎ × )), where ℎ and are height and width of the matrix, respectively. In the full resolution U-Net, feature maps are directly transferred between corresponding parts of the encoder and decoder. In the reduced U-Net, feature maps produced by the first and the second UMs in the encoder are reduced using max-pooling to a quarter and half of their size respectively, before their concatenations with feature maps produced within the decoder module. In both variants, the networks are closed by Conv(1 × 1, , 1) followed by a sigmoid activation function (sig), where is the number of recognized classes. To achieve the required output dimensions (288 × 288 or 72 × 72 px), we zero-pad inputs of operations, if necessary.

YOLO person detection
We compare the proposed centroid-based person detector with a detector based on YOLOv2 architecture [34]. The YOLO-based person detector expects 288 × 288 px greyscale images at its input, and it returns, for each image, a set̂of bounding box predictionŝ, Table 1 Topology of the full resolution U-Net.

Encoder
Decoder   where the predictions are 5-tuples (1). As our aim is to compare the centroid-based person detector with a similar competitor in terms of inference time, we consider GoogLeNet [35], MobileNet-v2 [36], and SqueezeNet [37] as backbone models of the YOLO-based person detector. All the networks have proven to be successful in various time-critical computer vision applications.

CenterNet person detection
The second competitor of the proposed detector is CenterNet. The inputs of the CenterNet-based detector are 288 × 288 px greyscale images. The outputs are setŝof bounding box predictionŝ, where the predictions are 7-tuples (2). To remain consistent with the original paper [15], we use ResNet101 as the backbone. We also consider EfficientDET D0 [38], which promises high computational efficiency.

Data acquisition
Quality of datasets predetermines performance of deep ConvNetbased computer vision systems. To ensure robustness of the person detectors in the intended setting, we collect data in diverse environments which include staircases, corridors, and entries into means of transport. We capture video streams with the RealSense camera D435 orthogonally placed above walking persons at eight different locations. The walking persons are adults with and without headgear. The head-lens distance varies between 25 and 100 cm depending on the environment and situation in the scene. This setting results in a large variance in the size of the heads and their sharpness. The lighting conditions differ among the experiments.

Dataset creation
We extract frames from the captured videos to create a set of 8bit RGB images. From the frames of seven locations, we cut out 7000 square images with up to nine persons. We randomly split the set of images in the ratio 6:1 to create training and test datasets and , respectively. From the eighth location, we form a blind test dataset of 1000 square image crops (Fig. 5). As these images are captured under different lighting conditions at a different place from the previous locations, allows testing the generalization capabilities of the detectors. We resize images in all datasets to 288 × 288 px.

Data analysis
The images capture persons of various heights at locations of various elevation profile. Both these aspects contribute to a high variability in sizes of heads in the images. The shape of heads is elliptical (Fig. 5). The smallest width and height of the ellipses is about 20 px. The largest ellipse dimension reaches 200 px. The number of heads in a scene varies from 0 to 10. Some images contain incomplete heads (persons near edges of the images in Fig. 5). As the camera has fixed focus, some of the heads are blurred.

Image annotation
Training and evaluation of the proposed centroid-based person detector requires extension of the image datasets with localization maps and ground truth centroids. We must create a localization map for each image, and we must assign a real value from the interval [0, 1] to each pixel of each localization map, where the non-zero values must be associated with objects of interests (heads).
To simplify the annotation process, it is reasonable to approximate the positions of objects in the maps using an appropriate geometric shape. Considering the elliptic shape of heads, we approximate the heads by gradient ellipses. We consider ellipse centroids to be identical with head centroids and ellipse circumferences to be borders between heads and background. We draw rectangles tightly enveloping complete heads within an image . To ensure correct approximation of protruding heads, we estimate shapes of rectangles so that they include visible and invisible parts of the heads (Fig. 6). In such a way, we create for a set R of rectangles r. The th rectangle is an ordered four tuple r = (x , y , w , h ), where x , and y are and coordinates of the left top rectangle corner respectively, and w , and h are width and height of the rectangle respectively. The ellipse defining border of the th head in the image is given as We use the canonical ellipse equation (13) to define a piecewise linear function which assigns real values from the interval (0, 1] to elements of  associated with areas of the ellipses. When applied to (5), we get where r is a set of spatial coordinates in an ellipse sub-window defined by r ∈ R. The ground truth centroid of the th head = (̄,̄, ) is given as We form a set of ground truth bounding boxes for the image using the set of rectangles R. The th ground truth bounding box = ( , , , ℎ , ) is given as We make the annotated datasets to be freely available at Kaggle [39].

Total localization error
Positions of objects in images are determined by coordinates of object centroids. Evaluation of the detection performance of centroidbased object detectors must consider distances of centroids predictions to the nearest ground truth centroid. Therefore, each prediction must be associated exactly with one ground truth label, and each ground truth label must be associated exactly with one prediction. If the number of predictions does not match the number of the ground truth labels, we can add a corresponding number of virtual predictions or ground truth labels to ensure equality of their numbers. Herein, we expect the coordinates of the virtual predictions and virtual ground truth labels to be infinity. Using the above stated principle, we define the total localization error (a single class detection performance measure) as follows.
Let a dataset consist of annotated images . The th image is associated with a set of ground truth centroids , where ∈ Z 0+ . Let the detector predict for a set̂of object centroidŝ, where ∈ Z 0+ . The total localization error ∑ of the detector on is given as where is the localization error of the detector on .
Let the localization error of the th image be given as a sum of the smallest relative distances between ground truth coordinates and their closest predictions, where each prediction is associated exactly with one ground truth label, and simultaneously, each ground truth label is associated exactly with one prediction. The error is given as wherẽis the th ground truth centroid,̃is the th predicted centroid, (̃,̃) is a relative distance betweeñand̃,̃∈̃, ∈̃,̃and̃are multisets of cardinality − + 1, and = max { , }. For = 2, … , , where (̄ * ,̄ * , * ) −1 = arg miñ Similarly, where For = 1, the multisets̃and̃contain, with multiplicity one, all elements of the sets and̂, respectively; and the element (∞, ∞, 0) with multiplicity max {0, − } and max {0, − }, respectively.
The relative distance between the th ground truth centroid̃= (̄,̄, ) and the th centroid predictioñ= (̂,̂,̂) in is given as where and ℎ are width and height of the th image , respectively.
In other words, if the number of predictions is equal to the number of ground truth labels for the th image ( Fig. 7(a)), the error (18) can be directly calculated as the sum of the smallest relative distances of the ground truth label-prediction pairs designed according to (19)- (22). If > (Fig. 7(b)), we add ( − ) virtual ground truth labels (∞, ∞, 0) to allow calculation of the error (18). If < (Fig. 7(c)), we add ( − ) virtual predictions (∞, ∞, 0).

Relative inference time
Let a relative inference time of a detector be given as where is the total inference time of the detector on a set of images, and is the total inference time of a baseline detector of this set.

Experiment conditions
We implement the centroid-based detector in Python 3.6 with Ten-sorFlow 2.0. We train the full resolution U-Net and the reduced U-Net map generators from scratch, minimizing a binary cross entropy function. We use normal distribution initialization with mean and standard deviation set to 0 and 0.05, respectively. When training the reduced U-Net, we resize the localization maps to 72 × 72 px (dimensions of localization maps produced by the reduced U-Net). For both variants of U-Nets, we rescale values of the images and of the localization maps to the range [0, 1]. In the centroid counterpoint module, we set the threshold value and the size of the maximum filter ℎ × at 0.65 and 10 × 10 px, respectively. The threshold value was estimated experimentally. The filter size is set with respect to the most common size of heads in the images.
Since the Computer Vision Toolbox in Matlab offers a YOLO implementation wrapped in a very user-friendly interface including the possibility of deployment to Jetson NANO, we implement the YOLO detector (the bounding box-based detector) in MATLAB instead of the original Darknet framework. We use models pre-trained on the Imagenet for its training on the person detection task. We replace the layers after 'out_relu', 'inception_5b-output', and 'relu_conv10' (naming according MATLAB) in GoogLeNet, MobileNet-v2, and SqueezeNet, respectively, by the last YOLOv2 layers. Moreover, since these backbone architectures expect a three-channel RGB image as input, we add a convolutional layer with three trainable filters (3,3) in front of these base models to transform a single-channel input into a three-channel input. We empirically find that the overlap threshold in non-maximum suppression at 0.75 gives good results, and 7 anchor boxes are a good balance between performance and time to process. We estimate widths and heights of the anchor boxes on the dataset using a -means clustering algorithm and the IoU distance metric [34].
We use PyTorch implementation of CenterNet from GitHub [40] with its initial setting of all parameters. As backbones, we use ResNet-101 and EfficientDET D0 models pre-trained on the COCO dataset for its training on the person detection task. We modify the backbones to process 288 × 288 px images analogously to the YOLO detector, and we adjust the CenterNet outputs for the one class problem.
For both the centroid-based and bounding box-based detectors, we convert the input images into greyscale. We carry out 5 training sessions. Within each session, we train the localization map generators and all variants of the bounding box-based detectors on an identical training subset. For each training session, we randomly split up the dataset at the ratio 17:3 into training and validation subsets, respectively. We train the generators and the bounding box-based detectors with mini batches of 8 samples for 300 and 30 epochs, respectively. We save and validate the models on a validation subset in every epoch. We shuffle samples in every epoch.
We use the Adam optimizer for the training of the map generators and of the bounding box-based person detectors. We set up an exponential decay rate for first and second moment estimates at 0.9 and 0.999, respectively. For the generators and for the CenterNet detectors, we use an initial learning rate of 10 −3 . In the case of CenterNet detectors, we multiply the learning rate by a factor of 0.96 every 10 epochs. For the YOLO detectors, we set up the learning rates for the last (the YOLOv2) layers at 10 −3 , the preceding layers are not modified during learning. These settings are adapted from the sources of the individual architectures, i.e., Computer Vision Toolbox in Matlab for the YOLO detector, and GitHub source [40] for the CenterNet.  Fig. 7. Calculation of the localization error for the th image (black solid lines) with two ground truth centroids (black crosses) in the case of (a) two, (b) three, and (c) zero centroid predictions (red circles). In the case of two predictions (a), the number of ground truth and the number of predicted coordinates are equal, and the localization error is simply the sum of the smallest distances between the ground truth and the predictions (dashed black lines), i.e. the localization error of the th image = 0.20. In the case of three predictions (b), one of the predictions is redundant. When calculating the localization error, we first find for each ground truth centroid the nearest prediction and calculate the distance between the prediction and the ground truth. The remaining prediction is the redundant one (in this casẽ2), and we consider its ground truth̃3 to be in infinity, which corresponds to the highest possible distance in the image (i.e. (̃3,̃2) = 1). Thus, the localization error = 1.15 in this case. In the case of zero predictions (c), two predictions are missing. We expect the predictions to be in infinity which results in = 2.00.
We use data augmentation to avoid overfitting by the training of the map generators. Specifically, we use random rotation (range of a rotation angle: ±20 degree), random horizontal and vertical flipping with probability 0.5, random horizontal and vertical translation (up to ±20 % of image height and width, respectively), random rescaling (zoom range 0.2) and random horizontal and vertical shear (shear intensity 0.2). We perform the augmentation in filling mode set to nearest.
We evaluate performance of the person detectors on the test and blind datasets and , respectively. For both variants of the centroid-based and all five variants of the bounding box-based detectors, we select the best performing model (a model with the smallest value of a loss function over a validation subset obtained within the five training sessions). We calculate the total localization error (17) of the models on and . For each image in a dataset, we convert a set̂of YOLO predictionŝinto a set̂of centroid predictionŝ. The coordinate of the th centroid prediction in an image is given aŝ In the case of CenterNet, we use the centroid coordinate predictions by its evaluation. For each person detector, we detect persons in one identical image a thousand times, while measuring total inference time of the detector . We calculate for the detectors frame rates = −1 and the relative inference time (24), where the total inference time of the centroidbased detector with the reduced U-Net map generator is the baseline . We train and evaluate detection performance of the detectors on a personal computer with Intel Core i5-8600K (3.6 GHz) CPU, internal memory 16 GB DDR4 (2666 MHz), video card NVIDIA PNY Quadro P5000 16 GB GDDR5 PCIe 3.0. For evaluation of the inference time, we use a NVIDIA Jetson NANO single-board computer with Quadcore ARM A57 1.43 GHz CPU and 4 GB RAM. To allow the unbiased comparison of the inference times, we export all detectors into the TensorRT NVIDIA CUDA parallel programming models.

Results
We train the centroid-based detectors as well as the competitive bounding box-based detectors with the datasets described in Section 2.5 according to the procedure addressed in Section 2.7. In order to show the performance capabilities of the detectors, we summarize the resulting values of the evaluation measures described in Section 2.6 in Table 3.
Additionally, we demonstrate absolute frequencies of differences between numbers of ground truth labels and numbers of predictions Total localization errors ∑ , relative inference times , and framerates (first row) of the best performing models (third to ninth rows), on the test and blind datasets and (second row), respectively. symbolizes an image which we used a thousand times within measurement of inference times of the models, and FPS stands for frame per second. The best result is for each measure in bold. The test dataset consists of 1000 samples. The frequencies of multiple detections (columns for negative numbers in the first row), overlooks (columns for positive numbers in the first row), and correct detections (the column for zero in the first row) are arranged with respect to the detectors (the first column). The highest value for the correct detections indicates the best performing detector (in bold).
of the detectors for the test dataset (Table 4) and for the blind dataset (Table 5). A graphical representation of these differences is depicted in Fig. 8.

Discussion
The evaluation results presented in Table 3 speak in favour the centroid-based person detection. On both the test dataset and the blind dataset , the localization errors of both variants of the centroids-based detector are less than half of the localization errors of the best performing bounding box-based detector (YOLOv2 with the MobileNetv2 backbone). The centroid-based detector with the reduced U-Net map generator shows even 2.6 and 2.9-times smaller error P. Dolezel et al.  Table 4.   on and on , respectively, compared to the YOLOv2 with the MobileNetv2 backbone.
The high localization errors of YOLO detectors (Table 3, YOLO-GoogleNet and YOLO-SqueezeNet on and , and YOLO-Mobi-leNetv2 on ) indicate predispositions of the detectors to multiple detections or to marginalization of persons in the images. The results summarized in Tables 4-5 and in Fig. 8 confirm this suspicion. The YOLO-SqueezeNet, which has the highest errors on both datasets (Table 3), shows a clear tendency to overlook persons (Fig. 8). The YOLO-GoogleNet with the second highest errors on both datasets inclines to false detections on the blind dataset . On the test dataset , it leans rather to overseeing persons (Fig. 8). The high error of YOLO-MobileNetv2 on and the small error on indicates low generalization capability of the detector. The decline in the number of images with the correct number of detections confirms this assumption (compare the histograms for YOLO-MobileNetv2 on and in Fig. 8, or the results in .
The tendency of multiple detections and marginalization of persons is even greater in the case of the CenterNet detectors. This is most apparent in the results in Tables 4-5; however, this property of the detectors naturally emerges in the localization errors too. The CenterNet with the EfficientDET D0 and ResNet101 backbones have the first and second highest localization errors among the tested detectors on both the test and the blind datasets.
The low values of the error for both variants of the centroid-based detector point to low numbers of false and miss-detections which also confirm the histograms shown in Fig. 8 (see Full resolution U-Net and Reduced U-Net). The values of the error on the blind dataset (Table 3, Full resolution U-Net and Reduced U-Net) indicate good generalization capability of the centroid-based detector. The detector P. Dolezel et al.

Table 5
Absolute frequencies of differences between numbers of ground truth labels and numbers of predictions for the blind dataset .
with a reduced U-Net map generator has the highest number of images with the correct number of detections on both datasets (Tables 4-5).
Despite the low quality of some images in the datasets, the number of detections match the number of persons for 88.5% and 71.9% of images in and , respectively. The utilization of quarter size localization maps instead of the full resolution ones results in a slight improvement in detections, which is apparent from the localization errors (Table 3, Full resolution U-Net and Reduce U-Net) as well as from the numbers of 'correct detections' (Tables 4-5). Decreasing the map resolution does not change the distribution in false and miss-detection frequencies (Fig. 8, Full resolution U-Net and Reduced U-Net on and ). The results indicate that the reduction of the map resolution contributes to a better generalization of the network within the training phase.
The simplification of the U-Net topology which we have made within the development of the reduced U-Net, allows us to reach inference time comparable to the fastest bounding box-based detector (compare in Table 3 for Reduced U-Net and YOLOv2-SqueezeNet). It is worth mentioning in this context that the YOLOv2 with the SqueezeNet backbone shows one of the highest localization errors among the evaluated detectors (14.4-times and 4.7-times higher errors on and on , compared to the centroids-based detector with the reduced U-Net map generator). When compared with the best performing bounding box-based detector, the centroid-based detector with the reduced U-Net map generator is about 40 % faster than the YOLOv2 with the MobileNetv2 backbone (Table 3). When using the full resolution U-Net, the inference times of the centroid-based detector and of the YOLOv2 with the MobileNetv2 backbone are almost identical.
The presented results confirm our expectations with respect to the advantages of the centroid-based detection by the localization of persons in the orthogonally captured images. The low localization errors of both versions of the centroid-based detector support previously published results, which point to the superiority of the centroid-based object detection over the bounding box-based object detection for counting small objects [41]. The centroid-based detector also meets the edge computing standards, especial when using the reduced U-Net as the map generator.

Conclusion
We proved that the determination of head centroids position, using the fully convolutional network (U-net) in combination with the presented sequence of simple image processing operations (centroid counterpoint module), is an efficient way for the fast and precise detection of persons in orthogonally captured images. The presented centroid-based person detector meets the edge computing standards, has good generalization capability, and shows small localization error even on low quality images. It efficiently operates in diverse environments including environments with high variabilities in elevation profiles. The utilization of quarter size localization maps instead of the full resolution ones allowed us to reduce inference time of the detector by 40 %. A side effect of the reduction is a slight improvement in the detection performance. Considering all these facts and the low price of visible spectrum cameras (compared to depth cameras, 3D laser scanners, etc.), we conclude that the centroid-based detector allows development of low cost and powerful commercial solutions. These solutions are particularly aimed at automatic tracking and counting of persons in public transport. The presented localization error allowed us to quantitatively compare detection performance of both centroid and bounding box-based detectors. For the annotation of datasets, we used the bounding box inspired annotation. Such an approach allowed us simultaneous annotation of the images for both bounding boxbased and centroid-based detection. This considerably simplified the annotation process.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Pavel Kryda is currently affiliated with a commercial company Mikroelektronika spol. s r. o. He provided expertise in dataset acquisition and supervision in this article. This does not alter the authors' adherence to Journal of Computational Science policies on sharing data and materials. Petr Dolezel received his Ph.D. degree from the University of Pardubice, Czech Republic, in 2009. In 2017, he defended his habilitation thesis at Tomas Bata University. He works as an associate professor and vice-dean for research and development at the Faculty of Electrical Engineering and Informatics, University of Pardubice. His research interests include neural and evolutionary computation in process control and in image processing. He is the author of more than 100 scientific contributions, including 20 journal papers and lectures at CORE ranked conferences. He has been a leader or member of research teams for a dozen research and development projects.

P. Dolezel et al.
Dominik Stursa is a Ph.D. student at the University of Pardubice, Czech Republic. His topic is image processing applications with the deep neural networks. Since 2019, he is a lecturer at the Process Control Department, University of Pardubice. He is the key member of a research group lead by Petr Dolezel. His research aims to robotics, signal and image processing and neural networks. He is an author of 5 journal articles and more than 10 conference papers. He has a membership with IEEE Robotics and Automation Society and IEEE Signal Processing Society.
Bruno Baruque Zanon holds an associate professor position at the University of Burgos, Spain since 2018. He obtained his Ph.D. degree in Computer Science Artificial Intelligence in 2009 at that same university. He is an active member of the GICAP research group at University Burgos and author of more than 80 research publications in indexed journals and conferences. His research interests focus on the data analysis and automated learning field, with special emphasis in artificial neural networks. He collaborates as guest editor and reviewer of several international journals and numerous international conferences related with the artificial intelligence knowledge area.
Hector Cogollos Adrian is a graduate in computer engineering (2019), and with a master's degree in computer engineering (2021). He currently works as assistant teacher at the Computer Science Department, while working on the pursuit of his Ph.D. degree at the University of Burgos. His research interest focuses on the study of artificial intelligence techniques for the analysis and improvement of mobility.
Pavel Kryda received the M.S. degree in electrical engineering from the University of Pardubice, Czech Republic, in 2012. He worked as a HW research specialist in Mikroelektronika spol. s r. o., Vysoke Myto, Czech Republic. In 2019, he was promoted to lead product manager. His previous work experiences include embedded systems programming, design of electrical devices, and sensors systems.