The Impact of Partial Occlusion on Pedestrian Detectability

Robust detection of vulnerable road users is a safety critical requirement for the deployment of autonomous vehicles in heterogeneous traffic. One of the most complex outstanding challenges is that of partial occlusion where a target object is only partially available to the sensor due to obstruction by another foreground object. A number of leading pedestrian detection benchmarks provide annotation for partial occlusion, however each benchmark varies greatly in their definition of the occurrence and severity of occlusion. Recent research demonstrates that a high degree of subjectivity is used to classify occlusion level in these cases and occlusion is typically categorized into 2 to 3 broad categories such as partially and heavily occluded. This can lead to inaccurate or inconsistent reporting of pedestrian detection model performance depending on which benchmark is used. This research introduces a novel, objective benchmark for partially occluded pedestrian detection to facilitate the objective characterization of pedestrian detection models. Characterization is carried out on seven popular pedestrian detection models for a range of occlusion levels from 0-99%, in order to demonstrate the efficacy and increased analysis capabilities of the proposed characterization method. Results demonstrate that pedestrian detection performance degrades, and the number of false negative detections increase as pedestrian occlusion level increases. Of the seven popular pedestrian detection routines characterized, CenterNet has the greatest overall performance, followed by SSDlite. RetinaNet has the lowest overall detection performance across the range of occlusion levels.

Abstract-Robust detection of vulnerable road users is a safety critical requirement for the deployment of autonomous vehicles in heterogeneous traffic. One of the most complex outstanding challenges is that of partial occlusion where a target object is only partially available to the sensor due to obstruction by another foreground object. A number of leading pedestrian detection benchmarks provide annotation for partial occlusion, however each benchmark varies greatly in their definition of the occurrence and severity of occlusion. Recent research demonstrates that a high degree of subjectivity is used to classify occlusion level in these cases and occlusion is typically categorized into 2-3 broad categories such as "partially" and "heavily" occluded. In addition, many pedestrian instances are impacted by multiple inhibiting factors which contribute to nondetection such as object scale, distance from camera, lighting variations and adverse weather. This can lead to inaccurate or inconsistent reporting of detection performance for partially occluded pedestrians depending on which benchmark is used. This research introduces a novel, objective benchmark for partially occluded pedestrian detection to facilitate the objective characterization of pedestrian detection models. Characterization is carried out on seven popular pedestrian detection models for a range of occlusion levels from 0-99%, in order to demonstrate the efficacy and increased analysis capabilities of the proposed characterization method. Results demonstrate that pedestrian detection performance degrades, and the number of false negative detections increase as pedestrian occlusion level increases. Of the seven popular pedestrian detection routines characterized, CenterNet has the greatest overall performance, followed by SSDlite. RetinaNet has the lowest overall detection performance across the range of occlusion levels.

I. INTRODUCTION
A CCURATE and robust pedestrian detection systems are an essential requirement for the safe navigation of autonomous vehicles in heterogeneous traffic. The SAE J3016 standard [1] defines levels of driving automation ranging from Level 0, where the vehicle contains zero automation and the human driver is in complete control, to level 5 where the vehicle is solely responsible for all perception and driving tasks in all scenarios. The progression from automation levels 3-5 requires a significant increase in assumption of responsibility by the vehicle, placing progressively increasing demands on the performance of pedestrian detection systems to inform efficient path planning and to ensure the safety S. Gilroy   of vulnerable road users. Despite recent improvements in pedestrian detection systems, many challenges still exist before we reach the object detection capabilities required for safe autonomous driving. One of the most complex and persistent challenges is that of partial occlusion, where a target object is only partially available to the sensor due to obstruction by another foreground object. The frequency and variety of occlusion types in the automotive environment is large and diverse as pedestrians navigate between vehicles, buildings, traffic infrastructure and other road users. Pedestrians can be occluded by static or dynamic objects, may inter-occlude (occlude one another) such as in crowds, and self-occludewhere parts of a pedestrian overlap. Leading pedestrian detection systems claim a detection performance of approximately 65%-75% of partially and heavily occluded pedestrians respectively using current benchmarks [10] [11] [12] [13]. However, recent research [14] demonstrates that the definition of the occurrence and severity of occlusion varies greatly, and a high degree of subjectivity is used to categorize pedestrian occlusion level in each benchmark. Occlusion is typically split into 2-3 broad, loosely defined, categories such as "partially" or "heavily" occluded, Table ?? [14]. In addition, many pedestrian instances are impacted by multiple inhibiting factors which contribute to non-detection such as object scale, distance from camera, lighting variations and adverse weather. This makes it difficult to determine if the primary factor for non-detection is the severity of occlusion alone and can lead to inaccurate or inconsistent reporting of detection performance for partially occluded pedestrians depending on which benchmark is used. A knowledge gap exists for objective, detailed occlusion level analysis for pedestrian detection across the complete spectrum of occlusion levels. Use of an objective, fine grained occlusion specific benchmark will result in more objective, consistent and detailed analysis of pedestrian detection algorithms for partially occluded pedestrians.
This research proposes a novel, objective benchmark for partially occluded pedestrian detection to facilitate the objective characterization of pedestrian detection models. Objective characterization of occluded pedestrian detection performance is carried out for seven popular pedestrian detection routines for a range of occlusion levels from 0-99%. The contributions of this research are: 1. A novel, objective, test benchmark for partially occluded pedestrian detection is presented. 2. Objective characterization of pedestrian detection performance is carried out for seven popular pedestrian detection routines.

II. RELATED WORK
A number of popular pedestrian detection benchmarks provide annotation of pedestrian occlusion level to determine the relative detection performance for partially occluded pedestrians. Dollar et al [25] provides analysis on occluded pedestrians based on the Caltech Pedestrian Dataset [5]. Caltech Pedestrian estimates the occlusion ratio of pedestrians by annotating 2 bounding boxes, one for the visible pedestrian area and one for the annotators estimate of the total pedestrian area. Pedestrians are categorised into 2 occlusion categories, "partially occluded", defined as 1-35% occluded and "heavily occluded", defined as 35-80% occluded. Any pedestrians suspected to be more that 80% occluded are labelled as fully occluded. Analysis of the frequency of occlusion on the Caltech Pedestrian Dataset demonstrated that over 70% of pedestrians were occluded in at least one frame, highlighting the frequency of occurrence of pedestrian occlusion in the automotive environment. The Eurocity Persons [2] Dataset categorizes pedestrians according to three occlusion levels: low occlusion (10%-40%), moderate occlusion (40%-80%), and strong occlusion (larger than 80%). Classification is carried out by human annotators in a similar manner to the Caltech Pedestrian Dataset. The full extent of the occluded pedestrian is estimated, and the approximate level of occlusion is then estimated to be within one of the three defined categories. Citypersons [3] calculate occlusion levels by drawing a line from the top of the head to the middle of the two feet of the occluded pedestrian. Human annotators are required to estimate the location of the head and feet if these are not visible. A bounding box is then generated for the estimated full pedestrian area using a fixed aspect ratio of 0.41(width/height). This is then compared to the visible area bounding box to denote occlusion level. These estimates of occlusion level are then categorized into two levels, "reasonable" (<=35% occluded) and "heavy occlusion" (35%-75%). Similar approaches are taken in [8] [26] [27] [28] [29]. The Kitti Vision Benchmark [4] and Multispectral Pedestrian Dataset [6] tasked human annotators with marking each pedestrian bounding box as "visible", "semi-occluded", "fullyoccluded". Although these methods are useful for the relative comparison of detection performance on specific datasets, the occlusion categories used are broad (usually 2 to 3 categories), are inconsistent from benchmark to benchmark, and involve a high degree of subjectivity by the human annotator, Table  ?? [14] [30]. A knowledge gap exists for a detailed, objective benchmark to compare pedestrian detection performance for partially occluded pedestrians. Many pedestrian detection analysis papers [25] [40] highlight the outstanding challenges posed by occluded pedestrians, however, no known objective characterization of pedestrian detection performance spanning the spectrum of occlusion levels has been carried out to date.
Gilroy et al [30] describes an objective method of occlusion level annotation and visible body surface area estimation of partially occluded pedestrians. Keypoint detection is applied to identify semantic body parts and findings are cross-referenced with a visibility score and the pedestrian mask in order to confirm the presence or occlusion of each semantic part. A novel method of 2D body surface area estimation based on the "Wallace rule of Nines" [14] [41] is then used to quantify the total occlusion level of pedestrians. Experimental results demonstrate that the proposed objective occlusion level classifier outperforms prior works and more closely matches the pixel wise occlusion level of pedestrians in images than the previous state of the art.

III. METHODOLOGY
A novel occluded pedestrian test dataset, containing 820 person instances in 724 images, has been created in order to characterize pedestrian detection performance across a range of occlusion levels from 0 to 99% occluded. A diverse mix of images are used ensure that a wide variety of target pedestrians, pedestrian poses, backgrounds, and occluding objects are represented. The dataset contains both natural and superimposed occlusions in order to facilitate pedestrian detection characterization for the complete spectrum of occlusion levels from 0-99%.
The dataset is sourced from three main categories of images: 1) The "occluded body" subset of the partial re-identification dataset "Partial ReID" provided by Zheng et al [42], 2) The Partial ReID "whole body" subset [42] with custom superimposed occlusions and 3) Images collated from publicly available sources including [2] [3] [14] [43]. All images are annotated using the objective occlusion level classification method described in [30]. Occlusion level classification consists of the following steps: 1. Keypoint detection is applied to the input image in order to identify the presence and visibility of specific semantic parts for each pedestrian instance. 2. A visibility threshold is applied to identify occluded keypoints. 3. MaskRCNN is applied to define the pedestrian mask area and results are cross-referenced with detected keypoints to confirm which keypoints are occluded within the image. 4. Visible keypoints are then grouped into larger semantic parts and the total visible surface area is calculated using the 2D body surface area estimation method displayed in Figure 3 [14] [30]. Complex cases at very high occlusion rates were  [22] MobileNetV3 Large COCO Torchvision [20] 0.464 RetinaNet [23] ResNet-50 FPN COCO Voxel51 [16] 0.361 CenterNet [24] Hourglass-104 COCO Voxel51 [16] 0.533  manually verified using the method of 2D body surface area estimation presented in Figure 3. Each occlusion level contains a minimum of 55 pedestrian instances. A sample of the test dataset and dataset statistics by occlusion level can be seen in Figure 1 and Figure 2 respectively.

A. Pedestrian Detection Models
Performance characterization was carried out on seven popular pedestrian detection models. All models use publicly available pretrained weights from two popular model zoos [16] [20] and are trained using the COCO "train 2017" dataset [44]. An overview of the pedestrian detection models can be seen in Table II. The pedestrian detection models chosen for characterization can be divided into 3 categories: Two-Stage Frameworks, One-Stage Frameworks and Keypoint Estimation. Twostage frameworks such as FasterRCNN [15], MaskRCNN [17] and R-FCN [18] apply two separate networks to perform classification. One network is used to propose regions of interest and a dedicated second network performs object detection [45]. One-stage frameworks such as RetinaNet [23], SSD [19] and SSDLite [21] [22] attempt to reduce computation and increase speed by performing object detection using a single feed forward convolutional network that does not interact with a region proposal module. RetinaNet also implements a novel method of "focal loss" which is used to reduce the imbalance between foreground and background classes during training with a view to increasing detection precision. CenterNet [24] takes an alternative approach based on keypoint estimation. Objects are represented as a single point at their bounding box center identified by a heat map generated using a fully convolutional network. Other object features such as object size, orientation and pose are then regressed directly from the image features at the center location. CenterNet has been shown to outperform a number of state of the art one-stage and two-stage algorithms in terms of a speed-accuracy trade off by maintaining an efficient network architecture [24].

B. Experiments
Detection performance is analyzed for the complete test dataset, and for each occlusion range from 0-9% to 90-99%, for pedestrian detection models to assess the impact of progressive levels of occlusion on the detectability of pedestrians. Analysis is carried out using Voxel51 [46] and the COCO style evaluation metric Mean Average Precision (mAP). Mean Average Precision is a popular and rigorous metric for object detection that calculates the Average Precision (AP) for a range of Intersection over Union (IoU) values from 0.5 to 0.95 with a step size of 0.5 and produces the mean value [44]. A summary of the results are shown in Figure 4, Figure 5 and Figure 6. All models are also characterized using the Kitti Vision Benchmark [4] in order to compare and demonstrate the advanced analysis capabilities provided by the proposed benchmark. Results on the Kitti Vision Benchmark are shown in Figure 8.

IV. RESULTS AND ANALYSIS
Results demonstrate that pedestrian detection performance (mAP) declines as the level of pedestrian occlusion increases, Figure 4. The number of false negative detections increase as occlusion level increases, Figure 6(c) and in general, the number of true positive detections begin to significantly decrease as occlusion level increases for pedestrians more than 50% occluded, Figure 6(a). As shown in Figure 5, of the seven popular pedestrian detection models analyzed, CenterNet [24] has the greatest overall detection performance for partially occluded pedestrians with an overall mAP of 0.533, followed by SSDLite [21] [22] with a total dataset mAP of 0.464.
The strategy employed by CenterNet of first identifying the bounding box centre using a keypoint heatmap and then predicting object size and bounding box dimensions relative to the centre point has demonstrated the highest precision bounding boxes for both fully visible pedestrians and for pedestrians up to 80% occluded, Figure 4. MaskRCNN [17] has the greatest detection performance for pedestrians occluded more than 80%, Figure 4. RetinaNet [23] is the lowest performing overall on the test data with a mAP of 0.361. RetinaNet's true positive detections begin to significantly degrade when pedestrians are more than 30% occluded and this model has the highest number of false negatives for pedestrians more than 30% occluded, Figure 6(a) and 6(c). Single Shot Detectors, SSD [19] and SSDLite [21] [22] have the highest number of true positive detections at high levels of occlusion, Figure 6(a), and maintain a very high level of true positive detections up to 60% occlusion, however their false positive rate is in the region of 100 times larger than popular two stage detectors such as FasterRCNN [15] and RFCN [18] and approximately 16 times larger than MaskRCNN [17], Figure 6(b). Unlike false negatives, the number of false positives per image does not appear to be significantly impacted by the occlusion level as these are not typically related to the target pedestrian in an image. SSDlite [21] [22] outperforms SSD [19] for almost all levels of occlusion despite having a higher number of false positive detections. MaskRCNN [17] has a higher percentage of true positives than Faster RCNN [15] for pedestrians over 40% occluded, however, it has around 4 times more false positive detections for the same data, Figure 6. Mask RCNN, RFCN and SSD all have similar overall performance on the test dataset, however, MaskRCNN and RFCN have a higher detection performance than SSD for pedestrians that are more than 60% occluded, Figure 4. Figure 7 compares the output from a two stage detector, FasterRCNN, with a one stage detector, SSD, for an occluded pedestrian. Two stage detectors first generate key regions of interest before applying object detection, one stage detectors directly apply object detection to the entire image. Figure 7 demonstrates that for the same image, FasterRCNN produces 4 detection outputs (1 true positive with 88% confidence and 3 false positives), Figures 7(b) and 7(c), whereas SSD produces . This indicates that all detection models must not be treated equally in the design of a pedestrian detection system. The characteristics and weaknesses of each detection model identified through robust performance characterization, must be taken into account further downstream in the object detection system, as some model outputs may be less reliable than others for safety critical systems.

A. Benchmark Comparison
Although a number of datasets contain occlusion labels to indicate the level of occlusion, current benchmarks are not designed for thorough characterization of partially occluded pedestrian detection performance. Each benchmark varies greatly in their definition of the occurrence and severity of occlusion and each benchmark uses different but highly subjective methods of occlusion level annotation, Table ?? [30]. In addition, many pedestrian instances are impacted by multiple additional inhibiting factors, making it difficult to determine if the contributing factor to non-detection is occlusion level alone. Algorithm performance can still be compared using the current state of the art, however users are unable to determine with any certainty if any non-detection is the result of occlusion or one of many other inhibiting factors such as object scale, distance from camera, adverse weather and lighting variations. This also makes it very difficult to accurately compare algorithm performance across multiple benchmarks.
If we take the popular KITTI Vision Benchmark as an example. Images are annotated for three levels of occlusion: "Fully Visible", "Partially Occluded", "Difficult to See". Images are captured using a wide angle lens and contain many contributing factors to non-detection in addition to occlusion as shown in Figure 8(b). The dataset is split into three test subsets in order to characterize pedestrian detection models by occlusion label: 1.) Images that only contain pedestrians tagged as "Fully Visible" (1669 Instances in 1242 Images); 2.) Images that only contain pedestrians tagged as "Partially Occluded" (236 Instances in 216 Images) and 3.) Images that only contain pedestrians tagged as "Difficult to See" (208 Instances in 158 Images). Sitting persons and persons on bicycles are included for test purposes in cases where they have a suitable occlusion label. Pedestrian detection performance is then assessed on each of the three subsets as shown in Figure  8(a). Results demonstrate that performance degrades for each broad, more complex data subset and MaskRCNN [17] has the greatest overall performance on the Kitti Vision Benchmark data. However, partial occlusion can not be concluded as the only contributing factor to non-detection as many pedestrian instances have a number of additional inhibiting factors such as object scale, distance from camera and lighting variations.
In contrast to this, the proposed benchmark facilitates detailed, objective and repeatable characterization of pedestrian detection performance specifically for partially occluded pedestrians across the complete range of occlusion levels from 0-99%, Figure 4.

B. Key Semantic Parts
Further analysis has been carried out to determine the impact that the visibility of a pedestrian's head has on the detection of occluded pedestrians. The dataset was split into two subsets: 1) Only images where the target pedestrian's head is visible and 2) Only images where the target pedestrian's head is occluded. Of the 820 pedestrian instances, the target pedestrian's head is visible in 582 instances and is occluded in 252 instances. Figure 9(a) displays the percentage of pedestrian instances with their head visible across each of the occlusion levels. Three pedestrian detection models, Faster-RCNN, RetinaNet and SSD were then tested on both data subsets across the occlusion range. Experiments demonstrate that, regardless of whether a pedestrians head is visible, a distinct declining profile in detection performance is observed as pedestrian occlusion level increases, Figures 9(b), 9(c) and 9(d). Results indicate that the detection models under test are not biased towards head visibility for the classification of partially occluded pedestrians.

V. CONCLUSION
Detection of partially occluded pedestrians remains a persistent challenge for driver assistance systems and autonomous vehicles. Current methods of characterizing detection performance for partially occluded pedestrians have been broad, subjective, and inconsistent in their definition of the level of occlusion. This research presents a novel test benchmark for the detailed, objective analysis of pedestrian detection models for partially occluded pedestrians. Detection performance is characterized for seven popular pedestrian detection models across a range of occlusion levels from 0-99%. The proposed benchmark focuses specifically on the complex issue of partial occlusion and facilitates more objective, repeatable and fine grained analysis than the current state of the art. Results demonstrate that pedestrian detection performance experiences a negative correlation to increases in occlusion level as the visibility of a pedestrian is incrementally reduced. An increase in the number of false negative detections is observed as occlusion level increases and the percentage of true positive detections significantly degrade for pedestrians who are more than 50% occluded. Further analysis demonstrates that not all pedestrian detection models should be treated equally within an object detection system. The speed vs. accuracy trade-off, encouraged by the near real-time requirements of autonomous vehicles, can result in high levels of false positive detections and lower detection confidence at progressive levels of pedestrian occlusion, particularly when using single stage detection models. Thorough objective characterization of pedestrian detection models at the design stage will improve the performance of object detection systems by calibrating the priority of detections in scenarios where known weaknesses can occur. System improvements may be gained through the use of an occlusion-aware step in the object detection pipeline to inform the priority of camera-based detections in sensor fusion networks for SAE level 4 and level 5 autonomous vehicles. In this manner, any reduction in performance at high occlusion levels can be mitigated in the design of the overall system to increase the safety of vulnerable road users and improve the efficiency of path planning based on environment detection. Widespread use of the proposed benchmark will result in more objective, consistent and detailed analysis of pedestrian detection models for partially occluded pedestrians.