Sustainable Cities and Society

Since the start of the COVID-19 pandemic


Preliminary
In December 2019, China officially announced the discovery of a new coronavirus disease, namely COVID-19, which is mainly caused by the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2), which has been a source of a global epidemic (Ghaemi, Amiri, Bajuri, Yuhana, & Ferrara, 2021). Until January 12, 2022, there have been 312,173,462 confirmed cases of COVID-19, including 5,501,000 deaths, reported to the world health organization (WHO) (Who coronavirus (covid-19) dashboard, 2022). To that end, an increasing effort has been made by the research community to put in use intelligent tools and measures to reduce or slow down the spread of COVID-19. In this respect, various studies have investigated some of the main pandemic open challenges, such as those related to (i) predicting of COVID-19 risk in public environments using IoT and machine learning (ML) (Elbasi, Topcu, & Mathew, 2021;Ramchandani, Fan, & Mostafavi, 2020;Tang, Feng, Chiheb, & Fan, 2021), (ii) monitoring social distancing (SD) and detecting violations (Ar et al., 2020;Prabakaran, Kumar, Kiran, Y. Himeur et al. Fig . 1. Smart DL-based VSDM system for smart cities: the most important steps are explained including (i) data collection, (ii) data storage, (iii) pedestrian detection, (iv) distance measurement and (vi) violation detection. body temperatures have attracted significant attention (Farooqi & Usman, 2021;Gad, ElBary, Alkhedher, & Ghazal, 2020). Indeed, these works can provide the public with information about where the risk of COVID-19 transmission may be high.
SD plays a major role in slowing down the distribution of the COVID-19 virus. Although the distance to be preserved between people is country-specific, most of the studies have defined SD as maintaining at least a distance of two meters (six feet) apart from other persons to prevent potential contacts (Agarwal et al., 2021;Kumar, John, Vighnesh, & Jagannath, 2022). Typically, while WHO has recommended one meter of physical distance, as adopted in France, Singapore, Hong Kong, Denmark and China, India, UK, Qatar and many other countries have been maintaining 2 meters' distance. The importance of SD also comes from its substantial economic benefits as it has longrun recovery effects on economic development (Pooranam, Sushma, Sruthi, & Sri, 2021). The COVID-19 pandemic may not be ended in the near future, and automated systems with the ability to monitor and analyze whether people are respecting or not SD norms help significantly profit our society. Besides, recent improvements in ML and DL allow object detection techniques to be quite efficient, which has benefited researchers to measure and monitor SD among pedestrians in public areas by analyzing recorded videos from fixed surveillance (e.g., CCTV cameras) or drone-based surveillance (Elharrouss et al., 2021;Haq, Du, & Jan, 2022). Typically, vision-based IoT systems already installed in public areas can be augmented with the people detection capability, which is a sub-task of the generic object detection process (Gaisie, Oppong-Yeboah, & Cobbinah, 2022;Manzira, Charly, & Caulfield, 2022). Moving on, adequate measures can be then initiated to measure the physical distances between detected pedestrians. Fig. 1 illustrates an overall architecture of a VSDM system based on DL for smart cities applications.
Because monitoring, managing, and preventing the spread of the COVID-19 virus require innovative and intelligent solutions and pathbreaking tools, ML models, and more particularly deep learning (DL) models, play a crucial role in the humanity's battle during the pandemic. Typically, computer vision (CV), which is part of artificial intelligence (AI), can teach computers to comprehend visual scenes and analyze dense crowds (Mohamed & Abdel Samee, 2022). In this regard, machines have become able (i) to identify and track objects, (ii) measure distance between them, and (iii) respond to observed scenes using cameras, smartphones, and DL tools (Nagrath et al., 2021). Similarly, CV combined with DL has recently been used to capture the average amount of human activity, monitor SD behaviors, and detect the violation of face mask-wearing in major cities. Typically, face mask detection is considered a complementary task to SD monitoring to decrease the risk of contamination with COVID-19. Drones or unmanned aerial vehicles (UAVs) have also been utilized to fight the COVID-19 virus in open areas (e.g., the perimeters of sports facilities and stadiums) (Conte et al., 2021) by collecting biomedical data of individuals, (ii) monitoring SD and recording essential signs parameters (e.g., respiratory rates, body temperature, heart rates, etc.). This has been efficient for analyzing individuals' health status and limiting the spread of the virus.
This review is introduced to shed light on the progress made by the scientific community in developing DL-based tools for VSDM since the pandemic's start. Specifically, a well-designed taxonomy is introduced to better overview existing frameworks from various perspectives, including the surveillance type (i.e., fixed or mobile), methodology (hand-crafted-based or CNN-based), nature of pedestrian detectors (single-stage or two-stage), and complexity of CNN models (i.e., complex or light-weight), etc. Moreover, a comparative study is conducted to assess the competency of DL-based VSDM solutions, primarily based on CNN models. Thereafter, insightful observations are made to identify solved challenges and those that remain unresolved, such as pedestrian overlapping, real-time implementation, camera calibration, lack of annotated datasets, security and privacy concerns, etc. Additionally, future directions that can help improve the performance of VSDM and promote its implementation are highlighted. Overall, the main contributions of this paper can be summarized as follows: • Presenting, to the best knowledge of the authors' knowledge, the first review of deep VSDM literatures. • Presenting the background of the VSDM concept and explaining its main steps. • Summarizing datasets used for validating VSDM frameworks and discussing their characteristics and limitations. • Systematically reviewing existing DL-based VSDM techniques and identifying their pros and cons. • Analyzing and discussing the performance of existing DL-based VSDM solutions and presenting a comparative study of relevant works.
• Highlighting the open issues where the actual research effort is heading and providing insights about the future directions that can attract considerable interest in the near future.

Survey methodology
VSDM literatures have been surveyed by searching them in academic databases, including Scopus, Elsevier, IEEEXplore, Springer, We-bofScience, etc. In doing so, the following keywords have been considered: ''visual social distancing monitoring'', ''social distancing detection using deep learning'', ''social distancing analysis using computer vision'' ''social distancing monitoring using CNN'' with the ''document title, abstract and keywords have been set in the advanced search. Hundreds of peer-reviewed articles have been obtained, but not all of them were related to the topic of the review. To that end, a careful filtering process has been conducted as follows: (i) all related journal papers have been included in this review as they present a detailed analysis and description, (ii) conferences papers written in other languages than English and not presenting sufficient quantitative results and experiments are filtered out, (iii) conferences lacking visual detection results have been excluded, and (iv) studies validated on small image datasets have not been considered. Moreover, some studies present very similar approaches, and only the datasets used to validate them are different. In this regard, only the frameworks validated on sufficient benchmarking data with a well-defined validation process have been included in this review. Overall, more than 75 VSDM literatures have been considered, covering peer-reviewed journal articles, conference proceedings articles, book chapters, and preprints.
The rest of this paper is organized as follows. Section 2 provides the background of VSDM systems, where the overall methodology is explained, and the types of adopted surveillance are described. Section 3 summarizes existing datasets used to validate VSDM techniques. Moving on, the limitations and drawbacks of non-visual SD monitoring (NVSDM) frameworks are briefly discussed in Section 4. Next, a thorough overview conducted based on a well-defined taxonomy of VSDM studies is presented in Section 5. After that, the important findings following this comprehensive review are identified in Section 6, where critical analysis is performed, and open challenges are highlighted. Lastly, future directions are derived in Section 7 before concluding this paper in Section 8.

Background
VSDM systems are based on detecting pedestrians, measuring the distance between them, and then quantifying the risk level of contaminating COVID-19 between monitoring people. Fig. 2 explains how the risk level can vary when monitored pedestrians are close to each other. Specifically, the closer or denser a crowd is, the more risky it is considered. The least risky level is on the top left while the more risky is on the top right Usually, a scene is defined at time as a three-tuple = ( , 0 , ), where ∈ × ×3 refers to the RGB frame. While and represent the height and width, receptively.

∈
represents the area of interest (ROI) on the real-world ground plane, and ∈ stands for the physical distance threshold which is required to maintain a safe environment.

Image to world mapping
At this stage, images are mapped into real-world, where the second mapping function ℎ ∶ → ′ is obtained. Typically, ℎ represents an inverse perspective transformation function, which enables mapping ′ in image coordinates to ∈ 2 in real-world coordinates. is represented in 2D bird's eye view (BEV) coordinates, where the ground plane = 0 is assumed. Specifically, the inverse homography transformation (Forsyth & Ponce, 2011) can be used to perform this: where ∈ 3×3 represents a transformation matrix that describes the translation and rotation from world to image coordinates. In this respect, = [ ′ , ′ , 1] refers to the homogeneous representation of ′ = [ ′ , ′ ] in image coordinates, and = [ , , 1] constitutes the homogeneous representation of the mapped pose vector. Moving on, the real-world pose vector can be obtained from . This operation is essential since it facilitates the measurement of the real physical distances between each pedestrian pair.

Pedestrian detection
First, any VSDM system aims to detect individuals (or pedestrians) in the video frames collected using fixed or drone-based monocular or stereo cameras and insert a collection of bounding boxes (BBs) (Li, Varble, Turkbey, Xu, & Wood, 2022). Typically, an ML-based object detector is applied on the frame : ∶ → { } maps a frame into tuples = ( , , ), ∀ ∈ {1, 2, … , }, where represents the number of detected objects. ∈ is the object class label among the overall object label set L. = ( ,1 , ,2 , ,3 , ,4 ) represent the corresponding BB with four corners.
Y. Himeur et al. Fig. 2. The risk level of contaminating COVID-19 between monitored people: the closer or denser a crowd is, the more risky it is considered. The least risky level is on the top left while the more risky is on the top right. , = ( , ; , ) provides pixel indices in the image domain, where represents the corners at ''top-left'', ''top-right'', ''bottom-left'', and ''bottom-right'', respectively. Lastly, indicates the corresponding detection score. VSDM systems attempt to only detect the case of = ''person''.

Social distancing (SD) detection
After detecting all the BBs = ( 1 , 2 , … , ) in real coordinates, the corresponding list of interpersonal distances is calculated between their centroids. Typically, the distance , for the individuals detected by the BBs 1 and 2 is estimated using the Euclidean distance between their centroids 1 and 1 : The overall number of SD violations in a scene is computed as follows: where is the interpersonal distance. It is worth noting that the obtained violations can be filtered by imposing thresholds on the time contact patterns and/or the number of contacts, and (ii) considering family/non-family classification. For example, in Pouw, Toschi, van Schadewijk, and Corbetta (2020), a minimum contact time threshold = 5 is defined to tag SD offenders.

Tracking and reporting
Once two pedestrians are detected to be close to each other and the distance value violates the minimum SD norm, the color of the bounding box is updated/changed to red. Moreover, the BB information is saved in a violation database and transmitted to surveillance and monitoring center for reporting purposes and sending alarms to concerned offenders. On the hand, centroid tracking algorithms can be deployed to track the people violating/breaching the SD norm. For instance, Yang, Sun, et al. (2021) use the simple online and real-time tracking (SORT) algorithm (Bewley, Ge, Ott, Ramos, & Upcroft, 2016) to tack pedestrians detected with YOLOv4 due to its simplicity and quick inference. Similarly, DeepSort (Wojke, Bewley, & Paulus, 2017), which is one of the most widely tracking algorithms is utilized in Punn, Sonbhadra, Agarwal, and Rai (2020) to track pedestrians detected with YOLOv3. The tracking has been performed using BBs and assigned IDs of people violating the SD norm. Other variants of DeepSort can be also utilized such as, StrongSort (Du, Song, Yang, & Zhao, 2022). Moreover, multi-object tracking (MOT) algorithms have also been considered to track detected pedestrians. This is the case of Al- Sa'd et al. (2022), where the global nearest neighbor (GNN) tracking technique has been used. Fig. 3 explains the main steps of performing VSDM based on CNN.

Evaluation metrics
To quantify the performance of existing VSDM frameworks and inform the state-of-the-art, we perform a comparative analysis showing their original results on their own datasets. Accordingly, we first briefly present the evaluation metrics commonly used in VSDM studies, including accuracy, F1 score, average precision (AP), and mean average precision (mAP).

Accuracy:
F1 score: where: = + and = + . Additionally, and represent the true positives, and true negatives, respectively. While, and refer to false positives, and false negatives, respectively.

Mean average precision (mAP):
Y. Himeur et al. where to refer to the average precision of calss . Overall, AP is defined as: where class refers to the object classes, e.g., ''pedestrian'' and ''nonpedestrian'' or ''people respecting SD'' and ''people violating SD'', etc.
Intersection over union (IoU): when pedestrians are detected, the model can generate multiple BBs for a single pedestrian. Thus, the intersection over union (IoU)-based filter is used, which is calculated for areas of two BBs 1 and 2 as follows: IoU provides the similarity rate between the ground truth BB and the predicted BB as a measurement for the quality of the prediction, the value of IoU varies from 0 to 1.

Fixed surveillance
As the IP surveillance industry enters the era of AI, security network cameras (IP cameras) and closed-circuit television (CCTV) cameras have seen significant advances through applying AI and deep learning technologies. These next-generation cameras have been equipped with video analytics and high-performance computing power, allowing users to convert real-time video frames into the big-data analysis. VSDM based on fixed surveillance refers to using existing CCTV cameras and/or IP cameras combined with ML and computer vision capabilities for detecting if pedestrians are respecting the SD norms or not (Pandiyan et al., 2022). When two or more pedestrians are detected in close contact using object detectors and distance measurement algorithms, an alarm is produced to alert people found in the monitored environment. AI alerts are also sent to concerned authorities or guards who can ask people to maintain distance. Fixed surveillance is mainly used in indoor environments, such as shopping areas, airports, sports facilities (e.g., stadiums), etc. (Al-Sa'd et al., 2022).

Drone-based surveillance
Conventional techniques of VSDM rely on fixed surveillance using monocular and intelligent cameras, which can only monitor a specific area. In contrast, drone-based surveillance is flexible, convenient, and broad in coverage. Drone-based VSDM analysis is a better option, as it can help monitor scenes from different points of view (Kadam, Seshapalli, Nayak, & Shaikh, 2021;Kumar et al., 2021) However, drones' images show complex backgrounds because of varying scenarios, altitudes, diversity, and illumination. These complex backgrounds have significant interference with the VSDM. Typically, quickly and accurately detecting individuals is challenging in such conditions. More attention can be paid to the targets in complex backgrounds. Recent studies have proved that the spatial attention mechanism can achieve this goal. Specifically, it has been demonstrated that spatial attention enhances the features we are interested in and ignores unimportant characteristics. Using drones for VSDM and other monitoring applications (e.g., face mask detection) is getting increasing attention because of their flexibility, although the computing power and memory capabilities are limited during timely distance monitoring. In this regard, performing real-time drone-based VSDM is a major issue. The Landing AI Company (Social distancing detector, 2022) developed a real-time VSDM solution that (i) detects pedestrians in video streams recorded with drones and (ii) uses the BEV of frames to measure physical distances between individuals. For instance, Ramadass, Arunachalam, and Sagayasree (2020) use a drone for VSDM of face mask detection of people in a public place. If violations are detected, the drone sends alarms to the nearby police station and provides the public with alerts. It can also carry and drop face masks to individuals. Similarly, in Kadam et al. (2021), Shao et al. (2021), autonomous drones are used for VSDM (Kadam et al., 2021).

Datasets
To validate the VSDM algorithms in crowded areas, various publicly available video surveillance datasets have been used. Typically, most of these datasets have already been employed to validate different video surveillance tasks, such as pedestrian detection, motion detection, crowd management, abnormal event detection, etc. For instance, Shorfuzzaman, Hossain, and Alhamid (2021) uses the Oxford town center (OTC) dataset (Benfold & Reid, 2011), which has been released by Oxford University. It encompasses one video sequence recorded in a semi-crowded urban street at a sampling rate of 25 frames per second (FPS) and with a resolution of 1920 × 1080. The ground truth BBs of the pedestrians are also generated in all the frames. In Su et al. (2021), in addition to realizing a new VSDM datasets, namely SCU-VSD, two other datasets are considered, i.e., Market1501 (Zheng et al., 2015) and MOT16 (Milan, Leal-Taixé, Reid, Roth, & Schindler, 2016). Typically, SCU-VSD is a data repository including 8 video sequences that have been recorded from the pedestrian street. They have a sampling rate of 25 fps, a duration of 60 s, and a resolution of 1920 × 1080 with numerous scenes and perspective views. Mar-ket1501 includes images recorded in front of a supermarket (Tsinghua University) and encompasses 12,936 images for training and 3,368 images for test. MOT16 includes 7 videos employed for training and verification and another 7 for the test with a resolution of 1920 × 1080 and 640 × 480. It contains as well top-view scenes recorded with a surveillance camera and front-view scenes collected with a moving camera. The varying illumination, number of pedestrians, and complex scene have made this dataset very challenging for VSDM applications. Shrestha et al. (2020) train their VSDM system on the PASCAL visual object classes challenge (VOC) 2007 (Everingham et al., 2008) and VOC 2012 (Everingham & Winn, 2011) datasets. Next, the system is tested on the PASCAL VOC 2007 test set. Specifically, VOC 2007 and VOC 2012 include 9963 and 11 540 images with objects from over 20 different classes. In this case, the system performance has been reported only for the person class. Shao et al. (2021) validate their real-time drone-based VSDM system using a merge-head dataset. It includes 18 767 video frames recorded at a resolution of 1920 × 1080 and divided into a training set (15 940 frames), validation set (1340 frames), and test set (1487 frames).
In Al-Sa'd et al. (2022), EPFL-MPV (Fleuret, Berclaz, Lengagne, & Fua, 2007), EPFL-Wildtrack (Chavdarova et al., 2018), and OTC (Benfold & Reid, 2011) datasets are used to evaluate DL models. The EPFL-MPV comprises four video sequences of six individuals freely moving in a room. Different scenes are collected from different points of view. Each video includes 2954 frames recorded at a sampling rate of 25 fps with a resolution of 920 × 1080. Besides, EPFL-Wildtrack comprises 7 video sequences of 400 frames each, describing the movement of 20 pedestrians outside the principal building of the ETH university (Switzerland). Pedestrians scenes have been collected using different cameras installed at different points of view. At the same time, OTC includes one video sequence collected on a pedestrian, which has 4501 frames recorded using a unique camera at 25 fps. In Madane and Chitre (2021), the INRIA person dataset (Dalal & Triggs, 2005) is utilized for training the DL models, which contains training and testing data and their corresponding annotations. While for inference or testing, the OTC dataset is considered along with the performance evaluation of tracking and surveillance (PETS 2009) dataset, in which both contain numerous crowd activities. Specifically, PETS contains video frames for different purposes, such as people tracking, crowd density estimation, flow analysis, etc.
In Rahim, Maqbool, and Rana (2021), Rahim et al. use the exclusively dark (ExDark) image dataset (Loh & Chan, 2019) for validating their VSDM solution. ExDark includes 12 different classes of objects with annotations, while only the pedestrian detection class has been considered for training developed algorithms. Moreover, it encompasses various indoor and outdoor low-light images. In Pi, Nath, Sampathkumar, and Behzadan (2021), to avoid overfitting, the YOLObased VSDM solution is trained on multiple datasets, including Pen-nFudanPed (Wang, Shi, Song, Shen, et al., 2007), which is a small dataset comprises one video showing 170 pedestrians walking on a street. Because it is not sufficient to train DL models, other largescale datasets are utilized, e.g., VOC 2010 (Everingham, Van Gool, Williams, Winn, & Zisserman, 2010), and Microsoft common objects in context (MS-COCO) (Lin et al., 2014). Typically, the VOC 2010 dataset includes 20 different object classes; however, only the pedestrian class is considered to validate VSDM systems. More specifically, only that illustrate individuals riding bikes, running and/or walking are used.
In Yang, Yurtsever, Renganathan, Redmill, and Özgüner (2021), three pedestrian crowd dataset are employed to evaluate an SSD-based VSDM system, ie., OTC, Mall Dataset (Mall-D) (Chen, Loy, Gong, & Xiang, 2012), and train station dataset (TSD) (Zhou, Wang, & Tang, 2012). The Mall-D is a 2000 video frames dataset with a resolution of 320 × 240, which is proposed originally for crowd counting. TSD is a one-video dataset comprising 50 010 frames, which is collected with a 25fps rate and a resolution of 480 × 720. In Rezaei and Azarmi (2020), Rezaei et al. validate their VSDM system using multi-object annotated datasets, i.e., VOC 2010 (Everingham et al., 2010), COCO, ImageNet (Russakovsky et al., 2015), and Google Open Images (GOI) datasets V6+ (Kuznetsova et al., 2020). The last one has 16 Million ground-truth BBs from 600 groups. Only the classes corresponding to human detection and identification are used. It is also a labeled dataset, where the BB labels have been used on every image and the corresponding coordinates of every label. In Shareef, Yannawar, Abdul-Qawy, and Ahmed (2022), in addition to Mall-D, PETS 2009, andOTC, Shareef et al. use the VIRAT (Oh et al., 2011), which is a natural, realistic, and challenging video surveillance dataset.
Overall, it is worth noting that most existing VSDM have been validated on already existing datasets that have been proposed for validating different video surveillance tasks, such as object detection, human action recognition, multi-object tracking, etc. This is mainly due to the similarities between those tasks and the VSDM task and the open challenges presented by these comprehensive and public datasets. On the other hand, very few datasets have been launched to particularly validate VSDM algorithms, such as SCU-VSD (Su et al., 2021).
For instance, in Bian, Zhou, Bello, and Lukowicz (2020), a wearable, oscillating magnetic field-based proximity sensing system is proposed for monitoring SD. It can track the individual's SD in real-time and enjoy better reliability than Bluetooth RSSI signal-based SD tracking solutions. In Oransirikul and Takada (2020), SD warnings are generated based on separating passing individuals from waiting individuals. Precisely, the activity of Wi-Fi signals from mobile devices has been passively monitored to check that the number of individuals in a specific area has exceeded the allowable density. If yes, individuals are provided with warnings to keep SD. In Chandel, Banerjee, and Ghose (2020), a mobile-based platform for monitoring SD in enterprise scenarios named ''ProxiTrak'' is proposed. It aids in tracking the path of potential COVID-19 transmission among an ensemble of individuals. Additionally, it helps guide the individuals to follow SD rules by providing real-time alerts on their mobile phones once they violate SD norms or are exposed to a person who has tested positive. In this regard, Fig. 4. Taxonomy of existing VSDM techniques proposed in the last two years with reference to the type of CNN models (complex of lightweight), transfer learning approaches, pedestrian detectors, data recording technique, and overall methodology. a classification algorithm is devised for making proximity decisions on the mobile phone itself using received signal strength indicator (RSSI) data of the on-board Bluetooth low energy (BLE) module. Besides, in Li, Sharma, Mishra, Batista, and Seneviratne (2021), Li et al. address the SD problem by developing a non-intrusive approach that monitors physical distances within a given space based on channel state information (CSI) from passive WiFi sensing. In this context, the frequency selective behavior of CSI is exploited by a support vector machine (SVM) classifier to improve the accuracy of SD detection and crowd counting.
Also, it is worth noting that many countries have used the global positioning system (GPS) to record the activities of people who tested positive. This helps track their traces and observe the probabilities of their contacts with fit people. For example, the EHTERAZ app has been used by the Qatar government to monitor and track doubted or diseased people and guarantee that they comply with the COVID-19 precautions (El-Haddadeh, Fadlalla, & Hindi, 2021). While in India, the Arogya setup app has been utilized by the government, which employs Bluetooth and GPS to localize and monitor COVID19 patients in public areas (Sharan, Chanu, Jena, Arunachalam, & Choudhary, 2020). However, most of these apps are only appropriate for indoor environments, and their accuracy significantly drops in dynamic environments. Moreover, they have significant privacy issues, and scalability problems (Borra, 2020).
Although NVSDM systems can help accurately detect physical distances between pedestrians, they are usually based on sensors that are handed out to achieve such a task. This can act as a medium for spreading the virus.

Visual social distancing monitoring (VSDM)
Since the outbreak of the COVID-19 pandemic, a large number of frameworks based on AI have been proposed to help fight against the virus. The literatures on VSDM around the world are arising. Various journal special issues and many international conferences were organized with many solutions introduced for resolving the VSDM problem in the last two years. This section sheds light on the state-of-the-art VSDM techniques. Typically, VSDM frameworks can be classified with reference to different aspects, such as the adopted model (conventional ML or DL), feature extraction (hand-crafted or neural network), data recording methodology (fixed or drone-based), complexity object detectors (complex or lightweight), object detection stages (single-stage or multi-stage), etc. Fig. 4 illustrates the proposed taxonomy.

Hand-crafted feature-based methods
In Cristani, Del Bue, Murino, Setti, and Vinciarelli (2020), a VSDM approach that relies on body pose estimation is introduced, where the body pose detector has been utilized for detecting visible pedestrians. Then, after converting the video frames into a top view (BEV) representation, every detected person is considered the center of a circle, while the radius represents the safe distance. In this regard, the VSDM task problem has been transformed into a sphere collision problem. In Al-Sa'd et al. (2022), a VSDM and crowd management system is introduced, which is based on (i) detecting pedestrians using a global nearest neighbor tracking (GNN), which is a real-time light-weight MOT approach (based on allocating detection/prediction annotations to tracks, (ii) and preserving their track records), (ii) filtering region of interest (ROI), (iii) transforming video frames into a top-View, (iv) tracking and smoothing, (v) estimating parameters, and (vi) detecting (SD) violations. In Aghaei et al. (2021), a semi-automatic VSDM approach is proposed for approximating the homography matrices between the image plan and scene ground. Using the measured homography, an off-the-shelf pose detection is then leveraged for detecting body poses on images and reasoning upon their interpersonal distances using the length of their body parts. Moving on, interpersonal distances are examined to identify potential SD violations.
In Ziran and Dahnoun (2021), Ziran et al. propose a contactless and real-time solution to monitor SD using stereo cameras, where pedestrians are first detected using a histogram of gradients (HOG) in the reduced ROIs of each frame. Moving on, a disparity map is generated for regions of the image with detected people before calculating the distances between detected persons using the hypotenuse theorem. In Jayatilaka et al. (2021), an end-to-end VSDM method is developed based on graph theory. Typically, a temporal graph representation structurally stores the information extracted by the object detector. Specifically, individuals are represented by nodes with time-varying properties for their location and behavior. The edges between people represent the interactions and social groups. Next, the graphs are interpreted, and the threat levels in each are quantified based on primary and secondary threat parameters, including proximity and group dynamics extracted from the graph representation and individuals' behavior.

CNN-based VSDM
CNNs have recently been considered a major player in different research topics, such as feature extraction, object detection, image segmentation, and human detection. Moreover, developing extended memory capacities and faster CPUs and GPUs have enabled the computer vision community to create powerful and robust pedestrian detectors that significantly outperform traditional ML algorithms. Despite that, many challenges still persist, including detection accuracy, detection speed, and computational training cost. These challenges also apply to the VSDM problem and need to be resolved to develop efficient and real-time VSDM systems. For VSDM, the pedestrian detection stage is the most critical part, and distance measurement accuracy depends mainly on it. To that end, most of the contributions have been focused on developing accurate people detectors using CNNs. The latter can be divided into single-stage CNN-based detectors, two-stage CNN-based, and multi-object detectors.

Single-stage CNN-based pedestrian detectors
A one-stage CNN-based detector relies on a unique pass through the CNN model to predict all the BBs in one go. This is appropriate for implementing mobile devices, such as drones, as it is fast. The most famous examples of one-stage CNN-based detectors are SSD, YOLO, RetinaNet, DetectNet and SqueezeDet (Faragallah et al., 2022).

YOLOv1
: it is based on framing the pedestrian detection as a regression problem, and hence spatially separating BBs and associated class probabilities. Few frameworks have been developed using this architecture. For YOLOv1instance, in Mercaldo, Martinelli, and Santone (2021), a YOLOv1 object detector is employed to detect people before using Euclidean distance between the people centroid to quantify the distance. Similarly, in Anitha Kumari, Purusothaman, Dharani, and Padmashani (2021), VSDM approach based on YOLOv1 is developed and implemented on a Jetson Nano computing board. In Mercaldo et al. (2021), the YOLOv1 object detector is employed to detect people before using Euclidean distance metric to quantify the physical distances between people's centroids.
YOLOv2: it is built upon the DarkNet-19, which is the model backbone. Compared to YOLOv1, YOLOv2 relies on removing fully connected layers and using anchor boxes for predicting BBs. Saponara, Elhanashi, and Gagliardi (2021) develop a VSDM scheme based on YOLOv2, which is applied to video streaming from thermal cameras. This approach enables tracking people, detecting SD violations, and monitoring body temperature. Moreover, the developed solution has been implemented on a Jetson Nano, which includes a fixed camera before testing it in a distributed surveillance system for visualizing individuals from multiple cameras in a centralized manner.
YOLOv3: it utilizes the complex DarkNet-53 as the model architecture. In Ramadass et al. (2020) an autonomous drone-based VSDM is proposed, in which the YOLOv3-based pedestrian detector has been trained on a dataset includes images of side and frontal views for a large number of people. This study has also been extended to detect face masks. The developed algorithm was then implemented on a surveillance drone with a camera to detect the physical distances between pedestrians from the frontal and side views. Similarly, the authors in Sathyamoorthy, Patel, Savle, Paul, and Manocha (2020) develop a pedestrian detection approach for VSDM in crowded areas using the YOLOv3-based detector designed in Wojke et al. (2017). Typically, a robot augmented with an RGB depth camera with 2D lidar is monitored in crowd gatherings for performing collision-free navigation. Moving on, YOLOv3 is utilized in Yang, Yurtsever, et al. (2021) to detect pedestrians in video sequences and identify SD violations. Specifically, the BEV coordinates have been adopted to estimate the distances between pedestrians. Additionally, the density of crowd gatherings has been estimated to alert for critically dense areas.
In Magoo, Singh, Jindal, Hooda, and Rana (2021), a BEV VSDM scheme based on YOLOv3 object detection model is introduced. Typically, key feature patterns are detected using a key point regressor. Moreover, once a massive crowd is detected, BBs are used for detecting the individuals violating the SD norms. In Ahmed, Ahmad, Rodrigues, Jeon, and Din (2021), an SD tracking system is developed based on YOLOv3 object recognition paradigm, which helps (i) detect humans in video streams, and (ii) measure SD violations between people by approximating physical distances to pixels and setting an empirical threshold. Additionally, a transfer learning scheme is utilized to overcome the problem of data scarcity and improve the model's accuracy. Using the same approach, the authors in Shalini, Margret, Niraimathi, and Subashree (2021), Widiatmoko, Berchmans, and Setiawan (2021) calibrate videos in the BEV plan before feeding them as inputs to the pre-trained YOLOv3 model. However, both studies do not provide enough assessment results.
In Pi et al. (2021), the study focuses on contact tracing using CNN to generate quantifiable metrics. Typically, a YOLOv3 network has been run on a training labeled video dataset, including pedestrians. Afterward, the trained architecture is validated on real-world crosswalk video sequences collected during the start of the pandemic in Xiamen, China. Following, identified pedestrians are projected onto an orthogonal map to trace contacts by (i) tracking movement trajectories and (ii) simulating the spread of droplets among the healthy population. Non-maximum suppression and Network pruning have been used to optimize model performance, resulting in an average precision of 69.41%.
In Hou, Baharuddin, Yussof, and Dzulkifly (2020), the pre-trained YOLOv3 is utilized for pedestrian detection in video sequences. Then, video frames are transformed into a top-down view to measure the Y. Himeur et al. YOLOv4: in Rodriguez, Luque, La Rosa, Esenarro, and Pandey (2020), a DL-crowd counting solution is developed for capacity control in commercial establishments buildings during the COVID-19 pandemic. It is based on YOLOv4 and has been validated on the MS-COCO dataset. Moreover, it can determine whether a person leaves or enters using the route and direction information, (ii) count remaining people inside a commercial building, and (iii) detect violations by comparing the result with a pre-defined threshold. However, the main drawback of this study is the lack of significant assessment. In Rahim et al. (2021), Rahim et al. propose a DL-based VSDMdetection scheme based on the object detection YOLOv4 model. A fixed single motionless time of flight (ToF) camera is used to record video data. After people detection, the Euclidean distance metric is used to measure the physical distances between detected BBs and then map them to real-world unit distance. Empirical evaluation has shown an mAP score of 7.84%, and the mean absolute error (MAE) between actual and measured social distance values has reached 1.01 cm. Similarly, in Ismail, Najeeb, Anzar, Aditya, and Poorna (2022), YOLOv4-based VSDM is proposed to detect pedestrians and then measure the distance between them using the Euclidean distance. This is to guarantee that people are properly following the SD norms.
In Ghasemi, Kostic, Ghaderi, and Zussman (2021), an accurate VSDM pipeline for automating video-based SD analysis, namely Auto-SDA, is designed, in which the performance is insensitive to scene dynamics and the camera's viewpoint. This method uses (i) a YOLOv4based object detector and (ii) a people tracking approach based on Nvidia DCF-based tracker (NvDCF) for extracting pedestrian trajectories. The latter is then deployed for computing the proximity duration of every two unaffiliated pedestrians separately. In Shareef et al. (2022), a YOLOv4-based VSDM solution is developed that first detects pedestrians in video scenes before a predefined SD threshold and a violation index to detect SD violations. Moving on, warnings are produced to make immediate awareness actions. Because low-light environments can result in spreading COVID-19, developing efficient VSDM schemes that address this issue is of utmost importance. To that end, Rahim, Maqbool, Mirza, Afzal, and Asghar (2022) introduce DepTSol, which is a CSP-ized YOLOv4-based VSDM system under different light conditions. It also enables the monitoring of pedestrians at varying camera distances. Khel et al. (2021), the authors employ a lightweight CNN-based MobilenetV2 architecture as a framework for the classifier to detect face masks and monitor SD. Additionally, an SSD is used to extract relevant features, while a spatial pyramid pooling (SPP) is deployed for integrating the collected features and improving the model's accuracy. Similarly, in Qin and Xu (2021), SSD300-based VSDM that is built upon a feed-forward convolutional network (FFCN) is proposed. It produces a fixed-size collection of BBs and scores for the presence of pedestrians, then estimates the distance between them using the Euclidean function. In Gopal and Ganesan (2022), an SSD-based VSDM scheme is introduced, where an overhead position dataset and a pre-trained MS-COCO dataset have been used to train the pedestrian detector. Additionally, a transfer learning scheme has been employed to enhance the performance of the pre-trained model. A new layer has been integrated over the existing architecture to train the overhead dataset. Moving on, a centroid chasing algorithm, working on the concept of the fixed distance threshold, is deployed to identify people violating the SD norms.

SSD: in
RetinaNet: In Chaudhary (2020), a VSDM approach is developed and installed in the hardware of CCTV cameras for contact tracing. Reti-naNet has been deployed to detect and track pedestrians before using the law of similar triangles to calculate the distance between them. Accordingly, 30% accuracy improvement has been achieved when the law of cosines is considered. Additionally, a multi-task cascaded CNNbased face detection has been utilized to identify people violating the SD norms. Besides, in Zuo et al. (2021), a VSDM approach is proposed based on obtaining pedestrian density and distance between each pedestrian pair. It uses three pre-trained CNN-based object detection architectures, i.e., RetinaNet, YOLOv3, and Mask RCNN, as backbone models. Mask RCNN and RetinaNet utilize ResNet-101 as the network architecture, and these models are pre-trained using the MS-COCOdataset (Lin et al., 2014). Real-time video sequences gathered in New York City (NYC) have been employed to validate this framework. However, the performance has been quantified using the average pedestrian density (APD) and SD adherence rate (SDAR), which cannot reflect the efficiency of the VSDM system.
To highlight the performance of one-stage pedestrian detectors for VSDM under different light conditions, seven detectors are evaluated on the ExDARK dataset to assess the accuracy and speed of their models, as explained in Rahim et al. (2022). Fig. 5 presents (a) the mAP performance at (i) various IoU thresholds (mAP (IoU=0.5) and mAP (IoU=0.75) , and (ii) considering different object dimensions (mAP (small) , mAP (medium) and (large) ; and (b) the mAR performance with reference to (i) the detection number per image number (i.e. mAR (max=1) , mAR (max=10) and mAR (max=100) , (ii) the scale variation (mAR (small) , AR (medium) and AR (large) . The CSP-ized YOLOv4 has outperformed all the other detectors for both mAP and mAR The CSP-ized YOLOv4 has achieved the best Y. Himeur et al. performance in terms of both the mAP and mAR scores compared to the six other one-stage detectors. For instance, up to 99.7% mAP has been reached by CSP-ized YOLOv4 under mAP (IoU=0.5) . For the computational cost, the test performance of the processed frames per second of each model has been assessed on a Tesla T4 GPU, which has a 512 × 512 network size, as portrayed in Fig. 6. It has been shown that the best performance has been reached by the CSP-ized YOLOv4, where 51.2 fps has been attained. Overall, one-stage pedestrian detectors have received increasing attention for VSDM due to their computational efficiency and competitive detection performance. Degadwala et al. (2020), different DL architectures are used to address the VSDM problem, including RCNN, Faster-RCNN, SSD, YOLOv1, YOLOv2, and YOLOv3. After detecting people in the video frames from MS-COCO (Lin et al., 2014) and PASCAL-VOC (Everingham & Winn, 2011) datasets, Euclidean distance has been considered to quantify the distance between them.

RCNN: in
Fast-RCNN: it improves some of the problems of RCNN and provides a faster architecture for pedestrian detection. In Saponara et al. (2021), Fast-RCNN has been implemented to perform a VSDM. Its performance has been compared with YOLOv2 and YOLOv4-tiny. The latter one has the best performance in terms of pedestrian detection accuracy and computation efficiency.
Faster-RCNN: it is built on RCNN and Fast-RCNN by using a region proposal network (RPN) for sharing complete images' convolutional characteristics with a detection network, which helps generate almost cost-free region proposals. In Ahmed, Ahmad, and Jeon (2021), a transfer-learning-based Faster-RCNN is introduced to detect persons in video frames using BBs, which have been recorded in top view environments. Typically, a pre-trained model has been combined with a new trained layer. Moving on, Euclidean distance is considered to estimate the distances between detected individuals. After estimating the central point of a BB, a distance to pixel threshold is set to determine whether individuals respect SD or not.
In Sahraoui et al. (2020), a DL-based VSDM based on the social internet of vehicles (SIoV) named DeepDist is proposed to detect SD violations in real-time. Typically, the Faster-RCNN model is utilized for detecting physical distancing violations between objects in video sequences recorded with vehicles equipped with thermal and vision imaging systems. The performance of this approach is evaluated on the Stanford vehicles' dataset (SVD), network simulator (NS-3), and the simulation of urban mobility (SUMO). Similarly, in Shah, Chandaliya, Bhuta, and Kanani (2021), a pre-trained Faster-RCNN is selected to perform VSDM from videos recorded using CCTV Cameras. In Tanwar et al. (2021), the VSDM task is performed using Faster-RCNN and YOLOv2 to analyze videos recorded using drone-based and CCTV cameras. The Euclidean distance has been utilized to calculate the distance between pedestrians. More importantly, the developed VSDM solution is augmented with a privacy preservation module based on blockchain, which helps ensure trusted and secure data exchange between different entities and the surveillance center at the physical layer. Additionally, blockchain currencies are utilized to pay fines if individuals violate SD norms.
Mask-RCNN: this architecture helps extend and improve Faster-RCNN (i) using the ROI align instead of an ROI pooling to address the location misalignment problem existing in the RoI pooling, and (ii) through the addition of a mask branch. However, few VSDM frameworks have been designed based on Mask-RCNN. For instance, Gupta, Kapil, Kanahasabai, Joshi, and Joshi (2020) develop a Mask-RCNN-based VSDM by (i) detecting pedestrians in each video frame, (ii) splitting the input proposals from the region proposal network (RPN) into ''bins'' using bilinear interpolation, and (iii) applying a pairwise distance measurement to detect if the SD requirements respected. (2021), three pre-trained object detectors, namely EfficientDet-DO, EfficientDet-D5, and DETR, having ResNet-50 as a backbone, are used to detect pedestrians in public areas. Moving on, the fine-tuned models have been evaluated on OTC (Davis & Sharma, 2007) and PETS (Ferryman & Shahrokni, 2009) people tracking datasets. In this respect, the developed VSDM system has been built upon DEtection TRansformer (DETR) with the aid of a perspective transform and camera calibration. This makes the distancing monitoring approach independent of the camera angle or position.

EfficientDet: in Madane and Chitre
Other models: it is worth mentioning that there are other frameworks that have used other object detectors, such as Ghodgaonkar et al. (2020), where Cascade-HRNet is deployed to detect pedestrians after being trained on the crowd human dataset (Shao et al., 2018). In Dai et al. (2021), Dai et al. introduce BEV-Net, a multi-branch network that localizes pedestrians in real coordinates and identifies high-risk areas of SD violation. Typically, this network aggregates camera pose estimation, feet, and head location detection, a differentiable homography scheme for mapping images into BEV coordinates, and uses geometric reasoning for producing BEV maps of individuals' locations in the s

Lightweight CNN models
In contrast to most of the studies that have focused on a front or side perspective for social distance tracking, a BEV is adopted to track SD in Karaman et al. (2021), where a lightweight CNN model, i.e., Mo-bileNet (with SSDv3) and other complex CNN models, i.e., Faster-RCNN (with ResNet-50), Faster-RCNN (with Inception-v2), are deployed to detect people in video sequences. A prototype has also been developed by implementing the Faster-RCNN-based image analysis algorithm on an embedded Jetson Nano platform, including a Raspberry Pi camera. Moreover, the system has been tested in various public spaces, where audible and light warnings have been used to detect social distance violations. Another VSDM scheme is introduced in Khandelwal et al. (2020) using MobileNetv2 network as a lightweight person detector to alleviate the computational cost, showing less accuracy in comparison with other common models. The Euclidean distance between detected people has been measured using a symmetric distance matrix and a 3D projected image of each frame. Moreover, this approach only focuses on an indoor manufactory-setup distance measurement and does not provide any statistical assessment on the virus spread. In Ansari et al. (2021), a VSDM using a compact CNN-based sequential model is proposed to first detect pedestrians in video frames collected using CCTV cameras. In doing so, a sliding window concept has been adopted as a region proposal when detecting pedestrians in each frame. Next, Euclidean distance has been used to measure the physical distance between detected persons. Y. Himeur et al. In Valencia et al. (2021), Tiny-YOLOv4 and DeepSORT model are deployed for crowd counting and SD monitoring in a top-view camera perspective. This system processes video streaming in realtime recorded with CCTV or surveillance cameras, counts the number of detected persons and analyzes the distance between them. Following, it generates alerts to indicate detected people per unit of time and identify the individual violating the SD protocols. In Keniya and Mehendale (2020), a DL-based VSDM detection system is developed, SocialdistancingNet-19, to detect individuals' video frames and display labels marked as safe or unsafe based on the monitoring distance. SocialdistancingNet-19 includes two subnetworks used for feature extraction and detection: CNN and MobileNet-V2 models. Moreover, performance has been compared to reduced ResNet-50, and ResNet-18 architectures, where an accuracy of 92.8% has been reached by SocialdistancingNet-19. In Shao et al. (2021) the lightweight PeleeNet model is used as a backbone for a pedestrian detection module implemented on drones. This enables detecting pedestrians in real-time based on human head detection on UAV images. Typically, spatial attention and multi-scale features are easily incorporated to enhance small objects' features, such as human heads. After that, SD is measured between pedestrians using a calibration approach. Moving forward, an end-to-end VSDM system that can support real-time implementation on edge devices is developed in . In doing so, the PoseNet model, a lightweight version of GoogleNet for real-time pedestrian pose estimation, is used. Moreover, physical distances between pedestrians are measured by synchronizing their positions in cameras to a 2D map. Table 1 summarizes the most pertinent VSDM frameworks based on CNN and their characteristics in terms of the ML mode used, description of the methodology adopted, datasets used for training/test, best performance, and advantage or limitation. Most existing VSDM techniques are based on frame-by-frame human detection. They focus on resolving the VSDM problem from local and static perspectives. By contrast, Su et al. (2021) introduce an online multi-pedestrian detection and tracking scheme. It relies on (i) using hierarchical data association Y. Himeur et al.

Faster-RCNN, YOLOv2
Secure and privacy preserving VSDM using blockchain COCO AUC = 73% Although a secure and privacy preserving VSDM framework is presented, the detection accuracy needs further improvement.
for deriving the trajectories of pedestrians in public spaces, (ii) applying spatio-temporal trajectories to implement the VSDM approach, and (iii) using the Euclidean distance between tracking objects frame-by-frame and considering the discrete Fréchet distance between trajectories to efficiently measure distance in both static and dynamic, local and holistic scenarios. The Average Ratio of Pedestrians with Unsafe SD (ARP-USD) has been used to evaluate the performance of this technique.

Multi-object tracking (MOT)
Besides, IMPERSONAL is introduced in Giuliano et al. (2021) to detect and track SD and alert users in case of gatherings. The process is conducted in three steps: i) object detection, multi-object tracking (MOT), and (iii) distance estimation. This system is built upon Fair-MOT (Zhang, Wang, Wang, Zeng, & Liu, 2021), which is an MOT scheme, which is in turn based on a ResNet-34 backbone network. Moving forward, the retrieved information is then sent to an IoT subnetwork to (i) identify the anonymous IDs of people belonging to a gathering and (ii) provide them with alert messages This framework has been validated on PETS2006 datasets (PETS2006 database, 2006 and other real-world video data recorded from outdoor live cameras in Odessa Mykolaiv ((Ukraine). In Rezaei and Azarmi (2020), a YOLOv4based VSDM in the crowd using CCTV cameras is presented, which can be applied either in outdoor or indoor environments. Specifically, an adapted inverse perspective mapping (IPM) approach has been integrated into the VSDM system along with a simple online and real-time   tracking (SORT) tracking technique. This has resulted in efficient pedestrian detection and SD analysis. The overall system has been trained on MS-COCO and GOI datasets and validated on the OTC dataset and realworld scenarios with challenging conditions, e.g., different lightning rates, occlusion, and partial visibility. Concretely, a 99.8% mAP and 24.1 fps processing have been achieved. Moving on, statistical analysis has been used to assess online infection risks using SD violations and spatio-temporal information from pedestrian movement trajectories. Fig. 7 presents the flowchart of the YOLOv4-based VSDM proposed in Rezaei and Azarmi (2020).

Transfer-learning-based VSDM
TL consists of training a model on a specific domain (or task) and then transferring the acquired knowledge to a new, similar environment (or task). For example, let us consider pedestrian detection, where a DL algorithm can be pre-trained on the large-scale ImageNet dataset to generate optimal model parameters. Next, a part of the model is re-trained (i.e., fine-tuning), and the validation process is performed on a new video target dataset collected from a real-world scenario (Ahmed, Jeon, Chehri, & Hassan, 2021;Loey, Manogaran, Taha, & Khalifa, 2021). Additionally, DL models can be pre-trained to perform a specific task like generic object detection on large-scale datasets, such as ImageNet, and fine-tuned to conduct a different but related task, such as pedestrian detection in VSDM. Fig. 8 explains the difference between conventional ML and TL techniques.

Fine-tuning
Most VSDM-based TL techniques are based on fine-tuning a pretrained DL model when the source and target domains are almost similar. In Shin and Moon (2021), Shin et al. firstly detect pedestrians using a YLOLOv4-based TL object detector in CCTV images. After that, DeepSORT-based MOT is utilized for assigning IDs and tracking objects. Moving forward, the weights of the transformation matrix are derived to extract the object coordinates using image warping of the initial frames. The center points of the BBs for the pedestrians are transformed to fit the shapes of the transformed frames using the extracted transform matrix weights. Following, actual distances are calculated using the Euclidean distance function. Punn et al. (2020) combine fine-tuned YOLOv3-based VSDM approach for detecting pedestrians, and Deepsort technique (Wojke et al., 2017) that aims at tracking detected persons using assigned IDs and BBs. To fine-tune the pedestrian detector, an open image dataset (OID) has been considered while the validation has been conducted on the OTC dataset. Moving forward, the empirical results have been compared with SSD and Faster-RCNN. However, no discussions about the validity of SD measurements are provided, and the statistical analysis of the obtained results is missing.
Using the same process in Ahmed, Ahmad, Rodrigues, et al. (2021), SD tracking is performed by detecting people in video sequences using a YOLOv3 object recognition system. Also, a TL scheme is considered to reduce the computational cost and improve detection accuracy. Fig. 9 illustrates the flowchart of the TL-based pedestrian detection system using overhead video frames, which has been employed to measure the physical distances between pedestrians. Typically, finetuning is adopted by freezing all the layers of the pre-trained YOLOv3 architecture, and only one new layer is trained on the real-world video training set.
In , a transfer learning-based Faster-RCNN is introduced to detect persons in video frames using BBs, which have been recorded in top view environments. Typically, a pre-trained model has been combined with a new trained layer. Moving on, Euclidean distance is considered to estimate the distance between detected individuals. After catching the central point of a BB, a distance to pixel threshold is set to determine whether individuals Y. Himeur et al. Fig. 9. The TL-based pedestrian detection framework proposed in Ahmed, Ahmad, Rodrigues, et al. (2021), which is built using YOLOv3 and overhead video frames from real-world. respect the SD norms or not. In Bouhlel, Mliki, and Hammami (2021), a VSDM scheme using drone-based surveillance is proposed, which relies on crowd behavior analysis. Typically, crowd density is first estimated by categorizing the drone video frame patches into four classes: none, medium, sparse and dense. Next, pedestrians are detected and tracked before calculating their physical distances. A TL approach is adopted for crowd density estimation, where the pre-trained AlexNet is utilized. Typically, a fine-tuning is adopted by substituting the classification layer with a novel softmax layer to classify the crowd patches into the classes mentioned above. Three datasets have been used to validate this approach, including Mayenberg's dataset (Meynberg & Kuschk, 2013), Mliki's dataset (Hazar, Arous, & Hammami, 2019) and UCF-ARG (agendran, Harper, & Shah, 2021).

Domain adaptation (DA)
DA refers to the possibility of applying a DL algorithm trained on a specific domain (source domain) to another distinct but related domain (target domain). This research topic has received increasing interest in the last decade as it helps in reducing the complexity of DL-based computer vision solutions (Khan & Alamin, 2021). Although the importance of DA, few VSDM frameworks have been built on it. For instance, the authors (Di Benedetto et al., 2022) propose a VSDM scheme to monitor compliance with SD norms in indoor and outdoor environments. In doing so, the DA-based VSDM strategy consists of (i) launching a new real-world crowd counting and monitoring dataset, namely CrowdVisorPisa; (ii) training a Faster-RCNN model on a synthetic dataset, namely Virtual Pedestrian Dataset (ViPeD) (Amato, Ciampi, Falchi, Gennaro, & Messina, 2019), to detect pedestrians; (iii) fine-tuning this model on real-world data by employing the balanced gradient contribution (BGC) method that helps mix synthetic and realword data during the training to boost the performance; and (iv) measuring the physical distances between detected pedestrians using a pre-calibration strategy and a geometrical transformation. Table 2 presents a summary of TL-based VSDM frameworks and their features concerning the adopted ML model, method description, datasets used for validation, best performance, and advantage/ limitation. It has been seen that the best performance has been achieved by Bouhlel et al. (2021), where a TL-based AlexNet approach is adopted to perform VSDM in drone images. Typically, an accuracy of 99.58% has been reached.

3D-based VSDM
The COVID-19 pandemic has shown the need to perceive people in 3D more than ever when using visual intelligence systems. In this context, efficiently monitoring SD requires not only going beyond a measure of distance but also perceiving people's orientations and relative positions. Put differently, people talking to each other strongly influence the risk of contagion more than walking apart. To that end, Bertoni et al. (2021) develop a VSDM solution that analyzes SD based on both 3D localization and social cues. Typically, a DL-based VSDM method is proposed to detect people's 3D locations and their body orientations from monocular cameras. Typically, this approach is built upon an improved version of MonoLoco (Bertoni, Kreiss, & Alahi, 2019), based on a deep fully-connected network (DFCN). Similarly, in Niu et al. (2021), Niu et al. introduce a 3D-based VSDM that enables detecting and localizing pedestrians in 3D using a combination of terrestrial point clouds and monocular images. Moreover, the correspondence between 2D image points and 3D world points has been used to calibrate the camera. Typically, point clouds have been utilized to extract the vertical coordinates of the ground plane (where the pedestrians stand). Moving on, the 3D coordinates of the pedestrian's head and feet have then been estimated iteratively using collinear equations, assuming that the pedestrians are perpendicular to the ground. Therefore, this helps localize and determine pedestrians in 3D based on data from monocular cameras, which are broadly installed in smart cities. ViPeD, CrowdVisorPisa mAP = 83.6% The performance needs further improvement, and privacy concerns have not been addressed.

Detection of free-standing conversation groups (FCGs) and social groups (SGs)
Seeking to prevent forming free-standing conversation groups (FCGs) social gathering, a convolutional variational autoencoder (CVAE) model is employed in Varghese and Thampi (2021) to develop a VSDM by integrating data from various sensor modalities. SD violations are detected, considering the spatial characteristics required for managing illumination variations and occlusions of video data. If SGs are detected as graphs using the pre-trained CVAE and connected components in graph theory, violation alerts are generated. Moreover, an SG graph clustering is performed using a cost function to identify FCGs based on a socio-psychological theory of Friends-formation. On the other hand, blind and visually impaired (BVI) people have some issues when practicing SD because of their low vision, which impedes them from maintaining a safe physical distance from other persons. To that end, the authors in Shrestha et al. (2020) introduced a smartphonebased VSDM based on CNN crowd detection before associating risks to BVI users via directive audio alerts on their mobile phones. Typically, pedestrians are first detected, and their distances from the mobile phone's monocular camera feed are estimated. Moving on, pedestrians are clustered into crowds to calculate distance and density maps from the crowd centers. Lastly, the system tracks each detection in previous frames to create motion maps that help (i) predict the crowds' motion information and (ii) produce corresponding audio alerts. Active Crowd Analysis is designed for real-time smartphone use, utilizing the phone's native hardware to ensure the BVI can safely maintain SD (Shrestha et al., 2020).
Moving on, in Usman et al. (2020), Usman et al. develop a VSDM for shopping malls using a crowd-based simulator. It is based on clustering consumers' behavior into three levels and using agent control. The SD index (SDI) is introduced as an evaluation metric, which is estimated to indicate the tendency of consumers to maintain a safe distance during their shopping experience. Concretely, SDI represents the occupancy throughput and the number of detected SD violations. This simulated VSDM has been tested on different scenarios by varying navigational guidelines, occupancy rate, and agent behavior.
It is worth noting that apart from the aforementioned studies, various VSDM solutions have also been proposed to analyze SD between pedestrians during the pandemic. For instance, the ones developed by Trident (Face mask detection system using artificila intelligence, 2022) and Landing AI (Social distancing detector, 2022) use AI-based algorithms to measure the physical distances between pedestrians using surveillance cameras. In addition, some solutions utilize visual data recorded from LiDAR cameras (Social Distance Monitoring , 2022), and 3D cameras (Using 3d cameras to monitor social distancing , 2022) to control SD. Moreover, visual intelligence is also used for real-time face mask detection in public, such as DatakaLab (Datakalab | Analyse de l'image par ordinateur, 2022), Trident (Face mask detection system using artificila intelligence, 2022) and Deloitte (Protected-your ai-solution for face mask detection in public places, 2022). These solutions provide an instant output, helping organizations meet public health guidelines.

Pedestrian localization error
When developing VSDM systems, it is essential to assess the pedestrian localization errors that can occur due to different reasons, e.g., occlusions (as found in the Mall dataset), small sizes of pedestrians (as seen in TSD), noise, etc. However, most existing VSDM frameworks claimed that they had achieved a limited number of missed detections, which has slightly affected the monitoring of SD violations, as explained in Yang, Yurtsever, et al. (2021).

Indoor environment
A typical example of VSDM systems has been proposed in Niu et al. (2021), which enables the localization and detection of pedestrians in video frames recorded using monocular cameras before measuring the physical distances between them. Fig. 10 illustrates an example of an indoor scene at the CUMTB-Campus, where four pedestrians have been detected using YOLOv1. The pedestrians' localization and height errors with different distances from the camera in this scene and the SD errors between adjacent pedestrians are evaluated. The results are reported in Table 3. Overall, it has clearly been seen that the most significant localization error has reached 0.32 m (pedestrian 1) while the most critical height error has attained 0.229 m (pedestrian 3). However, these errors have a slight effect on the SD monitoring errors, where the most significant error has attained 0.082 m (Pedestrians 3~4). Typically, if an SD norm of two meters is adopted, an average SD monitoring accuracy of 99.1% is reached.
Y. Himeur et al. Fig. 10. Example of an indoor video scene with four detected pedestrians recorded at the CUMTB-Campus to evaluate the VSDM system developed in Niu et al. (2021).

Table 3
Evaluation of pedestrian localization and height errors of an indoor video scene recorded at the CUMTB-Campus   Table 4 Evaluation of pedestrian localization and height errors in an outdoor scene recorded at CUMTB-Campus .

Outdoor environment
The second example refers to assessing the pedestrian localization error in an outdoor scene recorded at the CUMTB campus . It includes eight pedestrians detected using YOLOv1, which are in different positions, including people overlapping, as portrayed in Fig. 11. Table 4 presents the localization and height errors of the pedestrians detected in this scene. Additionally, the SD monitoring errors between adjacent pedestrians are listed. It can be seen from the obtained results that the most significant localization error has been reached with pedestrian 5 since he is more than 50 m away from the camera. The second larger error has been obtained with pedestrian 3 (yellow box), mainly due to the occlusion issue. However, it is worth noting that the maximum relative error of SD monitoring is 0.207 m. Keeping in mind that an SD norm of two meters has been considered in this study, an average SD accuracy of 94.5% has been achieved.
Besides, in Shao et al. (2021), the error of pedestrian localization and accuracy of the VSDM system based on PeleeNet are evaluated under different scenes with a multitude of pedestrian position patterns. Fig. 12 portrays a typical scene (recorded with a drone-based camera) used to assess the VSDM system performance with eight pedestrian positions. Typically, detected social distances are compared with the ground truth before calculating each pedestrian pair's absolute errors and SD accuracy. Table 5 depicts obtained results regarding the absolute errors (in m) and SD accuracy. Overall, an average error of 0.109 m has been achieved along with an SD accuracy of 0.945.

Critical discussion
The comprehensive overview conducted in this paper has shown that a significant amount of studies have been proposed to develop efficient VSDM systems and help slow down the spread of COVID-19.   Most of them are based on analyzing video sequences, detecting pedestrians in each frame, and quantifying the distances between detected people. From another point of view, most prototypes have focused on the side, and frontal camera perspectives, such as Ramadass et al. (2020) and Punn et al. (2020). Moreover, it has been demonstrated in the many frameworks that using visual data can effectively monitor SD by (i) accurately estimating physical distances, (ii) detecting crowd gatherings, and (iii) counting the number of people in each crowd. However, it is of utmost importance to mention that most existing VSDM solutions are based on a frame-by-frame SD analysis than on SD monitoring over time. In what follows, we summarize the main findings derived from this study.
Datasets: Validating VSDM frameworks necessitates at least one benchmark dataset, which includes a large number of images or video clips, different SD scenarios, different environments and scenes (outdoor, indoor (shopping malls, sports facilities, transport facilities, etc.)), and the proper ratio between virtual data and real data. However, datasets used for evaluating some existing frameworks have a set of limitations, which can be discussed as follows: • Some datasets include a small number of images, e.g., several hundreds, which can limit the performance of DL algorithms trained on them and results in overfitting problems. Indeed, when DL models are used, a larger dataset often helps develop a more accurate DL model. • Some datasets rely on simulated images/videos generated using virtual reality. DL algorithms are trained on these kinds of data, while in the real world, they should be validated on real images/videos. In this respect, their performance can be dropped due to the significant difference between the source and target domains.
• In some datasets, images/videos are gathered from simple scenes, which can easily bias them toward a special scene. In this regard, DL models trained on these datasets may be inefficient for new scenes.
• It has been demonstrated in the literature that training DL models on the VOC dataset for object detection (pedestrians) can improve models' performance. Typically, while the mAP of a DL model can reach 35.9%-46.5% with the MS-COCO dataset, it can attain 57.9%-74.9% with the VOC dataset (Redmon & Farhadi, 2017). Therefore, this dataset has been used to pre-train different VSDM algorithms (Ahmed, Ahmad, Rodrigues, et al., 2021;Pi et al., 2021).

Table 5
Evaluation of pedestrian localization errors and SD accuracy in an outdoor scene from the Merge-Head dataset (Shao et al., 2021 From another hand, although most developed systems are advantageous, they still have limitations and can be improved in different manners, including the (i) estimation of the bodies' orientations for relaxing the assumption of vertically oriented subjects; (ii) fusion of pedestrian detection samples and distance measurements from multi-view cameras for assessing the environment state instead of the particular camera scenery; (3) development of online automatic training processes to track algorithms' parameters; (4) integration of regression models for estimating crowd density maps; (5) detection of other abnormalities that can be related or not to the COVID-19 pandemic, e.g., smoke, fire or unattended objects in public areas, and any other abnormal events corresponding to crowd gatherings.
Additionally, even though some existing VSDM approaches have addressed both detection and tracking (e.g., Ahmed, Ahmad, Rodrigues, et al., 2021;Sathyamoorthy et al., 2020), the tracking schemes of these frameworks have been utilized to track detected pedestrians and then associate them with assigned IDs instead of using trajectory-based VSDM. Typically, these techniques based on frame-by-frame analysis pertain to the detection-based VSDM group. At the same time, only the work in Pouw et al. (2020) belongs to the trajectory-based VSDM category, which is based on quantifying the spatiotemporal trajectories distances to address the SD problem in a dynamic manner. Put simply, detection-based VSDM techniques aim to detect and calibrate individuals' positions and then analyze the frames one by one to measure the distances between detected individuals in the BEV. By contrast, the trajectory-based VSDM approaches track people and calibrate the trajectories. Thereafter, the corresponding calibrated trajectories in the 3D spatiotemporal coordinates (in addition to the time axis) are used to determine distances between detected pedestrians. For better monitoring of the SD, continuous measurement and analysis over time is more appropriate rather than a specific moment. Therefore, more research focus should be put on investigating the VSDM problem based on analyzing spatiotemporal trajectories over time.
Besides, most of existing VSDM have reached excellent performance in terms of the accuracy of detecting pedestrians and measuring the distance between them in a low and medium density scene. This is due to the capability of cameras in easily tracking moving objects, and measuring physical distances in such environments. However, performing these task in crowded and dense places rests challenging. Indeed, some pedestrians can be hidden together, and hence they become invisible in crowded gatherings, ever for human observers. Consequently, it is quite difficult to put BBs to all pedestrians in dense scenes (Sahraoui et al., 2020).

Pedestrian overlapping and sensor noise
Pedestrian overlapping and occlusion is a serious problem that can considerably bias the results of VSDM systems. Also, the distance calculation can be inaccurate in some indoor applications due to the limited height and space. Typically, the need for video data from multiple cameras is significant. While this option can be achieved in both indoor and outdoor scenarios by installing numerous cameras and collecting different views, adopting drone-based surveillance that has the flexibility to move and monitor pedestrians can be another option for outdoor application scenarios.
Moreover, as most reviewed VSDM frameworks have relied on using ML and DL tools, the probability that a detected violation is a false alarm (due to the sensor noise or other reasons) has been assessed using different ML metrics, e.g., confusion matrix, false alarm rate (FAR), etc. For instance, in Rahim et al. (2021), a YOLOv4-based VSDM solution under different low light conditions is proposed, demonstrating good reliability to light changes (which can be considered as sensor noise). Typically, no single false-positive (FP) has been detected.

Computational complexity
Based on the literature review, some studies have successfully achieved real-time SD monitoring (including pedestrian detection, calculation of interpersonal distances between pedestrians, violation detection and generation of alerts) in moderately dense crowds, such as Chandel et al. (2020), Nakano and Nishimura (2021), Pouw et al. (2020), Sahraoui et al. (2020), Saponara et al. (2021), Saponara, Elhanashi, and Zheng (2022), Shao et al. (2021). Besides, other commercial solutions have also been developed for real-time monitoring of SD, e.g., dRISK (drisk: Real-time monitoring of social distancing , 2022), based on predicting distances between individuals using a single monocular CCTV camera. Similarly, the live SD monitoring (LSDM) solution (LSDM: live social-distancing monitoring solution, 2022), developed by Intel, achieves real-time tracking and monitoring of pedestrians using distributed computing, AI models, and radar sensors. Moreover, it enables the representation of detected pedestrians as live, contextual insights and reporting on web-based dashboards.
However, the complexity of VSDM systems (e.g., detecting all mutual distances) increases with the increased density of monitored crowds (i.e., the rise in the number of observed people). For instance, the computational complexity of the VSDM system introduced by Al- Sa'd et al. (2022) has been measured by its frame rate (the number of processed video frames per second) and processing rate (i.e., the amount of processing time per frame). This VSDM system includes (i) person detection and localization, (ii) top-view transformation, (iii) smoothing/tracking (smooth noisy top-view positions and compensate for missing data due to occlusion with tracking), (iv) distance measurement and (v) violation detection. This has been done for two cases, i.e., without and with the smoothing/tracking stage on a desktop equipped with 2 Intel Xeon E5-2697V2 x64-based processors and has 192 GB of memory. Fig. 13 portrays computational complexity analysis results obtained for both scenarios. The capability of the system to run in real-time has been demonstrated by the average results, although the smoothing/tracking stage can add more computational complexity. In this regard, the VSDM system can be run at 106.5 fps (9.9 ms/frame) without the smoothing/tracking stage, while that could be decreased to run at 33.6 fps (44.5 ms/frame) when accommodating the smoothing/tracking algorithm. On the other hand, it is also confirmed from the reported results that increasing the number of tracked people can significantly augment the computational cost. This leads to lower frame and processing rates in both scenarios (i.e., without and with the smoothing/tracking stage).

Camera calibration
Some VSDM frameworks have different limitations among them camera calibration which is performed manually (Nakano & Nishimura, 2021). Even worst, for some datasets, the floor plan or the transformation matrix are missing. Thus, the authors need to estimate the size of a reference object in video frames by comparing it with the width of detected pedestrians and then utilize the key points of the reference object to measure the perspective transformation. In this respect, a transformation can be produced and used for camera calibration. To overcome the problem of camera's calibration, Nakano and Nishimura (2021) introduce a two-stage automatic VSDM, which is based on (i) camera auto-calibration (offline) using human joints to determine the 3D position and rotation of the camera, and (ii) pedestrian detection using pedestrian pose detection; (iii) pedestrian's 3D location detection using estimated calibrated data, distance measurement in the BEV.
Most of drone-based and CCTV cameras are collecting tilted images, which makes challenging their transformation to real-world coordinates. However, in some research studies such as Dubrofsky (2009), a homography exists between video frames recorded with the same camera for the same area at different positions or angles. Typically, a homography exists between two planes of the same area that correspond to tilted and vertical images, respectively. A homography transformation can then be used to transform the tilted images into real-world coordinates. In this respect, solving this problem requires transforming the tilted images to the vertical images using a homography matrix, and transforming the vertical images to the real-world coordinates based on the concept of transforming vertical images to real-world coordinates. Fig. 14 portrays an example of calibrating tilted images, where H is a homography matrix as defined in Dubrofsky (2009) and refers to the ratio of pixel to meter.

Lack of annotated datasets
Because of the privacy concerns and lockdown measures set in various countries, producing large-scale datasets for validating VSDM monitoring solutions was challenging. To close this gap, virtual reality (VR) is used in Mukhopadhyay, Reddy, Ghosh, LRD, and Biswas (2021), Mukhopadhyay, Reddy, Saluja, et al. (2021), to generate customized datasets and validate DL-based VSDM algorithms. Typically, VR can provide capabilities of interaction between individuals in a shared 3D environment. This opens the doors for various shared activities and experiences that could not be possible with other remote communication modalities.
In this regard, VR has been adopted in  to model a digital twin of an office space and utilized it to produce a comprehensive dataset of users in various locations, dresses and postures. Besides, in Mukhopadhyay, Reddy, Saluja, et al. (2021), a CNN-based VSDM system is implemented to detect individuals in a limited-sized dataset of real humans, which has been augmented with a simulated dataset of humanoid figures. Typically, the VR environment has been improved using an interactive dashboard, which shows information gathered from physical sensors and the latest statistics on COVID-19. On the other hand, YOLOv3 has been utilized to detect people in VR environments. Moving on, in Priyan, Johar, Alkawaz, and Helmi (2021), VR, smartphones, and IoT devices are used to monitor the compliance of pedestrians with SD norms. Specifically, this has been possible by visually enabling people to control their distances in the real-world based on their mobile cameras by using an augmented reality app.

Security and privacy concerns
VSDM techniques are based on mass surveillance of crowds and individuals in public areas, thus, it has been imperative to catch some potential impacts on the surrounding environments. Typically, an entire adherence to safety guidelines is not ensured as the VSDM technology is susceptible to human error and corrupt with different privacy breaches. Specifically, exchanging images/videos including information about detected individuals with data centers and responsive authorities to penalize SD violators can represent a serious privacy issue (Sugianto, Tjondronegoro, Stockdale, & Yuwono, 2021). Additionally, numerous complaints have been raised about increased panic and anxiety among the individuals receiving repetitive alerts.
To that end, developing systems that automate the SD monitoring procedure with high security and privacy preservation levels is becoming an urgent need. More, recently, few studies have been proposed to overcome these issues. For instance, in Al-Sa'd et al. (2022) a privacypreserving VSDM method for CCTV cameras is proposed. Typically, a person localization method is developed based on pose estimation. Next, a privacy-preserving adaptive smoothing and tracking approach is built for (i) mitigating noisy/missing measurements and occlusions, (ii) computing distances between pedestrians (in the real-world coordinates), detecting SD violations, and identifying overcrowded areas in scenes. Moving on, CNN models and the blockchain technology have been leveraged in Tanwar et al. (2021) to monitor SD. If SD violations are detected the surveillance center is alerted via blockchain and necessary actions are then taken. Another solution to alleviate the privacy issues relies on adopting BEV cameras. In this regard, because of the privacy concerns raised when deploying street-level cameras to record videos, Ghasemi, Yang, et al. (2021) develop a BEV-based SD analyzer (B-SDA), which helps preserve pedestrians' privacy by using BEV cameras.

Detection of family-groups and safe social groups (SSGs)
In most existing VSDM frameworks, SD violations are defined as instances where mutual distances between pairs of individuals become lower than a predefined threshold. Typically, these are considered as violations without any exceptions. However, this must not be the case for family-groups, as they are allowed to stay closer and no alerts should be triggered. To that end, it is important to discriminate between ''safe social groups (SSGs)'' and random pedestrians in close proximity to each other. A SSG can be defined as ensemble of individuals supposed to reside together, e.g., a family .
Although the importance of this point, few studies have been investigated to detect family-groups or safe social groups while monitoring the distance between pedestrians. For instance in Pouw et al. (2020), the authors focus on real-time trajectory detection and individual group analysis by imposing thresholds on the distance-time contact patterns. Typically, mutual distances and contact times have been considered along with statistical observables as the radial distribution functions (RDFs). They have conveniently been utilized for quantifying average exposure times. Therefore, the automation of definitions of familygroups and characterization of statistical distributions of violations have been enabled. In this respect, family members have been identified as the persons that persistently remain closer than a specific threshold distance for adequately long time. On the other hand, this helps define SD violations as those related to distance violations of individuals that only inconsistently (i.e., occasionally) yield COVID-19 events infringing the minimal distance rules.
On the other hand, if children and parents walk side-by-side, the rule on SD must be ignored even if the physical distance between them is less than the SD norm. Aiming at accounting for this particular scenario, an exclusion approach for child/parent pedestrians, defined based on pedestrians' height in China, is proposed in Niu et al. (2021). The authors approve the standard of free admission for children in most public places (e.g., malls, amusement parks, cinemas, tourist attractions, etc.) and select a reference height of 1.2 m for children. Meanwhile, based on the pertinent statistical data in Visscher (2008), the adult height reference has been selected as the average height of 1.715 m. In this context, pedestrians walking side-by-side with a height difference of more than 51.5 cm could be regarded as family members, and hence, the SD violations would be bypassed.

Future directions
We highlight in this section the future research perspectives although it was proved in Section YOLO-based methods have reached excellent performance, especially in simple scenes, in terms of the accuracy and reliability. Typically, there are still some performance issues with complex scenes in addition to other problems which are mainly related to privacy preservation, lack of annotated datasets, camera calibration, etc. We present in what follows the future directions that can overcome these issues:

VSDM on the edge
While CNN-based VSDM systems provide excellent accuracy to monitor SD, they can be deployed in different kinds of public spaces, such as shopping areas, airports, parks, industrial areas, etc., to slow the spread of the virus. However, their application scenarios present serious challenges to the underlying computing platforms. Specifically, small, low-cost, and energy efficient computing boards must be used to promote their implementation and enable the use of mobile surveillance while maintaining sufficient computing power and memory to run robust CNN algorithms at a lower latency. Moreover, preserving privacy of pedestrians detected in VSDM systems requires to process data edge devices, without transmitting it to cloud data centers (Fasfous et al., 2021) . In this regard, exploring light-weight CNN algorithms and deploying them on edge and mobile devices can be the great option. This helps avoid the privacy concerns as person-specific data is processed on the edge/mobile device closer to the monitoring entity. This also aids in implementing VSDM system on drones to benefit from their flexibility. Although light-weight CNN models have a fast inference, their main challenge is their low pedestrian detection accuracy, especially in dense crowds (Quiñonez & Torres, 2022;Restás, 2022).
As presented in Section 5.2.3, various lightweight CNN-based models have been introduced that are appropriate fore mobile platforms, such as ShuffleNet, SqueezNet, MobileNet (Kong et al., 2021). However, these models depend considerably on deep separable convolution and lack effective implementations in some DL frameworks. To that end, Shao et al. (2021) use PeleeNet, a light-wight CNN model perform real-time VSDM on images recorded from drones. The implementation of PeleeNet is completed using conventional convolution and features are extracted with fewer parameters. Similarly, a VSDM solution is developed by eInfochips (AI Vision Based Social Distancing Detection, 2022), which is powered by NVIDIA Jetson AGX Xavier (Jetson agx xavier developer kit , 2022). Typically, after decoding and preprocessing video recorded with, pre-trained DL algorithms detect pedestrians. Moving forward, insights about individual density/SD are extracted in real-time. Extracted information is locally saved on edge devices then moved to a cloud platform, which is only accessible to security managers or concerned authorities for (i) reviewing SD, and (ii) taking appropriate actions in case of any violations. Besides, in Ramadass et al. (2020), a OLOv3-based VSDM is embedded in a drone's camera, which runs the yolov3 algorithm and detects if the SD is respected or not and if people are wearing masks.

Federated learning (FL)
VSDM, as a computer vision-based DL technology, requires saving video data on cloud platforms for centralized training (especially for pedestrian detection and tracking). However, this cannot be the best methodology because of the high cost of transmitting video data and privacy concerns. Accordingly, as presented in Table 1, many frameworks included in this review have failed to address the privacy concerns. FL has recently been introduced to separate the requirement of powerful DL from the need to store large-scale datasets in the clouds. Specifically, FL is a distributed ML technique that relies on the storage and computing capacity of the devices themselves (e.g., cameras) to cobuild DL models without transferring data to the cloud, hence, without adversely affecting the privacy of individuals (Zhu, Yin, Xiong, Tang, & Yin, 2021).

Deep transfer learning (DTL) for better generalization
Developing DTL-and deep domain adaptation (DDA)-based VSDM schemes will help increase the generalization of these algorithms on datasets with entirely different characteristics. Typically, YOLOv3, Yolov4, and Faster-RCNN have successfully been applied to different simple datasets; however, using them on other complex datasets is still challenging. Also, processing different datasets with distinct image resolutions is still challenging as some DL models process image inputs with a fixed size of images. Thus, resizing these images is required, although this generally results in information deficiency and object distortion, which can be a possible restriction. Accordingly, applying DDA or DTL for processing numerous image resolutions considers a promising research direction to automate the VSDM task.
Up to now, all existing methods have used DTL for the pedestrian detection task. However, for better SD monitoring, it is worth applying DTL and DA to predict the physical distances among pedestrians. This is doable by first developing automatic and labeled SD monitoring datasets based on a rendering engine simulation (Di Benedetto et al., 2022).

Real-time VSDM
Implementing a real-time VSDM system requires optimizing two primary parameters, i.e., the accuracy of pedestrian detection and computation cost. Typically, the first parameter is usually represented by the mAP while the second one refers to computation time or the number of processed frames per second (fps). In this respect, the performance of the VSDM system will be increased if the computation cost is low and the mAP is high. Fig. 15 portrays a scatter plot of the mAP vs. the GPU computation times for different CNN-based object detectors (meta-architectures) and CNN-based feature extractors (Huang et al., 2017).
To enable the real-time operation of VSDM systems, all the stages involved in the SD monitoring should be run in real-time, including pedestrian detection and tracking, interpersonal distance estimation, detection of violations, and alert generation. For pedestrian detection and tracking, there has been a significant amount of studies targeting its real-time implementation since this topic has attracted substantial research in the last decade. For interpersonal distance estimation, most of the studies addressing this issue have been proposed following the COVID-19 pandemic. Accordingly, most of them have deployed the Euclidean distance to measure the distance between detected pedestrians' centroids of BBs, such as Ahmed, Ahmad, and Jeon (2021), Gonzalez-Trejo, Mercado-Ravell, and Jaramillo-Avila (2022), Lisi, Scattolin, Fusaro, and Aglioti (2021), Meivel et al. (2022), Shin and Moon (2021). However, it is rational that the complexity increases with the number of detected pedestrians. Though, many frameworks have already claimed to be able to measure the physical distances between all the pedestrian pairs in real-time, especially with moderate dense crowds, such as Pouw et al. (2020), Sahraoui et al. (2020), Saponara et al. (2021), Shao et al. (2021), Teboulbi et al. (2021). To that end, as explained in Yang, Sun, et al. (2021), a simple algorithm that helps calculate the Euclidean distance matrix for all detected pedestrians can easily be implemented to detect potential SD violation pairs in each video scene. Moreover, as another example, in Fitwi, Chen, Sun, and Harrod (2021), an interpersonal distance measurement algorithm based on triangle similarity is introduced to monitor the SD of crowds in realtime. This work relies on edge CCTV cameras, which capture crowds on video frames and leverage a YOLOv3 model to detect pedestrians.
Additionally, real-time running of VSDM systems can be enabled by implementing them on different types of graphics processing units (GPUs) or powerful central processing units (CPUs). For instance, the solution developed in Rezaei and Azarmi (2020) performs real-time monitoring using either a 10th generation multi-core/multi-thread CPU platform (or higher) or a basic GPU platform. Moving forward, in Fitwi et al. (2021), a powerful Predator Triton 700-A laptop equipped with a GPU card has been used to process more than 20 FPS. Moving on, the YOLOv4-based VSDM system presented in Rahim et al. (2021) has been implemented on a Tesla T4 GPU with 16 GB memory. Additionally, it is worthy to note that using multiple GPUs can overcome the computation complexity issue that occurred due to (i) the increasing crowd density or (ii) using complex pedestrian detectors with large batch sizes.

Conclusion
This paper presented, to the best of the authors' knowledge, the first comprehensive review of recent advances in the field of VSDM. In doing so, we first introduced the background of the VSDM problem after describing the survey methodology and explaining the article selection approach. Thereafter, evaluation metrics used in the overviewed articles are briefly presented. Next, the surveillance methodology employed to perform the VSDM, including fixed and drone-based, is explained.
Moving forward, existing VSDM contributions were discussed after categorizing them into two groups: techniques based on hand-crafted features and CNN-based methods. CNN-based methods have been classified into two categories with reference to the number of processing stages: single-stage and two-stage schemes. Also, these approaches have been classified into two main categories corresponding to the complexity of the CNN models used in each framework: complex and lightweight models. Additionally, the results of representative techniques are summarized according to the original literatures, and their pros and cons were identified. Overall, YOLOv3-based methods were the mainstream and promising techniques as up to 99.8% accuracy has been reached. However, the performance of existing methods is relative since they have been validated on different datasets. Thus, it is still challenging to conduct a fair comparison. Moreover, most existing VSDM frameworks have been tested in low-or medium-density scenes. By contrast, different areas in smart cities suffer from dense and crowded gatherings, particularly at peak periods, which makes monitoring SD between pedestrians a challenge. While VSDM techniques can smoothly detect and track pedestrians and calculate physical distances in low or medium crowded scenes, they have some difficulties performing well in highly dense areas. Concretely, some pedestrians can be hidden together and become not visible in crowded gatherings, even to human observers.
All in all, although the intense attention paid by the research community to develop VSDM solutions in the hopes of combating the COVID-19 pandemic, the critical analysis enabled identified various open challenges, such as pedestrian overlapping, cameras' calibration, lack of annotated datasets, and security and privacy concerns. Therefore, future directions that help overcome these issues and attract considerable research and development in the near future have been highlighted, including moving the VSDM algorithms to edge and mobile devices, using federated learning to promote privacy preservation, and adopting DTL for a better generalization of existing algorithms.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.