Automated Traffic Surveillance Using Existing Cameras on Transit Buses

Millions of commuters face congestion as a part of their daily routines. Mitigating traffic congestion requires effective transportation planning, design, and management. Accurate traffic data are needed for informed decision making. As such, operating agencies deploy fixed-location and often temporary detectors on public roads to count passing vehicles. This traffic flow measurement is key to estimating demand throughout the network. However, fixed-location detectors are spatially sparse and do not cover the entirety of the road network, and temporary detectors are temporally sparse, providing often only a few days of measurements every few years. Against this backdrop, previous studies proposed that public transit bus fleets could be used as surveillance agents if additional sensors were installed, and the viability and accuracy of this methodology was established by manually processing video imagery recorded by cameras mounted on transit buses. In this paper, we propose to operationalize this traffic surveillance methodology for practical applications, leveraging the perception and localization sensors already deployed on these vehicles. We present an automatic, vision-based vehicle counting method applied to the video imagery recorded by cameras mounted on transit buses. First, a state-of-the-art 2D deep learning model detects objects frame by frame. Then, detected objects are tracked with the commonly used SORT method. The proposed counting logic converts tracking results to vehicle counts and real-world bird’s-eye-view trajectories. Using multiple hours of real-world video imagery obtained from in-service transit buses, we demonstrate that the proposed system can detect and track vehicles, distinguish parked vehicles from traffic participants, and count vehicles bidirectionally. Through an exhaustive ablation study and analysis under various weather conditions, it is shown that the proposed method can achieve high-accuracy vehicle counts.


Introduction
Ever-increasing demand and congestion on existing road networks necessitates effective transportation planning, design, and management. This informed decision making requires accurate traffic data. Most medium-to long-term traffic planning and modeling is focused on traffic demand, as quantified by traffic counts or flow. Traffic flow is generally measured by fixed-location counting, which can be accomplished using permanent detectors or, in many cases, temporary detectors or human observers. Many traditional traffic studies use temporary pneumatic tube or inductive loop technologies that are deployed for only a few days at each location, and each location is revisited only every three to five years.

•
An automatic vision-based vehicle counting and trajectory extraction method using video imagery recorded by cameras mounted on transit buses. • A real-world demonstration of the automated system and its validation using video from in-service buses. An extensive ablation study was conducted to determine the best components of the pipeline, including comparisons of state-of-the-art CNN object detectors, Kalman filtering, deep-learning-based trackers, and region-of-interest (ROI) strategies. The main challenges in this endeavor are as follows: estimating vehicle trajectories from a limited observation window, robust vehicle detection and tracking while moving, localizing the sensor platform, and differentiating parked vehicles from traffic participants. Figure 2 shows an overview of the processing pipeline. First, images obtained from a monocular camera stream are processed with the automatic vehicle counter developed in this study. A Yolo v4 [10] deep CNN trained on the MS COCO dataset [19] detects vehicles, while SORT [20], a Kalman filter and Hungarian algorithm based tracker which solves an optimal assignment problem, generates unique tracking IDs for each detected vehicle. Next, using subsequent frames, the trajectory of each detected vehicle is projected onto the ground plane with a homographic perspective transformation. Then, each trajectory inside a predefined, geo-referenced region-of-interest (ROI) is counted with a direction indicator. The automatic vehicle counts are compared against human-annotated ground-truth counts for validation. In addition, an exhaustive ablation study is conducted to determine the best selection of 2D detectors, trackers, and ROI strategies. Overview of the proposed automated vehicle counting system. The objective is twofold: obtaining total bidirectional vehicle counts and extracting trajectories in bird's-eye-view (BEV) coordinates. Our method uses streams of images, GNSS measurements, and predefined ROI information on a 2D map as inputs. The detection and tracking branch does not require geo-referencing, but it is necessary for counting and projecting trajectories in BEV world coordinates. The homography calibration needs to be performed only once. Only four point correspondences are required for the calibration process.

Related work 2.1. Traffic Surveillance with Stationary Cameras
Automating vehicle detection and counting is crucial for increasing the efficiency of image-based surveillance technologies. A popular approach has been to decompose the problem into subproblems: detecting, tracking, and counting [2].
Detection: Conventional methods employ background-foreground difference to detect vehicles. Once a background frame, such as an empty road stretch, is recorded, the background can be subtracted from the query frames to detect vehicles on the road [3].
More recently, state-of-the-art deep-learning-based 2D object detectors such as Mask-RCNN [8], Yolo v3 [9], and Yolo v4 [10] have achieved remarkable scores on challenging object detection benchmarks [19,21]. Deep CNNs completely remove the background frame requirement. As such, pretrained, off-the-shelf deep CNNs can be deployed directly for vehicle detection and, if needed, vehicle classification tasks [6,7]. However, most vehicle detectors work on a single image frame, hence the need for tracking.
Tracking: Tracking is a critical step for associating detected objects across multiple frames and assigning a unique tracking ID to each vehicle. This step is essential for counting, as the same vehicle across multiple image frames should not be counted more than once. Once the vehicle is detected, a common approach is to track the bounding box with a Kalman filter [22].
SORT [20] is a more recent Kalman filter and Hungarian algorithm based tracker. Deep SORT [23] is an extension of SORT with a deep association metric, using a CNN-based feature extractor to associate objects with similar features across frames. Deep-SORT-based vehicle detectors [24] can achieve state-of-the-art performance.
Counting: The final component of the processing pipeline is counting. Usually, lane semantics with ROI reasoning [4] are used to define a counting area. This area is then used to decide which vehicles to count as traffic participants.

Traffic Surveillance with UAVs
Automated vehicle detection has progressed significantly over the years. However, stationary cameras cannot cover the entirety of the road network. The shortcomings of fixed-location, stationary cameras can be circumvented by using a UAV as a sensor platform [12][13][14][15]. Operating regulations have somewhat relaxed, and the cost of UAVs has fallen significantly over the past decade, allowing them to become a more cost-efficient solution for airborne surveillance.
UAV-based surveillance systems suffer from lower resolutions and smaller observed vehicle sizes. In addition, since the viewing angle is top-down, vehicles' distinguishing features cannot be observed. As such, off-the-shelf object detectors cannot be deployed directly. The deep learning models must be trained with a top-down object detection dataset [30].
UAV-based traffic surveillance is promising. However, adverse weather affects flight dynamics negatively [15], and safety concerns are a limiting factor for the usability of airborne drones.

Traffic Surveillance with Probe Ground Vehicles
The general concept of using moving vehicles to determine traffic variables is not new. It is common practice to use the floating car method to measure travel times and delays for arterial traffic studies [31,32] and to use probe vehicles to measure travel times in real time [33][34][35][36][37][38][39][40]. However, these approaches focus only on measuring the probe vehicle's travel time.
This study focuses on the estimation of traffic flow rather than travel times. Traffic flow estimation requires counting other vehicles. Although the theoretical foundations of such a framework were laid decades ago [41], an operational fully automated system has not yet been realized and deployed in a practical large-scale setting. While there are recent efforts to extend cellphone tracking to estimate traffic volumes [42,43], these approaches are limited by the fact that there is not a one-to-one match between cellphones and vehicles.

Traffic Flow Using Moving Observers
The main idea behind the moving observer method is to estimate hourly traffic volume on a finite road section while the observer is traversing it. Recent studies have made progress in this regard. Specifically, one study used an instrumented vehicle equipped with lidar sensors to emulate a transit bus [16,17]. Lidar sensors and data were used because they are readily processed, with limited computing power, to automatically detect vehicles. The detected vehicles were then converted to counts, which in turn were transformed into estimates of traffic flows. However, modern transit buses are already equipped with video cameras for purposes other than traffic flow estimation, rather than with lidar sensors. Therefore, in a follow-on study [18], counts from video imagery recorded by cameras mounted on transit buses, which had been extracted manually using a GUI, were used to show and establish the viability and validity of estimating accurate traffic flows from the such imagery.
Another recent study focused on applying moving observer techniques to estimate traffic flow using automated vehicles and a data-driven approach [44]. However, a practical large-scale implementation of such a moving observer surveillance system is not possible without a method that can entirely automatically extract vehicle trajectories and count vehicles from a limited observation window.
We aim to fill this gap in the literature with the proposed automatic vehicle trajectory extraction and counting methodology. In principle, the methodology developed herein for counting vehicles from existing cameras on transit buses could be transferred to existing cameras on public service, maintenance, or safety vehicles and even private automobiles equipped with cameras already being used for driver assist, adaptive cruise control, or vehicle automation, e.g., [44].

Method
As mentioned previously, our work focuses solely on automatically obtaining vehicle counts and extracting trajectories using computer vision. What follows are detailed descriptions of the problem and the various steps of the developed methodology.

Problem Formulation
Given an observation at time t as O t = (I t , p bus t ), where I ∈ Z H×W×3 is an image captured by a monocular camera mounted on the bus and p ∈ R 2 is the position of the bus on a top-down 2D map, the first objective is to find f 1 : S → (n 1 , n 2 ), a function that maps a sequence of observations S = (O 1 , · · · , O N ) to the total number of vehicles counted traveling in each direction (n 1 , n 2 ). The second objective is to find f 2 : S → {T i } m , which provides the m tuples of detected vehicles' trajectories. The trajectory of detected vehicle i is given with where J is the total number of frames in which vehicle i was observed by the bus and p i,j is the bird's-eye-view real-world position vector of vehicle i.
The overview of the solution is shown in Figure 2. The proposed algorithm, Detect-Track-Project-Count (DTPC), is given in Algorithm 1. The implementation details of the system are presented in Section 4.

Two-Dimensional Detection
) contains the bounding box corners in pixel coordinates, l is the class label (e.g., car, bus, or truck), and c is the confidence of the detection. A 2D CNN f CNN : I → {D} q maps each image I captured by the bus camera to q tuples of detections D. This processing is performed on individual image frames. The object classification information is used to avoid counting pedestrians and bicycles but could also be used in other studies.
The networks commonly known as Mask-RCNN [8], Yolo v3 [9], and Yolo v4 [10], pretrained on the MS COCO dataset [19], are employed for object detection. Based on the ablation study presented subsequently, the best-performing detector was identified as Yolo v4.

Tracking
The goal of tracking is to associate independent frame-by-frame detection results across time. This step is essential for trajectory extraction, which is needed in order to avoid counting the same vehicle more than once.
SORT [20], a fast and reliable algorithm that uses Kalman filtering and the Hungarian algorithm to solve the assignment problem, is employed for the tracking task. First, define the state vector as where is its aspect ratio, andu,v, andṡ are the corresponding first derivatives with respect to time. The next step consists of associating each detection to already existing target boxes with unique tracking IDs. This subproblem can be formulated as an optimal assignment matching problem where the matching cost is the Intersection-over-Union (IoU) value between a detection box and a target box i.e., This problem can be solved with the Hungarian algorithm [45]. After each detection is assigned to a target, the target state is updated with a Kalman filter [46]. The Kalman filter assumes the following dynamical model: where, F k is the state transition matrix, B k is the control input matrix, u k is the control vector, and w k is normally distributed noise. The Kalman filter recursively estimates the current state from the previous state and the current actual observation as follows: where P is the predicted covariance matrix and Q is the covariance of the multivariate normal noise distribution. Details of the update step can be found in the original Kalman filter paper [46]. Deep SORT [23] was also considered as an alternative to SORT [20]; however, SORT gave better results. An example of a detected, tracked, and counted vehicle is shown in Figure 3.

Geo-Referencing and Homography
The proposed method utilizes geo-referencing and homographic calibration to count vehicles in the desired ROI and transforms the detected vehicles' trajectories from pixel coordinates to real-world coordinates.
GNSS measurement data are used to localize the bus with a pose vector p bus t at time t. A predefined database contains geo-referenced map information about the road network and divides each road into road segments with a predefined geometry, including the lane widths for each segment. Road segments tend to run from one intersection to another but can also be divided at points where the road topology changes significantly, for example, when a lane divides. This information is used to build the ROI in BEV real-world coordinates for each road segment. The pixel coordinates of each detection can be transformed into real-world coordinates with planar homography using since planar homography is up-to-scale with a scaling factor. H, the homography matrix, has eight degrees of freedom. Hence, H can be determined from four real-world image point correspondences [47]. Finding four point correspondences is fairly straightforward for urban road scenes. Standard road markings and the width of a lane are used to estimate H. The homographic calibration process is shown in Figure 4.
Once H is obtained, the inverse projection can be easily achieved with the inverse homography matrix H −1 . Inverse homography is used to convert a BEV real-world ROI to a perspective ROI for the image plane. After obtaining the perspective ROI, vehicles on the corresponding regions can be counted.

Counting
The objective of counting is to count each unique tracked vehicle once if it is a traffic participant and not a parked vehicle, travels in the corresponding travel direction, and is within the ROI.
The 2D object detector's output is considered only if the inferred class l = {car, bus, truck} and the bounding box center is within the ROI. Thus, at each counting time t, using u c,i and v c,i , the bounding box center coordinates of each tracked vehicle i, the count in the top-tobottom direction (i.e., opposite the direction of travel of the bus) is updated incrementally n 1 ← n 1 + 1 if the one-step sequence (u c,i,t−1 , v c,i,t−1 ), (u c,i,t , v c,i,t ) of the image domain trajectory of vehicle i satisfies sgn(u c,i,t − u c,i,t−1 ) = 1 and vehicle i was not counted before. In a similar fashion, the count in the bottom-to-top direction (i.e., in the direction of travel of the bus) is updated incrementally n 2 ← n 2 + 1 if sgn(u c,i,t − u c,i,t−1 ) = −1 and vehicle i was not counted before. The ROI alignment for counting purposes is shown in steps 3 and 4 of Figure 4. Distinguishing parked vehicles from traffic flow participants is achieved by the ROI. This distinction and the difference between detecting and counting vehicles is illustrated in Figure 5.

Trajectory Extraction in BEV Real-World Coordinates
Finally, for each tracked unique vehicle, a trajectory in BEV real-world coordinates relative to the bus is built by transforming the image-domain coordinates using the inverse homography projection.
The image domain trajectory of vehicle i, T i = {(u c,i,t , v c,i,t )|t ∈ (t 1 , t 2 · · · , t K )}, is transformed into T i , with inverse homography, (x c,i,t , y c,i,t , 1) = H −1 (u c,i,t , v c,i,t , 1) . An example of extracted trajectories is illustrated in Figure 6.
Trajectory extraction in BEV real-world coordinates is not strictly necessary for the purpose of counting unique vehicles, which could be achieved without such transforma-tion. However, it is quite valuable for understanding the continuous behavior of traffic participants, including their speed, and therefore could be used as input to other studies.

Transit Bus
The Ohio State University owns and operates the Campus Area Bus Service (CABS) system consisting of a fleet of about 50 40-foot transit buses that operate on and in the vicinity of the OSU campus, serving around five million passengers annually (pre-COVID-19). As is the case for many transit systems, the buses are equipped with an Automatic Vehicle Location (AVL) system that includes GNSS sensors for operational and real-time information purposes and several interior-and exterior-facing monocular cameras that were installed for liability, safety, and security purposes. The proposed method depends on having an external view angle wide enough to capture the motion pattern of the surrounding vehicles. Figures 3-5 show sample image frames recorded by these cameras.
Video imagery collected from CABS buses while in service are used to implement and test the developed method. We chose to use the left side-view camera in this study because it is mounted higher than the front-view camera, which both reduces the potential for occlusions caused by other vehicles and improves the view of multiple traffic lanes to the side of the bus, particularly on wider multilane roads or divided roads with a median. Moreover, video from these cameras is captured at a lower resolution, which significantly reduces the size of the video files that needed to be offloaded, transferred, and stored. The video footage was recorded at 10 frames per second with a resolution of 696 × 478.

Implementation Details
The proposed algorithm was implemented in Python using OpenCV, Tensorflow, and Pytorch libraries. For 2D detectors, the original implementations of Mask-RCNN [8], Yolo v3 [9], and Yolo v4 [10] were used. All models were pretrained on the MS COCO dataset [19]. Experiments were conducted with an NVIDIA RTX 2080 GPU.
High-speed real-time applications are not within the scope of this study. For surface road traffic surveillance purposes, low frame rates are sufficient. The video files used in this study were downloaded from the buses after they complete their trips. However, it would be possible to perform the required calculations using edge computing installed on the bus or even integrated with the camera. This is a topic for future study.

Data Collection and Annotation
A total of 3.5 hours of video footage, collected during October 2019, was used for the experimental evaluation of the ablation study. An additional 3 hours of video footage, collected during March 2022, was used for the evaluation of the impacts of adverse weather. The West Campus CABS route on which the footage was recorded traverses four-lane and two-lane public roads. The route is shown in Figure 7. As described above, this route was divided into road segments which tend to run from one intersection to another but may be divided at other points, such as when the road topology changes significantly. This can include points at which the number of lanes change or, in the case of this route, areas in which a tree-lined median strip obscures the camera view. Using this map database, sections with a tree-lined median strip and occlusions were excluded from the study. For each road segment, an ROI was defined with respect to road topology and lane semantics to specify the counting area. As a result, a total of 55 bus pass and roadway segment combinations were used in the ablation study evaluation analysis. We note that should the bus not follow its prescribed route or the route changes, the algorithm can detect and report this due to the map database, as well as remove or ignore those portions of traveled road that are off-route or for which the road geometry information needed to form an ROI is not available.
The video footage was processed by human annotators in order to extract ground-truth vehicle counts. Annotators used a GUI to incrementally increase the total vehicle count for each oncoming unique vehicle and to capture the corresponding video frame number. Annotators counted vehicles once and only if the vehicle passed a virtual horizontal line drawn on the screen of the GUI while traveling in the direction opposite to that of the bus.
After annotation, the video frames, counts, and GNSS coordinates were synchronized for comparison with the results of the developed image-processing-based fully automated method.

Evaluation Metrics
First, for each of the 55 bus pass and segment combination, ground-truth counts and inferred counts using the developed automatic counting method were compared with one another. In addition to examining a scatter plot that depicts this comparison, four metrics were considered, namely difference c gt − c i , absolute difference |c gt − c i |, absolute relative difference |c gt − c i |/c gt , and relative difference (c gt − c i )/c gt , where c gt is the ground-truth count, and c i is the inferred count.
The sample mean, sample median, and empirical Cumulative Distribution Function (eCDF) were calculated for the proposed and alternative automated baseline methods.

Ablation Study
An extensive ablation study was conducted to identify the best alternative among multiple methods that could be applied to each phase of the counting system. All the combinations of modules from the following list were compared with one another using the sample mean and sample median of the evaluation metrics to find the better performing combinations: The Generic ROI was defined based on a standard US two-lane road in perspective view. Each combination option is denoted by the sequence of uppercase first letters of the name of the modules. For example, Y4SGR stands for the Yolo 4 detector, SORT tracker, and Generic ROI combination. The proposed method is denoted with "Proposed" and stands for the Yolo 4 detector, SORT tracker, and Dynamic ROI combination.

Ablation study
Considering all combinations of tools, the proposed method, consisting of the Yolo v4 object detector, SORT tracker, and Dynamic Homography ROI, was found to be the best based on the sample mean, sample median, and eCDF of the four evaluation metrics. Figure 8 shows pairs of inferred counts plotted against ground-truth counts for each of the 55 road segment passes for select combinations of modules (including the best-performing one among all combinations) that reflect a wide range of performance. For a perfect automatic counting system, the scatter plot of inferred counts versus ground truth should be on the identity line (y = x). A regression line was estimated for each combination. The estimated lines are shown in Figure 8, along with their confidence limits. Clearly, the proposed Y4SDR outperforms the other four combinations shown in Figure 8 by a substantial margin.
As expected, the proposed homography-derived ROI performs better than the No ROI and the Generic ROI modules. The latter two ROI modules lead to over counts because of their limitations in distinguishing traffic participants from irrelevant vehicles. In contrast, the dynamic ROI allows for counting vehicles only in the pertinent parts of the road network, thus omitting irrelevant vehicles, which in this study consisted mostly of vehicles parked on the side of the roads or in adjacent parking lots. This result validates the developed bird's-eye-view inverse projection approach to defining dynamic ROIs suitable to each roadway segment. Figure 9 shows the eCDF plots of the four different functions from the four different trackers. In addition to confirming the overall superiority of the Y4SDR combination, these plots also indicate that this combination is highly reliable, as is evident from the thin tails of the distributions.
Summary results for all combinations of modules are shown in Table 1. Specifically, the sample mean and sample median of absolute differences and absolute relative differences are given for each considered detector, tracker, and ROI combination. Across all combinations of detector and ROI options, SORT outperformed Deep SORT consistently. This result indicates that the deep association metric used by Deep SORT needs more training to be efficient. In addition, across all combinations of tracker and ROI options, the detectors Yolo v3 and Yolo v4 performed similarly and better than Mask-RCNN. This result indicates that two-stage detectors, such as Mask-RCNN, are prone to overfitting to the training data more than single-stage detectors. Moreover, it can be seen that for all detector and tracker combinations, the proposed dynamic ROI option outperforms the other two ROI options by a large margin, as one might expect. Finally, from the results in Table 1, the Y4SDR combination is seen to perform the best among all combinations.

Impacts of Adverse Weather and Lighting
Having determined the best algorithm to implement from the results of the ablation study, one may also consider the performance of both the image processing and the overall counting system in the presence of inclement weather, including rain or post-rain conditions with puddles and irregular lighting effects. Inclement weather exposes several issues with an image-processing-based system, including darker and more varied lighting conditions, reflective puddles that can be misidentified as vehicles (Figure 10a), roadway markings misidentified as vehicles (Figure 10b), water droplets and smudges formed from dust or dirt and water partially obscuring some portions of the image (Figure 10c), and finally, blowing heavy rain, or possibly snow, with falling drops that can be seen in the images (Figure 10d). We compared human-extracted ground truth with the automated extraction and counting results from videos of several loops of the West Campus route both on dry, cloudless, sunny days and days with rainfall varying from a light drizzle to heavy thunderstorms, which also included post-rain periods of time with breaking clouds that provided both darker and occasionally sunlit conditions. All the videos were acquired from actual inservice transit vehicles during the middle of the day.
Qualitatively, rainfall and wet conditions caused more transient-lasting only for one to two frames-inaccurate detection events to occur in the base image processing portion of the algorithm, including vehicles not detected, misclassified vehicles and other objects, and double detections when a vehicle is split into two overlapping objects, along with incorrectly identifying a greater number of background elements (e.g., puddles and road markings) as vehicles or objects. However, these transient events generally make little difference in the final vehicle counts, as the overall algorithm employs both filtering to eliminate unreasonable image processing results and tracking within the region of interest, such that a vehicle is declared present and counted only after a fairly complete track is established from the point it enters to the point it departs the camera's field of view. This approach, in general terms, imposes continuity requirements, causing transient random events to be discarded unless they are so severe as to make it impossible to match and track a vehicle as it passes through the images, thereby improving the robustness of the approach.
More problematic, however, are persistent artifacts such as water droplets or smudges formed by wet dust and dirt, which often appear on the camera lens or enclosure cover after the rain stops and the lens begins to dry. For example, in one five-minute period of one loop, there was a large smudge on the left side at the vertical center of the lens, which obscured the region of the image where vehicles several lanes to the left of the transit bus tended to cross through and leave the image frame, causing them to not be counted due to incomplete tracking.
We note that these are general problems that can affect most video and image processing systems-if the lens is occluded you cannot see anything in that region of the image. It would be possible in future work to implement a method to dynamically detect when the lens is occluded and note the temporary exclusion of those regions from counts while the occlusion persists. As a final note, heavy rain was also observed to clean the lens.
Quantitatively, we present the results of the automated extraction and counting experiments in Table 2 for both the dry, clear, sunny loops and the rainy and post-rain loops. The table presents the percentage of correctly counted vehicles, the percentage of vehicles missed due to not being detected at all or detected too infrequently to build a sufficient track, the percentage of vehicles not detected due to being substantially occluded by another vehicle (this is not actually a weather-related event but is included for completeness), and the percentage of vehicles not counted due to a smudge or water droplet covering part of the camera lens. The final columns of Table 2 indicate the percentage of double-counted vehicles and the percentage of false detections or identifications that persisted long enough to be tracked and incorrectly counted. These are two impacts of transient image processing failures that are not always detected, at present, by our filtering and tracking algorithms.
As can be seen in Table 2, the overall effects of poor weather conditions result in only a minor increase in the errors committed by this system.

Conclusions
This paper introduced and evaluated a fully automatic vision-based method for counting and tracking vehicles captured in video imagery from cameras mounted on buses for the purpose of estimating traffic flows on roadway segments using a previously developed moving observer methodology. The proposed method was implemented and tested using imagery from in-service transit buses, and its feasibility and accuracy was shown through experimental validation. Ablation studies were conducted to identify the best selection of alternative modules for the automated method.
The proposed method can be directly integrated into existing and future groundvehicle-based traffic surveillance approaches. Furthermore, since cameras are ubiquitous, the proposed method can be utilized for different applications.
Reimagining public transit buses as data collection platforms has great promise. With widespread deployment of the previously developed moving observer methodology facilitated by the full automation of vehicle counting proposed in this paper, a new dimension can be added to intelligent traffic surveillance. Combined with more conventional methods, such as fixed location and the emerging possibilities of UAV-based surveillance, spatial and temporal coverage of roadway networks can be increased and made more comprehensive. This three-pronged approach has the potential of achieving close to full-coverage traffic surveillance in the future.
Future work could focus on further comprehensive evaluation of the method presented here under more varied conditions, subsequent refinements, and the use of edge computing technologies to perform the image processing and automatic counting onboard the buses in real time. Another potential extension would involve coordinated tracking of vehicles across multiple buses, although this raises certain social and political privacy issues that would need to be addressed. Finally, there could be significant uses and value in vehicle motion and classification information, potential extensions to include tracking and counting bicycles, motorcycles, and pedestrians, and the eventual integration into smart city infrastructure deployments.

Data Availability Statement:
The original video data used in this study as well as the manually extracted ground truth records are available in the Zenodo repository at https://zenodo.org/record/7955464.

Acknowledgments:
The authors are grateful to The Ohio State University's Transportation and Traffic Management and the Campus Area Bus Service, notably Sean Roberts for providing video footage and AVL data and former Director Beth Snoke and current Director Tom Holman for their administrative support. The authors also wish to acknowledge the efforts of Diego Galdino of the OSU Transit Lab in identifying the video clips used in this study.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: