Keywords

1 Introduction

The special focus of the Connecting Austria project on the road infrastructure and innovative C-ITS services for level 1 truck platooning as outlined in [20] made the development of a video-based traffic flow estimation system for the assessment of traffic efficiency and safety of platooning in urban areas necessary. The scenario-based evaluation of use case 4: truck platoon crossing an intersection is carried out comprehensively for realistic traffic situations at the three-way intersection on the Perner Island in the city of Hallein, Austria (see Fig. 10.1). The investigated intersection was selected because of its mixed traffic that leads frequently to conflicting situations between vehicles, trucks and vulnerable road users like pedestrians or bicycle riders.

Fig. 10.1
figure 1

© OpenStreetMap contributors under the CC-BY-SA license, https://www.openstreetmap.org/copyright

Three-way intersection in Hallein where the traffic monitoring system was installed. The red arrows show the direction of view for each camera, base map and map data from OpenStreetMap,

A dynamic management of traffic flow [16] including platoons needs not only a precise recognition of the current traffic situation but also a representative statistic over the real traffic situation over the long term in order to optimise for efficiency. This includes the meaningful aggregation of trajectories and the automatic identification of the flow patterns at the intersection. We therefore developed in cooperation with the project partner SWARCO FUTURIT Verkehrssignalsysteme GmbH a video-based traffic measurement system that is able to locate all the traffic participants on the intersection and that can aggregate that information to reveal traffic flow patterns of different road users in great detail. We installed our system at the three-way intersection on the Perner Island in Hallein (see again Fig. 10.1) and tracked six different classes of road users for two weeks in order to get a clear understanding about the real traffic situations and flow patterns on the intersection.

Cameras are widely used for traffic monitoring at intersections and provide rich visual information about road users. A typical set-up consists of a single camera high mounted at a close by building such that the full crossing is observable [6, 24]. This set-up has the advantage that occlusions of road users through other road users are minimised due to the elevated view point and that the configuration process is simplified because the stitching and calibration of different views are not necessary. However, the installation of such a system can become cumbersome since the mounting of the system on a nearby building involves the house owner in the installation process or is not possible at all if the crossing is in a rural area. We therefore follow a different approach where we construct self-contained recognition units that can be easily attached to any traffic light system.

In addition, we were also looking for a more sustainable set-up in terms of energy consumption. There is a growing concern that the strong increase of energy consumption of the IT infrastructure, with the application of deep learning methods as one of the key drivers of this development, at this pace is not sustainable [22]. A currently widely used set-up for street monitoring cameras is to broadcast the video to the cloud and to do the processing of the video stream there. However, the transportation of the high-resolution video data is energy intensive [5] and also needs an Internet connection with high bandwidth in order to not risk processing instabilities. In the Connecting Austria project, we took a different approach taking the energy consumption of the whole system from the beginning of the design process into account. We designed our recognition units as edge computing [26] devices that are able to process the video stream in real time due to a dedicated low energy hardware accelerator for neural networks. Similar solutions have been recently proposed by  [2, 17]. Different to their approach, we go one step further and process and analyse the behaviour of the road users in terms of real-world coordinates which allows us to map the extracted traffic patterns to a digital twin of the intersection.

In the following, we describe the developed traffic estimation system that is able to collect comprehensive information about the traffic situation in real time to estimate traffic density and flows of cars and trucks with high precision.

2 Low Energy Internet of Things Traffic Monitoring System

Our project aim was the development of a low energy camera-based traffic monitoring system that is able to recognise six different types of road users in real time. The designed system should be easy to deploy on existing traffic light installations and should be able to send the precise location of the road users at the intersection to an operator or to an back-end solutions that can use this information to initiate further actions like warning a driver.

Figure 10.2 shows our measurement set-up at the intersection where we installed one recognition unit for each arm of the three-way intersection. Every recognition units consists of a camera that is connected to a processing unit with a dedicated hardware accelerator for artificial intelligence applications. To preserve the privacy of the road users, the processing of the camera stream is done locally via the attached AI processing unit. Therefore, possible sensitive information never leaves the device, and only the object class and its position in the image are sent to our cloud server. As an information broker, we use a Kafka server,Footnote 1 which is a an open-source distributed event streaming platform often used in Internet of Things (IoT) scenarios. At our cloud server, the final processing includes three steps: first, the integration of each camera view into one common world view, second, the tracking of the objects over time, and third, the estimation of traffic flow according to our flow graph of the street crossing. Figure 10.3 shows exemplary the intermediate results of our processing pipeline. All processing steps are detailed in the following sections.

Fig. 10.2
figure 2

a Measurement set-up at the three-way intersection in Hallein. b The processing pipeline. Object detection is done with a specialised AI processing unit attached to each of the three cameras. The detection results are transferred via a mobile Internet connection to the cloud infrastructure where the flow estimation is processed

2.1 Real-Time Object Detection

The last years showed tremendous progress in the field of object detection. One key driver was the development of new methods and tools that can be efficiently calculated on modern graphic cards. The now standard models like faster RCNN [19] or mask RCNN [8] achieve high accuracy but also need powerful server solutions for the processing of live video streams. Therefore, lightweight models like Yolo [18] or MobileNet [10] have been developed for the use on smartphones or embedded devices. Our system utilises an advanced architecture [25] derived from MobileNet that can be efficiently evaluated on the coral boardFootnote 2 that we use to process the live stream of the cameras.

The coral board is an ARM-based single board computer with an on-board Edge TPU co-processor to perform fast machine learning (ML) inferencing. Although the board comes with an object detector, its performance for our setting is rather poor because of the limited images size of 300\(\,\times \,\)300 pixels. In order to increase the detection performance of our system, we increased the image size to 960\(\,\times \,\)384 pixels and trained a new model using a selection of images from the COCO data set [14] and images that we collected at the crossing. The final training data set contained approximately 20k images from the COCO data set and 10k images from the crossing in Hallein that we automatically annotated using the consensus estimate of two state-of-the-art networks [19, 23]. The final model was able to process images of size 960\(\,\times \,\)384 with 15 frames per second (fps).

Several models were trained with the TensorFlow object detection toolkitFootnote 3 and quantised to 8 bit for the usage on the Coral device. The final model had a mean average precision of 82.6% on an manually labelled holdout data set consisting of 50 images from each of the three cameras at the crossing.

Fig. 10.3
figure 3

a Object detection on camera images with 15 fps where we extract only the bounding box of objects. b The detected objects of each camera are transferred to lat/long world coordinates and via Kalman filtering we integrate them to object trajectories. c In a final step, the car trajectories are used to automatically infer the major traffic flows on the three-way intersection. In this case, six major traffic flows F-1 to F-6 are identified and depicted in different colours

2.2 Sensor Fusion and Object Tracking

One important step in the configuration of our system is the projection of the image positions of recognised objects to the coordinate system of the crossing and to combine the projections of each camera to a common world view. Several techniques for roadside camera calibration are available [11] with the overconstrained approaches usually performing best. An accurate camera calibration facilitates the fusion of the projected camera positions and thus the object tracking as a whole. The final sensor fusion and tracking pipeline was implemented the following way.

First we calculated a class-specific reference point \(P_\mathrm {ref}\) from the bounding box of every object detection as shown in the following equation

$$\begin{aligned} P_\mathrm {ref}(x, y) = \left( x_1 + f_x\cdot (x_2 - x_1), \, y_1 + f_y\cdot (y_2 - y_1) \right) , \end{aligned}$$
(10.1)

where \(x_1, y_1\) are the top left coordinates and \(x_2, y_2\) the right bottom coordinates of the bounding box, respectively. The factor f takes a value between \([0-1]\) and was optimised in such a way that the distance between the projected world coordinates of the same object seen from different camera views becomes a minimum. The derived values for \(f_x\) and \(f_y\) are shown in Table 10.1.

Table 10.1 Class-specific relative position of the reference point in regard to the detection’s bounding box

Second, with the help of the camera calibration toolbox from the opencvFootnote 4 library, we projected the image coordinate of the reference point to our world coordinate system. Each camera was calibrated manually using the visible subset of 20 carefully selected and mapped reference points on the intersection. Third, the projected world coordinates are then assigned to the predicted positions of tracked objects. For every tracked object, there can be maximal one assigned detection per camera. We used the Hungarian algorithm [12] to calculate an assignment with minimal distance between detections and tracked objects. For the Kalman filter update, we used the average position of the assigned detections. The Kalman filter was initialised with the discrete constant white noise kinetic model [4] that we parameterised for every object class individually.

2.3 Traffic Flow Estimation

In order to get a better understanding of the vehicle flows on the observed intersection, we used trajectory clustering  [3] to group similar trajectories together and to automatically learn the possible traffic patterns at the intersection using an graph-based approach for traffic flow extraction  [9]. The method first uses all trajectories to build a flow graph and then extracts flow patterns based on the maximum flow between to nodes of the graph. The major traffic flows of vehicles on the three-way intersection are depicted in Fig. 10.3c. The measured vehicle trajectories are mapped to one of the six paths based on the minimal average distance between trajectory points and flow path. Because of tracking errors due to object occlusions or alignment failures, a significant number of tracks could be only observed partly and thus had to be removed from the analysis in order to get a good estimate of the vehicle counts. This was done with an additional calibration step that excluded short tracks due to tracking problems. We used two one-hour recordings with ground truth of the car and truck count to adapt the counting method parameters and measure the flow estimation quality in a cross-validation setting.

3 Traffic Flow Measurement Result

First we evaluate our traffic flow estimation method with manually obtained car and truck counts at the three-way intersection. Table 10.2 summarises the true and the estimated count of cars and trucks over two one-hour observation periods. For every observation period, we used the other one to perform a hyper-parameter optimisation of our estimation method. The measured deviation from the true count averaged over the two observation periods was 3.9% for cars and 7.4% for trucks, respectively. Second we also evaluated the precision of the object count for the six observed traffic flow patterns at the intersection (see Fig. 10.3c) individually. The average error of the count estimate increased to 10.8% for cars and to 26.4% for trucks. The strong increase of the error for the truck class is a result of the general low frequency of trucks for flows F-3 to F-6 which makes a precise estimation of the traffic flow more unreliable.

Table 10.2 Evaluation result of the traffic flow estimation

Furthermore, we investigated the traffic flow at the intersection for a two weeks observation period. Figure 10.4 shows the estimate flow of cars and trucks over a 12 h time span on an hourly basis. The measurement was done from 18 September till 1 October 2020. The estimates are calculated separately for workdays (orange bars) and weekends (blue bars) since the amount of truck traffic changes considerable during weekends. Whereas the truck traffic has its peak during the morning hours and decays considerably in the afternoon and evening hours, the car traffic stays almost constant and decreases only in the evening. The average percentage of trucks on the intersection during this measurement period was 6.0%. During weekends, the average percentage of trucks decreased to 1.8% due to a considerable lower number of trucks passing the intersection (Fig. 10.4b). For cars on the other hand, we find a strong decrease in the count only in the morning hours due to Sundays.

Fig. 10.4
figure 4

Object count estimates of cars (a) and trucks (b) from 7am to 7pm on an hourly basis. The bars show the mean count of the object class per hour, and the black line indicates the variation of the count estimate. Orange bars are averaged over workdays, whereas the blue bars show the average of weekends

We also investigated how the car and truck traffic is distributed along the six traffic flow patterns. Figure 10.5 shows that most traffic is along flow pattern F-1 and F-2 (see Fig. 10.3c) for the definition of the flow patterns) which is a higher ranked street that bypasses the old town. F-3 to F-6 are distributor roads from and into the old town that show less traffic (also because of a smaller time share on the traffic light switching schedule). It can be clearly seen that the truck traffic from and into the old town is very low and thus also the percentage of trucks on the intersection as Table 10.3 shows. The only exception is the distributional road from the old town (traffic pattern F-4) for which we observed the highest variation in the percentage of trucks on the intersection. Although the number of cars and trucks along this flow pattern is generally low and leads to high variation in the estimate, Fig. 10.5b also shows that there is significant truck traffic along this flow pattern that explains the strong increase in the percentage of trucks on the intersection.

Table 10.3 Percentage of trucks on the intersection during weekdays partitioned by traffic flow patterns
Fig. 10.5
figure 5

Figure shows the object count statistic of cars (a) and trucks (b) partitioned by flow patterns as defined in Fig. 10.3c. Orange box plots show the summary statistic for workdays, whereas the blue box plots show for weekends, respectively

4 Discussion

The evaluation measurement was done on a sunny day with very similar weather conditions between the two observation periods. It is well known that video-based tracking systems are sensitive to weather conditions like fog or snow [6], and we expect an increase in error for such conditions. Simulation of weather conditions via style transfers as in [15] could in principle help to generate a more precise evaluation of the system. However, in our case it was not possible to run an additional deep learning model on the embedded board due to a limited memory and processing capacity.

One difficulty we observed during the execution of this study was the classification of vehicles into separate car and truck classes. During evaluation, we found a gradual transition from car over van to truck class where it was not always easy to draw a clear border between these classes based on vehicle features. Furthermore, our vehicle classification model produced a rather coarse separation of vehicle classes due to the use of pretrained models for the automatically annotation procedure that are aimed for a more general recognition task. Therefore, to get a more standardized vehicle classification as outlined in [7], it would be necessary to extend the training set generator with a specialised classification network as described in MATLAB® [21].

A key feature of our solution was the simple installation procedure on the traffic light itself. The developed device is self-contained with a build in mobile connection to our cloud service and thus needs no wired connection to the Internet which is usually not available at intersections. Although this gives more flexibility in positioning of the device at the traffic light, we also observed that the smaller distance to the road leads to more occlusion of cars and other road users by big vehicles like trucks and buses. Because of these occlusions, a substantial number of vehicles could be tracked only partly leading to more than one track per vehicle. Another factor contributing to this problem is the difficulty of calculating the correct position of vehicles that are only partly seen in the video. In this case, the reference point calculation as described in Sect. 10.2.2 leads to an offset in the position estimate that makes the prolongation of trajectories between camera views more error prone. To circumvent this issue, we had to carefully tune the hyper-parameters of the trajectory selection process. We also observed a higher variation in object size of vehicles that made their recognition more difficult. Since our units were positioned at the roadside, we expect that a more central arrangement above the road could bring advantages.

One specific goal of this investigation was to build a low energy recognition system. Deep learning algorithms are energy hungry [13] and contribute significantly to the increase of energy consumption of the IT infrastructure [22]. Therefore, a sustainable traffic monitoring solution needs to take the energy consumption of the system into account since a nationwide enrolment would mean the installation of thousands of devices. The presented solution is based on the Coral Edge TPU which provides an energy-efficient way for object detection. The average power used by the device for the video stream processing was approximately 4.9 W (2.4 W in idle mode). Thus with the combination of our technology stack with newly proposed methods for low energy communication in 5G networks [1], an energy-efficient traffic monitoring platform is feasible.

5 Conclusion and Outlook

In this chapter, we presented a modern traffic measurement system that has four key advantages over conventional systems: (1) low energy consumption due to edge computing, (2) distributed logic edge and cloud results in a cost-efficient solution, (3) local processing grants a high level of privacy and (4) self-contained field device supports easy on-site installation.

We demonstrated that the system is able to measure the traffic flow of cars and trucks at the three-way intersection in Hallein with high precision and that we are able to partition the vehicle flow into one of the six automatically extracted flow patterns. Our analysis gives more insight on the spatial and temporal distribution of the car and truck traffic at the intersection and provides a basis for a more detailed scenario-based simulation approach in the Connecting Austria project.

In this work, we focused solely on traffic flow measurement of vehicles. The described measurement system can be also used to track and analyse the behaviour of vulnerable road users at an urban intersection. The precise location of these road users could be used to generate C-ITS messages that warn approaching vehicles of potentially dangerous situations as for example “person on the road”. For such a use case, it is critical that the necessary information is provided within a short time frame. Although the measurement frequency of our system with 15 frames per second would in principle allow for such a fast processing, the current design is not favourable for this use case, since the communication with the cloud introduces some significant delays with traditional communication networks. Thus, the latency of such a system is a key factor which we will consider in the future development of our system.