Occluded Street Objects Perception Algorithm of Intelligent Vehicles Based on 3D Projection Model

We present a super perception system of intelligent vehicles for perceiving occluded street objects by collecting images from neighbor front vehicles V2V (vehicle to vehicle) video streams based on 3D projection model. This super power can avoid some serious accidents of driver-assistant systems or automatic driving systems which can only detect visible objects. Our street perception system can “see through” the front vehicles to realize detecting of the occluded street objects only by analyzing the pair images received from front and host (back) vehicles. Upon the 3D projectionmodel based on the pair images, the system uses affine transformation to achieve augmented reality method to increase the visibility perspective of driver system. Experimental results on different datasets are shown to validate our approach. Evaluation method was first introduced into our perception system.


Introduction
According to the World Health Organization 2016 Global's report on road traffic injuries, about 1.25 million people die each year as a result of road traffic crashes.Between 20 and 50 million more people suffer nonfatal injuries [1].The statistics from NHTSA show that 31% of traffic accidents are due to rear-end collisions which happens when front vehicles suddenly slow down or stop because of the chain reaction of the road objects ahead of them [2].This is mainly because the unawareness of the situation ahead of the front vehicles or incapability of perception of the existence of street objects which are occluded by the front vehicles, which are especially serious in urban street.These can also lead to other kinds of accidents.For example, a car or a pedestrian may burst into the street at the head of other vehicles at any time, while the driver cannot have enough reaction time and will definitely cause serious traffic accident.Another example is when cars execute a passing maneuver through trucks or buses.This will cause one of the most severe road crashes when a vehicle shifts into an opposing traffic lane and crashes head on with an oncoming vehicle.
The WHO report also mentioned that improving the safety features of vehicles will be an efficient way to prevent the road traffic injuries [1].By analyzing the driving aid technologies of advanced vehicles, it is obvious that an enhanced forward road perception system is an efficient way to decrease the traffic incidents mentioned above.For mostly forward street perception system can only perceive things that vehicles can "see" but failed when street objects are partly or fully occluded.As human beings tend to believe in what they see with their own eyes, the drivers can have enough reaction time if they can see through the front vehicles and can realize the situation happened in front of the front vehicles.In this way, accidents will be decreased sharply in urban street.
The idea of V2V in our paper is similar to the truck platoons [3][4][5].The difference is that the truck platoon link trucks or cars in train-like line which can save fuel, fit more cars on road, and potentially improve safety.However, in our systems, vehicles can drive on both the same lanes and different lanes and connect to each other to improve the safety based on sharing the video data.Using V2V communication based on DSRC, a video stream of vehicles' forward-looking cameras can be transmitted between vehicles with low delay [6].Our perception system of the host vehicle (vehicle A in Figure 1) can be augmented by combining the video information from the front vehicles (vehicle B in Figure 1).Few works have been done on front street perception based on V2V video stream.Most of the forward perception systems cannot be deployed as see-through windows [7][8][9][10][11].These perception systems can detect their distance to other vehicles and can only perceive visible street objects via sensors, such as laser, radar, and camera.As to the collaborative method, [12,13] employed location information of vehicles which is periodically exchanged to prevent potential danger in advance.Similarly, [14,15] adopted the route data coming from a conventional navigation system.But drivers are not sensitive to digital information.Human beings tend to believe in what they see with their own eyes.The works [6,16] propose vehicles transparent method based on V2V video streams in order to deal with passing maneuvers.Their method need accurate distance information gathered from radar sensor to realize the object projection between two images.The work [17] proposes a vehicle blind spot elimination system based on videos captured from other vehicles.But the linear projection fusion causes superior deformation when front and back vehicles are in different lanes.
Great progress has been made on modern object detector based on convolution neural network (CNN) recently, such as Fast-RCNN [18], R-FCN [19], Multibox [20], SSD [18], and YOLO [19], which are good enough to detect most of the objects on road but still failed on perception invisible things (including the fully occluded objects or objects more than 80% occluded).Our street perception system can "see through" the front vehicles to realize detecting of the "invisible" objects (such as vehicles, pedestrians, and motorcycles) by analyzing the images received from front and back vehicles.We name it super perception system due to its super ability to see things through, shown in Figure 1.Only when street objects are detected in front vehicle's windshield images, the perception system will turn on its "super ability."Then, the system will combine images from front vehicles and project occluded objects on windshield image of the host vehicle.And there is no limitation of the location of front and back vehicles.The main idea is inspired from [6,17].The main contributions of this paper are as follows.
(1) In our paper, a novel super perception system for perceiving street object partly or fully occluded by front vehicles is proposed.Our system gives the architecture of the new generation driving aid systems based on V2V stream.
(2) The system also drafts simple communication protocol between two vehicles.
(3) We improved the results of the augmented reality method on projecting objects from front vehicle image to host vehicle image.
(4) We also proposed object-based fusion method based on affine transformation.The fusion will happen only when street objects are detected on the images of front vehicles.And only object's areas were fused to the host vehicle image to augment the perception of the intelligent vehicles.
(5) A performance evaluation method is first proposed in this paper to show the accuracy of the projection.

Architecture of Super Perception System
The super perception system, shown in Figure 1, of the host vehicle (vehicle A in Figure 1) can be enhanced by combining the video information from the front vehicles (vehicle B in Figure 1).Vehicle B periodically sends beacon signal to surrounding vehicles based on DSRC equipment after detecting street objects based on CNN.If street objects were detected, B sends beacon signal to A and began to receive the video stream from B. The perception system on A began the fusion process based on 3D projection model.The 3D projection model parameters were computed depending on the synchronize images from A and B. The flowchart in Figure 2 describes the processing procedure of the perception system.
The system is constituted of four function blocks boxed in blue dashed line.Function blocks includes communication protocol block, object detection block, 3D projection model block, and object-based fusion block.In the following sections, these blocks will be described in detail.

DSRC-Based V2V Communication
A lot of advanced driver-assistance system applications are available based on V2V communication standards DSRC [21].V2V communication plays a decisive role in several of these cooperative approaches.Figure 3 shows a flowchart describing the communication process (in black) of each vehicle and the communication protocol (in blue) between two vehicles.
Every vehicle periodically sends beacon signals to nearby vehicles after street objects are detected in its forward images.In Figure 3 B detected cars or pedestrian in its image and then sent beacon signals to vehicle A. A receives beacon signal from B and the perception system is activated.The cooperative protocol between two vehicles is initiated, with vehicle A request for video stream and camera intrinsic parameter from vehicle B. Then, B sends those data to A. The delay caused by transmission will be lower than 100ms for per frame image.And the effect under 100ms can be ignored in perception systems, which is proved in experiments in [3].If no objects were detected, vehicle B will stop sending videos and will send stop signal to A, which will terminate the communication between A and B.

SSD Based Street Objects Detection
Before the perception system is activated, the image of vehicle B needs to be analyzed first.Detection must be applied before pursuing projection, as the projection only happens when there exist some occluded street objects (which mean there are some street objects in front of B).Here we adopt an end-to-end single deep neural network to realize detection algorithm, named SSD [22].The architecture of SSD network is shown in Figure 4. VGG16 form the early network layers (truncated before any classification layer).The layers added to the truncated VGG16 are 5 convolutional feature layers which progressively decrease in size and can produce feature maps in different scale.A small set of default boxes slide on several feature maps.For each default box, the shape offset and the confidence score were calculated for all street object categories (here we have 3 kinds of street objects; they are vehicle, pedestrian, and bicycle).The objective loss function consists of the confidence loss (conf) and the localization loss (loc): is the number of matched default boxes. means the location of predicted box and  presents the ground-truth box. = {1, 0} indicate the matching of default box and the ground-truth box.Then, a set of default boxes with each feature map cell are associated at the top of the network.Finally, model supplies nonmaximum suppression algorithm to find the most confidence level boxes depending on the score and region.This algorithm can achieve 90.27% mAP on our testing street dataset at 49 FPS on a Nvidia Titan GTX 1080i GPU.
SSD achieves good performance both on accuracy and on speed which can meet our request.Though SSD performs poor on small object detection, small object means far distance from vehicle's perception system which will not cause accident in this situation.

3D Front-and-Back Projection Model
If street objects are detected in image of vehicle B, two images from both front and back vehicles are used to construct the 3D projection model, shown in Figure 5. Different from the projection model proposed by Yair [23][24][25], our 3D projection model is based on the epipolar geometry which only depends on the cameras' internal parameters and their relative pose, but independent of scene structure.As the locations of the two cameras are different, the same object will have different size, location, and deformation after projection into two cameras.Founding the transformation function between two images is the key to projection.Feature pairs in both images achieve the realization of 3D projection model.The processing can be divided into 3 parts: (1) feature pair selection; (2) camera epipolar geometry estimation; (3) objectbased fusion.More details are discussed below.

Feature Pair Selection.
In order to provide a representative description of the object, we can extract characteristic feature points on the object in an image.We make use of these descriptions to find the correspondence points of the same object in two images.To perform trustful points matching, it is very important that the description of the points should be invariant to the changes in scale, noise, illumination, and deformation.Feature Pairs Selection includes feature detection and feature matching.
(1) Feature Detection.Here, we adopt Lowe's SIFT method [26] as feature selection and description method.SIFT method uses a 128-element-long feature vector descriptor to characterize the gradient pattern in a properly oriented neighborhood surrounding a SIFT feature.These features are (semi-)invariant to incidental environmental changes in lighting, viewpoint, and scale.Here, we used a public-domain SIFT implementation (http://www.vlfeat.org/).
(2) Feature Matching.Emphatically, matching SIFT features in front and back images is trying to search the similarity of the those descriptors, as lacking of the relative pose of two cameras.Brute-force algorithm is adopted here to match feature pairs in the front and back images.The Euclidean distance was computed between feature vectors as the matching score.The selected matching pair needs to meet this equation (2) [26].
max((  ,   )) means the best matching pair and max sec((  ,   )) is the second best one.  ,   represent points from feature map of vehicle images A and B. Figure 7(a) depicts the matching results between front and back images.The results reveal that there exists error pair matching only based on similarity.In the following section, geometrical constraints will be introduced in filter error matches.two vehicles have noncoincident centers, then the fundamental matrix F is the unique 3 * 3 rank 2 homogeneous matrix which satisfies [27]     = 0

Camera Epipolar Geometry
where (  ,   ) is any pair of the corresponding points in two images from vehicles A and B. This enables F to be computed from image correspondences alone.Here, we used the fivepoint algorithm [28].The Normalized Eight-Point Algorithm [29] can also be used to improve the performance while lowering the efficiency.
Based on the fundamental matrix F, we can calculate the parameters rotation R and direction of translation T, which shows the relative pose between two cameras.As shown in Figure 5 Suppose K a and K b give the intrinsic matrix for the two cameras.R and T denote the movement between two cameras.

Feature Pair Optimization.
Generally, five pairs of points or eight pairs of points are enough to compute fundamental matrix F. In fact, we often match more features than that.Hence, we can iteratively use these features to gain an optimistic result.Here, RANSAC algorithm is employed to improve the robustness in camera motion inference.Randomly selected n small subsets "seed" (n pairs of matching points), fundamental matrix F is calculated in n times.The value of |p a Fp b | calls the residual error, which is, ideally, supposed to be zero.An F will be computed by those outlierfree seeds and will produce small residual errors in |p a Fp b | for mostly inlier matching pairs.We preserve those seeds that produce the minimum median |p a Fp b | residual errors.After filtering the error pair points, five-point algorithm is then performed to compute the precise value of F, R, and T by using all remaining pairs in a least-square way.Also, epipolar e a and e b can be solved by calculating the standard Singular Value Decomposition (SVD) of F [27].

Object-Based Data Fusion
(1) Estimation of Transformation Parameters.In order to realize fusing objects on image B to image A, we need to  figure out some information related to detected objects.The information includes size, shape, and location of the fusion region.Hence, the mapping parameters between two images need to be estimated.The work [17] used polar coordinates to approximately simulate the mapping relation between two images in a global, linear form to get visually appealing color fusion result.This method can achieve good performance in vehicles in the same lane (shown in Figure 1(a)) but failed in vehicles in different lane (shown in Figure 1(b)).As an affine transformation is a nonsingular linear transformation followed by a translation, we regard the mapping relationship between two images as an affine transformation.It has the matrix representation P B and P A , respectively, represent matching pair point matrix in two images.H is the parameter matrix of affine transformation.The homogeneous formula is as follows: ) .
a 11 , a 12 , a 21 , a 22 , t 1 , and t 2 are six parameters in H matrix. Five pairs of point are enough to calculate the parameters of affine transformation.However, we have more feature pairs than that.Hence, RANSAC is also used to optimize the parameter of the transformation.Affine transformation can offer better performance than linear method in [17], but the projection of the objects from image B to image A is still not precisely correct.Because we use matching points   and   of object G to estimate the parameters of affine transformation of points   and   of object F, this rough method can only be applied to realize visualization, which helps drivers to "see" objects occluded and perceive the approximate situation ahead of the front vehicles.In order to get precise results, more information should be adopted, such as deep information of every pixel.Or we can use other transformation to replace affine transformation.
(2) Images Fusion.The fusing region, where the fusing process is applied to, is a circle area in the image of vehicle A. The center and radius of the circle depend on the detected object region location and size.Epipolar e a and e b can be used to eliminate those objects that are not occluded by vehicle B. The blending method is similar to [14].The blending weight is adjusted to use more color from the front image B close to fusion center and more color from the back image A away from center toward the edge of the circle.The transparency parameter controls the mixture of two images.

Experiment Results
Our proposed system runs on a server with a Nvidia Titan GTX 1080i GPU.
6.1.Datasets.Experiments were performed on dataset from [17], Karlsruhe dataset, KITTI dataset, and our own dataset.Four Datasets were shown in Figure 6.frames (△t) in the video to simulate the front and back vehicle images.And we used Karlsruhe dataset and KITTI dataset [31] to evaluate the accuracy of the projection results.However, it is hard to evaluate the results by using dataset in [17] and our dataset because the objects in back images are occluded by the front vehicles.

Feature Matching and Optimization Results
. Feature matching and optimized matching results are shown in Figure 7. Results in column (a) are the matching results after pursuing Brute-force algorithm and results in column (b) show the matching results after being optimized by adopting geometrical constraint of projection model.The experiment results reveal the optimized results, in which error matches were deleted.

Affine Transformation
Results.In our system, the objects in two images are supposed to meet the affine transformation.
The affine transformation results are shown in Figure 8 and the quantitative evaluation is performed on KITTI and The (b) images were taken as the ground-truth images and the affine results (c) are compared to the ground-truth images.The evaluation results are represented in Figure 9 and Table 1 shows the improving results of our method compared to [17] especially if the locations of the front and back vehicles do not follow the linear relationship.Method in [17] supposes that the front vehicle and back vehicle meet the linear model, but, in fact, most of the situations do not meet this hypothesis.
IoU (Intersection over Union) is an evaluation metric used to measure the accuracy of an object detector.
We use IoU here to evaluate affine results which is key to the accuracy of perception.In Figure 9, red box is the groundtruth bounding box, yellow box is the result of [14] method, and green box is the result of our method.
Figure 10 reveals the projection results based on different fusion methods.Result of (a) adopts the method used in [17] which supposes the projection between front and back vehicle images satisfied the polar linear relationship, whereas result in (b) adopts the affine transform to hypothesis of the relationship between two images.Obvious improvement can be seen from the compared results.However, the results of affine transformation still were not accurate according to the facts, because the pair features choose from images that belong to different objects (most of them belong to background) in different depth.And these pair features were used to calculate the affine parameters of one object, which results in the deviation results.So we need more complicated model to imitate the relationship between the objects.

Object Fusion Results
. Affine transformation is supposed to satisfy the relationship between front and back vehicles, so the augmented fusion can be done based on the above calculated affine parameters.

Conclusion
In this paper, we introduce a super perception system which can "see through" vehicle and detect the fully occluded street objects.Our perception system is a good example of advanced driver-assistance system (ADAS) that can collect information from sensors in neighbor vehicles.Our future research will focus on the algorithm on how to improve the accuracy of the projection model, which means correct location and correct size of the occluded objects after projecting from front image to the back image.We also will invent a performance evaluation method to evaluate the projection results of the system.

Figure 2 :
Figure 2: Flowchart describing the processing procedure of the perception system.

Figure 3 :
Figure 3: Flowchart showing the communication protocol between two vehicles and processing flow of each vehicle.
, a point 3-space P is imaged as p a in the view of vehicle A and p b in the view of vehicle B. O a and O b denote two cameras' optical centers, and ∏ a and ∏ b are the corresponding image A and image B. Geometrically, points O a , O b , p a , p b , and P all lie on the same epipolar plane, which gives the expression

Figure 6 :
Figure 6: Image dataset used in this paper.

Figure 7 :
Figure 7: Matching pairs features between front and back vehicles.

Figure 8 :
Figure 8: Affine transformation results based on front and back vehicle images.

Figure 9 :Figure 10 :
Figure 9: IoU of two methods compared with ground-truth.

Figure 11
reveals the final fusion results.If the front vehicle (vehicle B) detected the object on street, it will send its image B to back vehicle (vehicle A) and the detected objects are fused in image A. The yellow rectangular shows the detected objects which are occluded by front vehicle B. The fusion process is to blend the pixels color in image A with the corresponding pixels of objects' area in image B. The fusion region is a circle where the center is the center of the rectangular and we set the transparency parameter to be 1.The blending weight is adjusted to use more color from the front image close to the center and more color from the back image away from the center.

Table 1 :
Results on the KITTI and Karlsruhe dataset for different IoU thresholds.