Application of machine learning methods for traffic signs recognition

This paper focuses on solving a relevant and pressing safety issue on intercity roads. Two approaches were considered for solving the problem of traffic signs recognition; the approaches involved neural networks to analyze images obtained from a camera in the real-time mode. The first approach is based on a sequential image processing. At the initial stage, with the help of color filters and morphological operations (dilatation and erosion), the area containing the traffic sign is located on the image, then the selected and scaled fragment of the image is analyzed using a feedforward neural network to determine the meaning of the found traffic sign. Learning of the neural network in this approach is carried out using a backpropagation method. The second approach involves convolution neural networks at both stages, i.e. when searching and selecting the area of the image containing the traffic sign, and when determining its meaning. Learning of the neural network in the second approach is carried out using the intersection over union function and a loss function. For neural networks to learn and the proposed algorithms to be tested, a series of videos from a dash cam were used that were shot under various weather and illumination conditions. As a result, the proposed approaches for traffic signs recognition were analyzed and compared by key indicators such as recognition rate percentage and the complexity of neural networks’ learning process.


Introduction
Ensuring road safety is one of the most pressing issues in modern society. Safety issue is directly related to the 'human factor'. Driver and pedestrian inattention is one of the main causes of road traffic accidents. Among the key tasks for both active driver assistance systems (ADAS) and control systems for autonomous vehicles, the analysis of marking and traffic signs occupies an important place. Numerous computer vision systems have been developed recently for traffic signs analysis. But the characteristics of existing algorithms [1 -5] (recognition accuracy, quantity of errors, robustness against atmospheric changes) are still not good enough to rule out a human operator.
Main tasks for computer vision systems used for the traffic signs analysis are detecting the area on a given frame that contains a road sign (localization task) and determining the meaning of the sign(s) on the selected area of the frame (classification task). The localization task can be solved in the following ways: using a deep neural network for predicting the location of the object's describing rectangle [6]; using methods for determining the parts of objects and relations between them [7 -9].
Classification can be performed using the following methods: sliding window method [10 -16]; gradient boosting over decision trees [17].
The purpose of the work is to carry out a comparative analysis of the two methods of traffic signs recognition and identify their application features, as well as to formulate possible modifications of the given methods to achieve the best quality of recognition.

Materials and methods
This study utilizes a modular approach to creating a computer vision system with an option of separate optimization and adjustment. The system considered is a collection of two modules, the tasks of which are distributed as follows: module I -selecting areas that may contain a sign; module II -detecting and classifying signs in the selected area.
Schematic representation of the sequence of actions (interaction of system modules) is shown in figure 1. To use the first approach, a three-layer perceptron with a system selecting ROI on the basis of color filtering. The system of selecting the ROI consists of the following stages: 1) filtering noise using a Gaussian filter; 2) converting the image from the RGB color space to the HSV space to facilitate searching under different illumination conditions; 3) applying color filters to select areas containing basic colors of the signs in question; 4) applying morphological operations for noise filtering of the color filter (erosion) and restoring the shape of the sign with filling areas of non-basic colors (dilatation); 5) removing areas not matching in size and aspect ratio. Areas found through this procedure are scaled to 30x30 pixels in size. For each such area, a descriptive vector consisting of 63 values is calculated. For its calculation, each of the found areas is converted into a grayscale image, then the average brightness value of all its pixels is calculated. First 30 components of the vector consist of the number of pixels in each image row, brightness of which is above the average value. The following 30 components consist of the number of pixels in each image column, brightness of which is above the average value. The remaining 3 components of the vector represent average values of color components in the HSV color space in the found area.
Classification of the detected areas is carried out with the help of a three-layer perceptron having 63 neurons in the first layer, 80 neurons in the hidden layer and the number of neurons in the output layer being equal to the number of signs considered. A sigmoid is used as an activation function in the network. A normalized describing vector of each region is sent to the input of the network.
In the second approach, in order to determine the ROI a convolution neural network with a small number of filters is used to increase performance. Neural network structure is shown in figure 2.
Detection and classification of a sign or signs ROI collection Selecting areas with a high probability of finding a sign(s) As the network input, the image from the DVR with high resolution, converted to a size of 1200x600, is used. The output of the network is an array of data of size 200x100 with a range of possible values from 0 to 1, what represents a probability scale of finding a sign in the region corresponding to the area in the original image.
As a loss log, the intersection over union function was used (1). The main optimization method was the Adam stochastic optimization approach [21].
After receiving the probability array, the ROI's are selected from the original image (to preserve the high resolution) and transferred to the input of the module II.
To implement module II in this approach, a fully connected convolutional neural network with indirect communication between convolution layers was developed. The network structure is shown in figure 3.  Due to the need to perform the tasks of detecting and classifying an image, the learning selection and network structure were presented in the form of multiple naming. Such a representation allows to annotate an image using an output vector elements of which are disconnected and can take a high and a low values independently of other output elements. To implement this representation, a list of signs being classified was drawn up: a stop sign, a bus stop sign, a pedestrian crossing sign and a main road sign.
Thus, the network output vector consists of four neurons. The output activation function was chosen as sigmoidal, which is a smooth nonlinear function with limiting values from 0 to 1, what corresponds to the probability of classifying the inclusion of the sign with the image.
As an estimation function, a binary crossentropy function (2) was chosen.
Such a function is often used in logistic regression problems. Adagrad was chosen as an optimization approach [22].

Results
In order to compare the considered approaches, we use the basic metrics applied for the evaluation of classification problems: recall (3), precision (4), F1 (5).

FN TP
TP being the number of correct classifications of the presence of a sign on the frame, FN being erroneous classification of sign's absence on the frame, FP being erroneous classification of sign's presence on a frame.
When implementing the first approach for learning a neural network, a training sample was created from 255 domains containing the whole sign or part of it, as well as areas that do not contain signs, but have a similar color. Neural network learning was carried out using the backpropagation algorithm. The results of network's test using a test selection are presented in Table 1. In the second approach, a test set of 500 annotated frames from a dash cam was created for a convolution neural network of module I to learning. 20% of the images were randomly picked from the selection for network validation. Also, to reduce the probability of network's overtraining, data augmentation of training set was used by applying basic image processing operations: move, rotate, expand, compress.
As a result of learning, the value of estimation function reached 0.63, while the value of network recall was 0.86, what actually was the main purpose of module I learning, i.e. predicting which frame areas contain a sign. However, precision is not evaluated.
An example of image annotation using a convolution neural network is shown in figure 4. For the convolution neural network to learn and get configured in module II, a selection of 400 image sections with signs names depicted on frames was selected and annotated. Also, a selection of 40 images was created for testing.
Neural network learning for module II was also carried out using artificial data extension method. The selection for validation was 20% of the training set.
According to the results of learning, error function reached 0.001 in the learning selection and 0.01 in the validation selection. The results of network's test using a test selection are presented in table 2. Metric values were chosen on the basis of the largest F1 value of the metric when the threshold for sign labeling changed. The threshold of sign labeling means the smallest value of activation of the output neuron, at which the labeling of the sign value of this neuron is considered active. The value of 0.6 was chosen as the threshold for sign labeling.

Conclusion
As a result of comparing two approaches for implementing the module of signs classification, the following features were revealed:  the approach using image preprocessing is sensitive to changes in environmental conditions;  both approaches have a high robustness in when classifying signs with noises;  the approach using a convolution neural network requires a wider selection to enhance the ability to correctly annotate values without overtraining, what in its turn in the approach with color filtering falls on adding and modifying algorithms;  the approach utilizing color filtering has a higher potential for scaling and implementation compared to convolution networks due to a lower requirements for computational resources;  extension in convolution networks becomes available through increasing the number of filter layers.  the following possible modifications are planned in future:  module I to utilize properties of recurrent networks in view of the work with a continuous video stream; 7 1234567890''"" IASF-2017 IOP Publishing IOP Conf. Series: Materials Science and Engineering 315 (2018) 012008 doi:10.1088/1757-899X/315/1/012008  using a sliding frame inside the section containing a sign for localizing the sign;  prioritizing signs for the driver (depending on the position on the image).