DetReco: Object-Text Detection and Recognition Based on Deep Neural Network

Deep learning-based object detection method has been applied in various fields, such as ITS (intelligent transportation systems) and ADS (autonomous driving systems). Meanwhile, text detection and recognition in different scenes have also attracted much attention and research effort. In this article, we propose a new object-text detection and recognition method termed “DetReco” to detect objects and texts and recognize the text contents. )e proposed method is composed of object-text detection network and text recognition network. YOLOv3 is used as the algorithm for the object-text detection task and CRNN is employed to deal with the text recognition task. We combine the datasets of general objects and texts together to train the networks. At test time, the detection network detects various objects in an image. )en, the text images are passed to the text recognition network to derive the text contents. )e experiments show that the proposed method achieves 78.3 mAP (mean Average Precision) for general objects and 72.8 AP (Average Precision) for texts in regard to detection performance. Furthermore, the proposedmethod is able to detect and recognize affine transformed or occluded texts with robustness. In addition, for the texts detected around general objects, the text contents can be used as the identifier to distinguish the object.


Introduction
Object detection [1,2], as one of the most fundamental and challenging problems in computer vision, has received great attention in recent years. In the context of computer vision, object detection deals with the task of detecting instances of visual objects of specific classes such as humans, animals, and cars in digital images. It combines the cutting-edge technologies in many fields such as image processing, pattern recognition, automatic control, and artificial intelligence. Object detection is widely used in many fields including intelligent transportation systems [3,4], advanced driver assistance systems (ADAS), and autonomous driving systems.
In intelligent traffic surveillance system [5], vehicle detection and recognition are a vital task. e automatic monitoring digital cameras take snapshots of passing vehicles and other moving objects to provide valuable clues including license plate number, the vehicle type, and the driver's facial image for authorities and other security departments. In recent years, autonomous cars and driverless vehicles have significantly changed the manner of transportation. Computer vision system is efficiently used in the development of ADAS. Sakhare et al. [6] have a detailed study of the vehicle detection in dynamic conditions. Yudin et al. [7] study vehicle detection in difficult areas with various architectures of deep neural networks [8].
In automated driving, detection and recognition of pedestrians, vehicles, traffic lights, and traffic signs [9] help avoid accidents and achieve safe driving. Collision avoidance systems are required for the driver to handle the emergence. Detecting pedestrians is essential for autonomous driving [10]. Zhang and Kim [11] propose a pedestrian detector which combines skip pooling from multiscale feature maps and recurrent convolutional layers to detect pedestrians of different scales. Reliable traffic light detection and classification in urban environments are also crucial for automated driving [12,13]. Kim et al. [12] develop a two-step method to detect traffic lights with SSD architecture. Lu et al. [14] utilize a visual attention model to detect traffic signals which is effective for the detection of small objects.
Object detection, which is the core of various intelligent transportation systems, has been a research hotspot in recent years. Meanwhile, the rapid development of deep learning has accelerated the development of object detection. Many deep learning based object detection techniques have led to giant breakthroughs and remarkable performance. Object detection can be divided into one-stage methods and twostage methods. Object detection algorithm of two-stage methods usually involves two steps. Firstly, region proposals are obtained from the original image. Secondly, the classification and regression networks such as the R-CNN [15] (Regional Convolutional Neural Network) series are used to detect the region proposals. Object detection algorithm of one-stage method just needs one step. One-stage methods can accomplish the classification and bounding box regression tasks directly without finding the region proposals separately. Typical one-stage algorithms include SSD [16] (Single Shot Multibox Detector) and YOLO [17] (You Only Look Once). R-CNN proposed by Ross B. Girshick uses selective search [18] method to perform ROI (Region of Interest) scaling and feature extraction on target images. Because R-CNN requires forward calculation for a large number of region candidates which may overlap each other, the speed of training and detection is very slow. Fast R-CNN [19] uses a feature extractor to extract the features of the entire image instead of extracting each image multiple times for each region proposal. Because Fast R-CNN does not extract features repeatedly, the processing time is significantly reduced. Faster R-CNN [20] uses a design similar to Fast R-CNN. Faster R-CNN replaces the selective search method with RPN (region proposal network), which solves the problem of excessive time overhead in generating ROI. e Faster R-CNN achieves high accuracy and detection speed to some extent, but it still cannot meet the real-time requirement.
Compared with Faster R-CNN, SSD has a significant advantage of detection speed. e network generates multiple feature maps at different scales. en the classification and bounding box regression tasks are simultaneously done on multiscale feature maps. SSD is able to detect large objects effectively. YOLO is another one-stage method. It predicts bounding boxes and class probabilities of multiple objects simultaneously. However, different from the SSD algorithm, YOLO does not use multiscale feature maps for detection. Its generalization capability is poor for object with large scale variations compared with that of SSD. It leads to missed detection and low recognition accuracy. YOLOv2 [21] algorithm uses anchor mechanism which utilizes convolutional layers instead of fully connected layers as in YOLO to predict the bounding boxes. e disadvantage of using fully connected layer to predict bounding boxes is that the spatial information of feature map is lost. However, the anchor mechanism directly predicts the bounding boxes on the feature map with convolutional layers. e spatial information of feature map is well preserved. Each feature point of the feature map corresponds to each grid of the original image. YOLOv2 improves the performance of the detection accuracy. YOLOv3 [22] algorithm adopts multiscale feature maps to predict bounding boxes. YOLOv3 uses FPN (Feature Pyramid Networks) concept which uses the output of the middle layers to merge with the output of the latter layers. e high-level features are passed to the low layers, so that small objects on low-level feature maps can be better detected. YOLOv3 has been greatly improved in regard to detection speed and accuracy. e majority of the recent works related to deep neural networks has been devoted to detection or classification of object categories [23]. On top of that, another problem in computer vision that plays a vital role in intelligent transportation systems is the image-based text recognition. Text recognition aims to decode a sequence of labels from cropped text images. e conventional methods recognize the text contents at character level. e characters of the text are segmented from the cropped text image. en the segmented character regions are preprocessed and recognized. Different from the character-level recognition methods, recent text recognition methods do not require character segmentation in advance. One famous method is the multidigit number classification proposed by Goodfellow et al. [24], which is based on DCNN (deep convolutional neural network). e method requires selecting the maximum predictable sequence length in advance.
is limits it to recognizing house number or license plate number whose length of texts is known beforehand. Another commonly used method is RNN (recurrent neural network) with CTC [25] (connectionist temporal classification). Shi et al. [26] and He et al. [27] propose RNN models to encode the features from the CNN and adopt CTC to decode the encoded sequence. e advantage of this method is that it can generate texts of any length. Furthermore, the nature of the Recurrent Neural Network determines that the model is able to learn the relationship between text and text temporal relations. Another type of method that does not require character segmentation of texts is attention mechanism. Lee and Osindero [28] use attention-based sequence-to-sequence structure to automatically focus on certain extracted CNN features and directly use text images to perform word string learning.
is method implicitly learns character-level language models embodied in RNN. It is able to perform text recognition in unconstrained natural scenes.
Scene text recognition [29] in intelligent transportation systems has many applications, such as vehicle license plate recognition and road sign recognition. As an important part of intelligent transportation systems, vehicle license plate recognition is widely used in intelligent monitoring systems and parking systems. Automatic license plate recognition (ALPR) refers to the extraction of vehicle license plate information from an image or a sequence of images [30]. Chai and Zuo [31] propose an automatic vehicle license plate recognition method which adopts edge detection algorithm in extraction and character segmentation and recognition. Chang et al. [32] use license plate recognition technology to track vehicle on the road in complex traffic conditions.
Object detection in applications refers to the detection under specific application scenarios, such as pedestrian detection, vehicle detection, and scene text detection. Text recognition in specific application scenarios can get more information from the objects on which the applications focus. In this paper, we propose a model which combines object-text detection and text recognition. e model is able to detect both texts and general objects simultaneously. e model combines object detection task and text detection task and recognizes the detected text contents. In addition, for the texts detected around general objects, the contents can be used as the identifier to distinguish the object. e method we propose can be applied to a wide range of applications in regard to intelligent transportation systems and has comprehensive capabilities of detection and recognition.
e contributions are summarized as follows: (1) We propose an object-text detection model for multiple objects which can simultaneously detect texts and general objects (2) We propose a text recognition framework that effectively combines text detection and recognition (3) e method we propose can detect multiple types of objects and instantiate the identities of the detected objects based on the identified text labels. e recognized text label is used as a valid identity of the object

Materials and Methods
e network structure in this paper consists of two parts: object-text detection network and the text recognition network. We use the YOLOv3 architecture which adopts a fully convolutional neural network [33] to detect objects and texts in real-scene images. e convolutional network is used to extract the features in multiple scales feature maps from the image. e classification and bounding box regression networks directly output the objectness score, the class of the object, and the coordinate offsets of the object at multiple feature maps. We use NMS [34] (nonmaximum suppression) to remove the redundant bounding boxes which have large overlap with the same object. We adopt a successful scene text recognition algorithm, CRNN [26] (Convolutional Recurrent Neural Network), in conjunction with object-text detection. According to the coordinates of the text type which are output from the object-text detection network, the text regions are cropped from the original image. A convolutional neural network is used to extract features from the text regions. e extracted feature maps need to be scaled to a uniform height with a fixed aspect ratio. We use the recurrent model to encode the feature sequences from the feature maps and CTC to decode the encoded sequence. e network structure we propose is shown in Figure 1.

Architecture of the Object-Text Detection Network.
e backbone network adopts Darknet-53, which uses the former 52 layers without fully connected layer. e feature extraction network is a fully convolutional network. It is mainly composed of 3 × 3 and 1 × 1 convolution kernels and a large number of shortcut links with residual units [35][36][37]. e structure of the feature extraction network is shown in Figure 2. e network uses convolution kernel with stride instead of pooling layers to reduce the negative gradient effects brought by pooling. We also adopt a lot of data augmentation and batch normalization to avoid overfitting. In order to enhance the accuracy of the algorithm for small object detection, the network adopts upsampling and fusion methods which are similar to FPN [38] to implement the multiscale feature maps.
As shown in Figure 3, we assume the size of the input image to be 416 × 416. We extract three different scale feature maps from 26th, 43rd, and 52nd layers of the feature extraction networks in Figure 2. e scales of the extracted feature maps are 13 × 13, 26 × 26, and 52 × 52. e feature fusion network outputs three different scale feature maps with upsampling and fusion. e top layer with a size of 13 × 13 is concatenated with the 26 × 26 feature map via onetime upsampling. en it is concatenated with the 52 × 52 feature map by upsampling twice. In this way, the high-level features from the top layer are passed to the low layers, which makes the model better at detecting small objects on low-level feature maps. Finally, the network generates three feature maps of different scales which are 1/8, 1/16, and 1/32 of the original image. e output layers in 3 different scales are also convolutional. In our experiments with our dataset which has twenty-one classes including twenty general categories and one text category, we predict 3 bounding boxes with different sizes at feature maps of each scale. e shape of the output tensor is N × N[3 × (4 + 1 + 21)], where N is the scale of the feature map, 3 is the anchor boxes in features of different scales, 4 is the coordinate offsets of the bounding box, 1 is the objectness confidence prediction, and 21 is the object classes. e network adopts the anchor-based mechanism. Each gird of the feature maps predicts 3 bounding boxes according to the anchor boxes of 3 different scales. ere are in total 9 different scale anchor boxes which are generated from k-means clustering. e 9 clusters on the COCO dataset [39]  e anchor boxes in different scale feature maps are shown in Figure 4. e object-text detection network simultaneously predicts bounding boxes of texts and general objects conditioned on its input feature maps. At each grid of associated feature map, it outputs the objectness confidence, classification score, and coordinate offsets to its associated anchor boxes in a convolutional manner. e object-text detection network adopts logistic regression to predict the bounding boxes and the objectness scoring on each anchor. Only the anchor with the highest objectness score is calculated. Each object can be detected by only one anchor. is step is performed before prediction, which can remove unnecessary anchors and reduce the amount of calculation. In bounding box regression, the network outputs the coordinate offsets. e formula that converts offsets to bounding box coordinates is defined as where b x , b y , b w , and b h are the coordinates of the bounding box, c x , c y , p w , and p h are the coordinates of the anchor box, and σ(·) represents the sigmoid function.

Loss Function of the Object-Text Detection Network.
Objectness confidence is the probability of predicting the existence of the object-text in anchor box. Objectness confidence loss adopts binary cross entropy. e objectness confidence loss function is defined as where o i ∈ 0, 1 { } represents the existence of the object-text in anchor box and c i represents the sigmoid probability of the existence of the bounding box.
Object-text classification score is the probability of the class which the object-text belongs to. e object-text class loss function is defined as where O ij ∈ 0, 1 { } represents the existence of the objecttexts' class j in anchor box i and C ij represents the sigmoid probability of the class j of the bounding box i.
Object-text detection model predicts the coordinate offsets between anchor boxes and bounding boxes. Equation (1) is used to convert the offsets to the coordinates of the bounding box. e object-text location loss adopts the GIoU [40] (Generalized Intersection over Union) method to calculate the error between the bounding box and ground truth. e GIoU Loss Algorithm is defined as in Algorithm 1.
We use L loc � GIoU_Loss to form of the object-text location loss. e total loss function can be represented as where α, β, and c are the weights of each loss. We empirically set α � β � c � 1.

NMS Module.
e NMS module is applied to remove the redundant object-text bounding boxes detected from the same object. We adopt the NMS after the object-text detection on the object-text bounding boxes.

Text Recognition.
After the object positions are detected from the object-text detection network, we pick out the texttype bounding boxes based on the text class. Firstly, the text extractor extracts the text regions corresponding to the coordinates of the text bounding boxes produced by the object-text detection module. en the text recognition module preprocesses the extracted text regions by resizing them before they are fed into convolutional neural network. We scale the text regions to (32, 100, 3) with a fixed aspect ratio, where 32 is the fixed height, 100 is the maximum length, and 3 represents the number of the image channel. Finally, we use the scaled text region as the input of the convolutional layers.
We adopt the CRNN model as our text recognizer. Firstly, the convolutional layers extract the feature maps from the preprocessed text region. A sequence of feature vectors is extracted from left to right from the feature maps.
en each frame of the sequences which represents a vertical region corresponding to the original text image becomes the input of the recurrent layers. e recurrent layers adopt the deep bidirectional LSTM [41] (long short-term memory) to encode the sequence of the feature vectors. Finally, we adopt CTC to predict the text label corresponding to the sequences from the recurrent layers.

Experiment Setup.
e object-text detection network is trained with training images using Adam (adaptive moment estimation) [42]. We initialize the model with pretrained weights on the COCO dataset. We divide the training process into two stages. In the first stage, we fix the backbone network and just train the classification and  regression network. In the second stage, we train the whole network.
When we train the object-text detection network, in the first two epochs in training, we adopt the method of gradually increasing the learning rate from low to high which is called "warmup stage" method. e network converges quickly with the large learning rate. And then, we need to stabilize the network with a low learning rate to avoid gradient oscillation. We adopt a cosine annealing strategy proposed by Loshchilov et al. [43]. At the i-th training step, the learning rate decays with a cosine annealing as follows: where η init is the initial value of the learning rate, which is set to 10 − 4 , η end is the end value which we set to 10 − 6 , and T cur accounts for how many steps have been performed. T train is the total steps during the training. T warm represents the warmup steps in the first two epochs. e learning rate curve is shown in Figure 5. e training algorithm of the object-text detection model is summarized as in Algorithm 2.
We use a CRNN model proposed by Shi et al. [26] as the text recognition network. e experiment uses a pretrained model trained on the synth90k dataset [44] to initialize the parameters of the text recognition model. We use NEOCR [45] dataset and SCUT FORU dataset to fine-tune the pretrained model. We set the training parameters as follows: e model training runs for 2000000 epochs. e batch size is 32. e initial learning rate is 0.01 with exponential decay of 0.1 every 500000 epochs. e experiment adopts gradient descent with momentum [46] to train the text recognition network. We set the parameter of momentum to 0.9. e training algorithm of the text recognition model is summarized as in Algorithm 3.

Dataset.
We evaluate the proposed method on four datasets: VOC 2007 [47], VOC2012 [48], ICDAR 2013 [49], and SCUT FORU DB. VOC2007 and VOC2012 are the datasets about object detection. ICDAR 2013 and SCUT FORU DB are the datasets about text detection. We integrate them into a comprehensive dataset for detecting type-text object and general object simultaneously.
VOC2007 is the challenge to recognize objects from a number of visual object classes in realistic scenes. e database contains a total of 9963 annotated images. We use 5011 images as training set and 4952 images as testing set.
ere are twenty object classes in the dataset. VOC2012 is the same challenge as VOC2007 which increases the size of the training set.
ere are 17125 training images in total. e testing set has not been released yet. COCO Dataset is a large-scale dataset for object detection, segmentation, and captioning. It contains more than 330K images and 200K labels. e COCO dataset has 80 object categories in total.
In the experiment, we integrate the datasets into a comprehensive dataset of 29265 images in total. ere are 23565 training images and 5700 testing images. Since these datasets have different annotation formats, we need to convert them into a unified annotation format. e coordinates format of the annotation is defined as  x min , y min , x max , y max . We shuffle the combined dataset to feed into the model. Text recognition network is performed with a CRNN model proposed by Shi et al. We use a pretrained model trained on the synth90k dataset and use NEOCR dataset and SCUT FORU dataset to fine-tune the pretrained model. e annotations in NEOCR dataset contain characters that are not in the English alphabet. We have modified the annotations by replacing the special characters to English letters that look similar. e text images in SCUT FORU dataset are cropped from the original images corresponding to the coordinates in annotations. e text images are resized to the size of (32 × 100) before they are fed into the text recognition network.

Evaluation Metrics.
We use mAP (mean Average Precision) as the measurement to evaluate the detection model performance. e mAP calculation is based on the following metrics [50]: Recall. Recall is the percentage of true positive detected among all relevant ground truths. e recall is defined as PR (Precision-Recall) Curve. e PR curve is a good way to evaluate the performance of an object detector. e precision and recall values of detected objects are plotted to get a PR curve. e area under the PR curve is called AP (Average Precision). e AP calculation is defined as where P(R) is the measured precision against recall. mAP. e mAP is the average of all categories of AP.

Analysis of Experimental Results.
In order to verify the choice of YOLOv3 as the detection network in the proposed method, we compare the detection performance of different detection frameworks, namely, Fast R-CNN, Faster R-CNN, SSD, YOLO, YOLOv2, and YOLOv3. We as well compare different input size setups of YOLOv3. e results are shown in Table 1. All the detection frameworks being compared are trained on the VOC2007 and VOC2012 training datasets and the mAP is tested on the VOC2007 testing dataset. As can be seen in Table 1, the YOLOv3 framework with network input size of 416 × 416 achieves the highest mAP among the frameworks being tested. Further, the YOLO series, be it YOLOv2 or YOLOv3, generally achieve higher mAP than other frameworks. It can therefore be concluded that the choice of the YOLOv3 framework in the proposed method is an optimized solution.
After we have confirmed the performance of the YOLOv3 in object detection, we further train it on the comprehensive dataset which is composed of the general object detection datasets of VOC2007 and VOC2012 and the text detection datasets of SCUT FORU and ICDAR2013. en we test the performance of the frameworks on different testing datasets. e general objects detection testing dataset VOC2007 and the text detection datasets SCUT FORU and ICDAR2013 are used. We compare the performance of different detection frameworks with 3 categories out of the total 20 categories in the PASCAL VOC 2007 dataset. As shown in Table 2, the performance of YOLOv3 on the 3 categories is much better than other detection frameworks. We verify that the YOLOv3 has excellent performance on object detection. Our model achieves 70.0 mAP in the text Mathematical Problems in Engineering detection task. We are not listing the text detection performance of the other methods because they do not feature text detection and recognition.
One may notice that the mAP of YOLOv3 in Table 2 is lower than that of Table 1. is is because we further train the YOLOv3 network on the text detection datasets. e detection of text objects reduces the mAP to a certain extent. In addition, the text object in datasets of VOC2007 and VOC2012 is not marked in the annotations. e detected texts in VOC2007 and VOC2012 will be seen as 'False Positives', thus the mAP would decrease.
Due to the imperfection of the comprehensive dataset which consists of general object datasets and text datasets, we improve the annotation information of the comprehensive dataset. We label the text objects in VOC2007 and VOC2012 and the general objects in ICDAR2013 and SCUT.
is makes the comprehensive dataset of the object detection more accurate, reduces the false positive rate of the detection model in training and testing, and improves the detection accuracy as a whole. As can be seen from Table 2, the detection model used in the experiment has the highest detection accuracy on the YOLOv3 framework with the size of 544 × 544. Table 3 compares the detection effect of the comprehensive dataset before and after the modification on the YOLOv3 framework with the size of 544 × 544. As shown in Table 3, the detection network on modified comprehensive dataset has higher accuracy on person and text objects than original dataset. e detection accuracy of the text object is significantly improved. e mAP on the modified comprehensive dataset has also improved.

Performance on Object-Text Detection and Recognition.
e model we propose performs two tasks: object-text detection and text recognition. e object-text detection network can detect general objects and text objects simultaneously. e text contents of the detected text regions from the detection network are recognized by the text recognition network. is section shows the detection and recognition results of test images in the experiment.
As shown in Figure 6, we mainly show some detection results of test images in transportation. e detection model can detect multiple objects in one image. It has good performance on both small objects and large objects. e text Input: e region of the GT and BB (GT, BB ⊆ S ∈ R n ), where S is the input image size.

Output: GIoU Loss
Step 1. Calculate the smallest enclosing region C,C ⊆ S ∈ R n ; Step 2. IoU � |GT ∩ BB|/|GT ∪ BB|; Step   8 Mathematical Problems in Engineering detection dataset contains billboards, signboards, road sign, etc. Some texts exist in complex environments and they might be occluded. As shown in Figure 7, the detection model can detect the text in complex scenes. However, some text bounding boxes in images are not accurate enough, which may cause wrong recognition in texts. e object-text detection model we propose can simultaneously detect the text and general objects. Some detection examples are demonstrated in Figure 8. e text recognition model can recognize the text contents of the text regions detected from the detection model. As shown in Figure 9, we demonstrate some examples of text recognition model on road sign. As shown in Figure 10, the text recognition model can recognize not only the horizontal text, but also the affine distorted text. e affine distorted texts exist commonly due to the variations of the camera views. Yet the proposed model not only locates these texts, but also finds the contents of the texts. Figure 11 gives a more application specific demonstration of the proposed object detection and text recognition model. In this scenario, information extracted by the text recognition module identifies the detected object. We use some cars images with plates as the proof of concept. e object-text detection model we have proposed can simultaneously detect the car and the plates on the car. en the text recognition model recognizes the text contents on the plates.     Mathematical Problems in Engineering

Conclusions
We present an object-text detection and recognition model in this article. e model not only detects the texts and general objects simultaneously but also recognizes the text contents inside the detected text bounding boxes. e method we have proposed combines both object detection and text recognition. In the applications of some scenarios, the recognized text contexts around the general objects are able to be used as the identifier to distinguish the object. e proposed method has potential in extensive applications, such as intelligent transportation systems and autonomous driving. Possible directions for future research include the following: (1) Improving the dataset: this refers to adding more samples which contain both text and general object to train the network (2) Improving the detection network on the text detection: for example, the anchor box which is suitable for the text size can be used. We can use k-means to cluster on the dataset containing text objects to make the size of the generated anchor boxes more suitable for text (3) Optimizing the connection between the detection network and the recognition network: in our proposed model, the connection between detection and recognition network is the text region which is cropped from the original image corresponding to the coordinates of the detected text boxes. In order to optimize the connection, we can extract the feature map from the detection network as the input of the recognition network. e affine transformation is applied to the feature map extracted from detection network to fit the input size of recognition network.
us, during backpropagation, the gradients can flow from the recognition network back to the detection network.
e detection and recognition model can be regarded as an end-to-end model.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.