Dense Residual Network: Enhancing global dense feature flow for character recognition
Introduction
Deep CNNs with multiple layers have made significant progress and achieved great success in many vision tasks, such as image recognition, speech recognition and video person recognition, by learning deeper representations and hierarchical information (Angelov and Soares, 2020, Bacciu et al., 2020, Gao, et al., 2020, Mhaskar and Poggio, 2020, Xie, et al., 2020, Xie et al., 2017). This success has also been demonstrated by the optical character recognition (OCR) that reads the scene text in images and predicts a sequence of characters from the machine generated texts (Ahmed et al., 2020, Bai et al., 2016, Caulfield and Maloney, 1969, He, et al., 2020, Liao, et al., 2019, Wang and Hu, 2017, Zhang, et al., 2019, Zhang, Tang, Zhang, Wang, Qin, and Wang, 2020). OCR has been widely applied in various applications, e.g., road sign recognition, identification, license plate recognition and assistive service for the blind.
For the task of OCR, two crucial sub-tasks are text line detection and text recognition (Ahmed et al., 2020, He, et al., 2020, Zhang, Tang, Zhang, Wang, Qin, and Wang, 2020). The first task is to extract the text regions from images and the second one is to recognize the textual contents of the identified regions. In this paper, we mainly discuss the task of text recognition rather than text line detection. As different images have complicated backgrounds and complex contents, OCR is still a challenging task. To tackle this task, many OCR models have been proposed, e.g., arbitrary orientation network (AON) (Cheng, et al., 2018), end-to-end trainable scene text recognition system (ESIR) (Zhan & Lu, 2019) and the convolutional recurrent neural network (CRNN) (Li, Cao, Zhao, & Cui, 2013). AON presented an arbitrary orientation network to recognize the oriented texts arbitrarily and achieved impressing results on both irregular and regular texts from images. ESIR has designed a novel line-fitting transformation to estimate the pose of text lines in scenes and has developed an iterative rectification framework for the scene text recognition. CRNN is the combination of two prominent neural networks, i.e., CNNs and Recurrent Neural Networks (RNNs) (Cho, et al., 2014, Graves et al., 2013, McBride-Chang et al., 2003). More specifically, CNNs are used to extract information of images (Jia, Zhang, Zhang, & Liu, 2021), RNNs are used to predict the label distribution of each frame and Connectionist Temporal Classification (CTC) module (Graves, Fernández, Gomez, & Schmidhuber, 2006) is also used to transform these predictions into the final label sequence. Note that recent work also revealed that even without the recurrent layers, the simplified models can still achieve the promising results with higher efficiency (Gao et al., 2017, Tang, et al., 2020). As such, the framework of CNNs plus CTC is a feasible and efficient solution. To extract features in convolutional layers, many existing convolution networks can be used, e.g., Dense Convolutional Network (DenseNet) (Huang, Liu, Van Der Maaten, & Weinberger, 2017), Residual Network (ResNet) (He, Zhang, Ren, & Sun, 2016) and Residual Dense Network (RDN) (Zhang, Tian, Kong, Zhong, & Fu, 2018).
It is noteworthy that the texts in images have different scales, angles of view and aspect ratios (Zhang et al., 2018). Although the hierarchical deep features extracted by a deep network could provide more clues for recognition, most existing CNNs based models usually neglect to use hierarchical features for recognition or only focus on learning local hierarchical features. For example, the dense block of DenseNet connects different inside layers tightly, so it has a strong ability in learning local features due to its intrinsic structures. However, like most existing CNN models, DenseNet also stacks the dense blocks (Huang et al., 2017, Zhang, et al., 2020) and transition blocks simply, which neglects the global properties of features. In addition, the way of combining features by concatenating them in DenseNet will bring about sharp increase in terms of the channel number of input features in each layer and huge computing efforts with dense block getting deeper, which will restrict the depth of the deep networks using the dense blocks. RDN proposed a residual dense block (RDB) (Zhang et al., 2018) for image super-resolution. It should be noted that RDB has some obvious advantages compared with the residual block of ResNet and dense block. Specifically, the structure of RDB contains the dense connected layers, local feature fusion (LFF) and local residual learning (LRL) modules, which can fully capture local hierarchical features and learn the local residual dense features. However, RDB uses the standard dense blocks, so it will have the same disadvantages as the dense blocks in terms of higher computing efforts.
In this paper, we propose a deep convolution network model with CTC, which can fully use and enhance the local and global hierarchical features from the text images and reduce the computing efforts at the same time. In summary, the main contributions of this paper are presented as follows:
- (1)
Technically, we propose a new effective deep representation learning and character recognition network, i.e., dense residual network (DRN). The proposed DRN can fully use all the local and global hierarchical features, and moreover enhance the global dense feature flow, while global features are usually ignored in existing models. That is, DRN can enhance the local and the global feature learning by exploiting hierarchical features from all convolutional layers fully.
- (2)
We also propose a RDB based refined residual dense block (r-RDB), which can clearly retain the ability of local feature fusion and local residual learning of the original RDB, but also reduce the computing efforts of inner layers by refining the structures of the representation block at the same time.
- (3)
After learning the multi-level local residual dense features by r-RDB, we use the sum operation and several r-RDBs to construct a new global dense block (GDB). GDB is clearly designed by imitating the dense block, but it can adaptively learn global dense residual features by connecting features of all the r-RDBs tightly in the form of dense connection in a holistic way. Note that discovering global deep features is the major innovation of this paper, but has been ignored by the vast majority of existing deep networks.
- (4)
We also design a down-sampling block with two convolutional layers of stride 2 to reduce the size of global features and extract more informative deeper global features due to having more kernel channels in this block. This block can also avoid important feature information loss and make the parameters of the whole framework learnable. To further reduce the computing effort and improve the efficiency, we use the depth-wise separable convolution to replace the traditional convolution.
The paper is outlined as follows. Section 2 briefly reviews the related models. In Section 3, we introduce the r-RDB and global dense block (GDB). We present the deep framework of our fast dense residual network (DRN) in Section 4. Section 5 shows the experimental setting and results. In Section 6, we describe the conclusion and future work.
Section snippets
Related work
In this section, we briefly introduce the residual block, dense block and residual dense block, which are closely related to our proposed global dense block and deep network model.
Residual block in ResNet. ResNet mainly solves the problem of network degradation (He et al., 2016). Fig. 1(a) shows the structure of one residual block. Let x, F(x) and H(x) denote the input features, output features without and with short connections respectively, we can easily obtain the following formulas:
Dense Residual Network (DRN) for character recognition
We first describe the network architecture of our DRN in Fig. 2, where it consists of three major parts, i.e., a global dense block (GDB), a down-sampling block and a transcription layer. It is noted that traditional CNN models usually neglect to use hierarchical features for image representation or only focus on local hierarchical features. It should be noted that all the convolution operations in our DRN refer to an operation group including the batch normalization (BN) (Ioffe & Szegedy, 2015
Refined residual dense block (r-RDB) and Global Dense Block (GDB)
We mainly introduce the refined residual dense block (r-RDB) and global dense block (GDB). In what follows, we show their definitions and also illustrate their structures.
Experimental results and analysis
We evaluate the performance of our DRN for text image representation and recognition. In this study, we mainly consider two recognition tasks: (1) recognizing the character strings in images; (2) recognizing the handwritten characters from images. For the first task, we compare the results of our DRN with those of several related deep network models, where the CPUs and GPUs of all the evaluated methods in the experiments are Xeon E3 1230 and 1080 Ti respectively and the used convolution
Conclusion and future work
In this paper, we investigated the representation learning problem that towards local residual dense learning to global dense residual learning. Technically, we propose a new dense residual network (DRN) for text image representation and recognition. The refined residual dense block (r-RDB) and the global dense block serve as the basic modules of our DRN, where r-RDB can not only retain the advantages of residual dense block, i.e., local feature fusion and residual learning, but also refines
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is partially supported by the National Natural Science Foundation of China (NSFC 62072151, 61822701, 62036010, 61806035 and U1936217), Anhui Provincial Natural Science Fund, China for Distinguished Young Scholars (2008085J30) and the Fundamental Research Funds for Central Universities of China (JZ2019HGPA0102). Both Yang Wang and Choujun Zhan are the co-corresponding authors of this paper.
References (52)
- et al.
Towards explainable deep neural networks (xDNN)
Neural Networks
(2020) - et al.
A gentle introduction to deep learning for graphs
Neural Networks
(2020) - et al.
Self-attention driven adversarial similarity learning network
Pattern Recognition
(2020) - et al.
Realtime multi-scale scene text detection with scale-based region proposal network
Pattern Recognition
(2020) - et al.
An analysis of training and generalization errors in shallow and deep networks
Neural Networks
(2020) - et al.
Mining the displacement of max-pooling for text recognition
Pattern Recognition
(2019) - et al.
Using visualization of t-distributed stochastic neighbor embedding to identify immune cell subsets in mouse tumors
Journal of Immunology
(2017) - et al.
Cursive script text recognition in natural scene images - Arabic text complexities
(2020) - et al.
Strokelets: A learned multi-scale mid-level representation for scene text recognition
IEEE Transactions on Image Processing
(2016) - et al.
Links between Markov models and multilayer perceptrons
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1990)
Random forests
Machine Learning
Improved discrimination in optical character recognition
Applied Optics
PCANet: A simple deep learning baseline for image classification?
IEEE Transactions on Image Processing
Denoising prior driven deep neural network for image restoration
IEEE Transactions on Pattern Analysis and Machine Intelligence
Reading scene text with attention convolutional sequence modeling
Neurocomputing
Mobilenets: Efficient convolutional neural networks for mobile vision applications
Cited by (35)
Structure-aware contrastive hashing for unsupervised cross-modal retrieval
2024, Neural NetworksSelf-supervised learning-based two-phase flow regime identification using ultrasonic sensors in an S-shape riser
2024, Expert Systems with ApplicationsA novel extended multimodal AI framework towards vulnerability detection in smart contracts
2023, Information SciencesFast writer adaptation with style extractor network for handwritten text recognition
2022, Neural Networks