Elsevier

Neural Networks

Volume 139, July 2021, Pages 77-85
Neural Networks

Dense Residual Network: Enhancing global dense feature flow for character recognition

https://doi.org/10.1016/j.neunet.2021.02.005Get rights and content

Abstract

Deep Convolutional Neural Networks (CNNs), such as Dense Convolutional Network (DenseNet), have achieved great success for image representation learning by capturing deep hierarchical features. However, most existing network architectures of simply stacking the convolutional layers fail to enable them to fully discover local and global feature information between layers. In this paper, we mainly investigate how to enhance the local and global feature learning abilities of DenseNet by fully exploiting the hierarchical features from all convolutional layers. Technically, we propose an effective convolutional deep model termed Dense Residual Network (DRN) for the task of optical character recognition. To define DRN, we propose a refined residual dense block (r-RDB) to retain the ability of local feature fusion and local residual learning of original RDB, which can reduce the computing efforts of inner layers at the same time. After fully capturing local residual dense features, we utilize the sum operation and several r-RDBs to construct a new block termed global dense block (GDB) by imitating the construction of dense blocks to adaptively learn global dense residual features in a holistic way. Finally, we use two convolutional layers to design a down-sampling block to reduce the global feature size and extract more informative deeper features. Extensive results show that our DRN can deliver enhanced results, compared with other related deep models.

Introduction

Deep CNNs with multiple layers have made significant progress and achieved great success in many vision tasks, such as image recognition, speech recognition and video person recognition, by learning deeper representations and hierarchical information (Angelov and Soares, 2020, Bacciu et al., 2020, Gao, et al., 2020, Mhaskar and Poggio, 2020, Xie, et al., 2020, Xie et al., 2017). This success has also been demonstrated by the optical character recognition (OCR) that reads the scene text in images and predicts a sequence of characters from the machine generated texts (Ahmed et al., 2020, Bai et al., 2016, Caulfield and Maloney, 1969, He, et al., 2020, Liao, et al., 2019, Wang and Hu, 2017, Zhang, et al., 2019, Zhang, Tang, Zhang, Wang, Qin, and Wang, 2020). OCR has been widely applied in various applications, e.g., road sign recognition, identification, license plate recognition and assistive service for the blind.

For the task of OCR, two crucial sub-tasks are text line detection and text recognition (Ahmed et al., 2020, He, et al., 2020, Zhang, Tang, Zhang, Wang, Qin, and Wang, 2020). The first task is to extract the text regions from images and the second one is to recognize the textual contents of the identified regions. In this paper, we mainly discuss the task of text recognition rather than text line detection. As different images have complicated backgrounds and complex contents, OCR is still a challenging task. To tackle this task, many OCR models have been proposed, e.g., arbitrary orientation network (AON) (Cheng, et al., 2018), end-to-end trainable scene text recognition system (ESIR) (Zhan & Lu, 2019) and the convolutional recurrent neural network (CRNN) (Li, Cao, Zhao, & Cui, 2013). AON presented an arbitrary orientation network to recognize the oriented texts arbitrarily and achieved impressing results on both irregular and regular texts from images. ESIR has designed a novel line-fitting transformation to estimate the pose of text lines in scenes and has developed an iterative rectification framework for the scene text recognition. CRNN is the combination of two prominent neural networks, i.e., CNNs and Recurrent Neural Networks (RNNs) (Cho, et al., 2014, Graves et al., 2013, McBride-Chang et al., 2003). More specifically, CNNs are used to extract information of images (Jia, Zhang, Zhang, & Liu, 2021), RNNs are used to predict the label distribution of each frame and Connectionist Temporal Classification (CTC) module (Graves, Fernández, Gomez, & Schmidhuber, 2006) is also used to transform these predictions into the final label sequence. Note that recent work also revealed that even without the recurrent layers, the simplified models can still achieve the promising results with higher efficiency (Gao et al., 2017, Tang, et al., 2020). As such, the framework of CNNs plus CTC is a feasible and efficient solution. To extract features in convolutional layers, many existing convolution networks can be used, e.g., Dense Convolutional Network (DenseNet) (Huang, Liu, Van Der Maaten, & Weinberger, 2017), Residual Network (ResNet) (He, Zhang, Ren, & Sun, 2016) and Residual Dense Network (RDN) (Zhang, Tian, Kong, Zhong, & Fu, 2018).

It is noteworthy that the texts in images have different scales, angles of view and aspect ratios (Zhang et al., 2018). Although the hierarchical deep features extracted by a deep network could provide more clues for recognition, most existing CNNs based models usually neglect to use hierarchical features for recognition or only focus on learning local hierarchical features. For example, the dense block of DenseNet connects different inside layers tightly, so it has a strong ability in learning local features due to its intrinsic structures. However, like most existing CNN models, DenseNet also stacks the dense blocks (Huang et al., 2017, Zhang, et al., 2020) and transition blocks simply, which neglects the global properties of features. In addition, the way of combining features by concatenating them in DenseNet will bring about sharp increase in terms of the channel number of input features in each layer and huge computing efforts with dense block getting deeper, which will restrict the depth of the deep networks using the dense blocks. RDN proposed a residual dense block (RDB) (Zhang et al., 2018) for image super-resolution. It should be noted that RDB has some obvious advantages compared with the residual block of ResNet and dense block. Specifically, the structure of RDB contains the dense connected layers, local feature fusion (LFF) and local residual learning (LRL) modules, which can fully capture local hierarchical features and learn the local residual dense features. However, RDB uses the standard dense blocks, so it will have the same disadvantages as the dense blocks in terms of higher computing efforts.

In this paper, we propose a deep convolution network model with CTC, which can fully use and enhance the local and global hierarchical features from the text images and reduce the computing efforts at the same time. In summary, the main contributions of this paper are presented as follows:

  • (1)

    Technically, we propose a new effective deep representation learning and character recognition network, i.e., dense residual network (DRN). The proposed DRN can fully use all the local and global hierarchical features, and moreover enhance the global dense feature flow, while global features are usually ignored in existing models. That is, DRN can enhance the local and the global feature learning by exploiting hierarchical features from all convolutional layers fully.

  • (2)

    We also propose a RDB based refined residual dense block (r-RDB), which can clearly retain the ability of local feature fusion and local residual learning of the original RDB, but also reduce the computing efforts of inner layers by refining the structures of the representation block at the same time.

  • (3)

    After learning the multi-level local residual dense features by r-RDB, we use the sum operation and several r-RDBs to construct a new global dense block (GDB). GDB is clearly designed by imitating the dense block, but it can adaptively learn global dense residual features by connecting features of all the r-RDBs tightly in the form of dense connection in a holistic way. Note that discovering global deep features is the major innovation of this paper, but has been ignored by the vast majority of existing deep networks.

  • (4)

    We also design a down-sampling block with two convolutional layers of stride 2 to reduce the size of global features and extract more informative deeper global features due to having more kernel channels in this block. This block can also avoid important feature information loss and make the parameters of the whole framework learnable. To further reduce the computing effort and improve the efficiency, we use the depth-wise separable convolution to replace the traditional convolution.

The paper is outlined as follows. Section 2 briefly reviews the related models. In Section 3, we introduce the r-RDB and global dense block (GDB). We present the deep framework of our fast dense residual network (DRN) in Section 4. Section 5 shows the experimental setting and results. In Section 6, we describe the conclusion and future work.

Section snippets

Related work

In this section, we briefly introduce the residual block, dense block and residual dense block, which are closely related to our proposed global dense block and deep network model.

Residual block in ResNet. ResNet mainly solves the problem of network degradation (He et al., 2016). Fig. 1(a) shows the structure of one residual block. Let x, F(x) and H(x) denote the input features, output features without and with short connections respectively, we can easily obtain the following formulas: Fx=w2Re

Dense Residual Network (DRN) for character recognition

We first describe the network architecture of our DRN in Fig. 2, where it consists of three major parts, i.e., a global dense block (GDB), a down-sampling block and a transcription layer. It is noted that traditional CNN models usually neglect to use hierarchical features for image representation or only focus on local hierarchical features. It should be noted that all the convolution operations in our DRN refer to an operation group including the batch normalization (BN) (Ioffe & Szegedy, 2015

Refined residual dense block (r-RDB) and Global Dense Block (GDB)

We mainly introduce the refined residual dense block (r-RDB) and global dense block (GDB). In what follows, we show their definitions and also illustrate their structures.

Experimental results and analysis

We evaluate the performance of our DRN for text image representation and recognition. In this study, we mainly consider two recognition tasks: (1) recognizing the character strings in images; (2) recognizing the handwritten characters from images. For the first task, we compare the results of our DRN with those of several related deep network models, where the CPUs and GPUs of all the evaluated methods in the experiments are Xeon E3 1230 and 1080 Ti respectively and the used convolution

Conclusion and future work

In this paper, we investigated the representation learning problem that towards local residual dense learning to global dense residual learning. Technically, we propose a new dense residual network (DRN) for text image representation and recognition. The refined residual dense block (r-RDB) and the global dense block serve as the basic modules of our DRN, where r-RDB can not only retain the advantages of residual dense block, i.e., local feature fusion and residual learning, but also refines

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (NSFC 62072151, 61822701, 62036010, 61806035 and U1936217), Anhui Provincial Natural Science Fund, China for Distinguished Young Scholars (2008085J30) and the Fundamental Research Funds for Central Universities of China (JZ2019HGPA0102). Both Yang Wang and Choujun Zhan are the co-corresponding authors of this paper.

References (52)

  • BreimanL.

    Random forests

    Machine Learning

    (2001)
  • CaulfieldH. et al.

    Improved discrimination in optical character recognition

    Applied Optics

    (1969)
  • ChanT.H. et al.

    PCANet: A simple deep learning baseline for image classification?

    IEEE Transactions on Image Processing

    (2015)
  • Cheng, Z., Xu, Z., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2018). Aon: Towards arbitrarily-oriented text recognition. In...
  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., & Schwenk, H., et al. (2014) .Learning phrase...
  • Courbariaux, M., Bengio, Y., & David, J. (2015). Binaryconnect: Training deep neural networks with binary weights...
  • DongW. et al.

    Denoising prior driven deep neural network for image restoration

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2019)
  • Duan, K., Keerthi, S., Chu, W., Shevade, S., & Poo, A. (2003). Multi-category Classification by Soft-Max Combination of...
  • GaoY. et al.

    Reading scene text with attention convolutional sequence modeling

    Neurocomputing

    (2017)
  • Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. In ICML. Atlanta, GA,...
  • Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling...
  • Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In ICASSP (pp....
  • Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In CVPR (pp....
  • Hara, K., Saito, D., & Shouno, H. (2015). Analysis of function of rectified linear unit used in deep learning. In...
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778). Las...
  • HowardA.G. et al.

    Mobilenets: Efficient convolutional neural networks for mobile vision applications

    (2017)
  • Cited by (35)

    View all citing articles on Scopus
    View full text