Dense Residual Network: Enhancing global dense feature flow for character recognition

doi:10.1016/j.neunet.2021.02.005

Neural Networks

Volume 139, July 2021, Pages 77-85

https://doi.org/10.1016/j.neunet.2021.02.005 Get rights and content

Abstract

Deep Convolutional Neural Networks (CNNs), such as Dense Convolutional Network (DenseNet), have achieved great success for image representation learning by capturing deep hierarchical features. However, most existing network architectures of simply stacking the convolutional layers fail to enable them to fully discover local and global feature information between layers. In this paper, we mainly investigate how to enhance the local and global feature learning abilities of DenseNet by fully exploiting the hierarchical features from all convolutional layers. Technically, we propose an effective convolutional deep model termed Dense Residual Network (DRN) for the task of optical character recognition. To define DRN, we propose a refined residual dense block (r-RDB) to retain the ability of local feature fusion and local residual learning of original RDB, which can reduce the computing efforts of inner layers at the same time. After fully capturing local residual dense features, we utilize the sum operation and several r-RDBs to construct a new block termed global dense block (GDB) by imitating the construction of dense blocks to adaptively learn global dense residual features in a holistic way. Finally, we use two convolutional layers to design a down-sampling block to reduce the global feature size and extract more informative deeper features. Extensive results show that our DRN can deliver enhanced results, compared with other related deep models.

Introduction

Deep CNNs with multiple layers have made significant progress and achieved great success in many vision tasks, such as image recognition, speech recognition and video person recognition, by learning deeper representations and hierarchical information (Angelov and Soares, 2020, Bacciu et al., 2020, Gao, et al., 2020, Mhaskar and Poggio, 2020, Xie, et al., 2020, Xie et al., 2017). This success has also been demonstrated by the optical character recognition (OCR) that reads the scene text in images and predicts a sequence of characters from the machine generated texts (Ahmed et al., 2020, Bai et al., 2016, Caulfield and Maloney, 1969, He, et al., 2020, Liao, et al., 2019, Wang and Hu, 2017, Zhang, et al., 2019, Zhang, Tang, Zhang, Wang, Qin, and Wang, 2020). OCR has been widely applied in various applications, e.g., road sign recognition, identification, license plate recognition and assistive service for the blind.

For the task of OCR, two crucial sub-tasks are text line detection and text recognition (Ahmed et al., 2020, He, et al., 2020, Zhang, Tang, Zhang, Wang, Qin, and Wang, 2020). The first task is to extract the text regions from images and the second one is to recognize the textual contents of the identified regions. In this paper, we mainly discuss the task of text recognition rather than text line detection. As different images have complicated backgrounds and complex contents, OCR is still a challenging task. To tackle this task, many OCR models have been proposed, e.g., arbitrary orientation network (AON) (Cheng, et al., 2018), end-to-end trainable scene text recognition system (ESIR) (Zhan & Lu, 2019) and the convolutional recurrent neural network (CRNN) (Li, Cao, Zhao, & Cui, 2013). AON presented an arbitrary orientation network to recognize the oriented texts arbitrarily and achieved impressing results on both irregular and regular texts from images. ESIR has designed a novel line-fitting transformation to estimate the pose of text lines in scenes and has developed an iterative rectification framework for the scene text recognition. CRNN is the combination of two prominent neural networks, i.e., CNNs and Recurrent Neural Networks (RNNs) (Cho, et al., 2014, Graves et al., 2013, McBride-Chang et al., 2003). More specifically, CNNs are used to extract information of images (Jia, Zhang, Zhang, & Liu, 2021), RNNs are used to predict the label distribution of each frame and Connectionist Temporal Classification (CTC) module (Graves, Fernández, Gomez, & Schmidhuber, 2006) is also used to transform these predictions into the final label sequence. Note that recent work also revealed that even without the recurrent layers, the simplified models can still achieve the promising results with higher efficiency (Gao et al., 2017, Tang, et al., 2020). As such, the framework of CNNs plus CTC is a feasible and efficient solution. To extract features in convolutional layers, many existing convolution networks can be used, e.g., Dense Convolutional Network (DenseNet) (Huang, Liu, Van Der Maaten, & Weinberger, 2017), Residual Network (ResNet) (He, Zhang, Ren, & Sun, 2016) and Residual Dense Network (RDN) (Zhang, Tian, Kong, Zhong, & Fu, 2018).

It is noteworthy that the texts in images have different scales, angles of view and aspect ratios (Zhang et al., 2018). Although the hierarchical deep features extracted by a deep network could provide more clues for recognition, most existing CNNs based models usually neglect to use hierarchical features for recognition or only focus on learning local hierarchical features. For example, the dense block of DenseNet connects different inside layers tightly, so it has a strong ability in learning local features due to its intrinsic structures. However, like most existing CNN models, DenseNet also stacks the dense blocks (Huang et al., 2017, Zhang, et al., 2020) and transition blocks simply, which neglects the global properties of features. In addition, the way of combining features by concatenating them in DenseNet will bring about sharp increase in terms of the channel number of input features in each layer and huge computing efforts with dense block getting deeper, which will restrict the depth of the deep networks using the dense blocks. RDN proposed a residual dense block (RDB) (Zhang et al., 2018) for image super-resolution. It should be noted that RDB has some obvious advantages compared with the residual block of ResNet and dense block. Specifically, the structure of RDB contains the dense connected layers, local feature fusion (LFF) and local residual learning (LRL) modules, which can fully capture local hierarchical features and learn the local residual dense features. However, RDB uses the standard dense blocks, so it will have the same disadvantages as the dense blocks in terms of higher computing efforts.

In this paper, we propose a deep convolution network model with CTC, which can fully use and enhance the local and global hierarchical features from the text images and reduce the computing efforts at the same time. In summary, the main contributions of this paper are presented as follows:

(1)
Technically, we propose a new effective deep representation learning and character recognition network, i.e., dense residual network (DRN). The proposed DRN can fully use all the local and global hierarchical features, and moreover enhance the global dense feature flow, while global features are usually ignored in existing models. That is, DRN can enhance the local and the global feature learning by exploiting hierarchical features from all convolutional layers fully.
(2)
We also propose a RDB based refined residual dense block (r-RDB), which can clearly retain the ability of local feature fusion and local residual learning of the original RDB, but also reduce the computing efforts of inner layers by refining the structures of the representation block at the same time.
(3)
After learning the multi-level local residual dense features by r-RDB, we use the sum operation and several r-RDBs to construct a new global dense block (GDB). GDB is clearly designed by imitating the dense block, but it can adaptively learn global dense residual features by connecting features of all the r-RDBs tightly in the form of dense connection in a holistic way. Note that discovering global deep features is the major innovation of this paper, but has been ignored by the vast majority of existing deep networks.
(4)
We also design a down-sampling block with two convolutional layers of stride 2 to reduce the size of global features and extract more informative deeper global features due to having more kernel channels in this block. This block can also avoid important feature information loss and make the parameters of the whole framework learnable. To further reduce the computing effort and improve the efficiency, we use the depth-wise separable convolution to replace the traditional convolution.

The paper is outlined as follows. Section 2 briefly reviews the related models. In Section 3, we introduce the r-RDB and global dense block (GDB). We present the deep framework of our fast dense residual network (DRN) in Section 4. Section 5 shows the experimental setting and results. In Section 6, we describe the conclusion and future work.

Section snippets

Related work

In this section, we briefly introduce the residual block, dense block and residual dense block, which are closely related to our proposed global dense block and deep network model.

Residual block in ResNet. ResNet mainly solves the problem of network degradation (He et al., 2016). Fig. 1(a) shows the structure of one residual block. Let x, F(x) and H(x) denote the input features, output features without and with short connections respectively, we can easily obtain the following formulas: $\begin{matrix} F (x) = w_{2} (Re) \end{matrix}$

Dense Residual Network (DRN) for character recognition

We first describe the network architecture of our DRN in Fig. 2, where it consists of three major parts, i.e., a global dense block (GDB), a down-sampling block and a transcription layer. It is noted that traditional CNN models usually neglect to use hierarchical features for image representation or only focus on local hierarchical features. It should be noted that all the convolution operations in our DRN refer to an operation group including the batch normalization (BN) (Ioffe & Szegedy, 2015

Refined residual dense block (r-RDB) and Global Dense Block (GDB)

We mainly introduce the refined residual dense block (r-RDB) and global dense block (GDB). In what follows, we show their definitions and also illustrate their structures.

Experimental results and analysis

We evaluate the performance of our DRN for text image representation and recognition. In this study, we mainly consider two recognition tasks: (1) recognizing the character strings in images; (2) recognizing the handwritten characters from images. For the first task, we compare the results of our DRN with those of several related deep network models, where the CPUs and GPUs of all the evaluated methods in the experiments are Xeon E3 1230 and 1080 Ti respectively and the used convolution

Conclusion and future work

In this paper, we investigated the representation learning problem that towards local residual dense learning to global dense residual learning. Technically, we propose a new dense residual network (DRN) for text image representation and recognition. The refined residual dense block (r-RDB) and the global dense block serve as the basic modules of our DRN, where r-RDB can not only retain the advantages of residual dense block, i.e., local feature fusion and residual learning, but also refines

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (NSFC 62072151, 61822701, 62036010, 61806035 and U1936217), Anhui Provincial Natural Science Fund, China for Distinguished Young Scholars (2008085J30) and the Fundamental Research Funds for Central Universities of China (JZ2019HGPA0102). Both Yang Wang and Choujun Zhan are the co-corresponding authors of this paper.

References (52)

AngelovP.P. et al.
Towards explainable deep neural networks (xDNN)
Neural Networks
(2020)
BacciuD. et al.
A gentle introduction to deep learning for graphs
Neural Networks
(2020)
GaoX. et al.
Self-attention driven adversarial similarity learning network
Pattern Recognition
(2020)
HeW. et al.
Realtime multi-scale scene text detection with scale-based region proposal network
Pattern Recognition
(2020)
MhaskarH. et al.
An analysis of training and generalization errors in shallow and deep networks
Neural Networks
(2020)
ZhengY. et al.
Mining the displacement of max-pooling for text recognition
Pattern Recognition
(2019)
AcuffN. et al.
Using visualization of t-distributed stochastic neighbor embedding to identify immune cell subsets in mouse tumors
Journal of Immunology
(2017)
AhmedS. et al.
Cursive script text recognition in natural scene images - Arabic text complexities
(2020)
BaiX. et al.
Strokelets: A learned multi-scale mid-level representation for scene text recognition
IEEE Transactions on Image Processing
(2016)
BourlardH. et al.
Links between Markov models and multilayer perceptrons
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1990)

BreimanL.

Random forests

Machine Learning

(2001)

CaulfieldH. et al.

Improved discrimination in optical character recognition

Applied Optics

(1969)

ChanT.H. et al.

PCANet: A simple deep learning baseline for image classification?

IEEE Transactions on Image Processing

(2015)

Cheng, Z., Xu, Z., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2018). Aon: Towards arbitrarily-oriented text recognition. In...

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., & Schwenk, H., et al. (2014) .Learning phrase...

Courbariaux, M., Bengio, Y., & David, J. (2015). Binaryconnect: Training deep neural networks with binary weights...

DongW. et al.

Denoising prior driven deep neural network for image restoration

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2019)

Duan, K., Keerthi, S., Chu, W., Shevade, S., & Poo, A. (2003). Multi-category Classification by Soft-Max Combination of...

GaoY. et al.

Reading scene text with attention convolutional sequence modeling

Neurocomputing

(2017)

Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. In ICML. Atlanta, GA,...

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling...

Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In ICASSP (pp....

Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In CVPR (pp....

Hara, K., Saito, D., & Shouno, H. (2015). Analysis of function of rectified linear unit used in deep learning. In...

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778). Las...

HowardA.G. et al.

Mobilenets: Efficient convolutional neural networks for mobile vision applications

(2017)

Cited by (35)

Structure-aware contrastive hashing for unsupervised cross-modal retrieval
2024, Neural Networks
Cross-modal hashing has attracted a lot of attention and achieved remarkable success in large-scale cross-media similarity retrieval applications because of its superior computational efficiency and low storage overhead. However, constructing similarity relationship among samples in cross-modal unsupervised hashing is challenging because of the lack of manual annotation. Most existing unsupervised methods directly use the representations extracted from the backbone of their respective modality to construct instance similarity matrices, leading to inaccurate similarity matrices and resulting in suboptimal hash codes. To address this issue, a novel unsupervised hashing model, named Structure-aware Contrastive Hashing for Unsupervised Cross-modal Retrieval (SACH), is proposed in this paper. Specifically, we concurrently employ both high-dimensional representations and discriminative representations learned by the network to construct a more informative semantic correlative matrix across modalities. Moreover, we design a multimodal structure-aware alignment network to minimize heterogeneous gap in the high-order semantic space of each modality, effectively reducing disparities within heterogeneous data sources and enhancing the consistency of semantic information across modalities. Extensive experimental results on two widely utilized datasets demonstrate the superiority of our proposed SACH method in cross-modal retrieval tasks over existing state-of-the-art methods.
Self-supervised learning-based two-phase flow regime identification using ultrasonic sensors in an S-shape riser
2024, Expert Systems with Applications
Two-phase flow regime identification is an essential transdisciplinary topic that spans digital signal processing, artificial intelligence, chemical engineering, and energy. Multiphase flow systems significantly impact pipeline safety, heat transfer, and pressure drop; therefore, precisely identifying the governing flow regime is crucial for effective modeling and design. However, it is challenging due to the geometrical complexity of flow regimes in multiphase flow. With the advances in sensor measurement and machine learning, applying non-destructive tests and self-supervised learning to practical industrial problems has become technically feasible and cost-effective. This study applies a weak-supervised learning-based two-phase flow regime identification solution using a non-destructive tests ultrasonic sensor in an S-shape riser experimental bed by proposing a self-supervised feature extraction algorithm. The proposed self-supervised feature extraction algorithm reduces time/labor consumption and human error in data annotation using SSL, which provides full supervision without manual annotation. The self-supervised feature extraction algorithm uses a bottlenecked neural network and encoder-decoder structure to extract compact features. The self-supervised feature extraction algorithm performance is evaluated using an established convolutional neural network-based classifier. The source data was collected from a 10 × 50 m riser experimental rig. The dataset is made available to the community as part of this study. The performance of the approach is comparable with state-of-the-art methods and is also the first successful attempt to apply self-supervised learning to multiphase flow regime ultrasonic signal identification. This study achieved 98.84%, 0.000663, 0.00312, and 7.71 × 10⁵ in accuracy, root mean square error, categorical cross-entropy, and model complexity, respectively. The practical experiment justifies the robustness, fairness, and practicability in the practical application environment. The proposed self-supervised feature extraction brings new approaches and inspirations for the feature extraction step in identifying a two-phase flow regime, and it will be beneficial to generalize this study in different riser shapes in the future.
A novel extended multimodal AI framework towards vulnerability detection in smart contracts
2023, Information Sciences
Current automatic data-driven vulnerability detection in smart contracts selects and processes features of interest under black box settings without empirical justification. In this paper, we propose a smart contract testing methodology that bestows developers with flexible, practical and customizable strategies to detect vulnerabilities. Our work enforces strong whitebox knowledge to a series of supervised multimodal tasks under static analysis. Each task encapsulates a vulnerability detection branch test and pipelines feature selection, dimension unification, feature fusion, model training and decision-making. We exploit multiple features made up of code and graph embeddings at the single modality level (intramodal settings) and across individual modalities (intermodal settings). We assign each task to either intramodal or intermodal settings, and show how to train state-of-the-art self-attentive bi-LSTM, textCNN, and random forest (RF) models to extract a joint multimodal feature representation per task. We evaluate our framework over 101,082 functions extracted from the SmartEmbed dataset, and rank each multimodal vulnerability mining strategy in terms of detection performance. Extensive experiments show that our work outperforms existing schemes, and the highest performance reaches 99.71%.
LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text
2023, Neural Networks
Text-based image captioning (TextCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the TextCap task, named TextLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the TextCaps dataset demonstrate the effectiveness of our method. Code is available at https://github.com/DengHY258/LCM-Captioner.
Gas-liquid flow regimes identification using non-intrusive Doppler ultrasonic sensor and convolutional recurrent neural networks in an s-shaped riser
2022, Digital Chemical Engineering
The problem of gas-liquid (two-phase) flow regime identification in an S-shaped riser using an ultrasonic sensor and convolutional recurrent neural networks (CRNN) is addressed. This research systematically evaluates three different schemes with four CRNN-based classifiers over fourteen experiments. Four metrics are used as the evaluation criteria: categorical accuracy, categorical cross-entropy, mean square error (MSE), and computation graph complexity. Compared with existing results, a compatible performance is achieved while considerably reducing the model complexity. The testing and validation accuracies were 98.13% and 98.06%, while the complexity decreased by 98.4% (only 117,702 parameters). The proposed approach is i) accurate, low complexity, and non-intrusive and hence suitable for industry, and ii) could provide a benchmark for flow regime identification.
Fast writer adaptation with style extractor network for handwritten text recognition
2022, Neural Networks
Writing style is an abstract attribute in handwritten text. It plays an important role in recognition systems and is not easy to define explicitly. Considering the effect of writing style, a writer adaptation method is proposed to transform a writer-independent recognizer toward a particular writer. This transformation has the potential to significantly increase accuracy. In this paper, under the deep learning framework, we propose a general fast writer adaptation solution. Specifically, without depending on other complex skills, a well designed style extractor network (SEN) trained by identification loss (IDL) is introduced to explicitly extract personalized writer information. The architecture of SEN consists of a stack of convolutional layers followed by a recurrent neural network with gated recurrent units to remove semantic context and retain writer information. Then, the outputs of the GRU are further integrated into a one-dimensional vector that is adopted to represent writing style. Finally, the extracted style information is fed into the writer-independent recognizer to achieve adaptation. Validated on offline handwritten text recognition tasks, the proposed fast sentence-level adaptation achieves remarkable improvements in Chinese and English text recognition tasks. Specifically, in the HETR task, a multi-information fusion network that is equipped with a hybrid attention mechanism and that integrates visual features, context features and writing style is proposed. In addition, under the same condition (only one writer-specific text line used as adaptation data), the proposed solution, without consuming extra time, can significantly outperform the previous multiple-pass decoding method. The code is available at https://github.com/Wukong90/Handwritten-Text-Recognition.

View all citing articles on Scopus

View full text

Dense Residual Network: Enhancing global dense feature flow for character recognition

Abstract

Introduction

Section snippets

Related work

Dense Residual Network (DRN) for character recognition

Refined residual dense block (r-RDB) and Global Dense Block (GDB)

Experimental results and analysis

Conclusion and future work

Declaration of Competing Interest

Acknowledgments

Neural Networks

Neural Networks

Pattern Recognition

Pattern Recognition

Neural Networks

Pattern Recognition

Using visualization of t-distributed stochastic neighbor embedding to identify immune cell subsets in mouse tumors

Journal of Immunology

Cursive script text recognition in natural scene images - Arabic text complexities

Strokelets: A learned multi-scale mid-level representation for scene text recognition

IEEE Transactions on Image Processing

Links between Markov models and multilayer perceptrons

IEEE Transactions on Pattern Analysis and Machine Intelligence

Random forests

Machine Learning

Improved discrimination in optical character recognition

Applied Optics

PCANet: A simple deep learning baseline for image classification?

IEEE Transactions on Image Processing

Denoising prior driven deep neural network for image restoration

IEEE Transactions on Pattern Analysis and Machine Intelligence

Reading scene text with attention convolutional sequence modeling

Neurocomputing

Mobilenets: Efficient convolutional neural networks for mobile vision applications