Natural Scene Text Detection and Segmentation Using Phase-Based Regions and Character Retrieval

Multioriented text detection and recognition in natural scene images are still challenges in the document analysis and computer vision communities. In particular, character segmentation plays an important role in the complete end-to-end recognition system performance. In this work, a robust multioriented text detection and segmentation method based on a biological visual system model is proposed. -e proposed method exploits the local energy model instead of a common approach based on variations of local image pixel intensities. Features such as lines and edges are obtained by searching for the maximum local energy utilizing the scale-space monogenic signal framework.-e candidate text components are extracted frommaximally stable extremal regions of the local phase information of the image. -e candidate regions are filtered by their phase congruency and classified as text and nontext components by the AdaBoost classifier. Finally, misclassified characters are restored, and all final characters are grouped into words. Experimental results show that the proposed text detection and segmentationmethod is invariant to scale and rotation changes and robust to perspective distortions, blurring, low resolution, and illumination variations (low contrast, high brightness, shadows, and nonuniform illumination). Besides, the proposed method achieves often a better performance compared with stateof-the-art methods on typical natural scene datasets.


Introduction
Nowadays, imagery has become an indispensable source of human communication and interaction. Millions of images are shared every day, and new content-based image applications have been developed. In particular, digital images with textual content provide useful information for tasks related to document classification, multimedia retrieval, language translator, text to voice converter, robotic navigation, and augmented reality, to name a few [1,2]. e analysis of this textual information involves basically three stages: text detection, character segmentation, and word recognition. e fundamental goal of text detection is to determine whether there is text in a given image, while character segmentation considers the extraction and localization of characters from background pixels. Word recognition considers character grouping and error correction in order to recognize the final words.
Since text localization, character segmentation, and word recognition stages are not necessarily applied in a specific order, the character segmentation as the first stage could provide a better performance for the following processes. However, text localization and character segmentation are still challenges in the document analysis and computer vision communities (http://rrc.cvc.uab.es/?com=introduction). Natural text scenes contain different types of fonts, symbols, colors, scales, and character orientations, which make text detection a complicated task. Moreover, natural scenes are commonly captured under uncontrolled conditions (illumination changes, partial occlusion, low resolution, sensor noise, blur, and alignment) and could contain complex backgrounds (people, buildings, fences, bricks, grass, trees, and cars) [1][2][3].
In the last decades, several techniques have been explored to solve the text detection and segmentation problem. ese methods can be broadly divided into four categories: sliding window-based, connected component-based, deep learning-based, and hybrid methods [1]. Sliding windowbased methods, also called texture-based methods, consider a sliding window across all over the images under different scales to identify text regions. Fourier-statistical features (FSF) [4], discrete cosine transform (DCT) [5], spatial filters [6], and wavelet coefficients [7] are commonly used as textural properties. Nevertheless, sliding window methods are sensitive to scale and rotation variations, besides they are computationally expensive. Connected component-based methods consider connected component properties such as color, stroke width, aspect ratio, and size to distinguish between character and noncharacter regions. Usually, connected components are obtained by color clustering [8,9], image binarization [10,11], edge detection [12], stroke width transform (SWT) computation [13], and maximally stable extremal region (MSER) extraction [14,15].
In the last years, the MSER and SWT techniques have become the most used techniques for text detection process due to their invariance to scale and rotation transformations. Besides, not only the MSER but also all extremal regions (ERs) are used for text segmentation [16][17][18][19][20]. However, ERbased methods need to process multiple repeated regions to obtain correct character segmentation, generating classification errors and a high computational cost. Furthermore, SWT-based techniques are dependent on the accurate edge detector, which is not feasible in many cases.
Recently, deep learning-based techniques have become popular for pattern recognition. In particular, for the multioriented text detection task, different neural networks (NNs) and configurations have been proposed [21][22][23][24]. However, NNs need to be pretrained using thousands of images in order to achieve a good performance, and in many cases, a final fine-tune is realized with the training images of the dataset to be evaluated. Moreover, it has been shown that this kind of approach can be easily fooled by modifying some values of the image pixels [25].
Lastly, hybrid methods combine the sliding window techniques, connected components, and neural networkbased methods [26][27][28][29][30]. Until now, most of the proposed methods related to natural scene text detection are based on the pixel intensity values. As a consequence, method performance is affected by the presence of nonuniform illumination, low contrast, blur, or noise degradations. In contrast, we propose a robust multioriented text detection and segmentation method based on the biological visual system model. Psychophysical evidence suggests that the human visual system decomposes the visual information into border and line components by using phase information. Furthermore, it is known that different groups of cells in V1 extract particular image features as frequency, orientation, and phase [31].
In this work, a new multioriented text detection and segmentation method based on the biological energy model is suggested.
is paper is an extended version of the conference papers [32,33]. Unlike the previous works, we utilize the phase-based MSER approach and the AdaBoost classifier instead of applying only heuristic rules for the character filtering, retrieval, and grouping stages. e main contribution of this work is as follows. First, the proposed character segmentation method is based on a biologically inspired model rather than being based on local intensities. us, the proposed text segmentation is robust to variations of the image pixel values (nonuniform illumination, low contrast, and shadows), and it is invariant to slight scale and rotation changes. Second, the phase congruency approach for character filtering and noise control is utilized, which significantly reduces the number of generated components, keeps a low number of regions, and preserves the most relevant regions. ird, AdaBoost classifiers are used rather than heuristic rules at character filtering, retrieval, and grouping stages. Finally, the computational complexity of the proposed system at the training stage is much lower compared with that of deep learning techniques, while the performance of the system with a small training set is competitive and, in some cases, better than that of the state-of-the-art algorithms. e paper is organized as follows. In Section 2, a brief description of the related works is presented. In Section 3, the proposed text detection and segmentation method is described. In Section 4, experimental results are presented and discussed. Section 5 summarizes our conclusions.

Related Work
Until now, there are two representative connected component-based techniques used for text segmentation, that is, the SWT [13] and the MSER [14]. e local operator SWT computes the character stroke width for each edge map pixel. erefore, strokes that have constant width values can be considered as characters, and those components which have similar stroke width values can be grouped into words. Since the original SWT is invariant to rotation and scale variations, several SWT-based methods have been developed. In [34,35], a SWT-based method is proposed for multioriented text detection. e Canny edge detector is used to calculate the SWT map from the image. e image pixels are associated considering the SWT ratio and grouped into connected components. e obtained components are classified into character and noncharacter elements using a two-layer filtering scheme. A set of heuristic rules are considered, and a trained random forest (RF) classifier is applied. Finally, the character candidates are aggregated into text chains satisfying a certain set of rules. In [36], an extended version of the SWT, called stroke feature transform (SFT), is proposed. In addition to stroke width constrains, the SFT considers color uniformity and local relationships of edge pixels during ray tracking. en, two text covariance descriptors are defined for component-level and text-line RF classifier training. In [37], an efficient stroke width value computation is proposed. e obtained stroke width value is used together with a perceptual diverge cue and an edge histogram of oriented gradient (HOG) descriptor to measure the properties of characters under a Bayesian framework.

Mathematical Problems in Engineering
On the contrary, the MSER method basically extracts image regions that remain stable under a certain number of thresholds, which are considered as potential character candidates. e MSER technique was first introduced by Matas and Zimmermann [15] for character detection and was recently extended for text detection and recognition [18]. In [16], an MSER-based text segmentation method is proposed. e character candidates are extracted using the MSER algorithm. e candidates are grouped using orientation, morphology, and protection clustering via adaptive hierarchical clustering. en, the text candidates are classified into text and nontext components. In [17], a subpath division from the ER tree is done. Multiple subpaths are created according to the size and position similarities or ER regions. en, an AdaBoost classifier is trained using mean local binary patterns (MLBP) for text and nontext classifications. Finally, heuristic rules are used for misclassified character filtering. In [20], the character candidates are extracted from low-variation ERs and classified using a support vector machine (SVM) and geometrical features. e obtained characters are grouped into text lines using heuristic rules, and a final restoration stage is considered if adjacent regions satisfy a set of predefined conditions. In [19], a similar ER-based method is proposed, but instead, geometrical features, the HOG, and local binary pattern (LBP) features are selected for character classification and recognition. en, characters are grouped into text lines, and a CNN model is used to verify text lines, removing noncharacter components. In [28], a multichannel and multiresolution (MC-MR) strategy is proposed. e text candidates are extracted using MSER technique under RGB and YUV color spaces under different resolutions. en, candidates are filtered by a coarse-fine strategy and classified as text and no-text components by a NN classifier.

Proposed Text Detection and Segmentation Method
In this section, the methodology for the proposed text detection and segmentation method is described. Connected components are obtained from the local image phase information. In order to extract the local phase-based image features, the scale-space monogenic signal framework [38,39] is utilized. Basically, connected component regions are extracted from the local phase image using the MSER approach. en, the obtained connected components are filtered considering geometrical properties, and the remaining components are considered as character candidates. Using an AdaBoost classifier, the character candidates are predicted as a character or noncharacter component. Finally, a second AdaBoost classifier is applied to restore misclassified characters. Figure 1 shows a block diagram of the proposed method. [40,41] proposed a local energy model. is model argues that the biological visual system can locate features of interest by searching for maximum local energy and identifying the feature type (shadow, edge, or line) by evaluating the argument at that point. at is, edges, lines, and shadows, can be obtained at points where the Fourier components of the signal are maximum in the phase distribution, called phase congruency. Continuing with this approach, in [42], a dimensionless measure of phase congruency (PC(x)) is proposed as follows:

Image Preprocessing. Morrone and Owens
where W(x) is a weight for the frequency spread; ε is a small constant to avoid division by zero; and T is a noise threshold parameter. PC(x) goes from 0 to 1. e PC(x) value indicates the significance of the current feature: unity means the most significant feature, and zero indicates the lowest significance. We refer to papers [42,43] for more details.
In practice, local frequency information is obtained via banks of oriented 2D filters, which are computationally expensive. Instead, we used the scale-space monogenic signal framework to compute the local phase information of the image.
Let be f(x, y) an image and where R � (R x , R y ) is the transfer function of the first-order Riesz transform in the frequency domain: and filtered by the band-pass filter: where λ ∈ (0, 1) indicates the relative bandwidth, s 0 indicates the coarsest scale, and k ∈ N indicates the band-pass number. Figure 2 shows a block diagram of the scale-space monogenic signal framework. en, the local amplitude A(x, y), local orientation θ(x, y), and local phase φ(x, y) (note that the function a tan 2(|y|/x) � sign(y) · tan −1 (|y|/x), where the factor sign (y) indicates the direction of rotation) can be computed as follows:

Mathematical Problems in Engineering
φ(x, y) � a tan 2 Local image phase congruency

Phase-Based Character Candidate Generation.
As we mentioned earlier, the local image phase φ(x, y) describes the image structural information, while local amplitude gives us an intensity measure of the structure. Furthermore, the local phase allows us to distinguish between edge, edge-line, and line features. A phase value of 0 indicates an upward going step, π/2 a bright line feature, π a downward going step, and 3π/2 a dark line feature [43]. However, we are not interested to make a distinction between dark or bright lines but in finding upward and downward going step features for region detection. For this reason, we consider the range from 0 to π, mapping the angles grater then π back into the range.
On the contrary, the MSER method [14] was first introduced for grayscale images, but it can be applied for any type of images as long as it maintains the two following conditions: totally ordered set and existence of adjacency relation.
us, the proposed phase-MSER method is described as follows.
Let I be a grayscale image and ϕ its local phase (equation (7)). e binary image I (t) bin is defined as where t denotes a threshold value. An extreme region R t with threshold t is defined as e extremal region R t * is maximally stable if and only if has a local minimum at i * , with |·| denoting cardinality, and Δ is a parameter that considers the stability of the region under a certain number of thresholds. e obtained regions are called character candidates (CC). Figure 3 shows an example of the MSER technique and the proposed phase-MSER method.  It is important to note that the local phase information is scale-and rotation-invariant. Moreover, due to the invariance-equivariance property, local phase information is independent of the local intensity; therefore, it is robust to contrast and illumination variations.

Character Candidate Feature Computation.
Once the character candidate generation stage is done, a morphological closing operation is applied to each candidate in order to eliminate small holes. e size of the structural element was experimentally defined as ��� CC area . Next, for each candidate, geometrical connected component properties are computed. Table 1 summarizes the computed properties. en, the obtained properties are used to compute the suggested candidate features: (1) e mean phase congruency value (PC mean ) is computed to consider the phase congruency value of the candidate. As mentioned above, the PC(x) value indicates the significance of the current feature. us, one means the most significant edge component, and zero indicates the lowest significance. PC mean is computed as follows: where pt i ∈ CC contour and |·| denotes cardinality. (2) e phase congruency ratio (PC ratio ) is computed to consider the contribution of the edge pixels of the candidate. One means a complete contribution from all the edge pixels, and zero indicates the lowest contribution. PC ratio is obtained as where and PC thresh is a threshold from 0 to 1. (3) e filled convHull ratio is computed to consider the convexity of the candidate: (4) e approximated area ratio considers the stroke uniformity of the candidate. One means a complete uniformity of the candidate stroke, and zero indicates the lowest uniformity. e approximated area ratio is computed as where CC approx � CC stroke · length(CC skel ). (5) e contour length ratio considers the difference between the external and internal candidate contours. is is to consider the complexity of the candidate edge. e contour length ratio is computed as abs length CC contour − length CC contourExt length CC contourExt . (16) where CC contourExt represents the external contour of the candidate.
In addition, the features used in [37,44] are also considered: (1) e filled area ratio: (2) e solidity: CC area area CC hull .
All the described features are used for AdaBoost classifier training to classify character candidates into text and nontext components. e text-component AdaBoost classifier was trained using the ICDAR2013 training dataset (299 images).

Character Candidate Classification.
In this stage, the character candidate classification is performed. As a first step, coarse candidate filtering is applied taking into account the following noncharacter properties: (1) e candidate area: to eliminate noncharacter candidates that are either larger or smaller than a predefined value, that is, where I area is the image area. (2) e aspect ratio: to eliminate noncharacter candidates that are too narrow or wide. CC ratio < 0.10 was considered. (3) e phase congruency value: to eliminate low phase congruency value candidates. If PC mean (equation (11)) is lower than a predefined threshold (PC thresh ), then the candidate is discarded. Figure 4 shows an example of the phase-based candidates under different PC thresh values.
After the filtering stage, the remaining candidates are classified as text and nontext components using the already trained AdaBoost classifier. A candidate is considered as a text character (Char) if the sum of votes of the classifier is positive. e remaining candidates with the negative vote sum are considered as candidate neighbors (CN) and are used in the next stage of character retrieval.

Character Retrieval.
During the classifier training stage, some characters were purposely mislabelled as noncharacters ("I," "i," "L," and "1") to reduce classification errors since these characters are usually similar to noncharacter structures in the image. e retrieval stage seeks to recover these characters and others that have been misclassified. e character retrieval method is described as follows.
For each Char, a neighborhood of radius R � 4 · max(Char height , Char width ) is defined. All the CNs inside the radius R are considered as character neighbors. If Char has no possible CNs, then the character is discarded from the retrieval stage but continues as a final character. It means that isolated characters are not discarded.
Next, each CN is evaluated to determine if it is a misclassified character. For this, a second AdaBoost classifier is applied. e classifier is trained using the following features between Char and its CN: (1) e area difference: abs Char area − CN area max Char area , CN area .
e character retrieval AdaBoost classifier was also trained using the ICDAR2013 training dataset.
Once the character retrieval AdaBoost classifier is trained, it is used to retrieve the CN as Char if the classifier vote sum is positive. en, the retrieval neighbors are considered as characters, and they are also used for retrieval of their candidate neighbors recursively. e method stops when no new neighbor component is classified as a new character.
Note that no alignment feature is computed, as in many related works. Considering horizontal alignment helps to avoid character misclassification but restricts the method to horizontal text only. us, the proposed method can be applied for nonhorizontal text images.

Character Grouping.
Since most of the state-of-the-art text detection methods evaluate word localization instead of character segmentation, a character grouping stage for text detection is considered. Similar closest characters are grouped together and considered as candidate words. en, the Hough transform is applied to obtain the final candidate word lines. e character grouping method is described as follows. First, for each character, the distance between the character and all its neighbors within a radius R � 4· max(Char height , Char width ) is computed. e distance is obtained as the minimum Euclidean distance between the convex hull of the character and its neighbors. All the characters are grouped into pairs, and a minimum region containing both components is created. e region is expanded to the minimum distance between characters.
All intersecting regions are considered as candidate words. en, the Hough transform is applied to obtain the candidate word lines. Each of these lines is processed individually to verify if all the selected characters belong to a single word. is is done by applying the AdaBoost classifier used in the retrieval stage. All the characters from the candidate word are compared with each other. ose characters that are classified as nonword characters to all other characters form a new word, and so on. e method stops when no new word is created. At the end, those final words that have only one element and its AdaBoost vote sum value is lower than zero, are eliminated. Figure 5 shows a character grouping example.

Evaluation Protocol.
e performance evaluation of the proposed method was realized using the following metrics. Two evaluation types are selected for text segmentation and text localization. For text segmentation, the character level recall-similarity rate [17] and the pixel-atom-based measures are utilized [45].
For character candidate generation evaluation, the recall-similarity rate is utilized. e recall-similarity is defined as the ratio between the total correctly detected candidate regions and the ground truth characters. A region is considered as a character candidate if the similarity value is up to 50%. e similarity value is defined as follows [17]: where D and GT represent the detected and ground-truth bounding box, respectively. For pixel-level segmentation evaluation, the pixel-and atom-based measures are utilized. Pixel-and atom-based measures not only consider pixel-level accuracy but also take into account the morphological properties of characters. In [45], the minimal and maximal coverage criteria are introduced, which measure the degree of overlap between the ground truth area and the obtained segmented component. e minimal coverage criterion is fulfilled if the predefined threshold T min � 90% of the ground-truth skeleton pixels is covered by the segmented component. Similarly, for the maximal criterion, the pixel distance to the ground-truth edge pixels should not exceed a maximum threshold T max � min(5, 0.5 · G), where G is the maximum stroke width of the character.
On the contrary, although the proposed method is designed specifically for the text segmentation task, text localization evaluation is carried out to compare its performance with that of the state-of-the-art methods. e recall (R), precision (P), and F-measure (F) are defined as follows [46]: G and D represent the ground-truth rectangle set and detection rectangle set, respectively. t r ∈ [0, 1] and t p ∈ [0, 1] are the recall and precision constrains, respectively. For more details, we refer to Wolf and Jolion [46].
For the MSER algorithm, the simulations were carried out using the reported MSER parameter [20], that is, Δ � 4, maximum variation v � 0.5, and minimum diversity d � 0.1.

Computer
Simulations. First, to analyze the tolerance of the proposed segmentation method to low contrast, high brightness, shadows, and nonuniform illumination degradations, computer simulations using synthetic images were performed. For the experiments, ten representative images from the ICDAR2013 dataset were selected. e selected images contain different symbols, font types, colors, sizes, and backgrounds. Each image was scaled, rotated, and synthetically degraded, obtaining 1000 synthetic images per degradation (see Figure 6). Table 2 shows the obtained results compared with the MSER method in terms of recallsimilarity measure. e proposed method shows a high candidate generation performance. e recall-similarity measure was up to 90% in most of the cases, excepting the brightness degradations. at is because brightness variations caused the loss of regions with low contrast (see Figure 6, second row, fifth column). Besides, the proposed segmentation method shows performance up to 30% for nonuniform illumination and shadow degradations and performance up to 10% for brightness and contrast variations compared with the MSER technique.

Text Segmentation Evaluation.
Since text segmentation depends on the quality of connected component generation, the proposed phase-based character candidate generation method is evaluated. Table 3 shows the obtained results in terms of recall-similarity measure and the obtained mean candidate regions. e obtained result shows that the proposed method obtains less character candidates with a high similarity rate than the other methods. Our method outperforms the results obtained in [8,17], even when the methods utilize grayscale, RGB, Cb, and Cr channels. Although the recent methods [19,28] report good similarity results for the given dataset, the mean number of candidates per image is too high, almost 30 and 15 times more than the proposed method. It is important to note that there exists a compromise between candidate region generation and computational complexity.
For the text segmentation evaluation, the precision and recall metrics were computed, as well as the F-measure. Table 4 shows the proposed method results on the ICDAR2013 dataset. e proposed method outperforms the methods [20,48], which utilize grayscale images for character candidate extraction.
Both results, character candidate generation and text segmentation, show that the proposed method obtains fewer candidate regions with a more accurate pixel-level segmentation result. Now, we provide the performance of the proposed method at different stages of its work. Table 5 presents character-level results in terms of recall, precision, and F-measure. We can observe that, after classification of candidates, the precision improves by 58%, while recall decreases by almost 24%. is is because at the classifier training stage, some characters were purposely mislabelled  as noncharacters. As expected, the retrieval stage recovers some characters that were misclassified; however, nontext components are also restored. Finally, the grouping stage discards noncharacters, which were recovered at the retrieved stage, as well as correct characters.

Text Localization Evaluation.
Since most of the existing methods present text localization evaluation instead of character segmentation, we also carry out the same evaluation. Table 6 shows the text localization performance of the MSER-based techniques on the ICDAR2013 dataset. It can be seen that the proposed method shows better F-measure results than most other methods, except the techniques [17,28] in which multiple image channels are used. However, the method [17] is designed for horizontal text only, decreasing its performance for multioriented text, while method [28] yields a lower F-measure than the proposed method with only grayscale images. Besides, the proposed method outperforms the latter one on the multioriented USTB-SV1K dataset (see Table 7). Next, the performance of the proposed method and state-of-the-art algorithms [16, 20, 24, 28-30, 34, 37] on four datasets is evaluated using the protocol given in [34]. e results are shown in Table 7. One can observe that the proposed technique using only 299 training images outperforms the state-of-the-art methods on USTB and OSTD multioriented datasets. e performance of the methods [28,29] drops by almost 30% compared with the performance of these methods on the ICDAR2013 dataset containing horizontally aligned texts. Since the MSRA dataset has Chinese characters that we are not familiar with, we perform two evaluations of the proposed method: over the entire MSRA dataset and English text images of the dataset. Note that classifiers used in our method were only trained using Latin-based characters. For a fair comparison with other methods on this dataset, the proposed technique needs additional training with Chinese characters. It is of interest to note that the proposed method can detect parts of Chinese texts (see Figure 7). Although the deep learningbased method [30] outperforms the proposed method (for the complete test set), the authors report a decrease of 20% on F-measure using only the MSRA training set (300 images), thereby obtaining a lower F-measure than the proposed method. Figures 8 and 9 show examples of correct text detection images and common errors of the proposed method in the USTB dataset, respectively. ree types of errors were found: the Google logo error (first row), where the proposed method recognized the Google watermark from the images; the unmarked text error (second row), where the proposed method recognized the text, but it was not considered as the text by the dataset ground truth; and the false positive and false negative errors (third row).
Finally, the average processing time of the proposed method was estimated using the ICDAR2013 dataset on a 2.8 GHz Intel Xeon E5-1603 PC with 16 GB of RAM. Table 8 summarizes the running time of all tested algorithms, as well Table 3: Character candidate generation results on the ICDAR2013 dataset.
One can observe that methods [28,30] achieved the best runtimes of recognition since GPU was utilized for implementation. Methods [18,48] work only for the horizontal text, which reduces the computational complexity (runtime) of these methods. Note that all deep learning   Mathematical Problems in Engineering 13    [24] 2.9 GHz 12-core CPU 0.9 256 G RAM, GTX Titan X algorithms require significantly longer training time compared with the proposed method, which is reasonably fast for detection and segmentation even using a conventional computer without a graphics processor. Further optimization of the method implementation, as well as the use of GPU technology, can definitely reduce the overall processing time of our method.

Conclusion
In this paper, a novel multioriented text detection and segmentation method inspired by the human vision system was proposed. e method is based on the local energy model and the scale-space monogenic signal framework to extract essential local phase information.
e proposed method consists of phase-based text segmentation, character retrieval, and character grouping stages. e phase-based candidate regions are extracted by applying the MSER algorithm to the local phase image; meanwhile, character retrieval and grouping are done by applying AdaBoost classifiers to avoid the use of heuristic rules. e proposed method proved to be robust to geometric distortions, font variations, complex backgrounds, low contrast, high brightness, shadows, and illumination changes. e method achieves a high character segmentation performance possessing low computational complexity (number of extracted components). e method outperforms the state-of-the-art algorithms on typical databases in terms of character segmentation, text localization, and the number of candidate regions. Besides, our method is not restricted to only horizontal texts like most of the existing methods but also to multioriented texts.
Finally, the proposed method can be used for text detection in different languages or handwritten texts.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.