Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition

doi:10.1016/j.patrec.2007.11.009

Pattern Recognition Letters

Volume 29, Issue 6, 15 April 2008, Pages 724-734

https://doi.org/10.1016/j.patrec.2007.11.009 Get rights and content

Abstract

Unlike many other languages, 18 out of 32 Farsi characters have dots appearing in groups of one, two or three. Some of these letters share common primary shapes, differing only in the number of dots and whether the dots are above or below the primary shape. In this paper, a new concept of using dots in a cursively handwritten Farsi/Arabic word is introduced for lexicon reduction and a fast method for extracting dots is presented. The technique involves extraction and representation of number and position of dots from off-line handwritten words to eliminate unlikely candidates. Experimental results on a set of 12,000 handwritten word images yield a lexicon reduction of 93% with accuracy of 85%. The proposed lexicon reduction algorithm achieves the speedup factor of 2 as well as 13% improvement in recognition rate.

Introduction

During the past decade, a remarkable progress has been achieved in the field of handwritten word recognition, and many applications, such as automatic reading of postal addresses, bank checks and forms have been emerged. However, most of the published works deal with the recognition of Latin and Chinese scripts. Farsi/Arabic script recognition has progressed slowly mainly due to the special characteristics of these languages. The reader is referred to Amin, 1998, Märgner et al., 2005, Lorigo and Govindaraju, 2006 for more details on the state of the art Arabic character and word recognition.

Due to some ambiguity and large diversity of writing styles, recognition systems are generally based on a set of possible words called lexicon. Depending on the application type, size of the lexicon can vary from 20–30 words, in reading of check amounts (Kaufmann et al., 1996) to 10,000–60,000 words, for English text recognition (Korich et al., 2003). The problem of such large lexicons is the number of times that the input image has to be compared with the words in the lexicon. Recognition large lexicons may be benefited from initial eliminating lexicon entries unlikely to match the given image. This process, lexicon reduction, has desirable effects not only on the recognition time, but also upon the recognition accuracy (Madhvanath and Govindaraju, 1995). Usually, when the lexicon is small, recognition accuracy is more important than recognition time. On the other hand, recognition speed is a critical issue for large lexicons.

Lexicon reduction task can be done by some basic ways such as: knowledge of the application environment, the input pattern characteristics, and clustering of similar lexicon entries (Korich et al., 2003). Usually, the application environment is the main source of information in limiting the lexicon size for handwriting recognition. For example in postal applications, most of the proposed approaches attempt to recognize the ZIP codes first, instead of reading other parts of the address. Depending on the reliability in recognizing the ZIP code, the pruning system can reduce the lexicon of thousands of entities to a few hundred words (Lee and Leedham, 2000).

However, in the case of recognizing words or sentences, the application environment has little influence on lexicon reduction. But here, linguistic knowledge plays an important role. By using language models based on grammars, both lexicon pruning and recognition accuracy can be achieved (Marti and Bunke, 2000). However, this source of information is more suitable for sentence recognition rather than isolated word recognition.

For a lexicon of isolated words, some characteristics of the input pattern such as the word length and shape can be used for reduction. The length of the input image is a very simple criterion for lexicon pruning. Long words can be easily distinguished from short words by comparing only their lengths. Length of a word can be estimated from the length of observation sequence, extracted from the input image. In this method, the reduction system is directly based on the feature vectors used as the input for HMMs. Therefore, little additional work is required for lexicon reduction (Kaufmann et al., 1997). Other approaches rely on topological features to estimate the number of characters as a measure of word length. Guillevic et al. (2000) estimated the number of characters by using the counts of the strokes crossing within the main body of a word. The word shape and its writing style is another aspect for lexicon pruning. Usually, presence or counts of some topological features such as ascenders, descenders (Madhvanath et al., 2001), t-crossing and i-dots (Carbonnel and Anquetil, 2003) have been used frequently to limit the number of candidates in a Latin lexicon. For more details of lexicon reduction methods see Korich et al. (2003).

Taking the dots into account, unlike many other languages such as English, which has only two dotted characters, 18 out of 32 Farsi characters have dots appearing above or below the baseline, in groups of one, two or three. Some of these letters share common primary shapes, differing only in the number of dots and whether the dots are above or below the primary shape. In order to differentiate such letters, structural features can be used to capture dot information explicitly. In the systems developed by Khorsheed, 2003, Amin, 2003, dots were used beside other structural features to recognize handwritten Arabic manuscripts. Usually, recognition systems utilize two different types of features: structural and statistical features. Structural features can highly tolerate variations and distortions in handwriting words but extracting them from images is not always easy. Due to frequency of dotted characters, Farsi and Arabic scripts are sensitive to speckle noise. In addition, parts of broken characters, caused by binarization method, and some small characters are also similar to dots. So, extracting dots information as structural features is a difficult task in Farsi/Arabic word recognition. A dot is the smallest and the most variant part in Farsi/Arabic languages. Therefore, from statistical point of view, dots do not have a crucial role and they may raise recognition error. However, importance of dots is undeniable in Farsi/Arabic languages.

Unlike previous work which used dots for word recognition, in this paper, a new concept of using presence and number of dots and their position with respect to the baseline is introduced for off-line handwritten Farsi/Arabic lexicon reduction. Recently, by collecting and releasing large Arabic data sets, the importance of Farsi/Arabic lexicon reduction techniques has grown significantly.

Section snippets

Farsi handwriting characteristics

Since the characteristics of Farsi (Arabic) handwriting are different from the Latin one and some of the readers maybe unfamiliar with Farsi scripts, a brief description of the important aspects is presented.

Farsi text is inherently cursive both in handwritten and printed forms and is written horizontally from right to left. Farsi writing is very similar to Arabic in terms of strokes and structure. Therefore, a Farsi word recognizer can also be used for recognition of Arabic words. The only

Extraction of dots from off-line handwritten Farsi/Arabic words

Extraction of dots from off-line handwritten Farsi and Arabic words is an interesting area for research. However, most of the previous works in structural feature extraction have used simplified assumptions regarding dots, and no algorithm has been published for dot extraction. In this paper we present a new approach to lexicon reduction in which we extract dots from the input image.

The word recognition system

The proposed system is designed for reading the city names from postal address fields. Since the lexicon of this application is limited to 200 city names and segmentation of handwritten Farsi/Arabic words is also a crucial problem, a holistic approach, based on model discriminate discrete HMM, is chosen for recognition. Furthermore, lexical pruning has been developed on the basis of the number and position of the dots in the input word image to optimize the recognition rate, the recognition

Experimental results

The proposed system for Farsi/Arabic word recognition consists of two main parts: lexicon reduction and word recognition. In the former, dots information is used while the word’s main stroke is utilized in the later. This section presents the results for the lexicon reduction and the overall system performances separately. A database consisting about 17,000 images of 200 city names of Iran was used. The descriptor string for each class was manually generated for all 200 classes.

In the first

Conclusion

Due to specific characteristics of Farsi and Arabic scripts, such as cursiveness and dependence of Farsi words to dots, recognition of handwritten words in these languages is significantly more difficult than the recognition of English words. We explored the use of dots in lexicon reduction for Farsi/Arabic handwritten words recognition. Dot information in a word can be extracted without performing word recognition or contextual analysis. The string descriptor for each word image is based on

References (25)

A. Amin
Offline Arabic character recognition: The state of art
Pattern Recogn.
(1998)
A. Amin
Recognition of hand-printed characters based on structural description and inductive logic programming
Pattern Recognition Lett.
(2003)
R. Azmi et al.
A new segmentation technique for omnifont Farsi text
Pattern Recognition Lett.
(2001)
M. Dehghan et al.
Handwritten Farsi(Arabic) word recognition: A holistic approach using discrete HMM
Pattern Recogn.
(2001)
M. Dehghan et al.
Unconstrained Farsi handwritten word recognition using fuzzy vector quantization and hidden Markov models
Pattern Recognition Lett.
(2001)
M.S. Khorsheed
Recognising handwritten Arabic manuscripts using a single hidden Markov model
Pattern Recognition Lett.
(2003)
S. Madhvanath et al.
Syntactic methodology of pruning large lexicons in cursive script recognition
Pattern Recogn.
(2001)
Carbonnel, S., Anquetil, E., 2003. Lexical post-processing optimization for handwritten word recognition. In: Proc. 7th...
M.Y. Chen et al.
Variable duration hidden Markov and morphological segmentation for handwritten word recognition
IEEE Trans. Image Process.
(1995)
F.J. Damerau
A technique for computer detection and correction of spelling errors
Comm. ACM
(1964)

Farooq, F., Govindaraju, V., Perrone, M., 2005. Pre-processing methods for handwritten Arabic documents. In: Proc. 8th...

Guillevic, D., Nishiwaki, D., Yamada, K., 2000. Word lexicon reduction by character spotting. In: Proc. 7th...

Cited by (35)

Effect of delayed strokes on the recognition of online Farsi handwriting
2013, Pattern Recognition Letters
Citation Excerpt :
In Farsi, single-dot is written in one stroke, double-dots in one or two strokes and triple-dots in one, two or three strokes. Lexicon reduction using the dots has been employed for offline Farsi/Arabic handwritten recognition (Mozaffari et al., 2008). The approach involves type and position of the dots to eliminate unlikely candidates.
Online handwriting recognition, OHR, has gained a widespread use in everyday life. In some scripts such as Farsi and Arabic, additional strokes are written after the main stroke. These delayed strokes include dots and small signs. In this paper, the delayed strokes effect was studied from two points of views: subword modeling and lexicon reduction. The model of a subword was made of concatenating the main body model and the delayed strokes models. Hidden Markov model, HMM, was employed as a classifier. The delayed strokes of an input subword were additionally exploited to reduce the lexicon size. Our proposed method was tested on TMU-OFS dataset, including 1000 online Farsi subwords, and a recognition rate of 85.2% was achieved.
Arabic handwriting recognition using structural and syntactic pattern attributes
2013, Pattern Recognition
Citation Excerpt :
This benefit is due to the use of structural approach. Integrating lexicon reduction techniques with an HMM based system may require a completely separate feature extraction and classification stage [36]. As for the recognition accuracy itself, our approach gives performance comparable with the state of the art results on the same data.
In this paper, we present research results on off-line Arabic handwriting recognition using structural techniques. Statistical methods have been more common in the reported research on Arabic handwriting recognition. Structural methods have remained largely unexplored in this regard. However, both statistical and structural techniques can be effectively integrated in multi-classifier based systems. This paper presents, to our knowledge, the first integrated offline Arabic handwritten text recognition system based on structural techniques. In implementing the system, several novel algorithms and techniques for structural recognition of Arabic handwriting are introduced. An Arabic text line is segmented into words/sub-words and dots are extracted. An adaptive slant correction algorithm that is able to correct the different slant angles of the different components of a text line is presented. A novel segmentation algorithm, which is integrated into the recognition phase, is designed based on the nature of Arabic writing and utilizes a polygonal approximation algorithm. This is followed by Arabic character modeling by ‘fuzzy’ polygons and later recognized using a novel fuzzy polygon matching algorithm. Dynamic programming is used to select best hypotheses of a sequence of recognized characters for each word/sub-word. In addition, several other key ideas, namely prototype selection using set-medians, lexicon reduction using dot-descriptors etc. are utilized to design a robust handwriting recognition system. Results are reported on the benchmarking IfN/ENIT database of Tunisian city names which indicate the robustness and the effectiveness of our system. The recognition rates are comparable to multi-classifier implementations and better than single classifier systems.
W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents
2012, Pattern Recognition
Citation Excerpt :
In the second stage, the word's diacritical mark types and positions are encoded into a string, and the lexicon is reduced based on the string edit distance. Mozaffari et al. [26] extended the previous approach to Farsi handwritten words, which contain more letters than the Arabic alphabet. Wshah et al. [27] propose a similar algorithm, in which the diacritic detection stage is improved by the use of a convolutional neural network.
This paper proposes a holistic lexicon-reduction method for ancient and modern handwritten Arabic documents. The word shape is represented by the weighted topological signature vector (W-TSV), which encodes graph data into a low-dimensional vector space. Three directed acyclic graph (DAG) representations are proposed for Arabic word shapes, based on topological and geometrical features. Lexicon reduction is achieved by a nearest neighbors search in the W-TSV space. The proposed framework has been tested on the IFN/ENIT and the Ibn Sina databases, achieving respectively a degree of reduction of 83.5% and 92.9% for an accuracy of reduction of 90%.
Binary segmentation algorithm for English cursive handwriting recognition
2012, Pattern Recognition
Citation Excerpt :
Sliding window and geometrical feature extraction are the base of the HMM module recognition. Mozaffari et al. [18] proposed a lexicon reduction scheme for static Farsi handwriting recognition by analyzing dots within characters. In segmentation based approaches, a word is segmented into characters and characters are learned by a classifier.
Segmentation in off-line cursive handwriting recognition is a process for extracting individual characters from handwritten words. It is one of the most difficult processes in handwriting recognition because characters are very often connected, slanted and overlapped. Handwritten characters differ in size and shape as well. Hybrid segmentation techniques, especially over-segmentation and validation, are a mainstream to solve the segmentation problem in cursive off-line handwriting recognition. However, the core weakness of the segmentation techniques in the literature is that they impose high risks of chain failure during an ordered validation process. This paper presents a novel Binary Segmentation Algorithm (BSA) that reduces the risks of the chain failure problems during validation and improves the segmentation accuracy. The binary segmentation algorithm is a hybrid segmentation technique and it consists of over-segmentation and validation modules. The main difference between BSA and other techniques in the literature is that BSA adopts an un-ordered segmentation strategy. The proposed algorithm has been evaluated on CEDAR benchmark database and the results of the experiments are very promising.
A neuro-fuzzy inference engine for Farsi numeral characters recognition
2010, Expert Systems with Applications
Character recognition of Farsi and Arabic texts as an open and demanding problem needs to encounter sophisticated specifications of the characters such as their shapes, continuity, dots and also, different fonts. Utilizing fuzzy set theory as a tolerant approach toward uncertainty and vagueness and artificial neural networks as a machine learning method in this paper, we propose a neuro-fuzzy inference engine to recognize the Farsi numeral characters. This engine takes holistic approach of character recognition through the comparison of the unknown character’s features with the features of the existing characters that itself is characterized through Mamdani inference engine on fuzzy rules which is largely enhanced with a multi layer perceptron neural network’s learning on features of the different fonts’ characters which leads to more comprehensive recognition of Farsi numeral characters in the proposed system. Having applied this novel engine on a dataset of unknown numeral characters consisted of 33 different Farsi fonts, it yielded more accurate results than the corresponding researches. The recognition rates of unknown numeral characters are greater than 97% except for Farsi character 4, so as the proposed schema could not score a better result than 95% for this numeral character which implies its recognition is still in need of more enhancement.
Markov model inferencing in distributed systems
2022, Distributed Sensor Networks: Image and Sensor Signal Processing

View all citing articles on Scopus

View full text

Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition

Abstract

Introduction

Section snippets

Farsi handwriting characteristics

Extraction of dots from off-line handwritten Farsi/Arabic words

The word recognition system

Experimental results

Conclusion

Pattern Recogn.

Pattern Recognition Lett.

Pattern Recognition Lett.

Pattern Recogn.

Pattern Recognition Lett.

Pattern Recognition Lett.

Pattern Recogn.

Variable duration hidden Markov and morphological segmentation for handwritten word recognition

IEEE Trans. Image Process.

A technique for computer detection and correction of spelling errors

Comm. ACM