Elsevier

Pattern Recognition Letters

Volume 29, Issue 6, 15 April 2008, Pages 724-734
Pattern Recognition Letters

Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition

https://doi.org/10.1016/j.patrec.2007.11.009Get rights and content

Abstract

Unlike many other languages, 18 out of 32 Farsi characters have dots appearing in groups of one, two or three. Some of these letters share common primary shapes, differing only in the number of dots and whether the dots are above or below the primary shape. In this paper, a new concept of using dots in a cursively handwritten Farsi/Arabic word is introduced for lexicon reduction and a fast method for extracting dots is presented. The technique involves extraction and representation of number and position of dots from off-line handwritten words to eliminate unlikely candidates. Experimental results on a set of 12,000 handwritten word images yield a lexicon reduction of 93% with accuracy of 85%. The proposed lexicon reduction algorithm achieves the speedup factor of 2 as well as 13% improvement in recognition rate.

Introduction

During the past decade, a remarkable progress has been achieved in the field of handwritten word recognition, and many applications, such as automatic reading of postal addresses, bank checks and forms have been emerged. However, most of the published works deal with the recognition of Latin and Chinese scripts. Farsi/Arabic script recognition has progressed slowly mainly due to the special characteristics of these languages. The reader is referred to Amin, 1998, Märgner et al., 2005, Lorigo and Govindaraju, 2006 for more details on the state of the art Arabic character and word recognition.

Due to some ambiguity and large diversity of writing styles, recognition systems are generally based on a set of possible words called lexicon. Depending on the application type, size of the lexicon can vary from 20–30 words, in reading of check amounts (Kaufmann et al., 1996) to 10,000–60,000 words, for English text recognition (Korich et al., 2003). The problem of such large lexicons is the number of times that the input image has to be compared with the words in the lexicon. Recognition large lexicons may be benefited from initial eliminating lexicon entries unlikely to match the given image. This process, lexicon reduction, has desirable effects not only on the recognition time, but also upon the recognition accuracy (Madhvanath and Govindaraju, 1995). Usually, when the lexicon is small, recognition accuracy is more important than recognition time. On the other hand, recognition speed is a critical issue for large lexicons.

Lexicon reduction task can be done by some basic ways such as: knowledge of the application environment, the input pattern characteristics, and clustering of similar lexicon entries (Korich et al., 2003). Usually, the application environment is the main source of information in limiting the lexicon size for handwriting recognition. For example in postal applications, most of the proposed approaches attempt to recognize the ZIP codes first, instead of reading other parts of the address. Depending on the reliability in recognizing the ZIP code, the pruning system can reduce the lexicon of thousands of entities to a few hundred words (Lee and Leedham, 2000).

However, in the case of recognizing words or sentences, the application environment has little influence on lexicon reduction. But here, linguistic knowledge plays an important role. By using language models based on grammars, both lexicon pruning and recognition accuracy can be achieved (Marti and Bunke, 2000). However, this source of information is more suitable for sentence recognition rather than isolated word recognition.

For a lexicon of isolated words, some characteristics of the input pattern such as the word length and shape can be used for reduction. The length of the input image is a very simple criterion for lexicon pruning. Long words can be easily distinguished from short words by comparing only their lengths. Length of a word can be estimated from the length of observation sequence, extracted from the input image. In this method, the reduction system is directly based on the feature vectors used as the input for HMMs. Therefore, little additional work is required for lexicon reduction (Kaufmann et al., 1997). Other approaches rely on topological features to estimate the number of characters as a measure of word length. Guillevic et al. (2000) estimated the number of characters by using the counts of the strokes crossing within the main body of a word. The word shape and its writing style is another aspect for lexicon pruning. Usually, presence or counts of some topological features such as ascenders, descenders (Madhvanath et al., 2001), t-crossing and i-dots (Carbonnel and Anquetil, 2003) have been used frequently to limit the number of candidates in a Latin lexicon. For more details of lexicon reduction methods see Korich et al. (2003).

Taking the dots into account, unlike many other languages such as English, which has only two dotted characters, 18 out of 32 Farsi characters have dots appearing above or below the baseline, in groups of one, two or three. Some of these letters share common primary shapes, differing only in the number of dots and whether the dots are above or below the primary shape. In order to differentiate such letters, structural features can be used to capture dot information explicitly. In the systems developed by Khorsheed, 2003, Amin, 2003, dots were used beside other structural features to recognize handwritten Arabic manuscripts. Usually, recognition systems utilize two different types of features: structural and statistical features. Structural features can highly tolerate variations and distortions in handwriting words but extracting them from images is not always easy. Due to frequency of dotted characters, Farsi and Arabic scripts are sensitive to speckle noise. In addition, parts of broken characters, caused by binarization method, and some small characters are also similar to dots. So, extracting dots information as structural features is a difficult task in Farsi/Arabic word recognition. A dot is the smallest and the most variant part in Farsi/Arabic languages. Therefore, from statistical point of view, dots do not have a crucial role and they may raise recognition error. However, importance of dots is undeniable in Farsi/Arabic languages.

Unlike previous work which used dots for word recognition, in this paper, a new concept of using presence and number of dots and their position with respect to the baseline is introduced for off-line handwritten Farsi/Arabic lexicon reduction. Recently, by collecting and releasing large Arabic data sets, the importance of Farsi/Arabic lexicon reduction techniques has grown significantly.

Section snippets

Farsi handwriting characteristics

Since the characteristics of Farsi (Arabic) handwriting are different from the Latin one and some of the readers maybe unfamiliar with Farsi scripts, a brief description of the important aspects is presented.

Farsi text is inherently cursive both in handwritten and printed forms and is written horizontally from right to left. Farsi writing is very similar to Arabic in terms of strokes and structure. Therefore, a Farsi word recognizer can also be used for recognition of Arabic words. The only

Extraction of dots from off-line handwritten Farsi/Arabic words

Extraction of dots from off-line handwritten Farsi and Arabic words is an interesting area for research. However, most of the previous works in structural feature extraction have used simplified assumptions regarding dots, and no algorithm has been published for dot extraction. In this paper we present a new approach to lexicon reduction in which we extract dots from the input image.

The word recognition system

The proposed system is designed for reading the city names from postal address fields. Since the lexicon of this application is limited to 200 city names and segmentation of handwritten Farsi/Arabic words is also a crucial problem, a holistic approach, based on model discriminate discrete HMM, is chosen for recognition. Furthermore, lexical pruning has been developed on the basis of the number and position of the dots in the input word image to optimize the recognition rate, the recognition

Experimental results

The proposed system for Farsi/Arabic word recognition consists of two main parts: lexicon reduction and word recognition. In the former, dots information is used while the word’s main stroke is utilized in the later. This section presents the results for the lexicon reduction and the overall system performances separately. A database consisting about 17,000 images of 200 city names of Iran was used. The descriptor string for each class was manually generated for all 200 classes.

In the first

Conclusion

Due to specific characteristics of Farsi and Arabic scripts, such as cursiveness and dependence of Farsi words to dots, recognition of handwritten words in these languages is significantly more difficult than the recognition of English words. We explored the use of dots in lexicon reduction for Farsi/Arabic handwritten words recognition. Dot information in a word can be extracted without performing word recognition or contextual analysis. The string descriptor for each word image is based on

References (25)

  • Farooq, F., Govindaraju, V., Perrone, M., 2005. Pre-processing methods for handwritten Arabic documents. In: Proc. 8th...
  • Guillevic, D., Nishiwaki, D., Yamada, K., 2000. Word lexicon reduction by character spotting. In: Proc. 7th...
  • Cited by (35)

    • Effect of delayed strokes on the recognition of online Farsi handwriting

      2013, Pattern Recognition Letters
      Citation Excerpt :

      In Farsi, single-dot is written in one stroke, double-dots in one or two strokes and triple-dots in one, two or three strokes. Lexicon reduction using the dots has been employed for offline Farsi/Arabic handwritten recognition (Mozaffari et al., 2008). The approach involves type and position of the dots to eliminate unlikely candidates.

    • Arabic handwriting recognition using structural and syntactic pattern attributes

      2013, Pattern Recognition
      Citation Excerpt :

      This benefit is due to the use of structural approach. Integrating lexicon reduction techniques with an HMM based system may require a completely separate feature extraction and classification stage [36]. As for the recognition accuracy itself, our approach gives performance comparable with the state of the art results on the same data.

    • W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents

      2012, Pattern Recognition
      Citation Excerpt :

      In the second stage, the word's diacritical mark types and positions are encoded into a string, and the lexicon is reduced based on the string edit distance. Mozaffari et al. [26] extended the previous approach to Farsi handwritten words, which contain more letters than the Arabic alphabet. Wshah et al. [27] propose a similar algorithm, in which the diacritic detection stage is improved by the use of a convolutional neural network.

    • Binary segmentation algorithm for English cursive handwriting recognition

      2012, Pattern Recognition
      Citation Excerpt :

      Sliding window and geometrical feature extraction are the base of the HMM module recognition. Mozaffari et al. [18] proposed a lexicon reduction scheme for static Farsi handwriting recognition by analyzing dots within characters. In segmentation based approaches, a word is segmented into characters and characters are learned by a classifier.

    • Markov model inferencing in distributed systems

      2022, Distributed Sensor Networks: Image and Sensor Signal Processing
    View all citing articles on Scopus
    View full text