Recognition of Off-line Printed Arabic Text Using Hidden Markov Models

In this paper, we introduce a method to identify the text printed in Arabic, since the recognition of the printed text is very important in the applications of information technology, the Arabic language is among a group of languages with related characters such as the language of Urdu , Kurdish language , Persian language also the old Turkish language " Ottoman ", it is difficult to identify the related letter because it is in several cases, such as the beginning of the word has a shape and center of the word has a shape and the last word also has a form, either texts in languages where the characters are not connected, then the image of the letter one in any location in the word has been Adoption of programs ready for him A long time. In this paper we present an off-line system to recognize printed Arabic text by using Hidden Markov Model with the aid of algorithm that segment the text line into sections and then into characters.


Introduction
Artificial intelligence is one of the most important fields of applied science in computer science.There are many applications in this field, including natural language processing, automatic translation, pattern recognition, etc.In the recent years, the research on the field of automatic character recognition has expanded significantly in both and indirect terms.The use of the term "understanding" is not intended, as has been the case so far, Written by computer, this means trying to decode the message to be reported.[2] "The first successful effort in this regard was due to the Russian scientist TYURIN in 1900, followed by the attempts by FOURIER DALBE to manufacture the reader machine for the 1912 speaking letters, and the prosthetic prosthesis Thomas built " THOMAS" in 1926. [3].This paper is used in the administrative processing of administrative files, such as contracts, birth certificates, questionnaires, bank files and postal addresses.It has made great strides in foreign languages, but its applications in Arabic, despite the commendable efforts of some competent authorities, remain below the required level.One of the applications of artificial intelligence software is to distinguish patterns [4] Pattern recognition is also a study of how machines can observe the environment, learn to show patterns that they wish to distinguish, and make a reasonable decision about the types of patterns.[5] As an antidote to chaos, as an undefined entity, it is possible to give a certain name.After definition [6] Despite gradual improvements in the applications of pattern recognition in the late twentieth and early twentieth centuries, character recognition remains one of the most important issues of pattern recognition.[7] These applications include reading the mailing address on the envelope, archiving and retrieving the text, digitizing the libraries, etc.And the distinction of patterns visually passes through several stages and the last stage is discrimination where there are several ways to conduct, and we will use in this research model in the distinction of printed Arabic text.Markov's Hidden Markov Model-HMM (Markov) is one of the models used in speech and language processing.[8] The double hidden HMM is known in which hidden cases can be viewed only by certain observations.[9]

A model for character recognition system
The character recognition system generally consists of four basic stages illustrated in Figure 1, where you begin by inserting the document containing the text we want to distinguish and ending with the characterization of the entered document.

Figure (1): A model for character recognition
The pattern recognition system may not have all of these stages.It is possible to shorten some stages without affecting this, On the process of pattern recognition, for example, the system discriminates without requiring the stage of Features Extraction, and is used instead (Matching templates)

Markov chains
Mathematical models may be specific or coincidental.However, in many cases in life, there are coincidental phenomena (phenomena that are not completely deterministic or unpredictable in their future behavior and are termed coincidences).[11] The cross-model becomes the most appropriate to represent it.,Thesystem described in Figure (2) can be described over a specified period of time, as described in )S1, S2, ..., SN) (Discrete states) (N) is one of the set of these cases

The Hidden Markov Model
The hidden Markov model is a system of limited machine stations that depends only on its previous state at time t, which is capable of generating observations of the probability of a state transition at time.The sequence of the situation that produces the given observation is unknown.
[12] Thus, in the hidden Markov model, t-1 is Time The case is not visible, so the hidden Markov model and transitions between situations are governed by a set of probabilities called the probability of transition from a given situation that can result in a result or observation and according to the probability distribution associated with that state.

Insert the image
This step was printed (28) lines of text per line printed, So that each line of text includes a certain character in all its forms at the beginning, medal or end, and then storing it as a binary image in the (BMP) file, the data in the image is stored in binary format (1,0), the black dot that is part of the pattern is represented by the value 0 and the white point is in value 1.We did not perform the noise reduction because the image was not inserted by optical scanning devices such as a scanner or a penlight that causes noise

Cutting stage
The stage of cutting is an important stage within the stages of the system of distinguishing the Arabic text because of the nature of Arabic writing Which require separating patterns of character patterns from one another.Two-step automatic shredding is performed.The text line is cut into words and / or sections using the vertical diagram.Each word and / or clip is then cut into its constituent characters The character assignment is done after the start and end of the characterization process as well as finding: 1-Base line: Be at the line that has the largest number of black dots.

Features extraction
In the previous stage all the characters were reached and the beginning and end of each character and the space occupied by the character.At this stage the process of extracting attributes is carried out for the purpose of generating a series of observations, then summoning a model.The hidden Markov is designed according to the character location, and the probability of the sequence of character views will then be calculated .And output the distinctive character.Repeat these steps on the rest of the letters sequentially.

6-1-4 Features vector
The vector of attributes consists of ( 8  First, the possibility of distribution of the primary case is the possibility of the occurrence of the situation Mi when i=1,2,… 9 In time (t) it is placed in a vector (A) be dimensions 1 X N where (N ) Represents the number of cases, N=9.A= [1.0 0 0 0 0 0 0 0 0] Second: Matrix probability of transition between previous cases in the vector (A) size = N X N According to the proposed model was its size (9 X 9).
The table shows the values of the vector (A).In the same way we design the identification of the middle and final character, separated by Hidden Markov Model And calculate the form elements First: Calculate the probability of distribution of the primary case
Figure (3): Proposed Scheme of recognition system 2-Top line: For each column in the section.3-Bottom line: For each column in the section.4-Threshold: Corresponds to the largest duplicate value in the histogram for each column created in the previous chipping step.5-Number of vertical transitions (0, 1) and (1, 0)The end column must meet several conditions.The start column of the character has a histogram that is larger than the threshold.

Figure
Figure (5): Vector features and elements probability distribution account for the path of view codes for the same characters in their other locations where a matrix of observations is called B This matrix consists of N x M, N=9 and M=17 Where (N) represents the number of instances of Hidden Markov model (HMM) designer, and (M) represents the number of expected view codes in each case, As shown in the table (2) below.