Text Detection In Indonesian Identity Card Based On Maximally Stable Extremal Regions

Most of Indonesian organizations either it is government or non government sometime required their member to provide their identity card (E-KTP) as legal document collection in their database. This collection of image usually being used as manual verification method. These document images acquired by each person with their own device, there are variations of angles they are used to acquire the image. This situation created problems in text recognition by OCR softwares especially in text detection part, orientation and noise will affect their accuracy. These cases making the text detection more complex and cannot be solved by simple vertical projection profile of black pixels. This research proposed a method to improve text detection in identity document by fixing the orientation first, then using MSER regions to form text region. We fix the orientation using the line that made by Progressive Probabilistic Hough Transform. Then we used MSER to obtain all candidate regions and Horizontal RLSA acts as connector between those candidate. The orientation fixing strategy reach average of margin error 0.377 o (in 360 o system) and the text detection method reach 84.49% accuracy in best condition.


INTRODUCTION
Information in identity card, in this case Indonesian identity card is important to all legal organization either government or non government related.Information data in E-KTP or Indonesian identity card can be saved in internal organization database for future use, including verification or statistics purpose.There are many cases in which the organization collecting photo or scan of E-KTP from their members for legal document collection.These collections of image files contains text information that can be extracted through OCR process.
There are many OCR softwares have been developed to complete this process.These softwares accuracy could be degraded by noise or any unwanted objects in image, either it is on Foreground or Background of text documents [1].The quality of input will affect them, therefore clear text on images without any unwanted object will increase accuracy of these OCR software [2].
The identity card usually has characteristic of neat format and fixed location where the text is.This characteristic enable us to detect the text location in simple way like using vertical projection of black pixels, which had been done before in E-KTP recognition research [3] and the researchers from Bangladesh to accomplished this task [4].They use vertical projection of black pixels because there were enough gaps between text areas in images.This approach indeed can be used when the photo only contains text and had been taken by camera from close up range.
The problem emerged when this photo had been taken by their owner with variation of distance between the camera and object.There was also problem about the orientation of camera which not fully straight.These cases causing the document images contain any unwanted object or noise which make vertical projection approach can't be used.
The orientation of image could affect OCR's accuracy [5], therefore image orientation fixing is also required.This paper proposed to fix the orientation of images first, before detecting the text region with one of the region detection methods available.This paper proposed using the line that formed by the text region in image and use it as reference to determine the orientation of image.This approach has been used before for determining Devenagari letter orientation (India and Nepal) [6].They used Standard Hough Transform to detect the line that formed by middle stroke in the letter which is the main characteristic of Devenagari letter.
The Standard Hough Transform that used before in Devenagari research is computionally expensive because of its voting mechanism.Therefore in this paper we proposed using PPHT (Progressive Probabilistic Hough Transform) as our method to detect the line.PPHT has lower complexity compare to the standard version because it is only using random subset of edge pixels [7].PPHT can also produce the line length information which can be used as line selection.
After fixing the image orientation, the process continues to text detection process.This paper approach is to detect groups of letter regions which can form text regions in images.This paper proposed using MSER (Maximally Stable Extremal Regions) as main method to detect letter regions.This method usually used in surveillance purpose, i.e. plat number in vehicles [8] or street sign [9].This method is robust enough for images which have text in it.There was some comparative study of region detector, MSER excelled other methods in most case especially region that has homogeny characteristic and clear boundary [10].These characteristics suits well with E-KTP documents.
This paper trying to contribute in text detection research area especially in identity card documents that has neat format characteristic such as E-KTP.Obtaining good text detection process is required to obtain a clear input for OCR software, and hoping to improve OCR accuracy later.

Image Orientation Fixing
This process is required because in previous explanation, there was some study indicated that the orientation of image could affect OCR accuracy [5].The orientation of image is determined by the assumption that text region in images could form the longest line possible in image, this line will be the reference to image orientation.First assumption is this line could be formed by the text region, if the photo of E-KTP had been taken in close up range, thus not many unwanted object could formed the false line.This base assumption will be tested later.All the process in orientation fixing step will be explained in the following order.
1. Image retrieval, this process is the first step to get original image from the scan, photo, or other sensors with JPEG or PNG file format.2. Grayscaling, converting three-channel image into one-channel intensity image.3. Linear Contrast Stretching, this process could be very important to separating text region from any unwanted object in background, hopefully will increase the accuracy later.4. Canny Edge Detection, producing edge image that consist of binary indicator 1 and 0, this image will be feed to PPHT (Progressive Probabilistic Hough Line Transform). 5. Horizontal RLSA [11], before edge image that produced by canny be feed into PPHT, this paper proposed RLSA as connector between Connected Components and hopefully connecting group of letters into one line.6. Finding longest line possible with PPHT (Progressive Probabilistic Hough Transform), The implementation of PPHT could produce line length because there was gap checking between pixels.Another advantage in previous explanation is lower complexity and computationally effective compared to standard version.PPHT overall process shown in Figure 2.
The process start with accepting edge image with binary indicator as input.Then PPHT will collect all non-zero pixels and pick them randomly in iteration.Iteration start with voter checking, if the picked pixel isn't voter to one particular accumulator box, the process continue.This pixel will vote the accumulator (,) in each box like shown in Figure 1.These boxes representing the number of votes and information of voter.As the process running, eventually this little boxes will visually make hills and valleys in 3d perspective, every top of the hill representing a line.
Figure 1 Accumulator (,) [12] Compared to Standard Hough Transform which required all edge pixels to make a valid line.PPHT only used random subset because its verification step to ensure there is no redundancy and thresholding step in voting mechanism.PPHT use thresholding step in case the voter is not enough to considering the line is valid.This process will reduce the iteration process later.
The process continues to check if the line is a good line, the next thing to do is to extract all the voter in corresponding accumulator.These pixels voters will be examined, if there were gaps between those pixels exceeding the threshold then this line will be eliminated otherwise add this line to a "good line".This gap checking will produce beginning point and end point information of a line.This information will be helpful to determine the line length and the orientation of the line.Finally, rotate image based on the value of angle calculation.

Text Detection based on MSER
After fixing the orientation of image, the process continue to detect text regions in E-KTP image.This paper basically proposed similar idea to Enhanced MSER Algorithm [13] with some modifications.This idea basically trying to find groups of neighboring letter region that have small gap.These groups of letter region will be detected by MSER and could form text region later.This paper also proposed RLSA instead of morphology dilation which had been used by Jaswanth [13].All the proposed process will be explained in the following order.1. Grayscaling, converting three channel image into one channel intensity image.2. Linear Contrast Stretching, this process is the same as previous step and the same image will be used for efficiency.3. Region detection by MSER, overall process shown in Figure 4.The process start with accepting intensity image as input and initializing some variable that will be used later, all the variable needed is the following a) Delta, this parameter is used to determine how many iteration the process will take.b) MaxArea, this parameter is used to determining maximum size of Connected Components that acceptable.c) MinArea, this parameter is used to determining minimum size of Connected Components that acceptable d) MaxVariation, this parameter determine the maximum variation that aceptable and considered stable.Every regions hierarchy consists of Extremal Areas that have their own variation value.This variation value calculated by Equation ( 1), which representing variation of extremals i in every step. ( Figure 3 shows the illustration of how MSER works through thresholding the images in every step of iteration from white to black, From threshold value 0 causing white image, then eventually black pixels emerged as the process step in iteration.Every delta step will form hierarchy between connected components and considered as parent-child relationship structure.4. Region Filling, this process also inspired by Jaswanth [13] as shown in Figure 5b and basically trying to filling polygon area that formed by MSER.Every region of MSER is filled by contrast color from the background as shown in Figure 5a (a) (b) Figure 5 Region Filling 5. Canny Edge Detection, producing edge image that consists of binary indicator from Region Filled Image.6. Horizontal RLSA, acts as connector between Connected Components in edge image and hopefully connecting group of letters into one text region.7. Using bounding box to retrieve all regions the from previous process.8. Region filter, using minimum and maximum width-height of image and also ratio of width and height as filter to accepting a region as valid.

Linear Contrast Stretching Influence to Delta Finding of MSER and Orientation Fixing
Delta parameter in MSER is used to determine how many iteration will be taken.The larger delta the less number of iteration will be taken, thus affecting complexity in overall process.Finding best delta value is important, to find delta value which not too large or small while maintaining overall performance.Experiment to finding this delta value start with 5 and followed by twice value from previous.This test shows how Linear Contrast Stretching affecting the process of finding best delta.Text regions have intensity close to black and Linear Contrast Stretching trying to separating those regions from the background.Experiment result shown in Table 3 indicates how Linear Contrast Stretching also affects the orientation fixing performance.Table 3 shows the average margin of error (in 360 degree system) reduced from first experiment to fifth.Margin of Error is calculated by difference between actual orientation of the image (manual checking) and predicted orientation by proposed algorithm.This experiment gradually stretch the pixels to white or black from the first one to fifth, hopefully could separating text region which closer to black pixel.First experiment reached 7.427751403 o , then in second attempt reduced to 3.817900892 o , in the third 2.60932669 o , then there were anomaly in fourth experiment but finally reached full potential in fifth attempt.This test indicates that separating text regions from the background with Linear Contrast Stretching will also affect orientation fixing performance.

Orientation Fixing and Text Detection Performances against Image Resolution
The proposed Orientation fixing and Text Detection algorithm also being tested to any kind of image resolution, Low, Medium, and High.Nowadays Smartphones or other gadgets usually have high resolution in their camera, therefore resizing is needed to lowering their quality [15].4 and 5 indicate that high-resolution photos tend to have bad accuracy because their high detail and quality.Noise and any unwanted object would be clearer to see and affecting region detector.The risk to get more false positive region would be higher because of this highquality photo, therefore resulting in their worst performance comparing medium or low quality photo and had the worst accuracy, precision and recall among them.

Reference Line Prediction From Text Area
The orientation of image as in previous assumption (Section 2.1) is determined by the longest line possible in image that created by text area in images.There were some cases in implementation that this assumption was correct.But there were also some cases that the reference line was formed by non text area even though producing the correct result, i.e.The reference line was formed by the top and bottom boundary of E-KTP or was formed by the boundary of profile photo area.
This test as shown in Table 6 indicates that this assumption is correct with accuracy between 68% to 78% in experiments against any kind of resolution.The orientation successfully formed by text area if the photo of E-KTP had been taken with close up range, in this case E-KTP image filled almost all the surface of photo thus making text area so dominant compared to other object.In general, High resolution photo created most problem because of high quality detail and making noise clearer than other resolution, resulting worst accuracy among them.

Proposed Algorithm Performance against Camera Angles
Table 7 and 8 are the results of the experiments to see how stable both algorithm performed against any camera's point of views.The E-KTP photo had been taken in 9 different angles of camera, like shown in Figure 6 with variety of distances.In general, both algorithm performed quite bad when the distance of camera more than 15 cm because too many unwanted objects in images and created many false positive regions.Both algorithms performed quite well when the distance between the camera and object between 10 cm and 15 cm, the proposed text detection process reached 62% until 72% in accuracy.The proposed orientation fixing algorithm also reached 0.36 until 0.23 (in 360 o system) Average Margin of Error (MoE) in close range distance, as the previous explanation Margin of Error is calculated by difference between actual orientation of the image (manual checking) and predicted orientation by proposed algorithm.

Proposed Algorithm Performance in Best Condition
Table 9 indicates the general performance of proposed text detection algorithm in best condition.The photos of E-KTP had been taken from front camera's angle with distance to object between 10 cm and 15 cm.The photos also resized into medium resolution with best setting of Linear Contrast Stretching.It shows that text detection algorithm could reach 84.49% accuracy, with 96.3% precision and 90% recall.

CONCLUSIONS
This study on text detection in E-KTP hopefully contributed to text detection research, especially as an effort to segment text area on image because as shown before in some experiments, it would affect some OCR softwares Performance i.e. tesseract and free-ocr.com.Linear Contrast Stretching had big influence to separating text area from the background thus affecting both proposed algorithm, orientation fixing and text detection based on MSER.The Linear Contrast Stretching process has also significance influence to the effort of finding best delta value for MSER, in order to find ideal delta value which is not too computationally expensive while maintaining the performance.
In orientation problem, Progressive Probabilistic Hough Transform could predict the reference line that was formed by text region quite well, the proposed orientation fixing algorithm could reach average margin of error 0.377 o (in 360 o system) in best condition.Overall, The Proposed algorithm could perform quite well and reached 84.5% accuracy, 96.3% precision, and 90% recall also in best condition, which is medium resolution photo, front angle camera with distance to object between 10 cm until 15 cm, and best setting of MSER and Linear Contrast Stretching.
This research still has many rooms to improvement.Future works including the weakness of proposed algorithm, i.e. skew photos or perspective fixing and also the risk of false

Figure 6
Figure 6 Camera's Point of View (POV) from top

IJCCS
ISSN (print): 1978-1520, ISSN (online): 2460-7258  Text Detection in Indonesian Identity Card based on MSER (Angga Maulana Purba) 187 positives because of the lightning problem.The light reflection because of the glossy surface and also uneven light distribution in E-KTP photo will also increase the risk of false positive.

Table 1
and 2 show how Linear Contrast Stretching affects the experiments of finding best delta.This test shows significant difference between trying to find the delta with and without linear contrast stretching.The best delta with value 40 could reach 0.845252052 in accuracy, 0.962616822 in Precision and 0.90125 in recall.It means the number of iteration only reach 6 at the maximum because the maximum value is 255 divided by 40.This number of iteration is small enough while maintaining the performance.

Table 3
Linear Contrast Stretching Influence to Orientation Fixing

Table 7
Orientation Fixing Performance against Camera's angles

Table 9
Text Detection Confusion Matrix