Abstract
The manuscript provides a novel method for language identification using the texture analysis of the script. The method consists of mapping each letter from the text with certain script type. It is made according to characteristics concerning the position of the letter in the baseline area. In order to extract features, the co-occurrence matrix is computed. Then, the texture features are calculated. Extracted measures show meaningful differences due to dissimilarities in the script and language characteristics. It represents a basis in a decision-making process of the language identification. Feature classification is performed by the extension of a state-of-the-art method called genetic algorithms image clustering for document analysis. The proposed method is tested on an example of documents given in English, French, Slovenian and Serbian languages and compared to other well-known classification methods and feature representations in the state of the art. The results of experiments show the superiority of the proposed approach.
Similar content being viewed by others
Abbreviations
- ASR:
-
Automatic speech recognition
- CM:
-
Confusion matrix
- DE:
-
Differential evolution
- GA-IC:
-
Genetic algorithms image clustering
- GA-ICDA:
-
Genetic algorithms image clustering for document analysis
- GLCM:
-
Gray-level co-occurrence matrix
- GMM:
-
Gaussian mixture model
- HF:
-
Hash function
- IR:
-
Information retrieval
- LI:
-
Language identification
- NLP:
-
Natural language processing
- NMI:
-
Normalized mutual information
- OCR:
-
Optical character recognition
- PDF:
-
Portable document format
- PKC:
-
Public key cryptography
- PSO:
-
Particle swarm optimization
- SKC:
-
Secret key cryptography
- SOM:
-
Self-organizing map
- WOI:
-
Window of interest
References
Kranig S (2006) Evaluation of language identification methods. In: Proceedings of ACM SAC
Chowdhury GG (2003) Natural language processing. Annu Rev Inf Sci Technol 37(1):51–89
Lewandowski D (2008) Problems with the use of web search engines to find results in foreign languages. Online Inf Rev 32(5):668–672
Jin H, Wong KF (2002) A Chinese dictionary construction algorithm for information retrieval. ACM Trans Asian Lang Inf Process 1(4):281–296
Botha G, Zimu V, Barnard E (2006) Text-based language identification for the South African languages. In; Proceedings of the 17th annual symposium of the pattern recognition association of South Africa. Parys, South Africa, pp 7–13
Grothe L, De Luca EW, Nürnberger A (2008) A comparative study on language identification Methods. In: Proceedings of the sixth international conference on language resources and evaluation (LREC), 28–30 May. Marrakech, Morocco, pp 980–985
Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Pearson-Prentice Hall, Upper Saddle River
Roark B, Saraclar M, Collins M (2007) Discriminative n-gram language modeling. Comput Speech Lang 21(2):373–392
Goodman J (2006) A bit of progress in language modeling: extended version. Technical report MSR-TR-2001-72, Machine Learning and Applied Statistics Group, Microsoft Research, Redmond, WA
Takci H, Sogukpimar I (2005) Letter based text scoring method for language identification. In: Yakhno T (ed) Advances in information systems, lecture notes in computer science 3261. Springer, New York, pp 283–290
Barroso N, Lopez de Ipina K, Grana M, Ezeiza A (2011) Language identification for under-resourced languages in the basque context. In: Corchado E et al (eds) Advances in intelligent and soft computing, vol 87. Springer, New York, pp 475–483
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/gac1/report.html
Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: Bunke H, Spitz AL (eds) Document analysis systems VII, lecture notes in computer science 3872. Springer, New York, pp 243–254
SkMd Obaidullah, Mondal A, Das N, Roy K (2014) Script identification from printed indian document images and performance evaluation using different classifiers. Appl Comput Intell Soft Comput 896128:1–12
Lu S, Tan CL, Huang W (2006) Bangla/English script identification based on analysis of connected component profiles. In: Bunke H, Spitz AL (eds) Document analysis systems VII, lecture notes in computer science 3872. Springer, New York, pp 232–242
Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: 4th applied natural language processing conference (ANLP). pp 15–21
Grothe L, De Luca EW, Nürnberger A (2008) A comparative study on language identification methods. In: Proceedings of the sixth international language resources and evaluation (LREC). Marrakech, Marocco, pp 980–985
Kranig S (2011) Evaluation of language identification methods. B.S. Thesis, University of Tubingen International Studies in Computational Linguistics, Tubingen
Do HV (2010) Natural language identification for OCR applications. B.S. Thesis, Freie Universität Berlin, Department of Mathematics and Computer Science, Berlin
Gottron T, Lipka N (2010) A comparison of language identification approaches on short, query-style texts. In: Gurrin C et al (eds) Advances in information retrieval, lecture notes in computer science 5993. Springer, New York, pp 611–614
Fogel DB (1997) The advantages of evolutionary computation. In: Proceedings of biocomputing and emergent computation BCEC97. World Scientific Press, pp 1–11
Arnold DV, Beyer HG (2002) Local performance of the (1 + 1)-ES in a noisy environment. IEEE Trans Evolut Comput 6(1):30–41
Van Gorp J, Schoukens J, Pintelon R (2000) Learning neural networks with noisy inputs using the errors-in-variables approach. IEEE Trans Neural Netw. 11(2):14–402
Liu C, Lu C, Lee W (2000) Document categorisation by genetic algorithms. In: Proceedings IEEE international conference on systems, man, and cybernetics, 08–11 October. IEEE CS Press, Nashville, TN, 5:3868-3872
Jian-Xiang W, Huai L, Yue-hong S, Xin-Ning S (2009) Application of genetic algorithm in document clustering. In: Proceedings IEEE international conference on information technology and computer science ITCS, 25–26 July. IEEE CS Press, Kiev, pp 145–148
Akter R, Chung Y (2013) An evolutionary approach for document clustering. IERI Proc 4:370–375
Abdel-Kader RF (2010) Genetically improved PSO algorithm for efficient data clustering. In: Proceedings second international conference on machine learning and computing (ICMLC), 9–11 February. IEEE CS Press, Bangalore, pp 71–75
Ali AF (2014) A novel hybrid genetic differential evolution algorithm for constrained optimization problems. (IJACSA) Int J Adv Comput Sci Appl 3(6):7–12
Hoffstein J, Pipher J, Silverman JH (2008) An introduction to mathematical cryptography. Springer, New York
Paar C, Pelzl J (2009) Hash functions, chapter 11 of understanding cryptography. A textbook for students and practitioners, Springer, New York
Yaksic VOC (2003) A study on hash functions for cryptography, global information assurrance certification paper, SANS Institute
Zramdini AW, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal 20(8):877–882
Brodić D, Milivojević ZN, Maluckov ČA (2013) Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis. Sci World J 896328:1–14
Haralick RM, Shanmugan K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cybern 3(6):610–621
Eleyan A, Demirel H (2011) Co-occurrence matrix and its statistical features as a new approach for face recognition. Turk J Electr Eng Comput sci 19(1):97–107
Clausi DA (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Can J Remote Sens 28(1):45–62
Conners RW, Trivedi MM, Harlow CA (1984) Segmentation of a high-resolution urban scene using texture operators. Comput Vis Gr Image Process 25:273–310
Newsam SD, Kamath C (2004) Retrieval using texture features in high resolution multi-spectral satellite imagery. In: SPIE conference on data mining and knowledge discovery: theory, tools, and technology VI
Amelio A, Pizzuti C (2014) A New evolutionary-based clustering framework for image databases. In: Elmoataz A et al (eds) Image and signal processing, lecture notes in computer science 8509. Springer, New York, pp 322–331
Marti R, Campos V, Laguna M, Glover F (2001) Reducing the bandwidth of a sparse matrix with tabu search. Eur J Oper Res 135(2):450–459
Comrey AL, Lee HB (1992) A first course in factor analysis. Psychology Press, Hillsdale
Cattell RB (1978) The scientific use of factor analysis in behavioral and life sciences. Plenum, New York
MacCallum RC, Widaman KF, Zhang S, Hong S (1999) Sample size in factor analysis. Psychol Methods 4(1):84–99
Brodić D, Milivojević ZN, Maluckov ČA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665
Shrestha P, Jacquin C, Daille B (2012) Clustering short text and its evaluation. In: Proceedings of the 13th international conference, CICLing, March 11–17. Springer, New Delhi, India, LNCS 7182, pp 169–180
Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval, Online edn. Cambridge University Press, Cambridge
Diem M, Kleber F, Fiel S, Sablatnig R (2013) Semi-automated document image clustering and retrieval. In: Proceedings SPIE, 9021, 0210M-90210M-10
Yuyu Y, Xu W, Yueming L (2013) A Hierarchical Method for Clustering Binary Text Image. In: Yuyu Y et al (eds) Trustworthy computing and services, CCIS 320. Springer, New York, pp 388–396
Weizhong Z, Qing H, Huifang M, Zhongzhi S (2012) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587
Marinai S, Marino E, Soda G (2008) Self-organizing maps for clustering in document image analysis. In: Simone M, Hiromichi F (eds) Machine learning in document analysis and recognition, studies in computational intelligence 90. Springer, New York, pp 193–219
Huaigu C (2008) Indexing and retrieval of low quality handwritten documents. Ph.D. Dissertation. State University of New York at Buffalo, Buffalo
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI Press, pp 226–231
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Adv Neur Inf 17:1601–1608
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
Fodor JD, Sakas WG (2004) Evaluating models of parameter setting. in: Proceedings of the 28th annual Boston University conference on language development, October 31–November 2, Boston, MA
Chiang MM, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40
Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, ICTAI. pp 576–584
Karami A, Johansson R (2014) Choosing DBSCAN parameters automatically using differential evolution. Int J Comput Appl 91(7):1–11
Berglund E, Sitte J (2006) The parameterless self-organizing map algorithm. IEEE Trans Neural Netw 17(2):305–316
Kwedlo W (2014) Estimation of parameters of Gaussian mixture models by a hybrid method combining a self-adaptive differential evolution with the EM algorithm. Adv Comput Sci Res 11:109–123
Acknowledgments
This study was partially funded by the Grant of the Ministry of Education, Science and Technological Development of the Republic of Serbia, as a part of the Project TR33037 within the framework of Technological development program. The receiver of the funding is Dr. Darko Brodić.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Rights and permissions
About this article
Cite this article
Brodić, D., Amelio, A. & Milivojević, Z.N. Language discrimination by texture analysis of the image corresponding to the text. Neural Comput & Applic 29, 151–172 (2018). https://doi.org/10.1007/s00521-016-2527-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-016-2527-x