Language discrimination by texture analysis of the image corresponding to the text

Brodić, Darko; Amelio, Alessia; Milivojević, Zoran N.

doi:10.1007/s00521-016-2527-x

Language discrimination by texture analysis of the image corresponding to the text

Original Article
Published: 19 August 2016

Volume 29, pages 151–172, (2018)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Darko Brodić¹,
Alessia Amelio² &
Zoran N. Milivojević³

404 Accesses
21 Citations
Explore all metrics

Abstract

The manuscript provides a novel method for language identification using the texture analysis of the script. The method consists of mapping each letter from the text with certain script type. It is made according to characteristics concerning the position of the letter in the baseline area. In order to extract features, the co-occurrence matrix is computed. Then, the texture features are calculated. Extracted measures show meaningful differences due to dissimilarities in the script and language characteristics. It represents a basis in a decision-making process of the language identification. Feature classification is performed by the extension of a state-of-the-art method called genetic algorithms image clustering for document analysis. The proposed method is tested on an example of documents given in English, French, Slovenian and Serbian languages and compared to other well-known classification methods and feature representations in the state of the art. The results of experiments show the superiority of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An Image Texture Analysis Method for Minority Language Identification

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Clustering documents in evolving languages by image texture analysis

Article 26 December 2016

Darko Brodić, Alessia Amelio & Zoran N. Milivojević

Abbreviations

ASR:: Automatic speech recognition
CM:: Confusion matrix
DE:: Differential evolution
GA-IC:: Genetic algorithms image clustering
GA-ICDA:: Genetic algorithms image clustering for document analysis
GLCM:: Gray-level co-occurrence matrix
GMM:: Gaussian mixture model
HF:: Hash function
IR:: Information retrieval
LI:: Language identification
NLP:: Natural language processing
NMI:: Normalized mutual information
OCR:: Optical character recognition
PDF:: Portable document format
PKC:: Public key cryptography
PSO:: Particle swarm optimization
SKC:: Secret key cryptography
SOM:: Self-organizing map
WOI:: Window of interest

References

Kranig S (2006) Evaluation of language identification methods. In: Proceedings of ACM SAC
Chowdhury GG (2003) Natural language processing. Annu Rev Inf Sci Technol 37(1):51–89
Article Google Scholar
Lewandowski D (2008) Problems with the use of web search engines to find results in foreign languages. Online Inf Rev 32(5):668–672
Article Google Scholar
Jin H, Wong KF (2002) A Chinese dictionary construction algorithm for information retrieval. ACM Trans Asian Lang Inf Process 1(4):281–296
Article Google Scholar
Botha G, Zimu V, Barnard E (2006) Text-based language identification for the South African languages. In; Proceedings of the 17th annual symposium of the pattern recognition association of South Africa. Parys, South Africa, pp 7–13
Grothe L, De Luca EW, Nürnberger A (2008) A comparative study on language identification Methods. In: Proceedings of the sixth international conference on language resources and evaluation (LREC), 28–30 May. Marrakech, Morocco, pp 980–985
Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Pearson-Prentice Hall, Upper Saddle River
Google Scholar
Roark B, Saraclar M, Collins M (2007) Discriminative n-gram language modeling. Comput Speech Lang 21(2):373–392
Article Google Scholar
Goodman J (2006) A bit of progress in language modeling: extended version. Technical report MSR-TR-2001-72, Machine Learning and Applied Statistics Group, Microsoft Research, Redmond, WA
Takci H, Sogukpimar I (2005) Letter based text scoring method for language identification. In: Yakhno T (ed) Advances in information systems, lecture notes in computer science 3261. Springer, New York, pp 283–290
Google Scholar
Barroso N, Lopez de Ipina K, Grana M, Ezeiza A (2011) Language identification for under-resourced languages in the basque context. In: Corchado E et al (eds) Advances in intelligent and soft computing, vol 87. Springer, New York, pp 475–483
Google Scholar
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/gac1/report.html
Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: Bunke H, Spitz AL (eds) Document analysis systems VII, lecture notes in computer science 3872. Springer, New York, pp 243–254
Google Scholar
SkMd Obaidullah, Mondal A, Das N, Roy K (2014) Script identification from printed indian document images and performance evaluation using different classifiers. Appl Comput Intell Soft Comput 896128:1–12
Google Scholar
Lu S, Tan CL, Huang W (2006) Bangla/English script identification based on analysis of connected component profiles. In: Bunke H, Spitz AL (eds) Document analysis systems VII, lecture notes in computer science 3872. Springer, New York, pp 232–242
Google Scholar
Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: 4th applied natural language processing conference (ANLP). pp 15–21
Grothe L, De Luca EW, Nürnberger A (2008) A comparative study on language identification methods. In: Proceedings of the sixth international language resources and evaluation (LREC). Marrakech, Marocco, pp 980–985
Kranig S (2011) Evaluation of language identification methods. B.S. Thesis, University of Tubingen International Studies in Computational Linguistics, Tubingen
Do HV (2010) Natural language identification for OCR applications. B.S. Thesis, Freie Universität Berlin, Department of Mathematics and Computer Science, Berlin
Gottron T, Lipka N (2010) A comparison of language identification approaches on short, query-style texts. In: Gurrin C et al (eds) Advances in information retrieval, lecture notes in computer science 5993. Springer, New York, pp 611–614
Google Scholar
Fogel DB (1997) The advantages of evolutionary computation. In: Proceedings of biocomputing and emergent computation BCEC97. World Scientific Press, pp 1–11
Arnold DV, Beyer HG (2002) Local performance of the (1 + 1)-ES in a noisy environment. IEEE Trans Evolut Comput 6(1):30–41
Article Google Scholar
Van Gorp J, Schoukens J, Pintelon R (2000) Learning neural networks with noisy inputs using the errors-in-variables approach. IEEE Trans Neural Netw. 11(2):14–402
Google Scholar
Liu C, Lu C, Lee W (2000) Document categorisation by genetic algorithms. In: Proceedings IEEE international conference on systems, man, and cybernetics, 08–11 October. IEEE CS Press, Nashville, TN, 5:3868-3872
Jian-Xiang W, Huai L, Yue-hong S, Xin-Ning S (2009) Application of genetic algorithm in document clustering. In: Proceedings IEEE international conference on information technology and computer science ITCS, 25–26 July. IEEE CS Press, Kiev, pp 145–148
Akter R, Chung Y (2013) An evolutionary approach for document clustering. IERI Proc 4:370–375
Article Google Scholar
Abdel-Kader RF (2010) Genetically improved PSO algorithm for efficient data clustering. In: Proceedings second international conference on machine learning and computing (ICMLC), 9–11 February. IEEE CS Press, Bangalore, pp 71–75
Ali AF (2014) A novel hybrid genetic differential evolution algorithm for constrained optimization problems. (IJACSA) Int J Adv Comput Sci Appl 3(6):7–12
Google Scholar
Hoffstein J, Pipher J, Silverman JH (2008) An introduction to mathematical cryptography. Springer, New York
MATH Google Scholar
Paar C, Pelzl J (2009) Hash functions, chapter 11 of understanding cryptography. A textbook for students and practitioners, Springer, New York
Yaksic VOC (2003) A study on hash functions for cryptography, global information assurrance certification paper, SANS Institute
Zramdini AW, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal 20(8):877–882
Article Google Scholar
Brodić D, Milivojević ZN, Maluckov ČA (2013) Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis. Sci World J 896328:1–14
Article Google Scholar
Haralick RM, Shanmugan K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cybern 3(6):610–621
Article Google Scholar
Eleyan A, Demirel H (2011) Co-occurrence matrix and its statistical features as a new approach for face recognition. Turk J Electr Eng Comput sci 19(1):97–107
Google Scholar
Clausi DA (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Can J Remote Sens 28(1):45–62
Article Google Scholar
Conners RW, Trivedi MM, Harlow CA (1984) Segmentation of a high-resolution urban scene using texture operators. Comput Vis Gr Image Process 25:273–310
Article Google Scholar
Newsam SD, Kamath C (2004) Retrieval using texture features in high resolution multi-spectral satellite imagery. In: SPIE conference on data mining and knowledge discovery: theory, tools, and technology VI
Amelio A, Pizzuti C (2014) A New evolutionary-based clustering framework for image databases. In: Elmoataz A et al (eds) Image and signal processing, lecture notes in computer science 8509. Springer, New York, pp 322–331
Google Scholar
Marti R, Campos V, Laguna M, Glover F (2001) Reducing the bandwidth of a sparse matrix with tabu search. Eur J Oper Res 135(2):450–459
Article MathSciNet MATH Google Scholar
http://www.lepoint.fr
Comrey AL, Lee HB (1992) A first course in factor analysis. Psychology Press, Hillsdale
Google Scholar
Cattell RB (1978) The scientific use of factor analysis in behavioral and life sciences. Plenum, New York
Book MATH Google Scholar
MacCallum RC, Widaman KF, Zhang S, Hong S (1999) Sample size in factor analysis. Psychol Methods 4(1):84–99
Article Google Scholar
Brodić D, Milivojević ZN, Maluckov ČA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665
Article Google Scholar
Shrestha P, Jacquin C, Daille B (2012) Clustering short text and its evaluation. In: Proceedings of the 13th international conference, CICLing, March 11–17. Springer, New Delhi, India, LNCS 7182, pp 169–180
Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval, Online edn. Cambridge University Press, Cambridge
MATH Google Scholar
Diem M, Kleber F, Fiel S, Sablatnig R (2013) Semi-automated document image clustering and retrieval. In: Proceedings SPIE, 9021, 0210M-90210M-10
Yuyu Y, Xu W, Yueming L (2013) A Hierarchical Method for Clustering Binary Text Image. In: Yuyu Y et al (eds) Trustworthy computing and services, CCIS 320. Springer, New York, pp 388–396
Google Scholar
Weizhong Z, Qing H, Huifang M, Zhongzhi S (2012) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587
Article Google Scholar
Marinai S, Marino E, Soda G (2008) Self-organizing maps for clustering in document image analysis. In: Simone M, Hiromichi F (eds) Machine learning in document analysis and recognition, studies in computational intelligence 90. Springer, New York, pp 193–219
Google Scholar
Huaigu C (2008) Indexing and retrieval of low quality handwritten documents. Ph.D. Dissertation. State University of New York at Buffalo, Buffalo
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI Press, pp 226–231
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Adv Neur Inf 17:1601–1608
Google Scholar
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
MathSciNet MATH Google Scholar
Fodor JD, Sakas WG (2004) Evaluating models of parameter setting. in: Proceedings of the 28th annual Boston University conference on language development, October 31–November 2, Boston, MA
Chiang MM, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40
Article MathSciNet MATH Google Scholar
Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, ICTAI. pp 576–584
Karami A, Johansson R (2014) Choosing DBSCAN parameters automatically using differential evolution. Int J Comput Appl 91(7):1–11
Google Scholar
Berglund E, Sitte J (2006) The parameterless self-organizing map algorithm. IEEE Trans Neural Netw 17(2):305–316
Article Google Scholar
Kwedlo W (2014) Estimation of parameters of Gaussian mixture models by a hybrid method combining a self-adaptive differential evolution with the EM algorithm. Adv Comput Sci Res 11:109–123
Google Scholar

Download references

Acknowledgments

This study was partially funded by the Grant of the Ministry of Education, Science and Technological Development of the Republic of Serbia, as a part of the Project TR33037 within the framework of Technological development program. The receiver of the funding is Dr. Darko Brodić.

Author information

Authors and Affiliations

Technical Faculty in Bor, University of Belgrade, Vojske Jugoslavije 12, Bor, 19210, Serbia
Darko Brodić
DIMES University of Calabria, Via P. Bucci Cube 44, 87036, Rende, CS, Italy
Alessia Amelio
College of Applied Technical Sciences, Aleksandra Medvedeva 20, Nis, 18000, Serbia
Zoran N. Milivojević

Authors

Darko Brodić
View author publications
You can also search for this author in PubMed Google Scholar
Alessia Amelio
View author publications
You can also search for this author in PubMed Google Scholar
Zoran N. Milivojević
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Darko Brodić.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brodić, D., Amelio, A. & Milivojević, Z.N. Language discrimination by texture analysis of the image corresponding to the text. Neural Comput & Applic 29, 151–172 (2018). https://doi.org/10.1007/s00521-016-2527-x

Download citation

Received: 07 March 2016
Accepted: 08 August 2016
Published: 19 August 2016
Issue Date: March 2018
DOI: https://doi.org/10.1007/s00521-016-2527-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Language discrimination by texture analysis of the image corresponding to the text

Abstract

Access this article

Similar content being viewed by others

An Image Texture Analysis Method for Minority Language Identification

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Clustering documents in evolving languages by image texture analysis

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Language discrimination by texture analysis of the image corresponding to the text

Abstract

Access this article

Similar content being viewed by others

An Image Texture Analysis Method for Minority Language Identification

Classification of the Scripts in Medieval Documents from Balkan Region by Run-Length Texture Analysis

Clustering documents in evolving languages by image texture analysis

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation