Skip to main content

Advertisement

Log in

Language discrimination by texture analysis of the image corresponding to the text

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The manuscript provides a novel method for language identification using the texture analysis of the script. The method consists of mapping each letter from the text with certain script type. It is made according to characteristics concerning the position of the letter in the baseline area. In order to extract features, the co-occurrence matrix is computed. Then, the texture features are calculated. Extracted measures show meaningful differences due to dissimilarities in the script and language characteristics. It represents a basis in a decision-making process of the language identification. Feature classification is performed by the extension of a state-of-the-art method called genetic algorithms image clustering for document analysis. The proposed method is tested on an example of documents given in English, French, Slovenian and Serbian languages and compared to other well-known classification methods and feature representations in the state of the art. The results of experiments show the superiority of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Abbreviations

ASR:

Automatic speech recognition

CM:

Confusion matrix

DE:

Differential evolution

GA-IC:

Genetic algorithms image clustering

GA-ICDA:

Genetic algorithms image clustering for document analysis

GLCM:

Gray-level co-occurrence matrix

GMM:

Gaussian mixture model

HF:

Hash function

IR:

Information retrieval

LI:

Language identification

NLP:

Natural language processing

NMI:

Normalized mutual information

OCR:

Optical character recognition

PDF:

Portable document format

PKC:

Public key cryptography

PSO:

Particle swarm optimization

SKC:

Secret key cryptography

SOM:

Self-organizing map

WOI:

Window of interest

References

  1. Kranig S (2006) Evaluation of language identification methods. In: Proceedings of ACM SAC

  2. Chowdhury GG (2003) Natural language processing. Annu Rev Inf Sci Technol 37(1):51–89

    Article  Google Scholar 

  3. Lewandowski D (2008) Problems with the use of web search engines to find results in foreign languages. Online Inf Rev 32(5):668–672

    Article  Google Scholar 

  4. Jin H, Wong KF (2002) A Chinese dictionary construction algorithm for information retrieval. ACM Trans Asian Lang Inf Process 1(4):281–296

    Article  Google Scholar 

  5. Botha G, Zimu V, Barnard E (2006) Text-based language identification for the South African languages. In; Proceedings of the 17th annual symposium of the pattern recognition association of South Africa. Parys, South Africa, pp 7–13

  6. Grothe L, De Luca EW, Nürnberger A (2008) A comparative study on language identification Methods. In: Proceedings of the sixth international conference on language resources and evaluation (LREC), 28–30 May. Marrakech, Morocco, pp 980–985

  7. Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Pearson-Prentice Hall, Upper Saddle River

    Google Scholar 

  8. Roark B, Saraclar M, Collins M (2007) Discriminative n-gram language modeling. Comput Speech Lang 21(2):373–392

    Article  Google Scholar 

  9. Goodman J (2006) A bit of progress in language modeling: extended version. Technical report MSR-TR-2001-72, Machine Learning and Applied Statistics Group, Microsoft Research, Redmond, WA

  10. Takci H, Sogukpimar I (2005) Letter based text scoring method for language identification. In: Yakhno T (ed) Advances in information systems, lecture notes in computer science 3261. Springer, New York, pp 283–290

    Google Scholar 

  11. Barroso N, Lopez de Ipina K, Grana M, Ezeiza A (2011) Language identification for under-resourced languages in the basque context. In: Corchado E et al (eds) Advances in intelligent and soft computing, vol 87. Springer, New York, pp 475–483

    Google Scholar 

  12. http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/gac1/report.html

  13. Zhou L, Lu Y, Tan CL (2006) Bangla/English script identification based on analysis of connected component profiles. In: Bunke H, Spitz AL (eds) Document analysis systems VII, lecture notes in computer science 3872. Springer, New York, pp 243–254

    Google Scholar 

  14. SkMd Obaidullah, Mondal A, Das N, Roy K (2014) Script identification from printed indian document images and performance evaluation using different classifiers. Appl Comput Intell Soft Comput 896128:1–12

    Google Scholar 

  15. Lu S, Tan CL, Huang W (2006) Bangla/English script identification based on analysis of connected component profiles. In: Bunke H, Spitz AL (eds) Document analysis systems VII, lecture notes in computer science 3872. Springer, New York, pp 232–242

    Google Scholar 

  16. Sibun P, Spitz AL (1994) Language determination: natural language processing from scanned document images. In: 4th applied natural language processing conference (ANLP). pp 15–21

  17. Grothe L, De Luca EW, Nürnberger A (2008) A comparative study on language identification methods. In: Proceedings of the sixth international language resources and evaluation (LREC). Marrakech, Marocco, pp 980–985

  18. Kranig S (2011) Evaluation of language identification methods. B.S. Thesis, University of Tubingen International Studies in Computational Linguistics, Tubingen

  19. Do HV (2010) Natural language identification for OCR applications. B.S. Thesis, Freie Universität Berlin, Department of Mathematics and Computer Science, Berlin

  20. Gottron T, Lipka N (2010) A comparison of language identification approaches on short, query-style texts. In: Gurrin C et al (eds) Advances in information retrieval, lecture notes in computer science 5993. Springer, New York, pp 611–614

    Google Scholar 

  21. Fogel DB (1997) The advantages of evolutionary computation. In: Proceedings of biocomputing and emergent computation BCEC97. World Scientific Press, pp 1–11

  22. Arnold DV, Beyer HG (2002) Local performance of the (1 + 1)-ES in a noisy environment. IEEE Trans Evolut Comput 6(1):30–41

    Article  Google Scholar 

  23. Van Gorp J, Schoukens J, Pintelon R (2000) Learning neural networks with noisy inputs using the errors-in-variables approach. IEEE Trans Neural Netw. 11(2):14–402

    Google Scholar 

  24. Liu C, Lu C, Lee W (2000) Document categorisation by genetic algorithms. In: Proceedings IEEE international conference on systems, man, and cybernetics, 08–11 October. IEEE CS Press, Nashville, TN, 5:3868-3872

  25. Jian-Xiang W, Huai L, Yue-hong S, Xin-Ning S (2009) Application of genetic algorithm in document clustering. In: Proceedings IEEE international conference on information technology and computer science ITCS, 25–26 July. IEEE CS Press, Kiev, pp 145–148

  26. Akter R, Chung Y (2013) An evolutionary approach for document clustering. IERI Proc 4:370–375

    Article  Google Scholar 

  27. Abdel-Kader RF (2010) Genetically improved PSO algorithm for efficient data clustering. In: Proceedings second international conference on machine learning and computing (ICMLC), 9–11 February. IEEE CS Press, Bangalore, pp 71–75

  28. Ali AF (2014) A novel hybrid genetic differential evolution algorithm for constrained optimization problems. (IJACSA) Int J Adv Comput Sci Appl 3(6):7–12

    Google Scholar 

  29. Hoffstein J, Pipher J, Silverman JH (2008) An introduction to mathematical cryptography. Springer, New York

    MATH  Google Scholar 

  30. Paar C, Pelzl J (2009) Hash functions, chapter 11 of understanding cryptography. A textbook for students and practitioners, Springer, New York

  31. Yaksic VOC (2003) A study on hash functions for cryptography, global information assurrance certification paper, SANS Institute

  32. Zramdini AW, Ingold R (1998) Optical font recognition using typographical features. IEEE Trans Pattern Anal 20(8):877–882

    Article  Google Scholar 

  33. Brodić D, Milivojević ZN, Maluckov ČA (2013) Recognition of the script in Serbian documents using frequency occurrence and co-occurrence analysis. Sci World J 896328:1–14

    Article  Google Scholar 

  34. Haralick RM, Shanmugan K, Dinstein I (1973) Textural features for image classification. IEEE Trans Syst Man Cybern 3(6):610–621

    Article  Google Scholar 

  35. Eleyan A, Demirel H (2011) Co-occurrence matrix and its statistical features as a new approach for face recognition. Turk J Electr Eng Comput sci 19(1):97–107

    Google Scholar 

  36. Clausi DA (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Can J Remote Sens 28(1):45–62

    Article  Google Scholar 

  37. Conners RW, Trivedi MM, Harlow CA (1984) Segmentation of a high-resolution urban scene using texture operators. Comput Vis Gr Image Process 25:273–310

    Article  Google Scholar 

  38. Newsam SD, Kamath C (2004) Retrieval using texture features in high resolution multi-spectral satellite imagery. In: SPIE conference on data mining and knowledge discovery: theory, tools, and technology VI

  39. Amelio A, Pizzuti C (2014) A New evolutionary-based clustering framework for image databases. In: Elmoataz A et al (eds) Image and signal processing, lecture notes in computer science 8509. Springer, New York, pp 322–331

    Google Scholar 

  40. Marti R, Campos V, Laguna M, Glover F (2001) Reducing the bandwidth of a sparse matrix with tabu search. Eur J Oper Res 135(2):450–459

    Article  MathSciNet  MATH  Google Scholar 

  41. http://www.lepoint.fr

  42. Comrey AL, Lee HB (1992) A first course in factor analysis. Psychology Press, Hillsdale

    Google Scholar 

  43. Cattell RB (1978) The scientific use of factor analysis in behavioral and life sciences. Plenum, New York

    Book  MATH  Google Scholar 

  44. MacCallum RC, Widaman KF, Zhang S, Hong S (1999) Sample size in factor analysis. Psychol Methods 4(1):84–99

    Article  Google Scholar 

  45. Brodić D, Milivojević ZN, Maluckov ČA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19(9):2655–2665

    Article  Google Scholar 

  46. Shrestha P, Jacquin C, Daille B (2012) Clustering short text and its evaluation. In: Proceedings of the 13th international conference, CICLing, March 11–17. Springer, New Delhi, India, LNCS 7182, pp 169–180

  47. Manning CD, Raghavan P, Schütze H (2009) Introduction to information retrieval, Online edn. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  48. Diem M, Kleber F, Fiel S, Sablatnig R (2013) Semi-automated document image clustering and retrieval. In: Proceedings SPIE, 9021, 0210M-90210M-10

  49. Yuyu Y, Xu W, Yueming L (2013) A Hierarchical Method for Clustering Binary Text Image. In: Yuyu Y et al (eds) Trustworthy computing and services, CCIS 320. Springer, New York, pp 388–396

    Google Scholar 

  50. Weizhong Z, Qing H, Huifang M, Zhongzhi S (2012) Effective semi-supervised document clustering via active learning with instance-level constraints. Knowl Inf Syst 30(3):569–587

    Article  Google Scholar 

  51. Marinai S, Marino E, Soda G (2008) Self-organizing maps for clustering in document image analysis. In: Simone M, Hiromichi F (eds) Machine learning in document analysis and recognition, studies in computational intelligence 90. Springer, New York, pp 193–219

    Google Scholar 

  52. Huaigu C (2008) Indexing and retrieval of low quality handwritten documents. Ph.D. Dissertation. State University of New York at Buffalo, Buffalo

  53. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI Press, pp 226–231

  54. Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Adv Neur Inf 17:1601–1608

    Google Scholar 

  55. Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188

    MathSciNet  MATH  Google Scholar 

  56. Fodor JD, Sakas WG (2004) Evaluating models of parameter setting. in: Proceedings of the 28th annual Boston University conference on language development, October 31–November 2, Boston, MA

  57. Chiang MM, Mirkin B (2010) Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40

    Article  MathSciNet  MATH  Google Scholar 

  58. Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, ICTAI. pp 576–584

  59. Karami A, Johansson R (2014) Choosing DBSCAN parameters automatically using differential evolution. Int J Comput Appl 91(7):1–11

    Google Scholar 

  60. Berglund E, Sitte J (2006) The parameterless self-organizing map algorithm. IEEE Trans Neural Netw 17(2):305–316

    Article  Google Scholar 

  61. Kwedlo W (2014) Estimation of parameters of Gaussian mixture models by a hybrid method combining a self-adaptive differential evolution with the EM algorithm. Adv Comput Sci Res 11:109–123

    Google Scholar 

Download references

Acknowledgments

This study was partially funded by the Grant of the Ministry of Education, Science and Technological Development of the Republic of Serbia, as a part of the Project TR33037 within the framework of Technological development program. The receiver of the funding is Dr. Darko Brodić.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Darko Brodić.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brodić, D., Amelio, A. & Milivojević, Z.N. Language discrimination by texture analysis of the image corresponding to the text. Neural Comput & Applic 29, 151–172 (2018). https://doi.org/10.1007/s00521-016-2527-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-016-2527-x

Keywords

Navigation