Abstract
The paper proposes a new method for characterization and distinction between closely related languages on the example of Serbian and Croatian languages. In the first step, the method transforms the text in different languages into the uniformly coded text. It is carried out in accordance to the position of each sign of the script in the text line and its height. Then, the coded text given as 1-D image is subjected to the texture analysis. According to that analysis, a feature vector of 28 elements is established. These 28 elements are extracted from co-occurrence texture and adjacent local binary pattern analysis. The feature vector is a starting point for classification by an extension of a state of the art method, called GA-ICDA. As a result, the distinction between the closely related languages is correctly accomplished. The method is tested on a database of documents in Serbian and Croatian languages. The experiments give promising results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C., Zhai, C.: A survey of text clustering algorithms. Mining Text Data, pp. 77–128. Springer (2012)
Amelio, A., Pizzuti, C.: A new evolutionary-based clustering framework for image databases. In: Elmoataz, A., Lezoray, O., Nouboud, F., Mammass, D. (eds.) ICISP 2014. LNCS, vol. 8509, pp. 322–331. Springer, Heidelberg (2014)
Andrews, N.O., Fox, E.A.: Recent Developments in Document Clustering. Technical report, Computer Science, Virginia Tec. (2009)
Diem, M., Kleber, F., Fiel, S., Sablatnig, R.: Semi-automated document image clustering and retrieval (2013)
Hu, X., Yoo, I.: A comprehensive comparison study of document clustering for a biomedical digital library medline. In: Proc. 6th ACM/IEEE-CS Joint Conference, pp. 220–229 (2006)
Ji, J., Zhao, Q.: Applying naive bayes classifier to document clustering. JACIII 14(6), 624–630 (2010)
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proc. 25th Ann. Int. ACM SIGIR Conf. on Research and Devel. in Inf. Retr., SIGIR 102, NY, USA, pp. 191–198 (2002)
Marinai, S., Marino, E., Soda, G.: Self-organizing maps for clustering in document image analysis. In: Marinai, S., Fujisawa, H. (eds.) Mach. Learn. in Doc. Anal. and Recogn. SCI, vol. 90, pp. 193–219. Springer, Heidelberg (2008)
Mart, R., Laguna, M., Glover, F., Campos, V.: Reducing the bandwidth of a sparse matrix with tabu search. Europ. J. Oper. Res. 135(2), 450–459 (2001)
Pu, Y., Shi, J., Guo, L.: A hierarchical method for clustering binary text image. In: Yuan, Y., Wu, X., Lu, Y. (eds.) ISCTCS 2012. CCIS, vol. 320, pp. 388–396. Springer, Heidelberg (2013)
De Vries, C.M., Geva, S., Trotman, A.: Document clustering evaluation: Divergence from a random baseline. CoRR, abs/1208.5654 (2012)
Yang, C., Yi, Z.: Document clustering using locality preserving indexing and support vector machines. Soft Comp. 12(7), 677–683 (2008)
Ronelle, A.: In honor of diversity: the linguistic resources of the Balkans. In: Kenneth, E. (ed.) Naylor Memorial Lecture Series in South Slavic Linguistics, vol. 2, Ohio State University, Dept. of Slavic and East European Languages and Literatures (2000)
Dale, I.R.H.: Digraphia. Int. J. of the Soc. of Lang. 26, 5–13 (1980)
Miller, B.: Translating Between Closely Related Languages in Statistical Machine Translation. Master of Science by Research, School of Informatics, University of Edinburg (2008)
Kordic, S.: Pro und kontra: “Serbokroatisch heute”. In: Slavistische Linguistik 2002: Referate des XXVIII. Konstanzer Slavistischen Arbeitstreffens, Bochum 2002. Slavistishe Beitrage, vol. 434, p. 141. Otto Sagner, Munich (2002)
Greenberg, R.D.: Language and identity in the Balkans: Serbo-Croatian and its disintegration. Oxford University Press (2004)
Brodić, D., Milivojević, Z.N., Maluckov, Č.A.: An approach to the script discrimination in the Slavic documents. Soft Comp. (in press) (online). doi:10.1007/s00500-014-1435-1
Brodić, D., Milivojević, Z.N., Maluckov, Č.A.: Recognition of the Script in Serbian Documents using Frequency Occurrence and Co-occurrence Analysis. The Scient. World J. 2013(896328), 1–14 (2013)
Nosaka, R., Ohkawa, Y., Fukui, K.: Feature extraction based on co-occurrence of adjacent local binary patterns. In: Ho, Y.-S. (ed.) PSIVT 2011, Part II. LNCS, vol. 7088, pp. 82–91. Springer, Heidelberg (2011)
Zramdini, A.W., Ingold, R.: Optical Font Recognition Using Typographical Features. IEEE T. Pattern Anal. 20(8), 877–882 (1998)
Yi, L.: Machine printed character segmentation An overview. Patt. Rec. 28(1), 67–80 (1995)
Haralick, R.M., Shanmugan, K., Dinstein, I.: Textural features for image classification. IEEE T. Sys., Man, and Cyber. 3(6), 610–621 (1973)
Eleyan, A., Demirel, H.: Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish J. Electr. Engin. and Comp. Sci. 19(1), 97–107 (2011)
Clausi, D.A.: An analysis of co-occurrence texture statistics as a function of grey level quantization. Canadian J. Remote Sens. 28(1), 45–62 (2002)
Tiedemann, J., Ljubesic, N.: Efficient discrimination between closely related languages. In: Proceedings of COLING 2012, Mumbai, India, pp. 2619–2634 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Brodić, D., Amelio, A., Milivojević, Z.N. (2015). Characterization and Distinction Between Closely Related South Slavic Languages on the Example of Serbian and Croatian. In: Azzopardi, G., Petkov, N. (eds) Computer Analysis of Images and Patterns. CAIP 2015. Lecture Notes in Computer Science(), vol 9256. Springer, Cham. https://doi.org/10.1007/978-3-319-23192-1_55
Download citation
DOI: https://doi.org/10.1007/978-3-319-23192-1_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23191-4
Online ISBN: 978-3-319-23192-1
eBook Packages: Computer ScienceComputer Science (R0)