Abstract
In this study, we outline computational issues in the design of a Digital Library (DL) for Indic languages. The complicated character structure of Indic scripts entails novel OCR analysis techniques and user interface (UI) designs. This paper describes a multi-tier software architecture, which provides text and image processing tools as independent, reusable entities. Techniques for measuring and evaluating different stages of an Indic script recognition engine are outlined.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
The xml version of the tei guidelines. February 24 (2004), http://www.teic.org/P4X/CH.html
Allen, R.B., Schalow, J.: Metadata and data structures for the historical newspaper digital library. In: Proceedings of the 8th international conference on Information and knowledge management, pp. 147–153 (1999)
Ashwin, T., Sastry, P.: A font and size independent ocr system for printed kannada documents using support vector machines. Sadhana 27, 35–58 (2002)
Baird, H., Ho, T.K.: Large-scale simulation studies in image pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(10), 1067–1079 (1997)
Bansal, V.: Integrating knowledge sources in devanagari text recognition. IEEE Transactions on Systems, Man and Cybernetics Part A 30(4), 500–505 (2000)
Bazzi, I., Schwartz, R., Makhoul, J.: An omnifont open-vocabulary ocr system for english and arabic. IEEE Pattern Analysis and Machine Intelligence 21(6), 495–504 (1999)
Bird, S., Day, D., Garofolo, J., Henderson, J., Laprun, C., Liberman, M.: Atlas: A flexible and extensible architecture for linguistic annotation. In: Proceedings of the Second International Language Resources and Evaluation Conference, pp. 1699–1706 (2000)
Bray, T., Paoli, J., Sperberg-McQueen, C.M., Maler, E.: Extensible markup language (xml) 1.0, second edition (2001)
Chaudhuri, B., Pal, U.: An ocr system to read two indian language scripts: Bangla and devanagari. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 1011–1015 (1997)
Chaudhuri, B., Pal, U., Mitra, M.: Automatic recognition of printed oriya script. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 795–799 (2001)
Consortium, U.: The Unicode Standard Version 4.0. Addison-Wesley, Reading (2003)
Couasnon, B., Camillerapp, J., Leplumey, I.: Making handwritten archives documents accessible to public with a generic system of document image analysis. In: Proceedings of the 1st International Workshop on Document Image Analysis for Libraries (DIAL 2004), pp. 270–277 (2004)
Daniels, P.T., Bright, W.: The World’s Writing Systems, March 1996. Oxford University Press, Oxford (1996)
Govindaraju, V., Khedekar, S., Kompalli, S., Farooq, F., Setlur, S., Prasad, V.: Tools for enabling digital access to multilingual indic documents. In: Proceedings of the 1st International Workshop on Document Image Analysis for Libraries (DIAL 2004), pp. 122–133 (2004)
Kompalli, S., Setlur, S., Govindaraju, V., Vemulapati, R.: Creation of data resources and design of an evaluation test bed for devanagari script recognition. In: Proceedings of the 13th International Workshop on Research Issues on Data Engineering: Multi-lingual Information Management, pp. 55–61 (2003)
Lee, C., Kanungo, T.: The architecture of trueviz:a groundtruth/metadata editing and visualizing toolkit. PR 36(3), 811–825 (2003)
Ma, H., Doermann, D.: Adaptive hindi ocr using generalized hausdorff image comparison. ACM Transactions on Asian Language Information Processing 26(2), 198–213 (2003)
Mao, S., Kanungo, T.: Software architecture of pset: A page segmentation evaluation toolkit. International Journal on Document Analysis and Recognition (IJDAR) 4(3), 205–217 (2002)
Microsoft, C.: Windows glyph processing, February 24 (2004), http://www.microsoft.com/typography/developers/opentype/default.htm
Negi, A., Bhagvati, C., Krishna, B.: An ocr system for telugu. In: Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 1110–1114 (2001)
B. of Indian Standards. Indian script code for information interchange (1999)
I. Sun Microsystems. Solaris 9 operating system features and benefits - compatibility, February 24 (2004), http://wwws.sun.com/software/solaris/sparc/solaris9_features_compatibility.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kompalli, S., Setlur, S., Govindaraju, V. (2004). DL Architecture for Indic Scripts. In: Marinai, S., Dengel, A.R. (eds) Document Analysis Systems VI. DAS 2004. Lecture Notes in Computer Science, vol 3163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28640-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-28640-0_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23060-1
Online ISBN: 978-3-540-28640-0
eBook Packages: Springer Book Archive