There is a newer version of the record available.

Published March 10, 2022 | Version v1
Dataset Open

MDIW-13: New Database and Benchmark for Script Identification

  • 1. Instituto Universitario para el Desarrollo Tecnológico y la Innovación en Comunicaciones,Universidad de Las Palmas de Gran Canaria, Campus de Tafira, Las Palmas de Gran Canarialmas de Gran Canaria, Spain, (8) (PDF) MDIW-13: New Database and Benchmark for Script Identification. Available from: https://www.researchgate
  • 2. Griffith University, Gold Coast, Queensland, Australia and Information Sciences institute,University of Southern California, USA
  • 3. Instituto Universitario para el Desarrollo Tecnológico y la Innovación en Comunicaciones,Universidad de Las Palmas de Gran Canaria, Campus de Tafira, Las Palmas de Gran Canaria
  • 4. Universidad Autonoma de Madrid, Spain
  • 5. Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India

Description

Script identification is a necessary step in some applications involving document analysis in a multi-script and multi-language environment. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspapers and handwritten letters and notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given.
 

https://www.dropbox.com/s/vtmy0l4gjxun0oe/Multiscript_SIW_Database_Feb25_acceptedPaper.zip?dl=0

 

Please, cite our work if you find useful the database:

  • M. A. Ferrer, A. Das, M. Diaz, A. Morales, C. Carmona-Duarte, U. Pal (2022), "MDIW-13: New Database and Benchmark for Script Identification", Multimedia Tools and Applications, Pages 1-14. Accepted
  • A. Das, M. A. Ferrer, A. Morales, M. Diaz, U. Pal, et al. "SIW 2021: ICDAR Competition on Script Identification in the Wild". 16th International Conference on Document Analysis and Recognition (ICDAR 2021). Lecture Notes in Computer Science, vol 12824. Springer. Sep. 5-10, 2021, Lausanne, Switzerland, pp. 738-753. doi: 10.1007/978-3-030-86337-1_49

Files

RG_MTAP2022.pdf

Files (1.8 MB)

Name Size Download all
md5:4273610962464a1b3d6dbf3c30f5d5c4
1.8 MB Preview Download