Skip to main content
Log in

CMSOF: a structured data organization framework for scanned Chinese medicine books in digital libraries

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

Organizing unstructured information from books into a well-defined structure is a significant challenge in digital libraries. Most digital libraries can provide only search services at the granularity of books and few libraries allow books to be accessed at the granularity of chapters, as manually constructing directory information for books is time-consuming. Extracting structured data from scanned books thus remains an urgent and important work. In this paper, we propose a novel structured data organization framework called CMSOF to organize scanned data automatically, and apply it to a Chinese medicine digital library. In the framework, image blocks and text blocks on the scanned page of books are separated based on the gray histogram projection method or a hybrid method of region growth and the Ada-Boosting classifier at first, and then the text structure is obtained from text blocks by text size and font type recognition. Finally, image blocks and structured OCRed text are correlated at the semantic level. By integrating the structured data into a Chinese medicine information system (CMIS), we can organize the Chinese medicine books well and users can access the books with flexibility, which indicates that CMSOF is an efficient framework to organize books mixed with images and text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Allison, L., Dix, T.I., 1986. A bit-string longest-common-subsequence algorithm. Inform. Process. Lett., 23(5): 305–310. [doi:10.1016/0020-0190(86)90091-8]

    Article  MathSciNet  Google Scholar 

  • Bosch, A., Zisserman, A., Munoz, X., 2007. Representing Shape with a Spatial Pyramid Kernel. Proc. CIVR, p.401–408. [doi:10.1145/1282280.1282340]

  • Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J., 2005. Automatic table detection in document images. LNCS, 3686:609–618. [doi:10.1007/11551188_67]

    Google Scholar 

  • le Bourgeois, F., Trinh, E., Allier, B., Eglin, V., Emptoz, H., 2004. Document Images Analysis Solutions for Digital Libraries. Proc. 1st Int. Workshop on Document Image Analysis for Libraries, p.2–24. [doi:10.1109/DIAL.2004.1263233]

  • Liu, Z.Y., Zhou, H.N., Yang, N., 2010. Semi-supervised learning for text-line detection. Pattern Recogn. Lett., 31(11):1260–1273. [doi:10.1016/j.patrec.2010.03.015]

    Article  Google Scholar 

  • Lu, X.N., Kahle, B., Wang, J.Z., Lee, C.G., 2008. A Metadata Generation System for Scanned Scientific Volumes. Joint Conf. on Digital Libraries, p.167–176. [doi:10.1145/1378889.1378918]

  • Namboodiri, A.M., Jain, A.K., 2007. Document Structure and Layout Analysis. In: Digital Document Processing, p.29–48. [doi:10.1007/978-1-84628-726-8_2]

  • Schapire, R.E., 1999. A Brief Introduction to Boosting. Proc. 16th Int. Joint Conf. on Artificial Intelligence, p.1401–1406.

  • Shi, S.M., Wei, B.G., Yang, Y., 2009. Msuggest: a Semantic Recommender Framework for Traditional Chinese Medicine Book Search Engine. Conf. on Information and Knowledge Management, p.533–542. [doi:10.1145/1645953.1646022]

  • Zhu, W.H., Wei, B.G., Zhuang, Y.T., Shi, S.M., Yang, Y., 2009. Content Integration in Digital Libraries. Proc. Int. Multimedia Conf. on Ambient Media Computing, p.57–64. [doi:10.1145/1631005.1631019]

  • Zhu, Y., Tan, T., 2001. Font recognition based on global texture analysis. IEEE Trans. Pattern Anal. Mach. Intell., 223(10):1192–1200. [doi:10.1109/34.954608]

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bao-gang Wei.

Additional information

Project supported by the China Academic Digital Associative Library (CADAL)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, J., Wei, Bg., Wang, Ld. et al. CMSOF: a structured data organization framework for scanned Chinese medicine books in digital libraries. J. Zhejiang Univ. - Sci. C 11, 882–892 (2010). https://doi.org/10.1631/jzus.C1001007

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C1001007

Key words

CLC number

Navigation