Skip to main content

Statistical Analysis of Bibliographic Strings for Constructing an Integrated Document Space

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2458))

Abstract

It is important to utilize retrospective documents when constructing a large digital library. This paper proposes a method for analyzing recognized bibliographic strings using an extended hidden Markov model. The proposed method enables analysis of erroneous bibliographic strings and integrates many documents accumulated as printed articles in a citation index. The proposed method has the advantage of providing a robust bibliographic matching function using the statistical description of the syntax of bibliographic strings, a language model and an Optical Character Recognition (OCR) error model. The method also has the advantage of reducing the cost of preparing training data for parameter estimation, using records in the bibliographic database.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. F. H. Ayres, J. A. W. Huggill, and E. J. Yannakoudakis. The universal standard bibligraphic code (usbc): its use for clearing, merging and controlling large databases. Program— Automated Library and Information Systems, 22(2):117–132, 1988.

    Article  Google Scholar 

  2. A. Belaid, J. C. Anigbogu, and Y. Chenevoy. Qualitative Analysis of Low-Level Logical Structures. In Proc. of International Conference on Electronic Publishing, pages 435–446, 1994.

    Google Scholar 

  3. H. Bunke and P.S.P. Wang, editors. Handbook of Character Recoginition and Document Image Analysis. World Scientific, 1997.

    Google Scholar 

  4. CrossRef The central source for reference linking:. http://www.crossref.org/. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.

  5. C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An Automatic Citation Indexing System. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.

    Google Scholar 

  6. P. Goyal. An investigation of different string coding methods. Journal of the American Society for Information Science, 35(4):248–252, 1984.

    Article  Google Scholar 

  7. P. Goyal. Duplicate record identification in bibiliographic databases. Information Systems, 12(3):239–242, 1987.

    Article  Google Scholar 

  8. The Digital Object Identifier:. http://www.doi.org/. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.

  9. S. Kahan, T. Pavlidis, and H. S. Baird. On the recognition of printed characters of any font and size. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):274–288, March 1987.

    Article  Google Scholar 

  10. Karen Kukich. “Techniques for Automtically Correcting Words in Text”. ACM Computing Surveys, 24(4):377–439, 1992.

    Article  Google Scholar 

  11. S. Lawrence, C. L. Giles, and K. D. Bollacker. Digital libraries and autonmous citation indexing. IEEE Computer, 32(6):67–71, June 1999.

    Google Scholar 

  12. Y. Li, D. Lopresti, and A. Tomkins. “Validation of Document Image Defect Models for Optical Character Recognition”. In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137–150, 1994.

    Google Scholar 

  13. T. O'Neill, E., A. Rogers, S., and M. Oskins, W. Characteristics of duplicate records in OCLC’s online union catalog. Library Resources & Technical Services, 37(1):59–71, 1992.

    Google Scholar 

  14. F. Parmentier and A. Belaid. “Bibliography References Validation Using Emergent Architecture”. In Proc. of IAPR International Conference on Document Analysis and Recognition, pages 532–535, 1995.

    Google Scholar 

  15. G. A. Story, L. O'Gorman, D. Fox, L. L. Schaper, and H. V. Jagadish. The rightpages image-based electronic library for alerting and browsing. IEEE Computer., 25(9):17–26, 1992.

    Google Scholar 

  16. A. Takasu. Probabilistic interpage analysis for article extraction from document images. In Proc. of 14th International Conference on Pattern Recognition, pages 932–935. IAPR, 1998.

    Google Scholar 

  17. A. Takasu and K. Aihara. “DVHMM: Variable Length Text Recognition Error Model”. In submit to 15th Internationa Conference on Pattern Recognition, pages xx–xx, 2002.

    Google Scholar 

  18. A. Takasu, N. Katayama, and et. al. “Approximate Matching for OCR-Processed Bibliographic Data”. In Proc. of 13th Internationa Conference on Pattern Recognition, pages 175–179, 1996.

    Google Scholar 

  19. K. Y. Wong, R. G. Casey, and F. M. Wahl. “Document Analysis System”. IBM journal Research and Development, 26(6):647–656, 1982.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Takasu, A. (2002). Statistical Analysis of Bibliographic Strings for Constructing an Integrated Document Space. In: Agosti, M., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2002. Lecture Notes in Computer Science, vol 2458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45747-X_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-45747-X_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44178-6

  • Online ISBN: 978-3-540-45747-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics