Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Michelson, Matthew; Knoblock, Craig A.

doi:10.1007/s10032-007-0052-2

Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Original Paper
Published: 16 October 2007

Volume 10, pages 211–226, (2007)
Cite this article

International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Matthew Michelson¹ &
Craig A. Knoblock¹

232 Accesses
21 Citations
Explore all metrics

Abstract

Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but it did so using supervised machine learning. In this paper, we present an unsupervised approach that both selects the relevant reference set(s) automatically and then uses it for unsupervised extraction. We validate our approach with experimental results that show our unsupervised extraction is competitive with supervised machine learning approaches, including the previous supervised approach that exploits reference sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

Information Extraction: Past, Present and Future

Joint Information Extraction from the Web Using Linked Data

References

Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 20-29. ACM, Baltimore (2004)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM, Baltimore (2003)
Cafarella, M.J., Downey, D., Soderland, S., Etzioni, O.: KnowItNow: Fast, scalable information extraction from the web. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 563-570. Association for Computational Linguistics, East Stroudsburg (2005)
Carman, M.J., Knoblock, C.A.: Learning semantic descriptions of web information sources. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 2695–2700 (2007)
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1251–1256 (2001)
Cohen, W., Ravikumar, P., Feinberg, S.: A comparison of string metrics for matching names and records. In: Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 13–18 (2003)
Cohen, W., Sarawagi, S.: Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98. ACM, Baltimore (2004)
Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: Proceedings of the Conference on Digital Libraries, pp. 37–46. ACM, Baltimore (2000)
Dill, S., Gibson, N., Gruhl, D., Guha, R., Jhingran, A., Kanungo,~T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien,~J.Y.: Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. In: Proceedings of the International World Wide Web Conference, pp. 178–186. ACM, Baltimore (2003)
Hassan, H., Hassan, A., Emam, O.: Unsupervised information extraction approach using graph mutual reinforcement. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 501–508. Association for Computational Linguistics, East Stroudsburg (2006)
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 729–737 (1997)
Lerman, K., Plangrasopchok, A., Knoblock, C.A.: Automatically labeling the inputs and outputs of web services. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1363–1368. AAAI, Charlotte (2006)
Levy, A.: Logic-based techniques in data integration. In: J.~Minker (ed.) Logic Based Artificial Intelligence, pp. 575–595. Kluwer, Dordrecht (2000)
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of the International Conference on Very Large Data Bases, pp. 251–262. Morgan Kaufmann, San Fransisco (1996)
Lin J. (1991). Divergence measures based on the shannon entropy. IEEE Trans. Inf. Theory 37(1): 145–151
Article MATH Google Scholar
McCallum, A.: Mallet: A machine learning for language toolkit http://mallet.cs.umass.edu (2002)
Michelson, M., Knoblock, C.A.: Semantic annotation of unstructured and ungrammatical text. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1091–1098 (2005)
Michelson, M., Knoblock, C.A.: An automatic approach to semantic annotation of unstructured, ungrammatical sources: A first look. In: Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructured Text Data, pp. 123–130 (2007)
Michelson, M., Knoblock, C.A.: Mining heterogeneous transformations for record linkage. In: Proceedings of the International Workshop on Information Integration on the Web, pp. 68–73. AAAI, Charlotte (2007)
Minton, S.N., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: Proceedings of the IEEE International Conference on Data Mining, pp. 314–321. IEEE Computer Society, Washington DC (2005)
Paşca, M., Lin, D., Bigham, J., Lifchits, A., Jain, A.: Organizing and searching the world wide web of facts - step one: the one- million fact extraction challenge. In: Proceedings of the National Conference on Artificial Intelligence, pp. 1400–1405. AAAI, Charlotte (2006)
Reeve, L., Han, H.: Survey of semantic annotation platforms. In: Proceedings of ACM Symposium on Applied Computing, pp. 1634–1638. ACM, Baltimore (2005)
Smith T.F., Waterman M.S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147: 195–197
Article Google Scholar
Thakkar S., Ambite J.L., Knoblock C.A. (2005). Composing, optimizing, and executing plans for bioinformatics web services. Int. J. Very Large Databases, Spec. Issue Data Manage. Anal. Mining Life Sci 14(3): 330–353
Google Scholar
Winkler, W.E.: The state of record linkage and current research problems. Technical Report U.S. Census Bureau (1999)

Download references

Author information

Authors and Affiliations

Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA
Matthew Michelson & Craig A. Knoblock

Authors

Matthew Michelson
View author publications
You can also search for this author in PubMed Google Scholar
Craig A. Knoblock
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthew Michelson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Michelson, M., Knoblock, C.A. Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web. IJDAR 10, 211–226 (2007). https://doi.org/10.1007/s10032-007-0052-2

Download citation

Received: 13 March 2007
Revised: 15 June 2007
Accepted: 20 August 2007
Published: 16 October 2007
Issue Date: December 2007
DOI: https://doi.org/10.1007/s10032-007-0052-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Abstract

Access this article

Similar content being viewed by others

A survey of methods for the extraction of information from Web resources

Information Extraction: Past, Present and Future

Joint Information Extraction from the Web Using Linked Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

Abstract

Access this article

Similar content being viewed by others

A survey of methods for the extraction of information from Web resources

Information Extraction: Past, Present and Future

Joint Information Extraction from the Web Using Linked Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation