Document Categorization and Query Generation on the World Wide Web Using WebACE

Boley, Daniel; Gini, Maria; Gross, Robert; Han, Eui-Hong (Sam); Hastings, Kyle; Karypis, George; Kumar, Vipin; Mobasher, Bamshad; Moore, Jerome

doi:10.1023/A:1006592405320

Document Categorization and Query Generation on the World Wide Web Using WebACE

Published: December 1999

Volume 13, pages 365–391, (1999)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Daniel Boley¹,
Maria Gini¹,
Robert Gross¹,
Eui-Hong (Sam) Han¹,
Kyle Hastings¹,
George Karypis¹,
Vipin Kumar¹,
Bamshad Mobasher¹ &
…
Jerome Moore¹

243 Accesses
80 Citations
Explore all metrics

Abstract

We present WebACE, an agent for exploring and categorizing documents onthe World Wide Web based on a user profile. The heart of the agent is anunsupervised categorization of a set of documents, combined with a processfor generating new queries that is used to search for new relateddocuments and for filtering the resulting documents to extract the onesmost closely related to the starting set. The document categories are notgiven a priori. We present the overall architecture and describe twonovel algorithms which provide significant improvement over HierarchicalAgglomeration Clustering and AutoClass algorithms and form the basis forthe query generation and search component of the agent. We report on theresults of our experiments comparing these new algorithms with moretraditional clustering algorithms and we show that our algorithms are fastand sacalable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TagTheWeb: Using Wikipedia Categories to Automatically Categorize Resources on the Web

An Approach to Explore Large-Scale Collections Based on Classification Schemes

An approach to text data categorization based on the ideas of J.S. Mill

Article 01 November 2015

References

Ackerman L. M. et al. (1997). Learning Probabilistic User Profiles. AI Magazine 18(2): 47-56.
Google Scholar
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. & Verkamo, A. I. (1996). Fast Discovery of Association Rules. In Fayyad, U.M. Piatetsky-Shapiro, G., Smyth, P. & Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, 307-328. AAAI/MIT Press.
Anderson, T. W. (1954). On Estimation of Parameters in Latent Structure Analysis. Psychometrika 19: 1-10.
Google Scholar
Armstrong, R. Freitag, D., Joachims, T. & Mitchell, T. (1995). Web Watcher: A Learning Apprentice for the World Wide Web. In Proc. AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.
Balabanovic, M., Shoham, G. & Yun, Y. (1995). An Adaptive Agent for Automated Web Browsing. Journal of Visual Communication and Image Representation 6(4).
Berge, L. C. (1976). Graphs and Hypergraphs. American Elsevier.
Berry, M. W. (1992). Large-Scale Sparse Singular Value Computations. International Journal of Supercomputer Applications 6(1): 13-49.
Google Scholar
Berry, M. W., Dumais, S. T. & O'Brien, G. W. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37: 573-595.
Google Scholar
Boley, D. L. (1997). Principal Direction Divisive Partitioning. Technical Report TR-97-056, Department of Computer Science, University of Minnesota, Minneapolis.
Google Scholar
Cheeseman, L. & Stutz, J. (1996). Bayesian Classification (Autoclass): Theory and Results. In Fayyad, U. M., Piatesky-Shapiro, G., Smyth, P. & Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, 153-180. AAAI/MIT Press.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. J. Amer. Soc. Inform. Sci. 41: 41.
Google Scholar
Doorenbos, R. B., Etzioni, O. & Weld, D. S. (1996). A Scalable Comparison Shopping Agent for the World Wide Web. Technical Report 96-01-03, University of Washington, Dept. of Computer Science and Engineering.
Duda, R. O. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley & Sons.
Frakes, W. B. (1992). Stemming Algorithms. In Frakes, W. B. & Baeza-Yates, R. (eds.) Information Retrieval Data Structures and Algorithms, 131-160. Prentice Hall.
Frakes, W. B. & Baeza-Yates, R. (1992). Information Retrieval Data Structures and Algorithms. Prentice Hall: Englewood Cliffs, NJ.
Google Scholar
Golub, G. H. & Van Loan, C. F. (1996). Matrix Computations, 3rd edn. Johns Hopkins Univ. Press.
Hammond, K., Burke, R., Martin C. & Lytinen, S. (1995). FAQ-Finder: A Case-Based Approach to Knowledge Navigation. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.
Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1997a). Clustering Based on Association Rule Hypergraphs (Position Paper). In Workshop on Research Issues on Data Mining and Knowledge Discovery, 9-13. Tucson, Arizona.
Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1997b). Clustering in a High-Dimensional Space Using Hypergraph Models. Technical Report TR-97-063, Department of Computer Science, University of Minnesota, Minneapolis.
Google Scholar
Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1998). Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results. Bulletin of the Technical Committee on Data Engineering 21(1).
Jackson, J. E. (1991). A User's Guide to Principal Components. John Wiley & Sons.
Jain A. K. & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.
Karypis, G., Aggarwal, R., Kumar V. & Shekhar, S. (1997). Multilevel Hypergraph Partitioning: Application in VLSI Domain. In Proceedings ACM/IEEE Design Automation Conference.
Kirk, T., Levy, A. Y., Sagiv, Y. & Srivastava, D. (1995). The Information Manifold. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.
Kohonen, T. (1988). Self-Organization and Association Memory. Springer-Verlag.
Kwok, C. & Weld, D. (1996). Planning to Gather Information. In Proc. 14th National Conference on AI.
Leighton, V. H. & Srivastava, J. (1997). Precision Among WWW Search Services (Search Engines): Alta Vista, Excite, Hotbot, Infoseek, Lycos. http://www,winona,msus.edu/is-f/ library-f/webind2/webind2.htm.
Lu, S. Y. & Fu, K. S. (1978). A Sentence-to-Sentence Clustering Procedure for Pattern Analysis. IEEE Transactions on Systems, Man and Cybernetics 8: 381-389.
Google Scholar
Maarek, Y. S. & Shaul, I. Z. Ben (1996). Automatically Organizing Bookmarks per Content. In Proc. of 5th International World Wide Web Conference.
Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V. & Mobasher, B. (1997). Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering. In 7th Workshop on Information Technologies and Systems.
Perkowitz, M. & Etzioni, O. (1995). Category Translation: Learning to Understand Information on the Internet. In Proc. 15th International Joint Conference on AI, pp. 930-936. Montreal, Canada.
Porter, M. F. An Algorithm for Suffix Stripping. Program 14(3): 130-137.
Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons.
Weiss, R., Velez, B., Sheldon, M. A., Nemprempre, C., Szilagyi, P., Duda, A. & Gifford, D. K. (1996). Hypursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. In Seventh ACM Conference on Hypertext.
Wulfekuhler, M. R. & Punch, W. F. (1997). Finding Salient Features for Personal Web Page Categories. In Proc of 6th International World Wide Web Conference.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CSci Building, 200 Union Street SE Minneapolis, MN, 55455, USA
Daniel Boley, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher & Jerome Moore

Authors

Daniel Boley
View author publications
You can also search for this author in PubMed Google Scholar
Maria Gini
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gross
View author publications
You can also search for this author in PubMed Google Scholar
Eui-Hong (Sam) Han
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Hastings
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar
Vipin Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Bamshad Mobasher
View author publications
You can also search for this author in PubMed Google Scholar
Jerome Moore
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boley, D., Gini, M., Gross, R. et al. Document Categorization and Query Generation on the World Wide Web Using WebACE. Artificial Intelligence Review 13, 365–391 (1999). https://doi.org/10.1023/A:1006592405320

Download citation

Issue Date: December 1999
DOI: https://doi.org/10.1023/A:1006592405320

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Document Categorization and Query Generation on the World Wide Web Using WebACE

Abstract

Access this article

Similar content being viewed by others

TagTheWeb: Using Wikipedia Categories to Automatically Categorize Resources on the Web

An Approach to Explore Large-Scale Collections Based on Classification Schemes

An approach to text data categorization based on the ideas of J.S. Mill

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation