Skip to main content
Log in

Document Categorization and Query Generation on the World Wide Web Using WebACE

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

We present WebACE, an agent for exploring and categorizing documents onthe World Wide Web based on a user profile. The heart of the agent is anunsupervised categorization of a set of documents, combined with a processfor generating new queries that is used to search for new relateddocuments and for filtering the resulting documents to extract the onesmost closely related to the starting set. The document categories are notgiven a priori. We present the overall architecture and describe twonovel algorithms which provide significant improvement over HierarchicalAgglomeration Clustering and AutoClass algorithms and form the basis forthe query generation and search component of the agent. We report on theresults of our experiments comparing these new algorithms with moretraditional clustering algorithms and we show that our algorithms are fastand sacalable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ackerman L. M. et al. (1997). Learning Probabilistic User Profiles. AI Magazine 18(2): 47-56.

    Google Scholar 

  • Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. & Verkamo, A. I. (1996). Fast Discovery of Association Rules. In Fayyad, U.M. Piatetsky-Shapiro, G., Smyth, P. & Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, 307-328. AAAI/MIT Press.

  • Anderson, T. W. (1954). On Estimation of Parameters in Latent Structure Analysis. Psychometrika 19: 1-10.

    Google Scholar 

  • Armstrong, R. Freitag, D., Joachims, T. & Mitchell, T. (1995). Web Watcher: A Learning Apprentice for the World Wide Web. In Proc. AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.

  • Balabanovic, M., Shoham, G. & Yun, Y. (1995). An Adaptive Agent for Automated Web Browsing. Journal of Visual Communication and Image Representation 6(4).

  • Berge, L. C. (1976). Graphs and Hypergraphs. American Elsevier.

  • Berry, M. W. (1992). Large-Scale Sparse Singular Value Computations. International Journal of Supercomputer Applications 6(1): 13-49.

    Google Scholar 

  • Berry, M. W., Dumais, S. T. & O'Brien, G. W. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37: 573-595.

    Google Scholar 

  • Boley, D. L. (1997). Principal Direction Divisive Partitioning. Technical Report TR-97-056, Department of Computer Science, University of Minnesota, Minneapolis.

    Google Scholar 

  • Cheeseman, L. & Stutz, J. (1996). Bayesian Classification (Autoclass): Theory and Results. In Fayyad, U. M., Piatesky-Shapiro, G., Smyth, P. & Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, 153-180. AAAI/MIT Press.

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. J. Amer. Soc. Inform. Sci. 41: 41.

    Google Scholar 

  • Doorenbos, R. B., Etzioni, O. & Weld, D. S. (1996). A Scalable Comparison Shopping Agent for the World Wide Web. Technical Report 96-01-03, University of Washington, Dept. of Computer Science and Engineering.

  • Duda, R. O. & Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley & Sons.

  • Frakes, W. B. (1992). Stemming Algorithms. In Frakes, W. B. & Baeza-Yates, R. (eds.) Information Retrieval Data Structures and Algorithms, 131-160. Prentice Hall.

  • Frakes, W. B. & Baeza-Yates, R. (1992). Information Retrieval Data Structures and Algorithms. Prentice Hall: Englewood Cliffs, NJ.

    Google Scholar 

  • Golub, G. H. & Van Loan, C. F. (1996). Matrix Computations, 3rd edn. Johns Hopkins Univ. Press.

  • Hammond, K., Burke, R., Martin C. & Lytinen, S. (1995). FAQ-Finder: A Case-Based Approach to Knowledge Navigation. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.

  • Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1997a). Clustering Based on Association Rule Hypergraphs (Position Paper). In Workshop on Research Issues on Data Mining and Knowledge Discovery, 9-13. Tucson, Arizona.

  • Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1997b). Clustering in a High-Dimensional Space Using Hypergraph Models. Technical Report TR-97-063, Department of Computer Science, University of Minnesota, Minneapolis.

    Google Scholar 

  • Han, E. H., Karypis, G., Kumar, V. & Mobasher, B. (1998). Hypergraph Based Clustering in High-Dimensional Data Sets: A Summary of Results. Bulletin of the Technical Committee on Data Engineering 21(1).

  • Jackson, J. E. (1991). A User's Guide to Principal Components. John Wiley & Sons.

  • Jain A. K. & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.

  • Karypis, G., Aggarwal, R., Kumar V. & Shekhar, S. (1997). Multilevel Hypergraph Partitioning: Application in VLSI Domain. In Proceedings ACM/IEEE Design Automation Conference.

  • Kirk, T., Levy, A. Y., Sagiv, Y. & Srivastava, D. (1995). The Information Manifold. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.

  • Kohonen, T. (1988). Self-Organization and Association Memory. Springer-Verlag.

  • Kwok, C. & Weld, D. (1996). Planning to Gather Information. In Proc. 14th National Conference on AI.

  • Leighton, V. H. & Srivastava, J. (1997). Precision Among WWW Search Services (Search Engines): Alta Vista, Excite, Hotbot, Infoseek, Lycos. http://www,winona,msus.edu/is-f/ library-f/webind2/webind2.htm.

  • Lu, S. Y. & Fu, K. S. (1978). A Sentence-to-Sentence Clustering Procedure for Pattern Analysis. IEEE Transactions on Systems, Man and Cybernetics 8: 381-389.

    Google Scholar 

  • Maarek, Y. S. & Shaul, I. Z. Ben (1996). Automatically Organizing Bookmarks per Content. In Proc. of 5th International World Wide Web Conference.

  • Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V. & Mobasher, B. (1997). Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering. In 7th Workshop on Information Technologies and Systems.

  • Perkowitz, M. & Etzioni, O. (1995). Category Translation: Learning to Understand Information on the Internet. In Proc. 15th International Joint Conference on AI, pp. 930-936. Montreal, Canada.

  • Porter, M. F. An Algorithm for Suffix Stripping. Program 14(3): 130-137.

  • Salton, G. & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.

  • Titterington, D. M., Smith, A. F. M. & Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons.

  • Weiss, R., Velez, B., Sheldon, M. A., Nemprempre, C., Szilagyi, P., Duda, A. & Gifford, D. K. (1996). Hypursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. In Seventh ACM Conference on Hypertext.

  • Wulfekuhler, M. R. & Punch, W. F. (1997). Finding Salient Features for Personal Web Page Categories. In Proc of 6th International World Wide Web Conference.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boley, D., Gini, M., Gross, R. et al. Document Categorization and Query Generation on the World Wide Web Using WebACE. Artificial Intelligence Review 13, 365–391 (1999). https://doi.org/10.1023/A:1006592405320

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1006592405320

Navigation