Abstract
We present Document Explorer, a data mining system searching for patterns in document collections. These patterns provide knowledge on the application domain that is represented by the collection. A pattern can also be seen as a query that retrieves a set of documents. Thus the data mining tools can be used to identify interesting queries which can be used to browse the collection. The main pattern types, the system can search for, are frequent sets of concepts, association rules, concept distributions, and concept graphs. To enable the user to specify some explicit bias, the system provides several types of constraints for searching the vast implicit spaces of patterns that exist in the collection. The patterns which have been verified as interesting are structured and presented in a visual user interface allowing the user to operate on the results to refine and redirect search tasks or to access the associated documents. The system offers preprocessing tools to construct or refine a knowledge base of domain concepts and to create an internal representation of the document collection that will be used by all subsequent data mining operations. In this paper, we give an overview on the Document Explorer system. We summarize our methodical approaches and solutions for the special requirements of this document mining area.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download to read the full chapter text
Chapter PDF
References
Amir A., Aumann Y., Feldman R., and Katz O. Efficient Algorithm for Association Generation. Technical Report, Department of Computer Science, Bar-Ilan University, Israel.
Agrawal R., Mannila H., Srikant R., Toivonen H., and Verkamo I. Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, pages 307–328, AAAI Press.
Apte C., Damerau F., and Weiss S. Towards language independent automated learning of text categorization models. In Proceedings of ACM-SIGIR Conference on Information Retrieval, 1994.
Feldman R., Amir A., Aumann Y., Zilberstein A., Hirsh H. Incremental Algorithms for Association Generation. In Proceedings of the 1st Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD97), Singapore, 1997.
Feldman R., Kloesgen W., and Zilberstein A. Document Explorer: Discovering Knowledge in Document Collections. Technical Report, Department of Computer Science, Bar-Ilan University, Israel.
Feldman R., Dagan I., and Kloesgen W. Efficient Algorithms for Mining and Manipulating Associations in Texts. In Proceedings of EMCSR96, Vienna, Austria, April 1996.
Feldman R. and Dagan I. KDT — knowledge discovery in texts. In Proceedings of the First International Conference on Knowledge Discovery (KDD-95), August 1995.
Iwayama M. and Tokunaga T. A probabilistic model for text categorization based on a single random variable with multiple values. In Proceedings of the 4th Conference on Applied Natural Language Processing, 1994.
Klemettinen M., Mannila H., Ronkainen P., Toivonen H., and Verkamo A. Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proceedings of the 3rd International conference on Information and Knowledge Management, 1994.
Klösgen W. Efficient Discovery of Interesting Statements. The Journal of Intelligent Information Systems, Vol. 4, No 1.
Klösgen W. Explora: A Multipattern and Multistrategy Discovery Assistant. In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Cambridge, MA: MIT Press.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feldman, R., Klösgen, W., Ben-Yehuda, Y., Kedar, G., Reznikov, V. (1997). Pattern based browsing in document collections. In: Komorowski, J., Zytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1997. Lecture Notes in Computer Science, vol 1263. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63223-9_111
Download citation
DOI: https://doi.org/10.1007/3-540-63223-9_111
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63223-8
Online ISBN: 978-3-540-69236-2
eBook Packages: Springer Book Archive