Abstract
We present a knowledge discovery and data mining process developed as part of the Columbia/Con Edison project on manhole event prediction. This process can assist with real-world prioritization problems that involve raw data in the form of noisy documents requiring significant amounts of pre-processing. The documents are linked to a set of instances to be ranked according to prediction criteria. In the case of manhole event prediction, which is a new application for machine learning, the goal is to rank the electrical grid structures in Manhattan (manholes and service boxes) according to their vulnerability to serious manhole events such as fires, explosions and smoking manholes. Our ranking results are currently being used to help prioritize repair work on the Manhattan electrical grid.
Article PDF
Similar content being viewed by others
References
Azevedo, A., & Santos, M. F. (2008). KDD, SEMMA and CRISP-DM: a parallel overview. In Proceedings of the IADIS European conf. data mining (pp. 182–185).
Becker, H., & Arias, M. (2007). Real-time ranking with concept drift using expert advice. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’07) (pp. 86–94). New York: ACM.
Boriah, S., Kumar, V., Steinbach, M., Potter, C., & Klooster, S. A. (2008). Land cover change detection: a case study. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08) (pp. 857–865). New York: ACM.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159.
Castano, R., Judd, M., Anderson, R. C., & Estlin, T. (2003). Machine learning challenges in Mars rover traverse science. In Workshop on machine learning technologies for autonomous space applications, international conference on machine learning.
Chen, G., & Peterson, A. T. (2002). Prioritization of areas in China for the conservation of endangered birds using modelled geographical distributions. Bird Conservation International, 12, 197–209.
Chen, H., Chung, W., Xu, J. J., Wang, G., Qin, Y., & Chau, M. (2004). Crime data mining: a general framework and some examples. IEEE Computer, 37(4), 50–56.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: a framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02).
Devaney, M., & Ram, A. (2005). Preventing failures by mining maintenance logs with case-based reasoning. In Proceedings of the 59th meeting of the society for machinery failure prevention technology (MFPT-59).
Dudík, M., Phillips, S. J., & Schapire, R. E. (2007). Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8, 1217–1260.
Dutta, H., Rudin, C., Passonneau, R., Seibel, F., Bhardwaj, N., Radeva, A., Liu, Z. A., & Ierome S, Isaac, D. (2008). Visualization of manhole and precursor-type events for the Manhattan electrical distribution system. In Proceedings of the workshop on geo-visualization of dynamics, movement and change, 11th AGILE international conference on geographic information science, Girona, Spain.
Fayyad, U., & Uthurusamy, R. (2002). Evolving data into mining solutions for insights. Communications of the ACM, 45(8), 28–31.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17, 37–54.
Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: an overview. AI Magazine, 13(3), 57–70.
Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.
Google Earth (2009). http://www.earth.google.com.
Grishman, R., Hirschman, L., & Nhan, N. T. (1986). Discovery procedures for sublanguage selectional patterns: initial experiments. Computational Linguistics, 205–215.
Gross, P., Boulanger, A., Arias, M., Waltz, D. L., Long, P. M., Lawson, C., Anderson, R., Koenig, M., Mastrocinque, M., Fairechio, W., Johnson, J. A., Lee, S., Doherty, F., & Kressner, A. (2006). Predicting electricity distribution feeder failures using machine learning susceptibility analysis. In Proceedings of the eighteenth conference on innovative applications of artificial intelligence IAAI-06, Boston, Massachusetts.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
Hand, D. J. (1994). Deconstructing statistical questions. Journal of the Royal Statistical Society Series A (Statistics in Society), 157(3), 317–356.
Harding, J. A., Shahbaz, M., Srinivas, & Kusiak, A. (2006). Data mining in manufacturing: a review. Journal of Manufacturing Science and Engineering, 128(4), 969–976.
Harris, Z. (1982). Discourse and sublanguage. In Kittredge, R., & Lehrberger, J. (Eds.) Sublanguage: studies of language in restricted semantic domains (pp. 231–236). Berlin: de Gruyter.
Hirschman, L., Palmer, M., Dowding, J., Dahl, D., Linebarger, M., Passonneau, R., Lang, F., Ball, C., & Weir, C. (1989). The PUNDIT natural-language processing system. In Proceedings of the annual AI systems in government conference (pp. 234–243).
Hsu, W., Lee, M. L., Liu, B., & Ling, T. W. (2000). Exploration mining in diabetic patients databases: findings and conclusions. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’00) (pp. 430–436). New York: ACM.
Jiang, R., Yang, H., Zhou, L., Kuo, C. C. J., Sun, F., & Chen, T. (2007). Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. American Journal of Human Genetics, 81(2), 346–360.
Kirtley, J. Jr., Hagman, W., Lesieutre, B., Boyd, M., Warren, E., Chou, H., & Tabors, R. (1996). Monitoring the health of power transformers. IEEE Computer Applications in Power, 9(1), 18–23.
Kittredge, R. (1982). Sublanguages. American Journal of Computational Linguistics, 79–84.
Kittredge, R., Korelsky, T., & Rambow, O. (1991). On the need for domain communication knowledge. Computational Intelligence, 7(4), 305–314.
Kohavi, R., & John, G. (1997). Wrappers for feature selection. Artificial Intelligence, 97(1–2), 273–324.
Krippendorff, K. (1980). Content analysis: an introduction to its methodology. Beverly Hills: Sage.
Kusiak, A., & Shah, S. (2006). A data-mining-based system for prediction of water chemistry faults. IEEE Transactions on Industrial Electronics, 53(2), 593–603.
Liddy, E. D., Symonenko, S., & Rowe, S. (2006). Sublanguage analysis applied to trouble tickets. In Proceedings of the Florida artificial intelligence research society conference (pp. 752–757).
Linebarger, M., Dahl, D., Hirschman, L., & Passonneau, R. (1988). Sentence fragments regular structures. In Proceedings of the 26th association for computational linguistics, Buffalo, NY.
Murray, J. F., Hughes, G. F., & Kreutz-Delgado, K. (2005). Machine learning methods for predicting failures in hard drives: a multiple-instance application. Journal of Machine Learning Research, 6, 783–816.
National Institute of Standards and Technology (NIST), Information Access Division (ACE) Automatic Content Extraction Evaluation. http://www.itl.nist.gov/iad/mig/tests/ace/.
Oza, N., Castle, J. P., & Stutz, J. (2009). Classification of aeronautics system health and safety documents. IEEE Transactions on Systems, Man and Cybernetics, Part C, 39, 670–680.
Passonneau, R., Rudin, C., Radeva, A., & Liu, Z. A. (2009). Reducing noise in labels and features for a real world dataset: application of NLP corpus annotation methods. In Proceedings of the 10th international conference on computational linguistics and intelligent text processing (CICLing).
Patel, K., Fogarty, J., Landay, J. A., & Harrison, B. (2008). Investigating statistical machine learning as a tool for software development. In Proceedings of ACM CHI 2008 conference on human factors in computing systems (CHI 2008) (pp. 667–676).
Radeva, A., Rudin, C., Passonneau, R., & Isaac, D. (2009). Report cards for manholes: eliciting expert feedback for a machine learning task. In Proceedings of the international conference on machine learning and applications.
Rudin, C. (2009). The P-Norm Push: a simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10, 2233–2271.
Sager, N. (1970). The sublanguage method in string grammars. In R. W. Ewton Jr. & J. Ornstein (Eds.), Studies in language and linguistics, University of Texas at El Paso (pp. 89–98).
Steed, J. (1995). Condition monitoring applied to power transformers-an REC view. In Second international conference on the reliability of transmission and distribution equipment (pp. 109–114).
Symonenko, S., Rowe, S., & Liddy, E. D. (2006). Illuminating trouble tickets with sublanguage theory. In Proceedings of the human language technology/North American association of computational linguistics conference.
Vilalta, R., & Ma, S. (2002). Predicting rare events in temporal domains. In IEEE international conference on data mining (pp. 474–481).
Weiss, G. M., & Hirsh, H. (2000). Learning to predict extremely rare events. In AAAI workshop on learning from imbalanced data sets (pp. 64–68). Menlo Park: AAAI Press.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Carla Brodley.
This work was done while Cynthia Rudin was at the Center for Computational Learning Systems at Columbia University.
Rights and permissions
About this article
Cite this article
Rudin, C., Passonneau, R.J., Radeva, A. et al. A process for predicting manhole events in Manhattan. Mach Learn 80, 1–31 (2010). https://doi.org/10.1007/s10994-009-5166-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-009-5166-y