Skip to main content
Log in

Machine-learning classifiers for imbalanced tornado data

  • Original Paper
  • Published:
Computational Management Science Aims and scope Submit manuscript

Abstract

Learning from imbalanced data, where the number of observations in one class is significantly larger than the ones in the other class, has gained considerable attention in the machine learning community. Assuming the difficulty in predicting each class is similar, most standard classifiers will tend to predict the majority class well. This study applies tornado data that are highly imbalanced, as they are rare events. The severe weather data used herein have thunderstorm circulations (mesocyclones) that produce tornadoes in approximately 6.7 % of the total number of observations. However, since tornadoes are high impact weather events, it is important to predict the minority class with high accuracy. In this study, we apply support vector machines (SVMs) and logistic regression with and without a midpoint threshold adjustment on the probabilistic outputs, random forest, and rotation forest for tornado prediction. Feature selection with SVM-recursive feature elimination was also performed to identify the most important features or variables for predicting tornadoes. The results showed that the threshold adjustment on SVMs provided better performance compared to other classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Bi J, Bennett KP, Embrechts M, Breneman CM, Song M (2003) Dimensionality reduction via sparse support vector machines. J Mach Learn Res 3:1229–1243

    Google Scholar 

  • Bluestein HB (1993) Synoptic-dynamic meteorology in midlatitudes: volume II: observations and theory of weather systems. Oxford University Press, New York

    Google Scholar 

  • Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on computational learning theory, Pittsburgh, Pennsylvania, US

  • Breiman L (2001) Random Forests. Mach Learn 45(1):5–32. doi:10.1023/a:1010933404324

    Article  Google Scholar 

  • Cárdenas AA, Baras JS (2006) B-ROC curves for the assessment of classifiers over imbalanced data sets. In: Proceedings of the 21st national conference on artificial intelligence (AAAI 06), Boston, Massachusetts, July 16–20, 2006

  • Donaldson RJ, Dyer RM, Krauss MJ (1975) An objective evaluator of techniques for predicting severe weather events. In: Ninth conference on severe local storms, Norman, OK, 1975. American Meteorological Society, pp 321–326

  • Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced data sets II, ICML, Washington, DC, 2003

  • Efron B, Tibshirani R (1993) An introduction to the bootstrap. In: Monographs on statistics and applied probability, vol 57. Chapman & Hall, New York

  • Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. doi:10.1023/a:1012487302797

    Article  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278

    Article  Google Scholar 

  • Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. In: Adaptive computation and machine learning. MIT Press, Cambridge

  • Heidke P (1926) Berechnung des erfolges und der gute der windstarkvorhersagen im sturmwarnungsdienst. Geografiska Annaler 8:301–349

    Article  Google Scholar 

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of 14th international conference on machine learning, 1997. Morgan Kaufmann, Los Altos, pp 179–186. citeulike-article-id:2526066

  • Lakshmanan V, Stumpf G, Witt A (2005) A neural network for detecting and diagnosing tornadic circulations using the mesocyclone detection and near storm environment algorithms. In: 21st international conference on information processing systems, San Diego, CA, 2005. p J5.2

  • Marzban C, Stumpf GJ (1996) A neural network for tornado prediction based on Doppler radar-derived attributes. J Appl Meteorol 35(5):617–626

    Article  Google Scholar 

  • McGill R, Tukey JW, Larsen WA (1978) Variations of box plots. Am Stat 32(1):12–16. doi:10.2307/2683468

    Google Scholar 

  • Platt J (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola A, PB, Schölkopf B, Schuurmans D (ed) Advances in large margin classifiers. pp 61–74. citeulike-article-id:3115812

  • Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231

    Article  Google Scholar 

  • Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Paper presented at the proceedings of the fifteenth international conference on machine learning

  • Richman MB (1986) Rotation of principal components. J Climatol 6(3):293–335

    Article  Google Scholar 

  • Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):1619–1630. doi:10.1109/TPAMI.2006.211

    Article  Google Scholar 

  • Roebber PJ (2009) Visualizing multiple measures of forecast quality. Weather Forecast 24:601–608

    Article  Google Scholar 

  • Stumpf GJ, Witt A, Mitchell ED, Spencer PL, Johnson JT, Eilts MD, Thomas KW, Burgess DW (1998) The national severe storms laboratory mesocyclone detection algorithm for the WSR-88D. Weather Forecast 13(2):304–326

    Article  Google Scholar 

  • Trafalis TB, Ince H, Richman MB (2003) Tornado detection with support vector machines. Paper presented at the proceedings of the (2003) international conference on computational science. Melbourne, Australia

  • Trafalis TB, Santosa B, Richman MB (2004) Bayesian neural networks for tornado detection. WSEAS Trans Syst 3:3211–3216

    Google Scholar 

  • Trafalis TB, Santosa B, Richman MB (2005) Learning networks for tornado forecasting: a Bayesian perspective. WIT Trans Inf Commun Technol 35:5–14

    Google Scholar 

  • Vapnik VN (1998) Statistical learning theory. In: Adaptive and learning systems for signal processing, communications, and control. Wiley, New York

  • Wilks D (1995) Statistical methods in atmospheric sciences. Academic Press, San Diego

    Google Scholar 

  • Yang JH, Honavar V (1998) Feature subset selection using a genetic algorithm. IEEE Intell Syst App 13(2):44–49. doi:10.1109/5254.671091

    Article  Google Scholar 

Download references

Acknowledgments

Funding for this research was provided under the National Science Foundation Grants AGS0831359 and EIA-0205628.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Theodore B. Trafalis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Trafalis, T.B., Adrianto, I., Richman, M.B. et al. Machine-learning classifiers for imbalanced tornado data. Comput Manag Sci 11, 403–418 (2014). https://doi.org/10.1007/s10287-013-0174-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10287-013-0174-6

Keywords

Mathematics Subject Classification (2010)

Navigation