The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Johnson, Justin M.; Khoshgoftaar, Taghi M.

doi:10.1007/s10796-020-10022-7

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Published: 21 June 2020

Volume 22, pages 1113–1131, (2020)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

1673 Accesses
50 Citations
1 Altmetric
Explore all metrics

Abstract

Training predictive models with class-imbalanced data has proven to be a difficult task. This problem is well studied, but the era of big data is producing more extreme levels of imbalance that are increasingly difficult to model. We use three data sets of varying complexity to evaluate data sampling strategies for treating high class imbalance with deep neural networks and big data. Sampling rates are varied to create training distributions with positive class sizes from 0.025%–90%. The area under the receiver operating characteristics curve is used to compare performance, and thresholding is used to maximize class performance. Random over-sampling (ROS) consistently outperforms under-sampling (RUS) and baseline methods. The majority class proves susceptible to misrepresentation when using RUS, and results suggest that each data set is uniquely sensitive to imbalance and sample size. The hybrid ROS-RUS maximizes performance and efficiency, and is our preferred method for treating high imbalance within big data problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data

Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks

Towards Deeper Insights into Deep Learning from Imbalanced Data

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. http://tensorflow.org/.
Ahmed, S.E. (2014). Perspectives on big data analysis: methodologies and applications. USA: Amer Mathematical Society.
Book Google Scholar
Anand, R., Mehrotra, K.G., Mohan, C.K., & Ranka, S. (1993). An improved algorithm for neural network classification of imbalanced training sets. IEEE Transactions on Neural Networks, 4(6), 962–969. https://doi.org/10.1109/72.286891.
Article Google Scholar
Bauder, R.A., & Khoshgoftaar, T.M. (2016). A novel method for fraudulent medicare claims detection from expected payment deviations (application paper). In 2016 IEEE 17Th international conference on information reuse and integration (IRI). https://doi.org/10.1109/IRI.2016.11 (pp. 11–19).
Bauder, R.A., & Khoshgoftaar, T.M. (2016). A probabilistic programming approach for outlier detection in healthcare claims. In 2016 15Th IEEE international conference on machine learning and applications (ICMLA), pp. 347–354, DOI https://doi.org/10.1109/ICMLA.2016.0063, (to appear in print).
Bauder, R.A., & Khoshgoftaar, T.M. (2018). The detection of medicare fraud using machine learning methods with excluded provider labels. In FLAIRS conference.
Bauder, R.A., Khoshgoftaar, T.M., & Hasanin, T. (2018). An empirical study on class rarity in big data. In 2018 17Th IEEE international conference on machine learning and applications (ICMLA). https://doi.org/10.1109/ICMLA.2018.00125 (pp. 785–790).
Bauder, R.A., Khoshgoftaar, T.M., Richter, A., & Herland, M. (2016). Predicting medical provider specialties to detect anomalous insurance claims. In 2016 IEEE 28Th international conference on tools with artificial intelligence (ICTAI). https://doi.org/10.1109/ICTAI.2016.0123 (pp. 784–790).
Branting, L.K., Reeder, F., Gold, J., & Champney, T. (2016). Graph analytics for healthcare fraud risk estimation. In 2016 IEEE/ACM International conference on advances in social networks analysis and mining (ASONAM), pp. 845–851. https://doi.org/10.1109/ASONAM.2016.7752336.
Buda, M., Maki, A., & Mazurowski, M.A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, 249–259. https://doi.org/10.1016/j.neunet.2018.07.011. http://www.sciencedirect.com/science/article/pii/S0893608018302107.
Article Google Scholar
Calvert, C., Kemp, C., Khoshgoftaar, T.M., & Najafabadi, M.M. (2018). Detecting of slowloris attacks using netflow traffic. In 24Th ISSAT international conference on reliability and quality in design (pp. 191–6).
Calvert, C., Kemp, C., Khoshgoftaar, T.M., & Najafabadi, M.M. (2019). Detecting slow http post dos attacks using netflow features. In FLAIRS conference.
Centers For Medicare & Medicaid Services. (2018). Hcpcs general information. https://www.cms.gov/Medicare/Coding/MedHCPCSGenInfo/index.html.
Centers For Medicare & Medicaid Services. (2018). Medicare provider utilization and payment data: Part d prescriber. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html.
Centers For Medicare & Medicaid Services. (2018). Medicare provider utilization and payment data: Physician and other supplier. https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-reports/medicare-provider-charge-data/physician-and-other-supplier.html.
Centers for Medicare & Medicaid Services. (2019). National provider identifier standard (npi). https://www.cms.gov/Regulations-and-Guidance/Administrative-Simplification/NationalProvIdentStand/.
Centers for Medicare & Medicaid Services. (2019). Physician compare datasets. https://data.medicare.gov/data/physician-compare.
Chahal, K., Grover, M., Dey, K., & Shah, R.R. (2019). A hitchhiker’s guide on distributed training of deep neural networks. Journal of Parallel and Distributed Computing. https://doi.org/10.1016/j.jpdc.2019.10.004.
Chandola, V., Sukumar, S.R., & Schryver, J.C. (2013). Knowledge discovery from massive healthcare claims data. In KDD.
Chawla, N.V., Bowyer, K.W., Hall, L.O., & Kegelmeyer, W.P. (2002). Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1), 321–357. http://dl.acm.org/citation.cfm?id=1622407.1622416.
Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., & Bowyer, K.W. (2003). Smoteboost: Improving prediction of the minority class in boosting. In Lavrač, N., Gamberger, D., Todorovski, L., & Blockeel, H. (Eds.) Knowledge discovery in databases: PKDD 2003 (pp. 107–119). Berlin: Springer.
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). cudnn: Efficient primitives for deep learning.
Chollet, F., & et al. (2015). Keras. https://keras.io.
Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492.
Article Google Scholar
Dumbill, E. (2012). What is big data? : an introduction to the big data landscape. http://radar.oreilly.com/2012/01/what-is-big-data.html.
Feldman, K., & Chawla, N.V. (2015). Does medical school training relate to practice? evidence from big data. In Big data.
Fernández, A., del Río, S., Chawla, N.V., & Herrera, F. (2017). An insight into imbalanced big data classification: outcomes and challenges. Complex & Intelligent Systems, 3 (2), 105–120. https://doi.org/10.1007/s40747-017-0037-9.
Article Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: The MIT Press.
Google Scholar
Han, H., Wang, W.Y., & Mao, B.H. (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In Huang, D.S., Zhang, X.P., & Huang, G.B. (Eds.) Advances in intelligent computing (pp. 878–887). Berlin: Springer.
Hasanin, T., Khoshgoftaar, T.M., Leevy, J.L., & Bauder, R.A. (2019). Severely imbalanced big data challenges: investigating data sampling approaches. Journal of Big Data, 6(1), 107. https://doi.org/10.1186/s40537-019-0274-4.
Article Google Scholar
Hasanin, T., Khoshgoftaar, T.M., Leevy, J.L., & Seliya, N. (2019). Examining characteristics of predictive models with imbalanced big data. Journal of Big Data, 6(1), 69. https://doi.org/10.1186/s40537-019-0231-2.
Article Google Scholar
He, H., & Garcia, E.A. (2009). Learning from imbalanced data. IEEE Trans. on Knowl. and Data Eng., 21 (9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239.
Article Google Scholar
Herland, M., Bauder, R.A., & Khoshgoftaar, T.M. (2017). Medical provider specialty predictions for the detection of anomalous medicare insurance claims. In 2017 IEEE International conference on information reuse and integration (IRI) (pp. 579–588), DOI https://doi.org/10.1109/IRI.2017.29, (to appear in print).
Herland, M., Bauder, R.A., & Khoshgoftaar, T.M. (2019). The effects of class rarity on the evaluation of supervised healthcare fraud detection models. Journal of Big Data, 6(1), 21. https://doi.org/10.1186/s40537-019-0181-8.
Article Google Scholar
Herland, M., Khoshgoftaar, T.M., & Bauder, R.A. (2018). Big data fraud detection using multiple medicare data sources. Journal of Big Data, 5(1), 29. https://doi.org/10.1186/s40537-018-0138-3.
Article Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd international conference on international conference on machine learning, (Vol. 37 pp. 448–456): ICML’15.
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explor. Newsl., 6(1), 40–49. https://doi.org/10.1145/1007730.1007737.
Article Google Scholar
Johnson, J.M., & Khoshgoftaar, T.M. (2019). Deep learning and data sampling with imbalanced big data. In 2019 IEEE 20Th international conference on information reuse and integration for data science (IRI). https://doi.org/10.1109/IRI.2019.00038 (pp. 175–183).
Johnson, J.M., & Khoshgoftaar, T.M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6(1), 27. https://doi.org/10.1186/s40537-019-0192-5.
Article Google Scholar
Kankanhalli, A., Hahn, J., Tan, S., & Gao, G. (2016). Big data and analytics in healthcare: Introduction to the special section. Information Systems Frontiers, 18(2), 233–235. https://doi.org/10.1007/s10796-016-9641-2.
Article Google Scholar
Kennedy, R.K.L., Khoshgoftaar, T.M., Villanustre, F., & Humphrey, T. (2019). A parallel and distributed stochastic gradient descent implementation using commodity clusters. Journal of Big Data, 6(1), 16. https://doi.org/10.1186/s40537-019-0179-2.
Article Google Scholar
Khoshgoftaar, T.M., Gao, K., Napolitano, A., & Wald, R. (2014). A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Information Systems Frontiers, 16(5), 801–822. https://doi.org/10.1007/s10796-013-9430-0.
Article Google Scholar
Kingma, D.P., & Ba, J. (2015). Adam: a method for stochastic optimization. arXiv:abs/1412.6980.
Ko, J., Chalfin, H., Trock, B., Feng, Z., Humphreys, E., Park, S.W., Carter, B., D Frick, K., & Han, M. (2015). Variability in medicare utilization and payment among urologists. Urology 85. https://doi.org/10.1016/j.urology.2014.11.054.
Krizhevsky, A., Nair, V., & Hinton, G. Cifar-10 (canadian institute for advanced research) http://www.cs.toronto.edu/kriz/cifar.html.
Krizhevsky, A., Sutskever, I., & Hinton, E.G. (2012). Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems, 25. https://doi.org/10.1145/3065386.
Kubat, M., Holte, R.C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2), 195–215. https://doi.org/10.1023/A:1007452223027.
Article Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 52, 436 EP. https://doi.org/10.1038/nature14539.
Article Google Scholar
LeCun, Y., & Cortes, C. (2010). MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. Accessed: 2018-11-15.
Lee, H., Park, M., & Kim, J. (2016). Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning. In 2016 IEEE International conference on image processing (ICIP). https://doi.org/10.1109/ICIP.2016.7533053 (pp. 3713–3717).
Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., & Seliya, N. (2018). A survey on addressing high-class imbalance in big data. Journal of Big Data, 5(1), 42. https://doi.org/10.1186/s40537-018-0151-6.
Article Google Scholar
Ling, C.X., & Sheng, V.S. (2007). Cost-sensitive Learning and the Class Imbalanced Problem.
Linux, S. (2014). About. https://www.scientificlinux.org/about/.
Lippmann, R.P. (1994). Neural networks, bayesian a posteriori probabilities, and pattern classification. In Cherkassky, V., Friedman, J.H., & Wechsler, H. (Eds.) From statistics to neural networks (pp. 83–104). Berlin: Springer.
Lippmann, R.P. (1994). Neural networks, bayesian a posteriori probabilities, and pattern classification. In Cherkassky, V., Friedman, J.H., & Wechsler, H. (Eds.) From statistics to neural networks (pp. 83–104). Berlin: Springer.
Liu, X., Wu, J., & Zhou, Z. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550. https://doi.org/10.1109/TSMCB.2008.2007853.
Article Google Scholar
Masko, D., & Hensman, P. (2015). The impact of imbalanced training data for convolutional neural networks. KTH, School of Computer Science and Communication (CSC).
National Plan & Provider Enumeration System. (2019). Nppes npi registry. https://npiregistry.cms.hhs.gov/registry/.
Office of Inspector General. (2019). Leie downloadable databases. https://oig.hhs.gov/exclusions/exclusions_list.asp.
Orenstein, E.C., Beijbom, O., Peacock, E.E., & Sosik, H.M. (2015). Whoi-plankton- a large scale fine grained visual recognition benchmark dataset for plankton classification. arXiv:abs/1510.00745.
OWASP: Owasp http post tool. https://www.owasp.org/index.php/OWASP_HTTP_Post_Tool.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in pytorch. In NIPS-W.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Google Scholar
Provost, F., & Fawcett, T. (1999). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (pp. 43–48).
Rao, R.B., Krishnan, S., & Niculescu, R.S. (2006). Data mining for improved cardiac care. SIGKDD Explor. Newsl., 8(1), 3–10. https://doi.org/10.1145/1147234.1147236.
Article Google Scholar
Requeno, J., Merseguer, J., Bernardi, S., Perez-Palacin, D., Giotis, G., & Papanikolaou, V. (2019). Quantitative analysis of apache storm applications: the newsasset case study. Information Systems Frontiers, 21(1), 67–85. https://doi.org/10.1007/s10796-018-9851-x.
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., & Fei-fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1), 1929–1958. http://dl.acm.org/citation.cfm?id=2627435.2670313.
Google Scholar
Sun, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Ph.D. thesis, Waterloo, Ont., Canada, Canada. AAINR34548.
Theano Development Team. (2016). Theano: A Python framework for fast computation of mathematical expressions. arXiv:abs/1605.02688.
Tukey, J.W. (1949). Comparing individual means in the analysis of variance. Biometrics, 5(2), 99–114. http://www.jstor.org/stable/3001913.
Article Google Scholar
U.S. Government, U.S. Centers for Medicare & Medicaid Services: The official u.s. government site for medicare. https://www.medicare.gov/.
Wei, W., Li, J., Cao, L., Ou, Y., & Chen, J. (2013). Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web, 16(4), 449–475. https://doi.org/10.1007/s11280-012-0178-0.
Article Google Scholar
Weiss, G.M. (2004). Mining with rarity: A unifying framework. SIGKDD Explor. Newsl., 6(1), 7–19. https://doi.org/10.1145/1007730.1007734.
Article Google Scholar
Wilson, D., & Martinez, T. (2004). The general inefficiency of batch training for gradient descent learning. Neural networks :, the official journal of the International Neural Network Society, 16, 1429–51. https://doi.org/10.1016/S0893-6080(03)00138-2.
Article Google Scholar
Wilson, D.L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408–421. https://doi.org/10.1109/TSMC.1972.4309137.
Article Google Scholar
Witten, I.H., Frank, E., Hall, M.A., & Pal, C.J. (2016). Data mining, fourth edition: practical machine learning tools and techniques, 4th edn. San Francisco: Morgan Kaufmann Publishers Inc.
Google Scholar
Yaltirakli, G. Slowloris. https://github.com/gkbrk/slowloris.
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10. http://dl.acm.org/citation.cfm?id=1863103.1863113 (pp. 10–10). Berkeley: USENIX Association.

Download references

Acknowledgements

The authors would like to thank the various members of the Data Mining and Machine Learning Laboratory at Florida Atlantic University for assistance with reviews.

Author information

Authors and Affiliations

Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA
Justin M. Johnson & Taghi M. Khoshgoftaar

Authors

Justin M. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Justin M. Johnson.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, J.M., Khoshgoftaar, T.M. The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data. Inf Syst Front 22, 1113–1131 (2020). https://doi.org/10.1007/s10796-020-10022-7

Download citation

Published: 21 June 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10796-020-10022-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Abstract

Access this article

Similar content being viewed by others

Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data

Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks

Towards Deeper Insights into Deep Learning from Imbalanced Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

Abstract

Access this article

Similar content being viewed by others

Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data

Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks

Towards Deeper Insights into Deep Learning from Imbalanced Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation