skip to main content
10.1145/3589335.3648303acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Free Access

SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure Prediction

Published:13 May 2024Publication History

ABSTRACT

Cloud failure prediction (e.g., disk failure prediction, memory failure prediction, node failure prediction, etc.) is a crucial task for ensuring the reliability and performance of cloud systems.However, the problem of class imbalance poses a huge challenge for accurate prediction as the number of healthy components (majority class) in a cloud system is much larger than the number of failed components (minority class). The consequences of this class imbalance include biased model performance and insufficient learning, as the model may lack adequate information to learn the characteristics associated with cloud failure effectively. Moreover, current methods for addressing the class imbalance problem, such as SMOTE and its variants, exhibit certain drawbacks, such as generating noisy samples and struggling to maintain sample diversity, which limit their effectiveness in addressing the challenges presented by the class imbalance in cloud failure prediction. In this paper, we propose a novel oversampling method for imbalanced classification, named SOIL (Score cOnditioned dIffusion modeL), which employs a score-conditioned diffusion model to generate high-quality synthetic samples for the minority class, more accurately representing real-world cloud failure patterns. By incorporating classification probabilities as conditional scores, SOIL offers supervision to the generation process, effectively limiting noise production while maintaining sample diversity. Through extensive experiments on various public and industrial datasets , upon adopting our method, the cloud failure prediction model's F1-score is improved by an average of 5.39% and consistently outperforms state-of-the-art competitors in addressing the class imbalance problem, which confirm the effectiveness and robustness of SOIL. In addition, SOIL has been successfully applied to a global large-scale cloud platform serving billions of customers, demonstrating its practicability.

Skip Supplemental Material Section

Supplemental Material

ip1380.mp4

Supplemental video

mp4

36.6 MB

References

  1. Danilo Ardagna, Barbara Panicucci, and Mauro Passacantando. 2011. A game theoretic formulation of the service provisioning problem in cloud systems. In Proceedings of the 20th international conference on World wide web. 177--186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 39--48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research , Vol. 16 (2002), 321--357.Google ScholarGoogle ScholarCross RefCross Ref
  4. Supratim Deb, Zihui Ge, Sastry Isukapalli, Sarat Puthenpura, Shobha Venkataraman, He Yan, and Jennifer Yates. 2017. Aesop: Automatic policy learning for predicting and mitigating network service impairments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1783--1792.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jiechao Gao, Haoyu Wang, and Haiying Shen. 2020. Task failure prediction in cloud data centers using deep learning. IEEE transactions on services computing (2020).Google ScholarGoogle ScholarCross RefCross Ref
  6. Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, et al. 2020a. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292--303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jiazhen Gu, Jiaqi Wen, Zijian Wang, Pu Zhao, Chuan Luo, Yu Kang, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, et al. 2020b. Efficient customer incident triage via linking with system incidents. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1296--1307.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, 1322--1328.Google ScholarGoogle Scholar
  9. Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, Vol. 21, 9 (2009), 1263--1284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems , Vol. 33 (2020), 6840--6851.Google ScholarGoogle Scholar
  11. Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intelligent data analysis, Vol. 6, 5 (2002), 429--449.Google ScholarGoogle Scholar
  12. Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, et al. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud $$VM$$ Interruptions. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 1155--1170.Google ScholarGoogle Scholar
  13. Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 383--394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. 2023. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 289--299.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yudong Liu, Hailan Yang, Pu Zhao, Minghua Ma, Chengwu Wen, Hongyu Zhang, Chuan Luo, Qingwei Lin, Chang Yi, Jiaojian Wang, et al. 2022. Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3438--3446.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making disk failure predictions smarter!. In FAST. 151--167.Google ScholarGoogle Scholar
  17. Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing.. In IJCAI. 1495--1502.Google ScholarGoogle Scholar
  18. Chuan Luo, Pu Zhao, Bo Qiao, Youjiang Wu, Hongyu Zhang, Wei Wu, Weihai Lu, Yingnong Dang, Saravanakumar Rajmohan, Qingwei Lin, and Dongmei Zhang. 2021. NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms. In Proceedings of WWW 2021, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). ACM / IW3C2, 1181--1191.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin. 2022. An empirical investigation of missing data handling in cloud node failure prediction. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14--18, 2022, Abhik Roychoudhury, Cristian Cadar, and Miryung Kim (Eds.). ACM, 1453--1464.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. ACM SIGMETRICS Performance Evaluation Review, Vol. 43, 1 (2015), 177--190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nashid Shahriar, Reaz Ahmed, Shihabur Rahman Chowdhury, Aimal Khan, Raouf Boutaba, and Jeebak Mitra. 2017. Generalized recovery from node failure in virtual network embedding. IEEE Transactions on Network and Service Management, Vol. 14, 2 (2017), 261--274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256--2265.Google ScholarGoogle Scholar
  23. Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems , Vol. 34 (2021), 24804--24816.Google ScholarGoogle Scholar
  24. Yilin Yan, Min Chen, Mei-Ling Shyu, and Shu-Ching Chen. 2015. Deep learning for imbalanced multimedia data classification. In 2015 IEEE international symposium on multimedia (ISM). IEEE, 483--488.Google ScholarGoogle ScholarCross RefCross Ref
  25. Pu Zhao, Chuan Luo, Bo Qiao, Lu Wang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. 2022. T-SMOTE: temporal-oriented synthetic minority oversampling technique for imbalanced time series classification. In Proceedings of IJCAI.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure Prediction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '24: Companion Proceedings of the ACM on Web Conference 2024
      May 2024
      1928 pages
      ISBN:9798400701726
      DOI:10.1145/3589335

      Copyright © 2024 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 May 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%
    • Article Metrics

      • Downloads (Last 12 months)46
      • Downloads (Last 6 weeks)46

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader