ABSTRACT
Cloud failure prediction (e.g., disk failure prediction, memory failure prediction, node failure prediction, etc.) is a crucial task for ensuring the reliability and performance of cloud systems.However, the problem of class imbalance poses a huge challenge for accurate prediction as the number of healthy components (majority class) in a cloud system is much larger than the number of failed components (minority class). The consequences of this class imbalance include biased model performance and insufficient learning, as the model may lack adequate information to learn the characteristics associated with cloud failure effectively. Moreover, current methods for addressing the class imbalance problem, such as SMOTE and its variants, exhibit certain drawbacks, such as generating noisy samples and struggling to maintain sample diversity, which limit their effectiveness in addressing the challenges presented by the class imbalance in cloud failure prediction. In this paper, we propose a novel oversampling method for imbalanced classification, named SOIL (Score cOnditioned dIffusion modeL), which employs a score-conditioned diffusion model to generate high-quality synthetic samples for the minority class, more accurately representing real-world cloud failure patterns. By incorporating classification probabilities as conditional scores, SOIL offers supervision to the generation process, effectively limiting noise production while maintaining sample diversity. Through extensive experiments on various public and industrial datasets , upon adopting our method, the cloud failure prediction model's F1-score is improved by an average of 5.39% and consistently outperforms state-of-the-art competitors in addressing the class imbalance problem, which confirm the effectiveness and robustness of SOIL. In addition, SOIL has been successfully applied to a global large-scale cloud platform serving billions of customers, demonstrating its practicability.
Supplemental Material
- Danilo Ardagna, Barbara Panicucci, and Mauro Passacantando. 2011. A game theoretic formulation of the service provisioning problem in cloud systems. In Proceedings of the 20th international conference on World wide web. 177--186.Google ScholarDigital Library
- Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 39--48.Google ScholarDigital Library
- Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research , Vol. 16 (2002), 321--357.Google ScholarCross Ref
- Supratim Deb, Zihui Ge, Sastry Isukapalli, Sarat Puthenpura, Shobha Venkataraman, He Yan, and Jennifer Yates. 2017. Aesop: Automatic policy learning for predicting and mitigating network service impairments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1783--1792.Google ScholarDigital Library
- Jiechao Gao, Haoyu Wang, and Haiying Shen. 2020. Task failure prediction in cloud data centers using deep learning. IEEE transactions on services computing (2020).Google ScholarCross Ref
- Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, et al. 2020a. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292--303.Google ScholarDigital Library
- Jiazhen Gu, Jiaqi Wen, Zijian Wang, Pu Zhao, Chuan Luo, Yu Kang, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, et al. 2020b. Efficient customer incident triage via linking with system incidents. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1296--1307.Google ScholarDigital Library
- Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, 1322--1328.Google Scholar
- Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, Vol. 21, 9 (2009), 1263--1284.Google ScholarDigital Library
- Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems , Vol. 33 (2020), 6840--6851.Google Scholar
- Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intelligent data analysis, Vol. 6, 5 (2002), 429--449.Google Scholar
- Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, et al. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud $$VM$$ Interruptions. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 1155--1170.Google Scholar
- Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 383--394.Google ScholarDigital Library
- Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. 2023. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 289--299.Google ScholarCross Ref
- Yudong Liu, Hailan Yang, Pu Zhao, Minghua Ma, Chengwu Wen, Hongyu Zhang, Chuan Luo, Qingwei Lin, Chang Yi, Jiaojian Wang, et al. 2022. Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3438--3446.Google ScholarDigital Library
- Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making disk failure predictions smarter!. In FAST. 151--167.Google Scholar
- Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing.. In IJCAI. 1495--1502.Google Scholar
- Chuan Luo, Pu Zhao, Bo Qiao, Youjiang Wu, Hongyu Zhang, Wei Wu, Weihai Lu, Yingnong Dang, Saravanakumar Rajmohan, Qingwei Lin, and Dongmei Zhang. 2021. NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms. In Proceedings of WWW 2021, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). ACM / IW3C2, 1181--1191.Google ScholarDigital Library
- Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, Saravanakumar Rajmohan, and Qingwei Lin. 2022. An empirical investigation of missing data handling in cloud node failure prediction. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14--18, 2022, Abhik Roychoudhury, Cristian Cadar, and Miryung Kim (Eds.). ACM, 1453--1464.Google ScholarDigital Library
- Justin Meza, Qiang Wu, Sanjev Kumar, and Onur Mutlu. 2015. A large-scale study of flash memory failures in the field. ACM SIGMETRICS Performance Evaluation Review, Vol. 43, 1 (2015), 177--190.Google ScholarDigital Library
- Nashid Shahriar, Reaz Ahmed, Shihabur Rahman Chowdhury, Aimal Khan, Raouf Boutaba, and Jeebak Mitra. 2017. Generalized recovery from node failure in virtual network embedding. IEEE Transactions on Network and Service Management, Vol. 14, 2 (2017), 261--274.Google ScholarDigital Library
- Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256--2265.Google Scholar
- Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems , Vol. 34 (2021), 24804--24816.Google Scholar
- Yilin Yan, Min Chen, Mei-Ling Shyu, and Shu-Ching Chen. 2015. Deep learning for imbalanced multimedia data classification. In 2015 IEEE international symposium on multimedia (ISM). IEEE, 483--488.Google ScholarCross Ref
- Pu Zhao, Chuan Luo, Bo Qiao, Lu Wang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. 2022. T-SMOTE: temporal-oriented synthetic minority oversampling technique for imbalanced time series classification. In Proceedings of IJCAI.Google ScholarCross Ref
Index Terms
- SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure Prediction
Recommendations
Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique
Highlights- Imbalanced soil classes distribution leads to uncertain map prediction.
- ...
AbstractClass imbalance is a problem in spatial predictive models; it occurs when classes with a large number of observations dominate the prediction and classes with much less number of observations are not predicted at all. It is a crucial ...
Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy
Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the ...
Learning imbalanced datasets based on SMOTE and Gaussian distribution
AbstractThe learning of imbalanced datasets is a ubiquitous challenge for researchers in the fields of data mining and machine learning. Conventional classifiers are often biased towards the majority class, and loss functions attempt to ...
Comments