Abstract
High reliability must be ensured in distributed storage systems (DSSs) to maintain the stability of warehouse-scale computing and high-performance computing (HPC) systems. For system-level reliability, a repair operation using redundant storage nodes can be used in conjunction with erasure coding (EC), which can also affect the system performance. The existing EC design mainly focused on minimizing the required bandwidth for the repair and storage overheads. However, the computing performance for EC should be considered to achieve high bandwidth in order to exploit back-end network link capacity with heterogeneous and high-speed interconnects over 10 Gbps Ethernet. In this study, a new computing acceleration method for repair operation in EC is proposed using multiple repair paths and modifying the computation kernel on the graphics processing unit (GPU) device. For the Cauchy Reed–Solomon (CRS) codes, the proposed scheme is observed to achieve sufficient repair bandwidth compared to the theoretical bound or exceed the current maximum Ethernet link bandwidth.
Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Balaji, S.B., Krishnan, M.N., Vajha, M., Ramkuma, V., Sasidharan, B., Kumar, P.V.: Erasure coding for distributed storage: an overview. Sci. China Inf. Sci. 6 (2018)
Bloemer, J., Kalfane, M., Karpz, R., Karpinski, M., Luby, M., Zuckermank, D.: An XOR-based erasure-resilient coding scheme. International Computer Science Institute, University of California at Berkeley, Berkeley, CA, USA, technical report no. TR-95-048 (1995)
Rashmi, K.V., Shah, N.B., Kumar, P.V.: Optimal exact-regenerating codes for distributed storage at the MSR and MBR points via a product-matrix construction. IEEE Trans. Inf. Theory 57(8), 5227–5239 (2011)
Papailiopoulos, D.S., Dimakis, A.G.: Locally repairable codes. IEEE Trans. Inf. Theory 60(10), 5843–5855 (2014)
Poutievski, L., Mashayekhi, O., Ong, J., Singh, A., Tariq, M., Wang, R., Zhang, J., Beauregard, V., Conner, P., Gribble, S., Kapoor, R., Kratzer, S., Li, N., Liu, H., Nagaraj, K., Ornstein, J., Sawhney, S., Urata, R., Vicisano, L., Yasumura, K., Zhang, S., Zhou, J., Vahdat, A.: Jupiter evolving: transforming Google’s datacenter network via optical circuit switches and software-defined networking ACM SIGCOMM conference, Amsterdam, Netherlands, pp. 66–85, August 2022
Zhou, H., Feng, D., Hu, Y.: Bandwidth-aware scheduling repair techniques in erasure-coded clusters: design and analysis. IEEE Trans. Parallel Distrib. Syst. 33(12), 3333–3348 (2022)
Zhou, T., Tian, C.: Fast erasure coding for data storage: a comprehensive study of the acceleration techniques. In: USENIX Conference on File and Storage Technologies (FAST), Boston, USA, February 25–28, 2019
Miller, L.T.E., Schwarz, T., Kwong, A.: High performance Galois field arithmetic. http://www.crss.ucsc.edu/proj/galois.html (2017)
Mitra, S., Panta, R., Ra, M.-R., Bagchi, S.: Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys), No. 30, pp. 1–16, April 2016
Uezato, Y.: Accelerating XOR-based erasure coding using program optimization techniques. In: Supercomputing Conference (SC), New York, USA, No. 87, pp 1–14, November 2021
Niu, T., Lyu, M., Wang, W., Li, Q., Xu, Y.: Cerasure: fast accelaration strategies for XOR-based erasure codes. In: International Conference on Computer Design (ICCD), Washington, DC, USA, November 2023
Liu, C., Wang, Q., Chu, X., Leung, Y.-W.: G-CRS: GPU accelerated Cauchy Reed-Solomon coding. IEEE Trans. Parallel Distrib. Syst. 64(2), 715–722 (2016)
Shah, N.B., Lee, K., Ramchandran, K.: When do redundant requests reduce latency? IEEE Trans. Commun. 64(2), 715–722 (2016)
Rawat, A.S., Papailiopoulous, D., Dimakis, A., Vishwanath, S.: Locality and availability in distributed storage. IEEE Trans. Inf. Theory 62(8), 4481–4493 (2016)
Yang, S., Hareedy, A., Calderbank, R., Dolecek, L.: Hierarchical coding for cloud storage: topology-adaptivity, scalability, and flexibility. IEEE Trans. Inf. Theory 68, 3657–3680 (2022)
Li, J., Li, B.: Parallelism-aware locally repairable codes for distributed storage systems. In: IEEE International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, July 2–5, 2018
Macwilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Codes. North-Holland Publishing Company, Amsterdam (1977)
Dinh, T.X., Ngyen L.Y Nhi, Mohan L.J., Boztas, S., Luong, T., Dau, H.: Practical consideration in repairing Reed-Solomon codes. In: International Symposium on Information Theory (ISIT), Helsinki, Finland, June 27–July 1, 2022
Open-source software of “G-CRS:GPU Accelerated Cauchy Reed-Solomon Coding” (2018). https://www.comp.hkbu.edu.hk/chxw/gcrs.html
Nvidia Corp., “CUDA C++ Programming Guide” (2024). https://docs.nvidia.com/cuda/cuda-c-programming-guide
Yao, Q., Hu, Y., Tu, X., Lee, P.P.C., Feng, D.: PivotRepair: fast pipelined repair for erasure-coded hot storage. In: International Conference on Distributed Computing Systems (ICDCS), Bologna, Italy, July 10–13, 2022, pp. 614–624
Li, X., Cheng, K., Tang, K., Lee, P.P.C., Hu, Y., Feng, D., Li, J., Wu, T.-Y.: ParaRC: embracing sub-packetization for repair parallelization in MSR-coded storage. In: USENIX Conference on File and Storage Technologies (FAST), Santa Clara, USA, February 21–23, 2023, pp. 17–31
Chon, K.-W., Hwang, S.-H., Kim, M.-S.: GMiner: a fast GPU based frequent itemset mining method for large-scale data. Inf. Sci. 439–440, 19–38 (2018)
Kim, M.-S., An, K., Park, H., Kim, J.: GTS: a fast and scalable graph processing method based on streaming topology to GPUs. In: International Conference on Management of Data (SIGMOD), San Francisco, USA, June 26–July 1, 2016, pp. 447–461
Han, S., Jang, K., Park, K., Moon, S.: PacketShader: a GPU accelerated software router. ACM SIGCOMM Comput. Commun. Rev. 40(4), 195–206 (2010)
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Nos. 2021R1G1A1091369 and RS-2023-00281635). In addition, the research was supported by “Research Base Construction Fund Support Program” funded by Jeonbuk National University in 2023.
Funding
This study was supported by National Research Foundation of Korea (Grant Nos. 2021R1G1A1091369, RS-2023-00281635), Jeonbuk National University (Grant No. Research Base Construction Fund Support Program).
Author information
Authors and Affiliations
Contributions
All authors wrote and reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kim, C., Chon, KW. Accelerating erasure coding by exploiting multiple repair paths in distributed storage systems. Cluster Comput (2024). https://doi.org/10.1007/s10586-024-04438-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10586-024-04438-y