Skip to main content
Log in

I/O separation scheme on Lustre metadata server based on multi-stream SSD

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

As the price of NAND-flash storage decreases, large-scale backend distributed file systems are being constructed as all-flash storage without HDDs. In fact, the performance of an SSD can sharply decrease due to the internal garbage collection overhead along with write amplification. Lustre distributed file system provides Data-on-MDT (DoM) feature, which stores small files directly in Metadata Server instead of Object Storage Server. Despite of its benefit on communication traffic, DoM fills Metadata Target (MDT) much faster, causing garbage collection with write amplification and drastically reduces the performance of MDT. Also, DoM I/O uses the I/O bandwidth causing I/O bandwidth starvation of other metadata I/O jobs on MDS. We therefore propose two types of I/O separation scheme: Data separation for write amplification, I/O bandwidth separation for bandwidth starvation. We separate the physical placement of DoM data, normal metadata, and journaling data using multi-stream SSD. We also virtually isolated I/O resource of DoM I/O and metadata I/O by limiting the bandwidth of DoM I/O using Linux cgroup. Our schemes enhance the I/O throughput of MDT by 70%, IOPS by 81% preventing write amplification and provide a stable performance of metadata I/O on MDS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

Available upon request.

References

  1. Liu, N., Cope, J., Carns, P., Carothers, C., Ross, R., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: Proceedings of the 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–11 (2012). https://doi.org/10.1109/MSST.2012.6232369

  2. Lockwood, G.K., Lozinskiy, K., Gerhardt, L., Cheema, R., Hazen, D., Wright, N.J.: Designing an all-flash lustre file system for the 2020 nersc perlmutter system. In: Proceedings of the 2019 Cray User Group (CUG) (2019)

  3. Hu, X.-Y., Eleftheriou, E., Haas, R., Iliadis, I., Pletka, R.: Write amplification analysis in flash-based solid state drives. In: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, pp. 1–9 (2009)

  4. Sun, H., Qin, X., Wu, F., Xie, C.: Measuring and analyzing write amplification characteristics of solid state disks. In: Proceedings of the 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 212–221 (2013). https://doi.org/10.1109/MASCOTS.2013.29

  5. Kang, J.-U., Hyun, J., Maeng, H., Cho, S.: The multi-streamed solid-state drive. In: Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 14). USENIX Association, Philadelphia, PA (2014). https://www.usenix.org/conference/hotstorage14/workshop-program/presentation/kang

  6. Lee, C., Lee, J., Kim, C., Bang, J., Bvun, E.-K., Eom, H.: Data separation scheme on lustre metadata server based on multi-stream ssd. In: Proceedings of the 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), pp. 7–12 (2021). https://doi.org/10.1109/ACSOS-C52956.2021.00026

  7. Braam, P.: The lustre storage architecture. http://arxiv.org/abs/1903.01955arXiv:1903.01955 (2019)

  8. Wang, F., Oral, S., Shipman, G., Drokin, O., Wang, T., Huang, I.: Understanding lustre filesystem internals. Technical report, Oak Ridge National Laboratory, National Center for Computational Sciences, Tech. Rep (2009)

  9. Fragalla, J., Loewe, B., Kling Petersen, T.: New lustre features to improve lustre metadata and small-file performance. Concurrency Comput. Pract. Exp. 32(20), 5649 (2020)

    Article  Google Scholar 

  10. Welch, B., Noer, G.: Optimizing a hybrid ssd/hdd hpc storage system based on file size distributions. In: Proceedings of the 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–12 (2013). https://doi.org/10.1109/MSST.2013.6558449

  11. Liu, Z., Lewis, R., Kettimuthu, R., Harms, K., Carns, P., Rao, N., Foster, I., Papka, M.E.: Characterization and identification of hpc applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3392717.3392774

  12. cgroups. https://man7.org/linux/man-pages/man7/cgroups.7.html

  13. MDTest. https://wiki.lustre.org/MDTest

  14. Jeong, D., Lee, Y., Kim, J.-S.: Boosting quasi-asynchronous i/o for better responsiveness in mobile devices. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST15), pp. 191–202 (2015)

  15. Acronyms. https://wiki.lustre.org/Frequently_Asked_Questions

  16. Block IO Controller. https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt

  17. cgcreate. https://www.unix.com/man-page/debian/1/cgcreate/

  18. cgexec. https://linux.die.net/man/1/cgexec

  19. Roe, A.: Analysis of dne phase i and ii in the latest lustre* releases. Technical report, Intel

  20. Rho, E., Joshi, K., Shin, S.-U., Shetty, N.J., Hwang, J., Cho, S., Lee, D.D., Jeong, J.: Fstream: managing flash streams in the file system. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST 18), pp. 257–264 (2018)

  21. Han, J., Koo, D., Lockwood, G.K., Lee, J., Eom, H., Hwang, S.: Accelerating a burst buffer via user-level i/o isolation. In: Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 245–255 (2017). https://doi.org/10.1109/CLUSTER.2017.60

  22. Koo, D., Lee, J., Liu, J., Byun, E.-K., Kwak, J.-H., Lockwood, G.K., Hwang, S., Antypas, K., Wu, K., Eom, H.: An empirical study of i/o separation for burst buffers in hpc systems. J. Parallel Distrib. Comput. 148, 96–108 (2021). https://doi.org/10.1016/j.jpdc.2020.10.007

    Article  Google Scholar 

  23. Yong, H., Jeong, K., Lee, J., Kim, J.-S.: vStream: Virtual stream management for multi-streamed SSDs. In: Proceedings of the 10th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 18). USENIX Association, Boston, MA (2018). https://www.usenix.org/conference/hotstorage18/presentation/yong

  24. Lockwood, G.K., Lozinskiy, K., Gerhardt, L., Cheema, R., Hazen, D., Wright, N.J.: A quantitative approach to architecting all-flash lustre file systems. In: Weiland, M., Juckeland, G., Alam, S., Jagode, H. (eds.) High Performance Computing, pp. 183–197. Springer, Cham (2019)

    Chapter  Google Scholar 

  25. Ahn, S., La, K., Kim, J.: Improving i/o resource sharing of linux cgroup for nvme ssds on multi-core systems. In: Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). USENIX Association, Denver, CO (2016). https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ahn

  26. Nam, Y., Choi, Y., Yoo, B., Eom, H., Son, Y.: Edgeiso: Effective performance isolation for edge devices. In: Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 295–305 (2020). https://doi.org/10.1109/IPDPS47924.2020.00039

Download references

Funding

This work was supported by the Korea Institute of Science and Technology Information (K-22-L02-C06-S01, K-22-L02-C01), the Basic Science Research Program (NRF-2020R1F1A1072696, NRF-2021R1F1A1063438) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, BK21 FOUR Intelligence Computing (Dept. of Computer Science and Engineering, SNU) funded by National Research Foundation of Korea(NRF) (4199990214639), the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2022-2018-0-01423) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and the GRRC program of Gyeong-gi province (No. GRRC-KAU-2017-B01, “Study on the Video and Space Convergence Platform for 360VR Services”).

Author information

Authors and Affiliations

Authors

Contributions

CL: Software, Writing—original draft, Review, Validation. JL: Conceptualization, Supervision, Writing—original draft, Funding acquisition, Project administration. CK: Writing—original, Validation, Methodology. JB: Methodology, Data curation. E-KB: Supervision, Resources, Funding acquisition. HE: Project administration, Funding acquisition, Conceptualization.

Corresponding author

Correspondence to Jaehwan Lee.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, C., Lee, J., kim, C. et al. I/O separation scheme on Lustre metadata server based on multi-stream SSD. Cluster Comput 26, 2883–2896 (2023). https://doi.org/10.1007/s10586-022-03801-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03801-1

Keywords

Navigation