ABSTRACT
Network File System (NFS) is commonly used in cloud environments as a cost-effective file storage solution that is easy to set up. However, the multi-tenant nature of cloud infrastructures makes distributed file systems prone to instability and unpredictability. These performance issues can be very harmful to both Cloud Service Providers (CSPs) and tenants. Therefore, CSPs and their customers require more and more real-time granular metrics (per-file, high-frequency) for dynamically optimizing data placement, resource usage and ensuring file access performance as well as for provisioning resources cost-effectively, billing and troubleshooting them rapidly. In this paper, we propose TrackIops, a novel NFS tracer that provides these metrics without effort and at low cost. TrackIops is an eBPF-based client-side request-oriented tracing solution. The main contribution of this paper is a smart kernel-level solution that reconstructs NFS request and response threads and analyses them online without requiring server instrumentation. TrackIops provides real-time per-tenant, per-file, per-second NFS metrics extractor, easy to integrate in any optimization or troubleshooting solution, with an overhead lower than 3.5% on the client in a worst-case scenario.
- 2010. nfsiostat man page. https://man7.org/linux/man-pages/man8/nfsiostat.8.htmlGoogle Scholar
- 2020. Mandatory Emissions Reporting Around the Globe. https://www.ul.com/news/mandatory-emissions-reporting-around-globeGoogle Scholar
- 2021. nfsdist. https://github.com/iovisor/bcc/blob/master/tools/nfsdist.pyGoogle Scholar
- 2021. nfsslower. https://github.com/iovisor/bcc/blob/master/tools/nfsslower.pyGoogle Scholar
- 2023. blktrace man page. https://linux.die.net/man/8/blktraceGoogle Scholar
- 2023. inotifywatch man page. https://linux.die.net/man/1/inotifywatchGoogle Scholar
- 2023. nfsstat man page. https://linux.die.net/man/8/nfsstatGoogle Scholar
- 2023. pidstat man page. https://man7.org/linux/man-pages/man1/pidstat.1.htmlGoogle Scholar
- 2023. ps man page. https://man7.org/linux/man-pages/man1/ps.1.htmlGoogle Scholar
- 2023. QCOW2 format reference. https://github.com/qemu/qemu/blob/master/docs/interop/qcow2.txtGoogle Scholar
- 2023. top man page. https://man7.org/linux/man-pages/man1/top.1.htmlGoogle Scholar
- 2023. What Is eBPF? https://ebpf.io/what-is-ebpf/Google Scholar
- 2024. atop man page. https://linux.die.net/man/1/atopGoogle Scholar
- 2024. BCC storage tools. https://github.com/iovisor/bcc?tab=readme-ov-file#storage-and-filesystems-toolsGoogle Scholar
- 2024. Dell EMC storage metrics. https://www.ibm.com/docs/en/storage-insights?topic=metrics-performance-dell-emc-storage-systemsGoogle Scholar
- 2024. NetApp storage metrics. https://docs.netapp.com/us-en/ontap-automation/rest/performance_metrics.htmlGoogle Scholar
- Luiz André Barroso and Urs Hölzle. 2007. The case for energy-proportional computing. Computer 40, 12 (2007), 33--37. Publisher: IEEE.Google ScholarDigital Library
- Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. 2011. Understanding and Improving Computational Science Storage Access through Continuous Characterization. ACM Trans. Storage 7, 3 (Oct. 2011). https://doi.org/10.1145/2027066.2027068 Place: New York, NY, USA Publisher: Association for Computing Machinery.Google ScholarDigital Library
- Tao Chen, Xiaofeng Gao, and Guihai Chen. 2016. The features, hardware, and architectures of data center networks: A survey. J. Parallel and Distrib. Comput. 96 (Oct. 2016), 45--74. https://doi.org/10.1016/j.jpdc.2016.05.009Google ScholarDigital Library
- Steven WD Chien, Artur Podobas, Ivy B Peng, and Stefano Markidis. 2020. tf-Darshan: Understanding fine-grained I/O performance in machine learning workloads. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 359--370.Google ScholarCross Ref
- Jonathan Corbet. 2016. Tracepoints with eBPF. https://lwn.net/Articles/683504/Google Scholar
- Tânia Esteves, Francisco Neves, Rui Oliveira, and João Paulo. 2021. CAT: Content-aware tracing and analysis for distributed systems. In Proceedings of the 22nd International Middleware Conference. 223--235.Google ScholarDigital Library
- Pawan Kumar and Rakesh Kumar. 2019. Issues and Challenges of Load Balancing Techniques in Cloud Computing: A Survey. ACM Comput. Surv. 51, 6 (Feb. 2019). https://doi.org/10.1145/3281010 Place: New York, NY, USA Publisher: Association for Computing Machinery.Google ScholarDigital Library
- Daniel Kunkle and Jiri Schindler. 2008. A load balancing framework for clustered storage systems. In International Conference on High-Performance Computing. Springer, 57--72.Google ScholarCross Ref
- Haitao Li, Yuliang Yang, and Bin Zheng. 2012. Research on Billing Strategy of Cloud Storage. In 2012 Fourth International Conference on Multimedia Information Networking and Security. 624--627. https://doi.org/10.1109/MINES.2012.172Google ScholarDigital Library
- Bjørn Lindi. [n. d.]. I/O-profiling with Darshan. PRACE report ([n. d.]).Google Scholar
- Guoxin Liu, Haiying Shen, and Haoyu Wang. 2015. Computing load aware and long-view load balancing for cluster storage systems. In 2015 IEEE International Conference on Big Data (Big Data). 174--183. https://doi.org/10.1109/BigData.2015.7363754Google ScholarDigital Library
- Mohammed Islam Naas, François Trahay, Alexis Colin, Pierre Olivier, Stéphane Rubini, Frank Singhoff, and Jalil Boukhobza. 2021. EZIO-Tracer: unifying kernel and user space I/O tracing for data-intensive applications. In Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. ACM, Online Event United Kingdom, 1--11. https://doi.org/10.1145/3439839.3458731Google ScholarDigital Library
- Francisco Neves, Nuno Machado, and others. 2018. Falcon: A practical log-based analysis tool for distributed systems. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 534--541.Google ScholarCross Ref
- Pankesh Patel, Ajith Ranabahu, and Amit Sheth. 2009. Service Level Agreement in Cloud Computing. Kno.e.sis Publications (Jan. 2009). https://corescholar.libraries.wright.edu/knoesis/78%7DGoogle Scholar
- Lorenzo Posani, Alessio Paccoia, and Marco Moschettini. 2018. The carbon footprint of distributed cloud storage. (2018). https://doi.org/10.48550/ARXIV.1803.06973 Publisher: arXiv Version Number: 3.Google ScholarCross Ref
- Junxian Shen, Han Zhang, Yang Xiang, Xingang Shi, Xinrui Li, Yunxi Shen, Zijian Zhang, Yongxiang Wu, Xia Yin, Jilong Wang, and others. 2023. Network-centric distributed tracing with DeepFlow: Troubleshooting your microservices in zero code. In Proceedings of the ACM SIGCOMM 2023 Conference. 420--437.Google ScholarDigital Library
- Arie Taal, Dexter Drupsteen, Marc X. Makkes, and Paola Grosso. 2014. Storage to energy: Modeling the carbon emission of storage task offloading between data centers. In 2014 IEEE 11th Consumer Communications and Networking Conference (CCNC). 50--55. https://doi.org/10.1109/CCNC.2014.6866547Google ScholarCross Ref
- François Trahay, François Rue, Mathieu Faverge, Yutaka Ishikawa, Raymond Namyst, and Jack Dongarra. 2011. EZTrace: a generic framework for performance analysis. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 618--619.Google ScholarDigital Library
- Matthew Wachs, Lianghong Xu, Arkady Kanevsky, and Gregory R Ganger. 2011. Exertion-based billing for cloud storage access. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 11).Google Scholar
Recommendations
Improving the write performance of an NFS server
WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical ConferenceThe Network File System (NFS) utilizes a stateless protocol between clients and servers; the major advantage of this statelessness is that NFS crash recovery is very easy. However, the protocol requires that data modification operations such as write be ...
NFS-cc: tuning NFS for concurrent read sharing
A common file access pattern found in cluster applications is concurrent read sharing: applications running on multiple sites read access the same data set concurrently. Traditional network file systems are limited by the server's network bandwidth; ...
Comments