Design and Implementation of Burst Buffer Over-Subscription Scheme for HPC Storage Systems

Burst Buffer is widely used in supercomputer centers to bridge the performance gap between computational power and the high-performance I/O systems. The primary role of Burst Buffer is to temporarily absorb the bursty I/O and reduce the heavy access on Parallel File System (PFS). However, the job resource manager on High-Performance Computer (HPC) systems prefers to use a dedicated Burst Buffer allocation approach, which eventually leads to the severely underutilized Burst Buffer resource. To improve the efficiency of using the expensive Burst Buffer resource, we analyze the I/O patterns on Burst Buffer in depth. We propose Burst Buffer over-subscription allocation method, which improves Burst Buffer utilization by allowing each job to access Burst Buffer only during its I/O phases so that the jobs can overlap each other. Furthermore, we develop a new I/O congestion-aware scheduler and a transparent data management system between Burst Buffer and PFS. Our approach also reduces the memory overhead and improves the data persistence of the data management system by adapting the persistent memory. With the proposed approach, not only the Burst Buffer utilization can be improved, but also HPC applications can achieve high I/O performance by exploiting the powerful Burst Buffer hardware capabilities. Experimental results show that BBOS can improve Burst Buffer utilization by up to 120% while more stable and higher checkpoint performance is guaranteed even under high I/O loads compared to other state-of-the-art schedulers. Besides, our approach can improve the hit ratio of restart requests by up to 96.4% and provides up to 210% higher restart throughput on Burst Buffer.


I. INTRODUCTION
As computational capability has grown over one petaflop, a large number of system components have been deployed in The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
HPC systems, thereby resulting in increased overall system failures [1], [2], [3]. For a fail-safety purpose, HPC applications tend to aggressively utilize checkpoint and restart strategy, which is the most common fault tolerance mechanism. The checkpoint and restart mechanism inevitably generates bursty I/O, occupying 75%∼80% of total I/O traffic of overall HPC system [2], [4]. Since PFS consists of cheap HDDs or low-end flash SSDs, heavy I/O accesses generated by checkpoint and restart operations makes it difficult for PFS to handle and eventually leads to low I/O performance. To alleviate the overhead on PFS and speed up the I/O performance, Burst Buffer composed of high-end flash SSDs (e.g., 3D XPoint SSD and NVMe SSD) [5], [6] has been introduced as a new storage tier located between compute nodes and PFS. Due to the substantial performance difference delivered by Burst Buffer and PFS, HPC users prefer to allocate dedicated Burst Buffer resource for a whole lifetime of the submitted jobs.
However, current dedicated Burst Buffer allocation mechanism leads to severely underutilized Burst Buffer resource for the following two reasons. First, HPC users prefer to request overabundant amount of Burst Buffer resources (e.g., up to six times [3]). According to the real-world log data collected from the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer system, only 5% of overall Burst Buffer resource is actively used. The reasons for over-requesting the Burst Buffer resource include preventing possible I/O errors, improving the I/O performance, and avoiding the complicated data movement between Burst Buffer and PFS. Second, since HPC applications use Burst Buffer only during the I/O phases, Burst Buffer stays idle for the rest of the time. Although checkpoint dominates the I/O traffic of the HPC system, long interval exists between two successive checkpoint operations. As a result, allocated Burst Buffer resource is wasted for most of the time during the application lifetime.
In this paper, we propose an efficient HPC storage management approach using the Burst Buffer over-subscription allocation method, called BBOS (Burst Buffer Over-Subscription). To support the Burst Buffer over-subscription method in the HPC storage systems, we transparently manage data movement between Burst Buffer and PFS when scheduling the I/O jobs. The key idea behind BBOS is to utilize the characteristics of checkpoint and restart operations that occupy most of the I/O traffic in HPC storage systems. Since checkpoint and restart mechanism has specific I/O characteristics, using primitive data management approach such as a kernel data management approach between memory and storage layers or commonly used approaches for storage tiers within PFS can result in low performance. In this work, we introduce a new data placement scheduling policy between Burst Buffer and PFS that considers the characteristics of checkpoint and restart operations. To show the improved Burst Buffer utilization and checkpoint/restart performance with BBOS, we evaluate our approach in comparison to Cray DataWarp [7], the current representative HPC scheduler which uses the dedicated Burst Buffer allocation method, and Harmonia [8], which is the Burst Buffer based dynamic I/O scheduler in consideration of Burst Buffer oversubscription method. Compared to DataWarp, Burst Buffer utilization is improved by up to 120% while maintaining stable and high checkpoint performance by using the BBOS framework. Besides, our approach can provide high checkpoint performance and improve restart performance by up to 96.4% on Burst Buffer by utilizing the characteristics of checkpoint and restart operations.
In the previous work [9], we implemented the Burst Buffer over-subscription framework to solve the Burst Buffer underutilization problem shown in most of the HPC systems. The BBOS framework consists of an I/O engine, a data management engine, and an in-memory key-value store to efficiently handle the checkpoint and restart operations of HPC applications. We extend the original work by using persistent memory on the BBOS framework. We implement an improved version of the framework using NVDIMM on the Redis in-memory key-value database. Specifically, the memory capacity can be increased with low cost and the data persistence is guaranteed even when power failure occurs. We show that there is no I/O performance degradation shown with NVDIMM-applied BBOS that stores most of the data in NVDIMM. We further extend the original work by evaluating the NVDIMM-applied BBOS design, showing the implementation details, and explaining the workflow of checkpoint, restart, and demotion operations when applying the BBOS framework on Burst Buffer.
Our contributions are as follows: • We adopt the over-subscription Burst Buffer allocation method to efficiently utilize the Burst Buffer resource.
• We analyze the characteristics of checkpoint and restart operations of HPC applications in detail. We find that the existing data management approach does not consider checkpoint and restart characteristics of HPC applications, which results in low I/O performance. • We propose BBOS, a novel HPC data management approach that provides high Burst Buffer utilization as well as stable and high checkpoint and restart performance. BBOS schedules I/O jobs, adjusts demotion threshold and I/O bandwidth of checkpoint and demotion adaptively, and manages data placement policy between Burst Buffer and PFS.
• We implement a BBOS prototype by adding multiple modules on GlusterFS. We utilize the persistent memory to lower the DRAM overhead and add data persistence when adopting the BBOS framework on Burst Buffer.

II. BACKGROUND AND MOTIVATION
A. UNDERUTILIZED BURST BUFFER Burst Buffer is located in the intermediate layer between computational nodes and storage systems to absorb bursty I/O in HPC systems [10], [11]. Each Burst Buffer node consists of expensive hardware resources, such as high-speed storage media and high-speed network. Most of the supercomputers, including Cori supercomputer [12] at NERSC and Summit supercomputer at ORNL (Oak Ridge National Laboratory), allocate Burst Buffer resources by using a dedicated Burst Buffer allocation method. The users specify the desired capacity or desired nodes to be used for the applications and the specified space is provided by an HPC VOLUME 11, 2023 scheduler [13], [14] during the whole lifetime of the applications. However, the dedicated allocation method leads to severe underutilization problem of Burst Buffer as the HPC users normally reserve Burst Buffer space larger than the actual capacity they need for the following reasons. The application jobs fail with I/O error when there is not enough Burst Buffer capacity to handle the I/O. To avoid failure, the users are recommended by a supercomputer providers to request the surplus amount of Burst Buffer capacity [3]. Not only for the failure, but users may also request a bountiful capacity expecting higher performance as well. Another reason for overabundant requests arises from complicated data management in multi-tier HPC storage systems (i.e., local storage of a compute node, Burst Buffer, and PFS). Since current supercomputers manage Burst Buffer and PFS separately, the users are challenged with redundant and complicated management. For instance, if the users use a limited amount of Burst Buffer capacity for writing only one checkpoint, they should copy data manually from Burst Buffer to PFS at every end of the checkpoint phase to make Burst Buffer space for the next phase [15].
Performing the Burst Buffer reservation process without considering the characteristics of the checkpoint and restart operations is also the critical reason that causes resource underutilization problem. HPC applications perform checkpoint operations during a fixed amount of time [16], [17], [18], called checkpoint period, by repeating compute phase and I/O phase periodically. However, as the checkpoint period lasts from tens of minutes to tens of hours, expensive Burst Buffer resources stay idle during compute phases. Moreover, Burst Buffer needs to be reserved for at least twice as much the capacity for the checkpoint data since old checkpoint file should be kept until a new checkpoint file is completely written safely. If HPC users decide to store multiple versions of checkpoint files in Burst Buffer to increase data durability, Burst Buffer becomes severely underutilized as the rest of the old version files except the latest one are rarely accessed. The addressed problems caused by using the dedicated Burst Buffer allocation method motivate our oversubscription-based HPC storage management approach.

B. CHARACTERISTICS OF CHECKPOINT AND RESTART OPERATIONS
Unlike common applications, HPC applications have checkpoint and restart-related characteristics. To apply the Burst Buffer over-subscription method on the HPC storage system, a novel data management scheme needs to be developed considering the following five checkpoint and restart characteristics.
First, most of the HPC applications solve computationally intensive problems and perform checkpoint operations at a particular cycle. We observe that the total amount of the checkpoint written to Burst Buffer in a certain period, called Data Written Per Period (DWPP) in this paper, is kept quite steady. As so, it is possible to predict future DWPP of a job using the previous DWPP values run.
Second, each application has a specific checkpoint period and an intermediate time interval between two checkpoint operations. Thus, each application accesses the Burst Buffer only during a specific checkpoint period. For instance, HPC applications with short checkpoint periods access Burst Buffer more frequently than ones with long checkpoint periods.
Third, HPC applications tend to keep multiple versions of checkpoint files to increase data durability [19]. HPC users prefer to keep the old versions of checkpoint files without deleting them even though only the latest version of the checkpoint file is required in the restart process. Since users demand different degree of data reliability when they run the jobs, each application job maintains the different number of checkpoint versions.
Fourth, HPC applications have different failure rates. Failure is occurred by individual components, such as processors, disk, memory, power supplies, network, cooling systems, and the physical connections between them [20]. The large number of the components together unavoidably leads to frequent failures [21], [22]. The prior studies show that the Mean Time Between Failure (MTBF) on a single node is thousands of hours, while MTBF on a large-scale cluster with hundreds of nodes is dozens of hours. In other words, failure rates increase linearly with the number of nodes used by HPC applications [23], [24].
Lastly, there is no data locality across the checkpoint files of HPC applications. Temporal locality does not exist across checkpoint files, because the checkpoint file is accessed only when the failure occurs. Also, spatial locality does not exist across checkpoint files. The checkpoint files will not be accessed unless failure occurs, even if they are stored around the other requested checkpoint file.

C. PROBLEM ANALYSIS
Different from the dedicated Burst Buffer allocation method, the Burst Buffer over-subscription method allocates more space to the applications than the actual capacity. To make this possible, the applications are allowed to access Burst Buffer only during the I/O phases. Applications in the computation phase should yield Burst Buffer to other applications in the I/O phase by moving data from Burst Buffer to PFS. Therefore, an efficient data management approach between Burst Buffer and PFS that does not degrade the overall performance is required. There are several previous works that propose efficient data management policy in the multi-tiered system [3], [8], [25], [26]. However, these approaches are not suitable for the HPC storage system where checkpoint dominates most of the I/O traffic for the following reasons.
The previous works use static demotion threshold without considering the amount of data to be moved between storage tiers. With the prior approaches, demotion is operated only when Burst Buffer is idle. When the total used capacity of Burst Buffer reaches the threshold, demotion has to be operated concurrently with checkpoint operations. Using the over-subscription method, the number of jobs accessing Burst Buffer is increased and Burst Buffer is filled up quickly with checkpoint data. As a result, demotion operations interrupt the checkpoint operations more aggressively. Figure 1 shows checkpoint performance with different DWPPs after setting the demotion threshold to 90% of total Burst Buffer capacity. As S represents the total capacity of Burst Buffer, 1.3S, 1.6S, and 1.9S write 1.3 times, 1.6 times, and 1.9 times the size of Burst Buffer in a certain period, respectively. Each dot in the figure represents the application job that runs on Burst Buffer as time goes by. We only assume the over-subscription scenarios where the total size of checkpoint files is larger than Burst Buffer size (DWPP > S), which can improve the Burst Buffer utilization. When only small number of jobs are reserved to use Burst Buffer (DWPP < S), Burst Buffer utilization would remain low with each job having high I/O performance.
When DWPP is 1.3S, the performance of the jobs gets slightly decreased after 1000 seconds since the number of jobs run concurrently increases. Yet when DWPP increases to 1.6S, Burst Buffer is fully used in the middle of checkpoint I/O operations and the performance begins to drop over time. In the 1.9S case, almost half of the jobs get four times lower performance compared to the others as checkpoint operations have to be stopped and wait for demotion to make free space in Burst Buffer.
Another limitation of the previous works is that the arrival pattern of checkpoint operations is not considered in data management policy. For instance, the HPC jobs issue checkpoint operations with different periods. When the inter-arrival time is long enough, there exists a sufficient amount of Burst Buffer idle time between the checkpoint operations. Then the files can be demoted to PFS making space in Burst Buffer for the next I/O jobs. However, if the checkpoint operations are issued with small inter-arrival time, the lack of Burst Buffer idle time makes it difficult to finish data migration before the next checkpoint operation. This inevitably leads to Burst Buffer capacity depletion. Figure 2 shows that checkpoint performance is highly related to the I/O job congestion rate under the same DWPP. Three I/O job congestion patterns, Low, Med, and High, represent the rate of how busy I/O jobs arrive and the jobs are allowed to use 1.9 times the size of Burst Buffer. Naively using data eviction policy algorithms including FIFO, LRU, and LFU can leads to low Burst Buffer utilization. HPC applications have specific checkpoint periods and keep different number of checkpoint files to be used for data recovery. When the FIFO algorithm is used, the latest checkpoint file of application with long checkpoint period is considered cold data while old-version checkpoint file of application with short period is considered hot. As a result, the application with long checkpoint period experiences low recovery performance, which makes Burst Buffer inefficient. Also, the checkpoint files do not have data locality and spatial locality and LRU, LRU or other hotness-aware algorithm is not suitable for data eviction policy. To better classify which checkpoint files to be evicted, the failure rates need to be considered. Without taking the failure rates into account, checkpoint files with high failure rates might be chosen as cold data, instead of checkpoint files with low failure rates.

III. DATA MANAGEMENT IN BURST BUFFER
Applications may suffer from severe performance degradation when the characteristics of checkpoint and restart operations are not fully considered. In this paper, we set the demotion threshold on Burst Buffer and adjust the speed of checkpoint and demotion operations in advance to avoid the performance degradation. Also, we develop novel data placement policy that manages the data movement between Burst Buffer and PFS.

A. ADAPTIVE DEMOTION ADJUSTMENT
In order to make free space in Burst Buffer when using oversubscription method, we determine a demotion threshold considering both DWPP and I/O job congestion rate, which data can be retrieved from the log history of application jobs. As shown in Figure 1, DWPP affects the amount of data to be demoted in a certain period. Large DWPP means that there are large amounts of I/O to be written to Burst Buffer. In this case, the data needs to be demoted promptly to make free space in Burst Buffer. Burst Buffer may fill up quickly depending on the I/O job congestion rate as well as shown in Figure 2. The checkpoint and demotion throughput is another factor that needs to be considered when configuring the demotion threshold. When checkpoint and demotion operations are performed together, there exists an inverse relationship between write and read bandwidth within Burst Buffer I/O capability. As a result, the minimum demotion throughput (Brmin: minimum read throughput provided by Burst Buffer) is determined by the maximum checkpoint throughput (Bwmax: maximum write throughput provided by Burst Buffer). The minimum checkpoint throughput(Bwmin: minimum write throughput provided by Burst Buffer) is determined by the maximum demotion throughput(Brmax: maximum read throughput provided by Burst Buffer) as shown in equation (1). Note that read throughput provided by Burst Buffer is affected by write throughput provided by PFS. Also, m and b valued are decided depending on the device I/O capability. In order to avoid the worst case when Burst Buffer is running out of space, our policy adjusts the checkpoint throughput from Bwmax to Bwmin, and demotion throughput from Brmin to Brmax after used Burst Buffer space reaches demotion threshold.
(1) Figure 3 shows the aggregated Burst Buffer write bandwidth under different DWPP. The overall goal is to sustain the maximum checkpoint bandwidth possible while handling the DWPP amount of data written to Burst Buffer within the checkpoint period. With S being the capacity of Burst Buffer, we refer to Data Written So Far (DWSF) as the amount of data written so far within the period. Within one period, the time given to execute checkpoint operation at Bwmax without any demotion is t c , while t d is the time required to demote C amounts of data with concurrent execution of checkpoint operations. t d is composed of t dd and t ds : each representing the time taken for demotion throughput to gradually increase from Brmin to Brmax, and the time taken when the demotion throughput is fixed to Brmax without changing, respectively. We categorize the I/O patterns of the demotion operations into three categories to decide the demotion threshold.

1) PATTERN 1: DEMOTION IS ONLY PERFORMED WHEN BURST BUFFER IS IDLE
When DWPP is less than 1.0S, the checkpoint can be executed with the bandwidth of Bwmax without the need for any demotion as shown in Figure 3. When the checkpoint period is finished, demotion can be performed in the Burst Buffer idle time.
2) PATTERN 2: DEMOTION IS PERFORMED TOGETHER WITH CHECKPOINT FOR SOME RANGE OF TIME When DWPP exceeds 1.0S, some of the data in Burst Buffer needs to be demoted concurrently with checkpoint operations. The time for demotion operations to be started is calculated depending on DWPP. For instance, when DWPP is 1.2S, the demotion threshold is calculated as 0.7S of DWSF. In other words, demotion should be start even when the checkpoint is being executed when 70% of total Burst Buffer space is used. The checkpoint throughput is adjusted in a range between Bwmax and Bwmin for demotion to be executed and the demotion throughput is also changed in a range between Brmin and Brmax accordingly.

3) PATTERN 3: DEMOTION IS ALWAYS PERFORMED CONCURRENTLY WITH CHECKPOINT
When DWPP exceeds certain point, the demotion needs to be executed concurrently with checkpoint operations all the time. When DWPP exceeds 1.35S in Figure 3, the demotion has to begin at the very beginning of the checkpoint period. In this case, the demotion operations are executed with maximum throughput, Brmax, while checkpoint operations are executed with minimum throughput, Bwmin, when more than 60% of total Burst Buffer space is used.
Using the following equation (3) and DWPP value, the threshold capacity of Burst Buffer to start demotion and corresponding demotion bandwidth are calculated.
Since all the data on Burst Buffer needs to be demoted in order to avoid interference with checkpoint operations on the next checkpoint period, the minimum required idle time between two checkpoint periods is calculated using the following equation (4).
In this work, we develop a novel data placement policy that takes into account the characteristics of checkpoint and restart operations. The new policy keeps the latest checkpoint file on Burst Buffer as long as possible so that there is no need to prefetch data from PFS to Burst Buffer. Specifically, the hotness of the data is determined by considering the version of the checkpoint file and failure rate of HPC application. Old version checkpoint files have the highest priority to be considered as cold data since latest version checkpoint data is also located in Burst Buffer. If there are no old version checkpoint files left in Burst Buffer, we identify the coldness based on failure rates of the applications that write checkpoint files. As HPC users tend to reserve a large number of Burst Buffer nodes to avoid failure, we consider the failure rate of the applications proportional to the number of Burst Buffer nodes used.

C. DIRECT CHECKPOINT ON PFS
The I/O capability of PFS can also be exploited to further improve the Burst Buffer efficiency. Since cold data in Burst Buffer is destined to be in PFS, those data do not need to be written on Burst Buffer first. For this reason, we add Burst Buffer bypassing option on the data management policy once the data is considered to be cold compared to the other data already stored on Burst Buffer. This is possible because the failure rates of all the incoming checkpoint data can be known in advance from the log history. The checkpoint data is always checked whether it is hot or cold by comparing failure rates with the ones of other checkpoint files on Burst Buffer. If the incoming checkpoint data is determined to be cold, the checkpoint is directed to be written on PFS. This reduces the amount of demotion data to be written to Burst Buffer, which also diminishes the concurrent execution of checkpoint and demotion.

IV. DESIGN AND IMPLEMENTATION
We propose Burst Buffer Over-Subscription scheme (BBOS), a novel HPC data management approach that improves both Burst Buffer utilization and maintains high checkpoint and restart performance. Figure 4 shows the overall architecture of the BBOS framework. BBOS is composed of two engines, I/O engine and data management engine, and an in-memory key-value store that helps engine process.
A. I/O ENGINE  checkpoint files stored in Burst Buffer are the latest version data, Demoters start to demote the files starting from the one having the highest failure rate. DWPP and DWSF are also stored in the database to decide the demotion threshold and constantly tracked to adjust the checkpoint and demotion bandwidth. Nine key-value pairs used in BBOS in-memory key-value store are shown in Table 1. The detailed explanation of each pair is described in Section IV-F.

D. OPTIMIZED DESIGN FOR STABLE CHECKPOINT AND DEMOTION PERFORMANCE
To provide stable checkpoint, restart and demotion performance, the data management policy is optimized using several techniques. First, Checkpoint and demotion bandwidth are adjusted dynamically with BBOS engines. However, it is difficult to accurately control I/O bandwidth in real-world HPC systems. As the number of I/O operations per second requested from each application job varies, the checkpoint bandwidth may be different from what the data management policy expects to be. Also, the system may not be able to provide stable checkpoint performance due to the inability to demote as much data as it should. For these reasons, we use blkio [30] controller of the cgroup provided by Linux kernel to throttle the speed of checkpoint and restart operations precisely. Second, we utilize send_file() system call [31] to maintain stable demotion performance. In the demotion process, data must be read from Burst Buffer and written to PFS. This process incurs context switch and data copy overhead between user and kernel level, which leads to low and unstable demotion performance. Since send_file() system call supports zero-copy, demotion overhead can be eliminated. Lastly, checkpoint and restart performance may be degraded due to garbage collection occasionally. To avoid the garbage collection overhead, we periodically request the TRIM command after deleting the files. The TRIM throughput is also controlled by using blkio controller in order to minimize the performance degradation.

E. OPTIMIZED DESIGN USING PERSISTENT MEMORY
We further improve BBOS using persistent memory. A nonvolatile dual in-line memory module (NVDIMM) is a new type of memory module that combines DRAM and storage in a DIMM socket. HPC supercomputer systems can get benefits from using NVDIMM as it provides memory-speed I/O performance at a lower cost. BBOS uses Redis in-memory key-value database to track the I/O-related metadata. Whenever the I/O accesses Burst Buffer, Redis updates multiple key-value pairs and uses those information to determine where to locate the checkpoint file or adjust read-write bandwidth. As a result, there are lots of accesses to the memory during I/O operations. In order to ease the memory overhead while improving data persistence, we take advantage of NVDIMM with Redis. We utilize pmem-redis [32], a Redis version that supports persistent memory to provide both high performance and persistence. Among several features that pmem-redis provides, we apply two features to the BBOS framework. The overall architecture of the BBOS in-memory key-value store using NVDIMM is shown in Figure 5. First, considering NVDIMM as low-cost memory, we store key-value pairs on both DRAM and NVDIMM with a data placement strategy. Most of the HPC applications are memory-intensive workloads and they require large amounts of memory capacity when accessing Burst Buffer resources. When BBOS manages the Redis database, it would increase the memory usage as the data is served from memory. Consequently, Redis has to limit its memory consumption in order to not interfere with the I/O bandwidth of the HPC applications. To increase the memory capacity that Redis can utilize, pmem-redis provides a feature that can store large values in NVDIMM. This is because NVDIMM shows better performance on big and sequential data access pattern than the small and random data access pattern compared to DRAM performance. In this way, the DRAM usage on Redis can be saved while still providing DRAM-like access to large data. Specifically, all values with more than 64B size by default are stored in NVDIMM and the rest including keys and small values are stored in DRAM. Although NVDIMM shows higher latency compared to DRAM, our experiment shows that there is negligible overhead on pmem-redis when applications issue I/O operations on Burst Buffer.
Second, NVDIMM has a hard disk aspect in that data in persistent memory still exists after power failure and restart. The information stored in Redis is an important factor for the Burst Buffer scheduler to work properly. Also, whenever the applications request checkpoint or restart operations, the key-value pairs in Redis help locate the proper file while assuring other I/O requests to get reasonable I/O bandwidth. Default Redis offers two types of data persistence in order to keep data safe in the database. The RDB persistence writes all the data stored in memory to disk periodically, while AOF persistence logs every insert/modify/delete command issued to the servers to disk. These persistence methods have a major drawback in that data has to be written to slow hard disk. This is also a problem when there is a power outage and the data has to be read from a slow disk on the restart process. To improve Redis persistence performance, pmem-redis writes a persistence file in the reserved space of NVDIMM. When the persistence data size exceeds the reserved space, the data is evicted using LRU policy. As a result, periodic persistence can get improved write bandwidth and persistence files can be read with DRAM-like bandwidth whenever the power failure occurs.

F. IMPLEMENTATION
In this section, we describe the process flows of each engine in the BBOS framework using the Redis in-memory keyvalue store in detail. The BBOS framework is implemented by modifying the Gluster file system (GlusterFS), a highly scalable distributed file system. I/O engine and data management engine are added to the GlusterFS so that the engines can be processed interacting with the critical path of workflow. GlusterFS also interacts with the Redis server to collect the information used in the Burst Buffer scheduling policy.
There are total of nine kinds of key-value pairs stored in Redis in-memory key-value store as shown in Table 1. Redis records all the metadata information of the files written to and read from Burst Buffer. First, Redis stores file path for every file written in Burst Buffer and PFS so that the data management engine can have fast access to the files that need to be demoted or read. Every checkpoint files written by HPC applications have a specific application ID used by GlusterFS in the I/O flow. We refer to application-specific metadata as App ID. In order to identify victim checkpoint files to be demoted to PFS when there is not enough space on Burst Buffer, we manage Sorted Set with key name ''VICTIM''. The Sorted Set records App IDs in MTBF order, which represents the failure rate of each application. ''CLEAN'' key has a list of applications that have demotion-finished checkpoint files. In this case, the files stored in Burst Buffer can be erased. ''APP'' key manages a list of applications that have more than two different versions of checkpoint files stored in Burst Buffer. When there is not enough space in Burst Buffer, the old version checkpoint files have to be erased for those applications. Also, the ''DWSF'' key records the amounts of data written so far within the checkpoint I/O phase. The checkpoint and demotion bandwidth can be calculated using the current DWSF value. ''REPLICA'' key manages the list of files that needs to be replicated on the remote storage nodes. AppID+''restart'' key records the restart time and new MTBF calculated accordingly whenever the checkpoint file is read. AppID+deviceID key records the version and  if(flush == TRUE) 8: demote file from Burst Buffer to PFS 9: put('"CLEAN"', AppID) 10: else 11: demote old version file from Burst Buffer to PFS 12: delete file on Burst Buffer 13: update('FileName', 'path') 14: In order to figure out the outdated files, the engine checks the Redis database if there is a key of the application in pair #9. If the key exists, the engine inserts the application to pair #4 so that Deleters can handle outdated files later on (line 3-4). The engine enlists checkpoint file names of each application and the device ID they are written to on pair #8 (line 5). Also, the engine saves the file path for each file name in pair #1 (line 6). After checkpoint operations are completed, the engine updates pair #5 with the current file size (line 7-8). The reason for continuously recording the DWSF value in the database is that the capacity of Burst Buffer will always remain full as our system keeps demotion-finished data in Burst Buffer until free space is actually necessary. BBOS would not know the actual amounts of data written if DWSF is not tracked during the checkpoint phases.
The process flow of restart operation is shown in Algorithm 2. When the system fails and the restart operation is requested, two new values are updated to let Demoters choose the victims. At first, the MTBF of the application that needs restart operation and the latest restart time is read from pair #7 (line 1). Then the module calculates new MTBF and updates pair #7 (line 2-3). At the same time, the engine checks whether the checkpoint file of the application exists on Burst Buffer or not by using pair #9. If the checkpoint file to be read is stored on Burst Buffer, pair #2 is updated with new MTBF (line [4][5]. After all the process is done, the engine reads checkpoint files of the application with pair #1 (line 6-7).
While the I/O engine schedules the I/O jobs accessing Burst Buffer, the data management engine manages an efficient demotion process between Burst Buffer and PFS using the four modules. First, Throttler regulates the bandwidth of the checkpoint and the restart operations by monitoring DWSF. Throttler obtains DWSF from pair #5 and decides whether to start the demotion. When DWSF exceeds the demotion threshold, Throttler regulates the checkpoint and restart bandwidth to the reconfigured bandwidth. Second, Demoters receive a signal from Throttler about which device in Burst Buffer needs the demotion. Then, Demoters collect information from the in-memory store to execute the demotion. The pseudo-code for the demotion process is described in Algorithm 3. Demoters first check for every victim checkpoint file in pair #4 since the oldest version of the checkpoint file should be demoted first (line 1). If there is no victim, the victim file is retrieved from pair #2 which is ordered by MTBF (line 2-3). In this case, the victim file has to be demoted even though it is the latest checkpoint file of a certain application. If the victim is found from pair #2, the victim file is not deleted from Burst Buffer right after the demotion is finished. The file has to be stored in both Burst Buffer and PFS to preserve restart performance (line 7-8). However, it is necessary to mark that victim file in pair #3 in order to erase the file when Burst Buffer needs available capacity (line 9). If the victim is retrieved from pair #4, it also means that the application has an old version of the checkpoint file. Since the file of the old version does not need to stay in Burst Buffer, the file can be deleted (line [10][11]. Finally, Demoters update pair #1 (line 12) and put the name of the file in pair #6 for Replicators to handle the replications (line 13). Third, Deleters erase demotion-finished files after receiving a signal from I/O workers. Specifically, Deleters pop information of the application first which is inserted in pair #3, and delete the files from Burst Buffer using pair #1 and #8. Lastly, Replicators replicate checkpoint files from the local storage device to the remote devices within the same replication group. Each storage node has a mount point of PFS which consists of storage nodes in the same replication group except itself. PFS-only lowspeed network is additionally installed between each storage node. Thus, Replicators transfer the demoted data to the mount point by using pair #6 without hindering Burst Buffer performance.

V. EVALUATION A. EXPERIMENTAL ENVIRONMENT
We evaluate the BBOS HPC storage management scheme on the small-scale testbed environment consists of eight compute nodes and a single storage node. Burst Buffer and PFS are configured together in the storage node. Four of the compute nodes consist of Intel Xeon Phi CPU 7290 processor with 72 physical cores and others are of Intel Xeon Phi CPU 7250 with 68 physical cores. The storage node consists of dual 12-core Intel Xeon Silver CPU 4115 and 32 GB memory. Burst Buffer is configured using four 800 GB FADU NVMe SSDs provided by a semiconductor start-up company [33], with the sequential write and read performance up to 920 MB/s and 3,200 MB/s. Also, 16GB Dell NVDIMM-N is deployed in the storage node in order to increase memory capacity and improve data persistence for the Redis in-memory database. PFS on the same storage node with Burst Buffer is composed of four 4TB Samsung 860 EVO SATA SSDs. The compute nodes and the storage node are connected with a 100 GbE Mellanox SN2100 switch.
We use GlusterFS [34] version 5.6 each configured for Burst Buffer and PFS and the file system configurations are tuned for high performance. GlusterFS is modified by adding multiple modules for BBOS scheme. Each variable of the BBOS framework is configured as following by considering the capable I/O bandwidth provided by storage devices: Bwmax as 3.56 GB/s, Bwmin as 3 GB/s, Brmax as 1.6 GB/s, Brmin as 0.08 GB/s, and period as 3800 seconds. For experiments, we execute large sequential write I/O to simulate checkpoint operations by using a microbenchmark FIO [35]. Since failure rate and MTBF have an inverse relationship [36], MTBF is used to represent failure rates of the applications in this evaluation.
We compare BBOS with DataWarp, one of the currently deployed HPC schedulers which use the dedicated Burst Buffer allocation, and two scheduling policies presented in Harmonia [26] which is the state-of-the-art scheduler that uses the Burst Buffer over-subscription method. Since Harmonia is not an open-source work, we make an emulation scheduler based on the paper. DataWarp does not perform I/O scheduling, while Harmonia schedules I/O jobs for preventing them from overlapping each other. MaxEff, one of Harmonia's policies, optimizes the Burst Buffer system efficiency by maximizing the Burst Buffer utilization. As the policy aims to maintain the high capacity of free Burst Buffer space, it always demotes data at full speed (Brmax) even when the checkpoint is performed concurrently. On the other hand, MaxBW, another policy introduced in Harmonia, aims to provide maximum checkpoint bandwidth to applications. The checkpoint and the demotion cannot be performed at the same time with the MaxBW scheduling policy. The demotion threshold of MaxEFF is 0S of DWSF while threshold of MaxBW is 1S of DWSF in Figure 3.

B. BURST BUFFER UTILIZATION
In this section, we evaluate the Burst Buffer utilization with four scheduling policies. We assume that each application requests to write an 80 GB checkpoint file once a period. The Burst Buffer utilization is decided by the number of applications that finish writing the checkpoint file within the period, which also indicates the maximum DWPP each scheduler can provide. The Burst Buffer utilization of four scheduling metrics is shown in Figure 6. DataWarp shows 0∼100% of Burst Buffer utilization since it allocates Burst Buffer capacity as much as the users demand with a dedicated allocation method. The best scenario is that the total Burst Buffer capacity is fully used within the checkpoint period even when all users demand Burst Buffer allocation as much as they need. This results in 100% of Burst Buffer utilization. However, Burst Buffer utilization remains low due to overabundant Burst Buffer capacity requests in most cases. On the other hand, Harmonia and BBOS can make Burst Buffer accommodate more I/O requests within the period since they use an over-subscription Burst Buffer allocation method. MaxBW does not allow demotion to be performed together with the checkpoint to ensure maximum checkpoint throughput of the applications. As a result, 190% of Burst Buffer utilization can be achieved using the MaxBW scheduling policy. MaxEff shows 210% of Burst Buffer utilization because demotion is always performed at maximum demotion throughput taking the risk of low checkpoint performance. BBOS is similar to MaxEff in that demotion is performed at any time possible without interfered by checkpoint operations. Hence, Burst Buffer can be utilized by up to 210% with BBOS.

C. CHECKPOINT PERFORMANCE
To evaluate the checkpoint performance on BBOS framework, we conduct experiments under various I/O scenarios with different I/O job congestion rates and DWPPs. Since the maximum DWPP of DataWarp is equal to the total capacity of Burst Buffer, we evaluate DataWarp with DWPP at 1S while others with DWPP at 1.3S, 1.6S, and 1.9S. We make different I/O job congestion patterns on the following three scenarios:   demotion and checkpoint, high checkpoint throughput can be ensured all the time. However, some of the I/O jobs still have to stall in order to wait for available Burst Buffer capacity before the execution. Under the Low congestion rate scenario, none of the applications have to wait to avoid I/O interference or to make space in Burst Buffer as there is sufficient idle time between the I/O jobs. As the I/O jobs arrive in crowds and DWPP increases, some of the applications begin to experience high latency. Specifically, the checkpoint latency starts to increase with Harmonia and BBOS under Med I/O job congestion rate and DWPP over 1.6S. The large DWPP represents that there is not much idle time between the jobs and the checkpoint latency increases as DWPP increases. When the I/O job congestion rate is High, jobs have to wait for the longest time and results in the highest checkpoint latency.
In addition, MaxBW shows extreme performance variance. Figure 8 shows the wait time of the first 45 I/O jobs under the Med I/O congestion rate at DWPP of 1.9S. The later arrived I/O jobs have to wait for a long time, resulting in severe performance fluctuation. On the other hand, BBOS makes sure that data is demoted in advance so that Burst Buffer always reserve free space for the incoming jobs. The wait time under BBOS scheme gradually increasing from the beginning, preventing a sudden burst in the wait time in any case. In summary, MaxBW provides higher performance compared to BBOS when there is no wait time. When there is not enough Burst Buffer idle time per period or idle time between I/O jobs, MaxBW shows the higher latency and higher performance variance compared to BBOS. This is because BBOS always prepares for the worst case and adjust the checkpoint performance to reserve free space in Burst Buffer in advance.
Both MaxEff and BBOS perform demotion in advance for Burst Buffer not to overflow. MaxEff shows the lowest checkpoint throughput because the data is always demoted at the maximum demotion speed. In this way, relatively large amounts of Burst Buffer capacity can be maintained. Consequently, MaxEff provides lower checkpoint latency compared to MaxBW. BBOS adjusts checkpoint throughput within the range from Bwmax to Bwmin depending on DWPP. The smaller DWPP, the higher checkpoint throughput can be achieved by avoiding unnecessary concurrent execution of checkpoint and demotion. When there is enough time to make free Burst Buffer capacity, only the checkpoint throughput affects the latency. Hence, BBOS shows lower checkpoint latency compared to MaxEff when DWPP is small. MaxEff shows higher checkpoint latency compared to BBOS when DWPP is large, even though MaxEff performs demotion more aggressively than BBOS does. This is because MaxEff always demotes data at full demotion speed. In order to demote data in the maximum speed, the checkpoint throughput has to be decreased. As a result, checkpoint I/O jobs need to wait longer to be scheduled. In our experiments, the difference in latency of MaxEff and BBOS seems small (about tens of seconds) because the difference between Bwmax and Bwmin is not large. If the difference gets greater, we expect significantly lower I/O latency with BBOS compared to MaxEff.
Overall, BBOS is the novel approach that takes advantage of and complements the shortcomings of MaxBW and MAX-Eff. By adjusting checkpoint and demotion speed depending on DWPP and I/O job congestion rate dynamically, BBOS always provides relatively high checkpoint throughput and low latency compared to the other approaches. Furthermore, our result shows that there is no I/O overhead shown with managing BBOS framework on Burst Buffer. In other words, applications can get expected I/O performance as BBOS throttles the speed of checkpoint and restart operations depending on decisions made by BBOS I/O scheduler. As a result, BBOS can adjust the checkpoint throughput of the jobs in real-world HPC systems with thousands of storage nodes likewise to the checkpoint throughput under single Burst Buffer node system. Since BBOS runs with pre-determined system configurations setting, BBOS always provides stable checkpoint performance within a configured range under hardware limits.

D. DIRECT CHECKPOINT ON PFS
When the difference of the I/O bandwidth provided by PFS and Burst Buffer is not large, bypassing Burst Buffer and directly accessing PFS can eliminate unnecessary demotion overhead. We conduct experiments with three different DWPP: 1.3S, 1.6S, and 1.9S. Each application requests an 80 GB checkpoint during one hour and MTBF of all the applications are set randomly from 0 to 100 minutes. We optimize the BBOS framework by checking the MTBF of the applications that issue checkpoint requests. Before serving the request, the I/O engine first checks whether the Burst Buffer capacity is fully used. Only when the demotion is needed in order to make free space in Burst Buffer, the engine next checks whether the checkpoint file to be written is cold data or not by comparing the failure rates of the applications and the version number among the checkpoint files of the same application. The application with large MTBF is considered to write cold data since there is less possibility to get failure. Also, when there are multiple checkpoint files with different version numbers, only the latest file is considered to be hot. When the cold data is to be written when there is not enough Burst Buffer space, optimized BBOS bypass the Burst Buffer and write directly on PFS. Figure 9 shows the normalized demotion data size with optimized BBOS under different DWPP scenarios. In the case of DWPP being 1.9S, large amounts of checkpoint files that are considered to be cold data are written directly on PFS, which decreases the amounts of demoted data by up to 38%. Since less amount of data is demoted concurrently with checkpoint operations, more applications can experience higher checkpoint throughput.

E. RESTART PERFORMANCE
We evaluate restart performance on Burst Buffer by comparing the hit ratio using different scheduling policy: LRU, FIFO, and BBOS. DWPP is configured as 1.5S, 1.7S, 1.9S, and 2.1S, and we randomly set MTBF of the applications between the following ranges: 0 to 20 minutes (Low), 0 to 50 minutes (Med), and 0 to 100 minutes (High). The applications that need restart is selected based on the expected MTBF, as the failure rate is in the inverse relationship with MTBF. All the checkpoint periods are fixed to be equal and the checkpoint size of each application is set to 80 GB. Figure 10 shows the hit ratio under different configurations. In every case, BBOS shows the highest hit ratio on Burst Buffer. Since the checkpoint files have a higher possibility to be in Burst Buffer with low DWPP, the hit ratio increases with low DWPP under all three scheduling policies. In the case of LRU and FIFO algorithms, however, cold data is chosen based on the order of data written time. As a result, the variance of the hit ratio of each experiment is high and the result is unrelated to the variance of MTBF. In contrast, BBOS shows an increased hit ratio as the MTBF variance gets higher. With low MTBF variance, the effectiveness of our system is relatively low compared to other policies since failure rates of the applications are similar. On the other hand, the checkpoint files are well distributed on Burst Buffer and PFS, sorted by the failure rates in case of the high variance of MTBF. As a result, BBOS provides up to 3.4 times higher hit ratio of restart requests on Burst Buffer compared to the others.

F. VERSION-AWARE DATA PLACEMENT
In order to keep the hit ratio high on Burst Buffer, BBOS uses the version-aware data placement method by identifying outdated checkpoint files as cold data. To demonstrate the effectiveness of the version-aware data placement method, we choose three checkpoint periods for applications as follows: 60 minutes, 30 minutes, and 20 minutes. Each user is to request an 80 GB-size checkpoint data and MTBFs of the applications are decided randomly from 0 to 100. We assume that the applications maintain three or more versions of checkpoint files. Thus, the applications with a 60-minute period have one checkpoint version within a one-hour period, two versions for a 30-minute period, and three versions for a 20-minute period. We also arrange the ratio of the number of applications having every three periods as 1:1:1 and 5:2:1 and the DWPP is fixed to be 1.9S. Figure 11 shows the hit ratio of restart requests on Burst Buffer with and without using the version-aware method. When the numbers of applications with different checkpoint periods are same, all the latest checkpoint files can be stored in Burst Buffer with the version-aware method and results in 96.4% hit ratio in the ideal case. Our result shows the slight decrease in hit ratio because the checkpoint file with the highest MTBF has to be demoted even though it is the latest one whenever the free Burst Buffer space is needed for incoming I/O requests. When the version-aware data placement is not used, cold data is decided only based on the MTBF of the applications and the latest checkpoint files with high MTBF may be stored on PFS while old version files with low MTBF stay on Burst Buffer. As a result, only 80.1% hit ratio is shown in our evaluation. In the case of the 5:2:1 ratio, there are large number of applications with low checkpoint period and every checkpoint files cannot all be stored in Burst Buffer. Thus, 92.5% of the restart requests can be handled in Burst Buffer with version-aware placement policy and only 71.7% of the requests can be handled without the policy.

G. PERFORMANCE OF BBOS USING NVDIMM
The BBOS framework is further improved by using NVDIMM on Redis in-memory database. We show two evaluation results in this section: the performance of NVDIMM-aware Redis database and the I/O performance of Burst Buffer with NVDIMM-applied BBOS.  First, to focus on the performance of Redis with persistent memory, we use memtier benchmark [37] and compare the performance of default Redis and pmem-redis version Redis. We set the range of the requested data size to be between 64B and 256B, which is chosen empirically as the same size of data is read and written to Redis in the BBOS framework when Burst Buffer serves the checkpoint and restart operations. We set the ratio of GET operations to SET operations as 10:0, 0:10, and 5:5, while the number of requests is set to 5,000, 10,000, and 20,000. Figure 12 shows the transaction rates when using default Redis and Redis with NVDIMM, respectively. Every key is kept in DRAM while value larger than 64B is written to NVDIMM. As every value size is larger than 64B in our configuration, every value is kept in NVDIMM. Although DRAM space is saved using this approach, an additional data copy is required in order to move value data from DRAM to NVDIMM. Also, data written in NVDIMM requires longer latency to read or write compared to that of DRAM. As a result, pmem-redis version of Redis provides 0.4% to 15.7% lower performance compared to default Redis.
Second, we evaluate the I/O performance of Burst Buffer with NVDIMM-applied BBOS. Figure 13 shows the aggregated disk I/O rate over time on a single Burst Buffer node when BBOS runs using Redis with NVDIMM. The red dashed line in the graph refers to the average I/O performance of Burst Buffer without NVDIMM. We do not throttle the I/O performance in this evaluation. The result shows that even when Redis stores most of the data in NVDIMM, the disk I/O rate does not decrease under heavy I/O load. In other words, the optimized version of Redis using persistent memory rarely harms the bandwidth of the I/O workloads. To sum up, we consider the low performance of pmem-redis version of Redis is acceptable as the experimental result shows that the I/O bandwidth of Burst Buffer still reaches the maximum performance limit.
When bursty I/O comes in crowd in Burst Buffer system with BBOS framework, not only the data movement among multiple tiers but data replication and file system processes to manage I/O operations increase memory usage. Under the real-world HPC system environment that consists of thousands of compute nodes and storage nodes, the memory consumption would extremely increases and eventually harms the overall system performance. As such, it is advantageous to maintain a large amount of free memory as possible. Considering that exploiting NVDIMM can easily increase memory capacity with lower cost, our solution is capable of maintaining sufficient amount of memory space with no I/O performance degradation.

VI. RELATED WORK
Burst Buffer has been widely deployed in HPC storage systems in the past few years to improve the I/O performance. A large number of scientific HPC applications can benefit from high I/O performance provided by Burst Buffer [10], [38]. Depending on the architecture designs, Burst Buffer can be either located within the compute nodes or independently located as dedicated Burst Buffer nodes [11], [39], [40]. Common Burst Buffer design used in HPC systems is shared Burst Buffer organization, which shows higher I/O performance compared to local Burst Buffer design [41]. To further optimize the Burst Buffer system, numerous studies have been made in HPC communities with different approaches.
As Burst Buffer works as a cache layer in HPC storage systems, several studies have proposed the novel optimization techniques on Burst Buffer framework. Khetawat et al. [42] designed a simulation framework that can accurately find the best Burst Buffer configuration setting considering the I/O characteristics of real-world HPC workloads. Aupy et al. [43] minimized the I/O contention by sizing and partitioning Burst Buffer using polynomial time algorithms. After Burst Buffer is widely used as high-performance cache layer, researchers have also focused on improving checkpoint and restart performance using Burst Buffer. One of the approaches is to write checkpoint files on multiple layers including compute nodes, Burst Buffer, and PFS [44]. To reduce checkpoint overhead on PFS, Moody et al. [25] developed the multi-level checkpointing mechanism considering the different degree of reliability and the checkpoint cost of each tier in the HPC storage system. Data Elevator implemented by Dong et al. [45] offloads I/O access from Burst Buffer to PFS to reduce the contention on Burst Buffer. Different from Data Elevator that needs users to specify the final destination of the data, BBOS dynamically manages direct checkpointing on PFS when there is not enough free space in Burst Buffer. Since multi-level checkpointing can lead to high failure rates on a large-scale HPC environments, Sato et al. [46] combined the multi-level checkpointing and non-blocking mechanism so that data can be transferred asynchronously on checkpoint operations. Similar to the previous studies, our work also focuses on improving checkpoint and restart performance on Burst Buffer on multi-layered HPC systems.
Burst Buffer can be fully utilized with help of proper I/O scheduling policies, likewise to the policy that exists for PFS [27], [47], [48]. Han et al. [28] observed that the I/O capability of Burst Buffer cannot be fully used when multiple HPC users simultaneously use Burst Buffer. To address the problem, they proposed Burst Buffer with multi-stream SSDs by assigning each user a separate I/O stream to remove the I/O interference. Koo et al. [49] further improved the I/O separation scheme on Burst Buffer by proposing stream-aware scheduling policy on Burst Buffer I/O pools. Thapaliya et al. [47] also reported the I/O interference problem in shared Burst Buffer system and Gainaru et al. [50] attempted to dynamically schedule the I/O jobs based on the past I/O patterns of the jobs. TRIO is the Burst Buffer I/O scheduling policy that efficiently transfer the I/O traffic from Burst Buffer to PFS [2]. Similar to the above works, BBOS introduces novel I/O scheduling policy that efficiently handle data between Burst Buffer and PFS considering the checkpoint characteristics.
Several works claimed that HPC applications have frequently accessed data including checkpoint files [3], [51]. By placing hot data on Burst Buffer, the I/O intensive applications can get benefit from using Burst Buffer. Shin et al. [15] automatically placed data on HPC multi-tiered storage system using goal-driven data management scheme, while Shi et al. [52] regulated I/O traffic on Burst Buffer and PFS by using the write access patterns of the applications. Our work also considers I/O patterns and characteristics of checkpoint operations when making data placement decision.

BBOS, the new I/O scheduling framework for Burst
Buffer-based HPC storage system, uses the over-subscription scheduling method by allocating Burst Buffer only during I/O phases to improve Burst Buffer utilization. In order to mitigate performance degradation, the Burst Buffer aware I/O scheduler and the data management module are implemented in BBOS. We analyzed and utilized the characteristics of checkpoint and restart operations to design the BBOS modules. Based on the characteristics, data is transferred from Burst Buffer to PFS transparently by dynamically adjusting VOLUME 11, 2023 the thresholds and the speed of the demotion. We also identified the cold data considering different versions and failure rates of the checkpoint files. All the metadata related to the BBOS framework is handled in the Redis in-memory database, which is improved by using persistent memory. As a result, we improved Burst Buffer utilization by up to 120% compared to the default dedicated Burst Buffer allocation method and guaranteed higher checkpoint throughput without sudden performance reduction. Also, 96.4% of restart requests can be handled in Burst Buffer and provided up to 3.1 times higher restart performance with BBOS framework.