Weight-Aware Cache for Application-Level Proportional I/O Sharing

Virtualization technology has enabled server consolidation where multiple servers are co-located on a single physical machine to improve resource utilization. In such systems, proportional I/O sharing is critical to meet the SLO (Service-Level Objectives) of the applications running in each virtual instance. However, previous studies focus on block-level I/O proportionality without considering the upper-layer I/O caches, which handle I/O requests on behalf of the underlying storage devices, thereby failing to achieve application-level proportional I/O sharing. To overcome this limitation, we propose a new cache management scheme, Weight-aware Cache (WaC), which reflects the I/O weights on cache allocation and reclamation. Specifically, WaC prioritizes higher-weighted applications in the lock acquisition process of cache allocation by re-ordering the lock waiting queue based on I/O weight. Additionally, WaC keeps the number of cache entries of each application proportional to its I/O weight, through weight-aware cache reclamation. To verify the efficacy of our scheme, we implement and evaluate WaC on both the page cache and bcache. The experimental results demonstrate that our scheme improves I/O proportionality with negligible overhead in various cases.


INTRODUCTION
V IRTUALIZATION has become an essential building block for modern sever systems [1], [2], [3], [4], [5]. One of its fascinating benefits is the consolidation of multiple servers into a single physical machine, thereby maximizing resource utilization. However, server consolidation inherently forces multiple servers to share underlying system resources, making it difficult to meet the SLO (Service-Level Objective) of each application [6], [7], [8], [9]. To achieve SLO guarantees, the majority of virtualization techniques, such as KVM and Docker, rely on cgroup for controlling the host kernel resources [10], [11], [12], [13].
Cgroup [11] is a Linux kernel feature that limits and controls various kinds of system resources, including CPU, memory, and block I/O. Particularly, cgroup collaborates with I/O schedulers to proportionally distribute I/O resources using I/O weight. However, cgroup is designed solely to achieve block-level I/O proportionality, rather than application-level I/O proportionality. Accordingly, it regulates I/O activities only in the block layer without giving consideration to the upper layers of the system software stack, such as caching layers. As a result, when an application utilizes I/O caches, such as bcache and the page cache, the application-level I/O proportionality can be distorted. In other words, even though the bandwidths of the applications are proportional to their I/O weights at the block layer, the applications actually experience disproportional I/O bandwidth in the holistic point of view.
The concept of caches is widely utilized to bridge the performance gap between fast and slow media. For example, Linux kernel assigns a certain amount of main memory as page cache to remedy relatively slow access to storage devices. The page cache handles I/O requests on behalf of the underlying storage devices [14], thereby achieving enormous performance improvement. Similarly, bcache [15] has been developed to hide the high latency of slow storage devices in the block layer by utilizing comparatively faster storage devices. For example, the poor random I/O performance of hard disk drives (HDD) can be alleviated by using flash storage devices as bcache. However, the design principles of conventional I/O cache management schemes do not include application-level I/O proportionality, and the ability to cooperate with cgroup is insufficient to realize application-level proportional I/O sharing. Therefore, the adoption of such I/O cache can cause dis-proportionality of I/O performance.
The conventional I/O cache management consists of two phases, cache allocation and reclamation. Cache allocation handles the upcoming I/O operations by allocating a new cache entry and temporarily storing the data inside the cache. Since cache is shared by multiple threads, cache allocation is often protected by synchronization techniques, such as spinlock. For example, the page cache and bcache utilize qspinlock, which manages multiple lock acquisition requests in a lock waiting queue. However, the qspinlock is implemented based on FIFO-queue, where the oldest element of the queue is serviced first [16], [17]. Thus, regardless of I/O weight, the lock acquisition order follows the order of enqueueing. Accordingly, the conventional cache management schemes cannot reflect the I/O weight in cache allocation.
Cache reclamation secures free cache entries by evicting the existing ones. Since cache reclamation decides which data the cache will keep, it significantly affects the performance of read operations. Conventional cache management schemes often adopt an LRU policy, which considers recency of page references without considering the I/O weight. Thus, the conventional cache reclamation can reclaim pages used by higher-weighted applications ahead of those used by lower-weighted ones, even when the pages have similar reference characteristics. In this way, the use of I/O cache can distort application-level I/O proportionality.
To realize application-level proportional I/O sharing, we introduce a new cache management scheme, called WaC (Weight-aware Cache). WaC reflects the I/O weight in the process of cache allocation and reclamation. The cache allocation scheme of WaC prioritizes higher-weighted applications in the lock acquisition process of cache allocation. When the lock for cache allocation is available, WaC traverses the entire lock waiting queue and detects the element with the highest I/O weight. Afterward, WaC re-orders the lock waiting queue based on the I/O weights so that the element with the highest I/O weight can possess the lock in the next turn. The reordering job can degrade scalability in the case when the next lock holder is located in a different NUMA node. To prevent this, WaC also considers the NUMA topology when deciding the next lock holder. Through this process, WaC can achieve proportional I/O sharing according to the I/O weight of each application.
Re-ordering the lock waiting queue based on I/O weight can incur a problem, called starvation, where an application fails to acquire the lock for very long time. When there are many applications with high I/O weights, the lowerweighted applications should keep yielding their chances for lock acquisition, thereby experiencing starvation. To solve the problem, we adopt a conventional well-known technique for starvation, called aging. WaC continuously increments I/O weights of the applications that yield their turns of lock acquisition at each re-ordering phase. Therefore, lower-weighted applications can acquire the lock in a finite time, avoiding the starvation problem.
The cache reclamation scheme of WaC prioritizes cache entries that are used by higher-weighted applications while considering the recency of the cache references. WaC keeps track of the owner application of each cache entry and the number of cache entries owned by each application. During cache reclamation, WaC decides cache entries for eviction by considering the I/O weights of the owner applications and their current number of cache entries. By doing so, WaC makes the number of cache entries of each application proportional to their I/O weights. Consequently, the higherweighted applications can possess more cache entries in the I/O cache, which in turn improves the read performance of such applications, compared with lower-weighted ones.
To verify the effectiveness of our scheme, we implement WaC in two I/O cache systems: the kernel page cache and a generic block-layer caching kernel module (bcache). In our experiments with the Docker virtualization, we measured the application-level I/O proportionality and the performance of WaC, while comparing them with those of conventional cache management schemes. Evaluation results with real-world benchmarks indicate that WaC displays up to 36.9% better I/ O proportionality than the conventional scheme with only 3.9% overhead at most.
The rest of this paper is organized as follows. Section 2 elaborates on the background, and Section 3 demonstrates the motivation of our work. Section 4 explains the design of WaC in detail. The implementation of WaC on the page cache and bcache is presented in Section 5. Experimental results are provided in Section 6. Section 7 discusses the related work. We conclude this paper in Section 8.
Overall, this paper makes the following contributions: Experimental demonstration of the problem of conventional cache managements, regarding I/O proportionality (Section 3) Weight-aware locking mechanism for reflecting I/O weights in cache allocation. (Section 4.1) Weight-aware page reclamation for keeping data generated by higher-weighted applications. (Section 4.2)

BACKGROUND
In this section, we describe cgroup and application-level I/O proportionality. Afterward, the mechanism of the conventional I/O cache management schemes is given with their limitation in terms of I/O proportionality.

Cgroup and Proportional I/O Sharing
Cgroup is widely adopted to control and limit the allocation of system resources, such as CPU, memory, and block I/O. Cgroup manages those resources in the form of subsystems, each of which has various parameters for controlling resource consumption [11].  [11], [20] when they decide how many I/Os will be dispatched from the request queues in the block layer [18]. Specifically, the CFQ scheduler gives higher-weighted applications more time for processing I/O so that the block-level I/O proportionality can be achieved according to their I/O weights [21]. Note that, in this paper, I/O proportionality [22] refers to making the bandwidth ratio of applications be proportionally allocated according to their I/O weights. For example, as shown in Fig. 1, let us suppose that the system manager creates three resource groups with I/O weights 100, 400, and 500. Then, for I/O proportionality, they should have the I/O bandwidth ratio of 0.1:0.4:0.5, when the total amount of I/O resources available in the system is 1.0.
In practice, however, the application-level I/O proportionality cannot be guaranteed with I/O caches because I/O caches [21], [23], [24] may exclude the I/O scheduler from the critical path except when uncached data are requested. For example, when an application tries to access data already stored in the page cache, the application retrieves the data directly from the page cache, skipping the layers beneath the page cache. This characteristic of the page cache avoids relatively slow access to the underlying storage device and helps to achieve high I/O bandwidth and low latency. However, since such I/O requests do not experience the block layer, the I/O weights of cgroup cannot be applied. Additionally, the blkio subsystem does not directly control I/O caches, and also, the I/O caches cannot achieve the required I/O proportionality by themselves.

I/O Cache
A cache is a component that temporally stores data on a faster medium in lieu of a slower one. The current computer systems adopt various kinds of caches to bridge the performance gap between two different hardware devices. Particularly, I/O caches effectively improve I/O performance by directly servicing I/O requests on behalf of the slow storage device, when the corresponding data reside in it [25]. For example, Linux kernel utilizes unused space of the main memory as I/O cache, called page cache. Similarly, Linux kernel provides another type of I/O cache in the block layer, called bcache, to remedy the low performance of a slow storage device by means of a faster one.

Cache Allocation
I/O cache management mainly performs two tasks; cache allocation and cache reclamation. Cache allocation refers to a task that allocates a new cache entry to store incoming data. For example, when an application tries to write uncached data, the cache management allocates a new cache entry to the application and stores the corresponding data inside the entry. Since the cache resource is shared by multiple CPUs, cache allocation should be mutually exclusive [24]. To achieve this, I/O cache management often adopts a locking mechanism to protect the critical section. For example, both the page cache and bcache protect the cache allocation using qspinlock in Linux systems.
The overview of qspinlock mechanism is shown in Fig. 2. As described in Fig. 2, qspinlock mechanism consists of one qspinlock structure and multiple per-CPU qnodes, each of which is an element of lock waiting queue [16]. In the locking mechanism, for cache allocation, the conventional cache management selects the next lock holder from the lock waiting queue in a FIFO manner [16], [17]. Therefore, the conventional cache allocation cannot prioritize applications with higher I/O weights. For example, in Fig. 2, when CPU3 tries to acquire the qspinlock, it is inserted at the tail of the lock waiting queue regardless of its I/O weight. Afterward, when CPU1 releases the qspinlock, CPU2 stops busy-waiting and acquires the lock in a FIFO manner without consideration on I/O weight. Therefore, although the applications on CPU3 and CPU4 have higher I/O weights than the one on CPU2, CPU2 acquires the lock ahead of CPU3 and CPU4. Like this, the conventional cache management handles cache allocation using a FIFO-based queue which does not consider I/O weights, thereby distorting the I/O proportionality.
Cache allocation and its lock contention are critical to the I/O performance [24], and have an increasing impact on the performance as the number of applications co-running in a system increases. For example, in a multi-container environment where multiple servers co-run in a single physical system, lock contention increases due to the high number of simultaneous cache allocation requests. Thus, I/O requests from applications have to wait a significant amount of time to acquire the lock. Accordingly, the order of lock acquisition becomes a critical factor on the I/O performance in such systems.
We demonstrate this problem in Fig. 3, by measuring the lock contention count using lockstat [26] while running  multiple random write workloads on tmpfs. Here, the lock contention denotes the case that a process attempts to acquire the lock that is already held by another process. As shown in Fig. 3, the lock contention count increases along with the number of running threads, mainly due to high contention of cache allocation requests from multiple threads. When the number of threads becomes 128, the number of lock contentions reaches 92,687. In terms of performance, the average latency of the workloads significantly increases as the number of concurrent threads increases. Like this, the more threads run in a system, the more important the order of lock acquisition becomes. Moreover, this problem is exacerbated when the amount of available cache entries is low and cache allocation induces cache reclamation [24]. In the case of page cache, it is known that allocating free pages can take more than 200ms in such cases because dirty pages should be evicted in advance to create free pages [21], [24].

Cache Reclamation
Since the cache size is usually smaller than the entire data size, the cache management should reclaim the existing cache entries to secure free entries for new data. In this process, cache replacement algorithms such as LRU (Least Recently Used) and FIFO decide which cache entries to evict. There have been various kinds of cache replacement algorithms to maximize the cache hit ratio. For example, LRU policy evicts the cache entries that are not used for a long time under the assumption that recently accessed data will be accessed again soon.
The Linux page cache utilizes a variant of LRU, called 2Q-LRU, which utilizes two LRU queues: an active list to keep frequently accessed pages, and an inactive list for the other pages [27]. When a page is accessed for the first time, it is placed at the head of the inactive list. Afterward, the page can be promoted to the active list when it is accessed again. Pages in the active list are demoted to the inactive list when they are considered unlikely to be accessed again. Finally, pages at the tail of the inactive list are reclaimed when the amount of free memory space gets lower than the predefined threshold.
Bcache provides LRU, FIFO, and random as a configuration parameter. It assigns each cache entry a priority value, which decrements over time, and utilizes the value in cache reclamation. For example, in the case of LRU, bcache restores the priority value of cache entries to the pre-defined value upon each access and evicts cache entries that have the lowest priority value.
Cache reclamation is crucial to the I/O performance, especially to that of read I/Os, because it decides the contents of the cache. Unfortunately, both the page cache and bcache do not consider I/O weight in the process of cache reclamation. Therefore, those cache systems lack the ability to reflect I/O weight in cache reclamation, which in turn leads to failing to achieve application-level proportional I/O sharing. To experimentally demonstrate the distortion of I/O proportionality caused by cache allocation, we conducted an experiment with the Fileserver workload in Filebench. We ran this workload in each of four containers with different weights (i.e., 100, 200, 400, 800). Each of the workloads runs for around 300 seconds and generates around 6 GB of data. The experiment is performed with two different types of I/Os: (1) direct I/O whereby I/O requests bypass the page cache layer, and (2) buffered I/O whereby I/O requests use the page cache for buffering and caching. In this experiment, we do not adopt bcache, and thus all of direct I/Os visit the I/O scheduler. The detailed experimental setup is described in Table 1 of Section 6.    [25], [28].

MOTIVATION
To show the distortion of I/O proportionality caused by cache reclamation, we conducted an experiment with the Reread workload of FIO. Like the previous motivational experiment of Fig. 5, we ran this workload in each of four containers with different weights (i.e., 100, 200, 400, 800). Each container creates one 3 GB test file to cache all the data of the file into the page cache. Then, the host writes a 4 GB dummy file to trigger excessive page reclamation. Lastly, we executed 1 GB re-read operations in each container and measured the I/O bandwidth. The experiment is also performed with two different types of I/Os, direct I/O and buffered I/O. As shown in Fig. 6, buffered I/O shows poor I/O proportionality than direct I/O. A large number of re-read requests are serviced from the page cache in the case of buffered I/O. Unfortunately, when the conventional page cache management evicts cache entries after the dummy writes from the host, it does not consider I/O weights. Therefore, even though some cache entries are used by higher-weighted containers, the entries are evicted prior to other entries used by lower-weighted ones, resulting in poor I/O proportionality.
The aforementioned analyses on the conventional I/O cache management lead us to conclude that strictly keeping the FIFO or LRU policy without considering the I/O weights is harmful to the I/O proportionality when running weighted applications. Based on the motivational analyses above, we suggest a new cache management scheme that resolves the aforementioned problems.

DESIGN
In this section, we present a new cache management scheme, called WaC (Weight-aware Cache), to achieve applicationlevel proportional I/O sharing. The overview of WaC is presented in Fig. 7. WaC has the following goals: WaL (Weight-aware Lock): Prioritizing higherweighted applications in the locking mechanism for cache allocation. WaR (Weight-aware Reclamation): Prioritizing the cache entries used by higher-weighted applications in the cache reclamation.

Weight-Aware Lock (WaL)
WaL is a new locking mechanism that reflects I/O weight in the decision process of the next lock holder for cache allocation to achieve proportional I/O sharing. The structure of WaL is similar to the conventional qspinlock in that WaL utilizes the lock waiting queue and qspinlock. However, WaL stores the I/O weight of each application inside the qnode  and utilizes them in deciding the next lock holder. As a result, while the conventional qspinlock chooses the next lock holder in a FIFO manner, WaL traverses the lock waiting queue and chooses the one that has the highest I/O weight.
However, simply reordering the lock waiting queue based on I/O weight can incur two problems. First, the starvation problem can take place. Since the lower-weighted applications should yield their turns to higher-weighted ones, such applications may consume a long time to acquire the lock. Especially, when the majority of applications have high I/O weights, the lower-weighted ones might be denied to acquire the lock constantly. Prevention of such starvation is necessary to build a robust system because starvation can produce long-term unfairness and even system failures.
To solve this problem, we adopt a conventional wellknown technique for starvation, called aging, which gradually increments the priority of a task over time. WaL adjusts the I/O weights of qnodes by a certain value, whenever reordering occurs. Consequently, WaL considers not only I/O weight, but also waiting time in deciding the next lock holder.
The second problem is a scalability issue derived from NUMA-blindness. As previously reported [29], [30], [31], [32], [33], NUMA-blind lock management can result in performance degradation on NUMA systems. Specifically, frequent change in the NUMA nodes of the lock holders can induce noticeable overheads, because the memory bandwidth between NUMA sockets is finite and remote access is more expensive than local access [30]. To mitigate such overheads, WaL additionally stores the NUMA node ID of each qnode and endeavors to maintain the lock holder on the same NUMA node.
The pseudo-code of WaL is presented in Algorithm 1. When the current lock holder releases the qspinlock, WaL traverses the lock waiting queue to find the qnode that has the highest I/O weight (search phase). We call this qnode as maxNode. WaL examines not only the I/O weight, but also the NUMA node ID to minimize performance overheads on a NUMA system. Therefore, when multiple qnodes have the same highest I/O weight, WaL chooses the one on the same NUMA node as the head node. Afterward, WaL reorders the lock waiting queue so that the maxNode is located at the next of the head node and can acquire the qspinlock in the next turn. Finally, WaL increments the I/O weights of the other qnodes to prevent the starvation problem.
Indeed, the aging phase is inherent in the search phase to minimize the traversing overheads. In other words, WaL needs to traverse the lock waiting queue only once. But for ease of understanding, we decoupled the search phase and the aging phase in Algorithm 1. The maximum I/O weight that can be manually set by cgroup is 1,000, while the adjusted I/O weights can be higher than 1,000. Accordingly, regardless of initial I/O weights, any application can eventually become the maxNode due to the aging technique and thus acquire the lock. In our experiments of Section 5, we set the increment value as 100 for the following reason. First, we set the weight values of cgroups in units of 100 (i.e., 100, 200, 400, and 800). I/O weight denotes the proportionality relationship in that a process with weight 800 is supposed to show 8 times higher I/O performance than that with weight 100. Therefore, setting the increment value as 100 follows the concept of I/O weight since a process with weight 100 can acquire the lock after a process with weight 800 acquires the lock around 7-8 times if the lock is contended. In other words, the process with weight 100 becomes weight 800 after yielding 7 times when the increment value is 100. If a system sets the weight values in units of 50, (e.g., 50, 100, 150, and 200), the increment value should be 50. An example of WaL is illustrated in Fig. 8. In the initial state, the qspinlock is occupied by CPU1 where APP1 is running, and two qnodes are in the lock waiting queue. CPU4 is busy-waiting for the qspinlock because it is the head node. The I/O weight of CPU2 qnode has been increased from 200 to 400 after reordering twice. When APP3 tries to acquire the qspinlock, WaL creates a qnode structure for APP3 at the tail of the lock waiting queue and stores its I/O weight and the NUMA node ID inside the qnode (A). When CPU1 releases the qspinlock, WaL traverses the lock waiting queue to find the maxNode. Here, since the head node (CPU4) has been busy-waiting, WaL does not reorder the head node and lets it acquire the lock in this turn. In the example, the qnodes of CPU2 and CPU3 have identical I/O weight (400). Therefore, WaL additionally investigates the NUMA node information and picks CPU3 node as maxNode, because CPU3 node is located on the same NUMA node as the head node (CPU4). Afterward, WaL reorders the lock waiting queue so that the maxNode (CPU3) is located next to the head node (CPU4) (B). Finally, as the head node (CPU4) holds the qspinlock, the maxNode (CPU3) becomes the new head of the queue and will acquire the qspinlock in the next turn. Meanwhile, the aging technique is applied, and thus the I/O weights of qnodes in the queue are incremented (C). Consequently, by use of WaL, higher-weighted applications can obtain the lock for cache allocation relatively faster than applications with a lower weight, thereby achieving application-level proportional I/O sharing.

Weight-Aware Reclamation (WaR)
The conventional cache reclamation usually utilizes FIFO and LRU variants. However, such methods cannot prioritize applications with high I/O weights. To overcome this limitation, we propose WaR, which considers both reference recency and I/O weight. The goal of WaR is to keep the number of allocated cache entries for each application proportional to its I/O weight. To achieve this, WaR calculates the I/O proportion of each application and calculates the threshold number of cache entries that the application should have. Finally, during cache reclamation, WaR compares the number of cache entries of each application and its threshold, and decides which cache entries to evict.  figure) with the threshold value and evicts the cache entry whose owner application has more cache entries than its threshold. Note that when multiple applications share the same page, WaC designates the highest-weighted application as the owner of the page.
For instance, as shown in Fig. 9, the cache entries of APP1 and APP3 will be evicted because their p.g. # entries is higher than their TH, while WaR keeps the cache entries of APP4. Afterward, WaR decreases the total # entries from 10 to 8, and the TH values become adjusted due to the change of the total # entries. Finally, WaR decrements the p.g. # entries of APP1 and APP3 by 1, because of the cache eviction. By performing this process repetitively, WaR can keep the number of pages of each application proportional to its I/O weight.
Cache reclamation is closely related to read performance in that it decides the contents of the cache. Read operations can be serviced by the cache, depending on the existence of the corresponding data in the cache. WaR tries to keep more data of higher-weighted applications in the cache, which in turn increases the probability that their read requests are processed by the cache without accessing the underlying storage device. As a consequence, WaR can prioritize higher-weighted applications, thereby enhancing I/O proportionality.

IMPLEMENTATION
We implemented WaC in two places in the Linux kernel: the page cache and bcache. The operating system utilizes unused space of the main memory as page cache to accelerate I/O requests. Similarly, bcache is a discrete storage device to cache I/O requests at the block layer. Since they have different internal mechanisms, we explain the implementation details in this section.

Page Cache
The page cache manages cache entries in the unit of page and maintains the cache entries in two LRU lists: active list  and inactive list. The conventional page cache protects cache allocation using FIFO-based qspinlock. To implement WaL, we add weight and nid (NUMA node ID) variables into the qnode structure and assign the corresponding values to the variables when an application creates a new qnode. We utilize the numa_node_id() function, which is provided by Linux kernel, to obtain the NUMA node ID of the current application.
To implement WaR in the page cache, we performed the following tasks. First, we add two new variables, for I/O proportion and the number of cache entries per group, into the cgroup structure. Whenever the cgroup hierarchy changes due to an event, such as the creation of a new cgroup, WaR re-calculates the I/O proportion. Upon cache allocation, WaR links the new cache entry to the corresponding cgroup node, in order to clarify the ownership of the cache entry. Additionally, WaR increments the number of cache entries per group. In this way, WaR can refer to the I/ O proportion and the number of cache entries per group during cache reclamation. Finally, to obtain the total number of cache entries, WaR utilizes the existing NR_FILE_-PAGES variable, which is the number of file-backed page cache entries that the kernel keeps track of. When the page reclamation is triggered, WaC can evict pages from the tail of the inactive list according to the policy of WaR.
We only apply WaR to the inactive list of the page cache because the active list significantly contributes to the high cache hit ratio. Therefore, moving pages from the active list to the inactive list is performed with LRU to preserve the high hit ratio of the page cache as conventional. Meanwhile, WaC performs the conventional cache reclamation in the case of the direct reclamation (foreground reclaim), in which the system has a lack of free pages, in order to prevent the OOM (Out-Of-Memory) problem and minimize performance degradation. WaR requires additional CPU overhead for comparing the number of cache entries of each application and its threshold. Therefore, WaC performs WaR in the case of background reclaim (kswapd), thereby taking the computation off the critical path and minimizing performance degradation.

Bcache
Bcache manages the cache entries in the unit of bucket and utilizes priority values to implement cache reclamation algorithms such as LRU, FIFO, etc. Bcache also utilizes qspinlock to protect the cache allocation as in the case of page cache. Therefore, the implementation of WaL in the page cache is applied to bcache. To implement WaR in bcache, in addition to the modification in the cgroup structure, we additionally add a cgroup pointer inside the bucket structure to access I/O proportion and the number of cache entries per group during cache reclamation. Upon cache allocation, WaC links the corresponding cgroup pointer to the allocated cache entry and increments the total number of cache entries and the number of cache entries per group. When cache reclamation is needed, WaR calculates the threshold and decides which cache entries to evict. After cache reclamation, WaC adjusts the number of cache entries per group and the total number of cache entries. The implementation of WaR does not require modification on the mechanism for priority control. Therefore, the conventional cache reclamation algorithm is still valid among the cache entries within the same cgroup.

Evaluation Setup and Test Settings
To verify the efficacy of WaC, we performed various experiments on the two machines described in Table 1. Machine A is utilized to evaluate the page cache version of WaC. To test the scalability of WaC and the performance of the bcache version, we utilize machine B, which is equipped with 64 physical cores (hyper-threading disabled) and high performance storage device for bcache. All the benchmarks in the experiments are containerized by Docker v18.09.4-CE and run five times, unless otherwise specified. To quantitatively measure the I/O proportionality, we adopt a new metric called proportionality variation (PV), introduced in [34]. PV is calculated via the following equation.
Here, APPs are applications, N is the number of applications, Ideal is the ideal performance, and Actual is the actual performance obtained from experiments. The lower the value is, the closer the proportionality is to the ideal.

Fileserver Workload
To examine WaL on the page cache in terms of applicationlevel I/O proportionality, we ran eight Fileserver workloads in eight differently weighted containers, each of which runs for around 300 seconds and generates around 3 GB of data. The Fileserver workload performs a large amount of buffered writes and thus incurs frequent page cache allocation. Fig. 10a shows the normalized I/O bandwidth of eight containers in the Fileserver experiment. As shown in the x-axis of Fig. 10a, we assign different weights to the containers from 100 to 800. The bandwidths of the containers are normalized to the bandwidth of the container with weight 100. In the figure, Ideal represents the page cache management with ideal I/O proportionality that the containers expect to achieve, i.e., WaC can achieve better application-level I/O proportionality than the conventional scheme, because it prioritizes higher-weighted containers in the lock acquisition process during cache allocation. As a result, higher-weighted containers can quickly finish their write operations and resume the next write operations, thereby showing higher I/O bandwidth. In terms of PV, WaC outperforms the conventional scheme by around 36.9%. We also measured the total I/O bandwidth while varying the number of containers from two to eight, in order to analyze the overheads of WaL of WaC. Similarly to the previous experiment, we ran Fileserver workloads with differently weighted containers. As shown in Fig. 10b, the experimental result shows that the total I/O bandwidth decreases only by 3.9% at most when WaC is applied.

Re-Read Workload
To evaluate WaR of WaC, we performed the same Re-read experiments as the motivational experiments presented in Section 3. In the Re-read experiment, four containers with different I/O weights create their own files, and then the host creates a dummy file to contaminate the page cache. Afterward, the containers read the files again to examine how many pages of each container reside in the page cache. In our experiments, we also ran the workload with direct I/O as well as buffered I/O for better comparison. All the performance results are normalized to the case of weight 100. As shown in Fig. 11, the conventional scheme exhibits poor I/O proportionality because the conventional cache reclamation scheme does not consider I/O weight at all. Therefore, pages of higher-weighted containers can be evicted before those of lower-weighted containers, even when their reference counts are the same. As a result, the conventional scheme shows the PV of 1.4, while WaC shows the PV of 0.33. This result stems from the fact that WaC balances the number of allocated cache entries of the containers according to their I/O weights, by keeping cache entries of higher-weighted containers longer in the page cache. This result is even superior to that of direct I/O, in that direct I/O shows the PV of 0.61. This is because, in the case of WaC, the block-level I/O proportionality is also guaranteed by the underlying I/O scheduler, in addition to the cache-level I/O proportionality.

Overhead of WaC
To investigate the overheads caused by additional processing of WaC, we measured the hit ratio, while executing four containers with different I/O weights, each of which runs the Webserver (read-intensive) workload. In addition, we also measured the total bandwidth and the average latency, while performing Fileserver and Re-read workloads in four containers with the same I/O weight. As shown in Fig. 12a, the overall hit ratio drops just by about 0.8% on average, denoting that WaC keeps a satisfactory hit ratio even with deprioritizing the pages of lower-weighted containers. This result comes from the fact that WaC reflects I/O weight only in the inactive list leaving the active list as it is. Fig. 12b shows the write bandwidth of Fileserver and the read bandwidth of Re-read workloads. They are performed in the same configuration as in Fig. 10b and Fig. 11 except that all the containers have the same I/O weight in this experiment. Compared with the conventional scheme, WaC exhibits 2.7% and 3.7% drop in the total bandwidth on the Fileserver and Re-read workloads, respectively. In addition, the average latency of executing Fileserver and Re-read workloads increases by 1.5ms and 0.6ms on average in WaC, respectively. These overheads of WaC originate from searching for a container with the highest I/O weight during the cache allocation and repeatedly keeping track of the number of allocated cache entries per application for cache reclamation. However, considering the satisfactory results of I/O proportionality, we believe WaC is very practical to help improve the I/O proportionality with imperceptible overheads.

Comparison With Memory Cgroup
One might assume a combination of memory and blkio cgroup could be an alternative to WaC. The memory subsystem of cgroup provides the ability to control the maximum memory usage of each application. However, when the memory usage of a resource group exceeds its limit, the memory cgroup reclaims not only its file-backed pages but also anonymous pages. Therefore, it cannot solely limit the Fig. 10. Results on Fileserver. Fig. 11. Results on Re-read workload.  Fig. 13b of the Re-read workload experiment, from weights 100 to 800, I/O proportionality of WaC shows 1:2.02:3.91:8.46, whereas that of memory cgroup shows 1:1.15:1.60:4.53. PVs of WaC and memory cgroup are 0.14 and 1.67, respectively. We believe that these results come from the fact that memory cgroup cannot solely control file-backed pages (page cache) without limiting anonymous pages.

Aging Technique
To prevent the starvation problem, we added the aging technique to WaL. To verify that our scheme is robust in extreme cases, we performed the Fileserver experiment with eight containers. Here, the I/O weight of one container (C1) is 100 and the others (C2 -C8) are 1000. As shown in Fig. 14 whenever it yields its turn for lock acquisition, it shows better I/O proportionality than the case of WaC without the aging technique. Therefore, even though there are multiple higher-weighted containers, our scheme is still able to guarantee the performance of the lower-weighted containers according to their I/O weights.

Scalability
To analyze the effect of NUMA-awareness of WaL, we performed a scalability experiment using FIO on machine B shown in Table 1. We ran 4 KB buffered writes via four differently weighted containers, each of which execute multiple threads for the write operations. In this experiment, we spread the threads of the same container to different nodes as much as possible to simulate the worst case where the NUMA node of the lock holder frequently changes. As shown in Fig. 15, for all the cases, the total IOPS decreases after the total number of threads reaches 64, which is the number of physical cores. The peak IOPS of WaC without NUMA-awareness is around 5% lower than that of the conventional. It is because WaC without NUMA-awareness chooses the next lock holder without considering the NUMA topology and thus frequently changes the NUMA node of the lock holder. On the other hand, WaC outperforms even the conventional scheme by around 7% due to its NUMA-awareness.

Bcache
In addition to the page cache, we also evaluate WaC on bcache. We utilized machine B to perform experiments with the high performance storage device (Optane SSD) for bcache and ran benchmarks using four differently weighted containers from 100 to 800.

Random Write Test
To test WaL, we ran FIO with 4KB random write workloads using four containers. The random write workload performs direct I/O which bypasses the page cache and utilizes bcache. Similarly to the previous experiments, the containers are differently weighted from 100 to 800 so as to inspect proportional I/O sharing. As shown in Fig. 16, the conventional bcache cannot effectively prioritize higher-weighed applications, resulting in 2.59 of PV. On the other hand, WaC shows only 0.94 of PV by gracefully overcoming the limitation of the conventional bcache through WaL. The computation overhead of WaL is imperceptible in the case of bcache, because bcache is a storage device which is much slower than CPU and main memory. Therefore, in this experiment, we did not observe

Random Read Test
To examine the I/O proportionality of read performance, we ran random read workloads in four differently weighted containers. Prior to measuring the performance, we ran the benchmark once to warm up the cache. As shown in Fig. 17, the conventional bcache shows the PV of 3.01, while WaC exhibits that of 1.45. Moreover, on the conventional bcache, the lower weighted containers show higher read bandwidth because their data are cached more in the warm-up phase. The random read workload is not time-based, and thus the higher-weighted containers finished earlier than the others in the warm-up phase. Therefore, the data for lowerweighted containers evicts the data for higher weighed ones, according to the recency policy of LRU. On the other hand, WaR of WaC considers not only recency, but also I/O weight. Therefore, bcache keeps more cache entries of higherweighted containers, thereby showing better I/O proportionality. The total read bandwidth of the conventional and our scheme are 2,047 MB/s and 2,188 MB/s, respectively. In detail, the raw bandwidths (MB/s) of the conventional bcache from weight 100 to weight 800 are 700, 643, 375, and 329, whereas those of WaC are 238, 371, 588, and 991.
WaR induces additional CPU computations, compared with the conventional reclamation. Therefore, we additionally measured the CPU utilization of cache reclamation during the evaluation using the perf profiler. Here, 1 CPU utilization means that a thread occupies 1 CPU core during the given period. In the case of the conventional bcache, cache reclamation requires 0.009 CPU utilization during the evaluation whereas WaR requires 0.014 CPU utilization. Although WaR shows around 1.56 times higher CPU utilization, the CPU utilization is still extremely low. Therefore, we believe the additional computation of WaR is negligible.

TPC-C
To extensively verify the efficacy of our scheme, we ran the TPC-C workload with mysql-innodb. The TPC-C workload generates 92% of read/write transactions and 8% of read-only transactions [35]. In terms of I/O requests, the workload issues small-sized random read/write operations, and the ratio of read/write is around 1.9:1 [35]. The TPC-C workload generates around 40 GB of data in total. The performance of TPC-C is measured in transactions per minute (tpmC). As shown in Fig. 18, the conventional bcache could not effectively differentiate the performance of differently weighted containers. As a result, the PV of the conventional bcache is 2.48. The detailed I/O proportionality of the conventional bcache is 1 : 1.09:1.35:1.62. On the contrary, WaC prioritizes containers according to their I/O weights during both cache allocation and reclamation. Consequently, WaC shows the PV of 0.63 (1:2.63:4.78:6.85). Moreover, the total tpmC of WaC is 18% higher than that of the conventional scheme. In the case of WaC, higher-weighted containers can utilize sufficient amount of cache resource, thereby greatly improving their performance. As a result, the tpmC of the container with weight 800 is around 66% higher with WaC than the conventional scheme.

Heterogeneous Workloads
Finally, we ran two heterogeneous workloads in a time-varying manner to further verify the effectiveness of our scheme. In this experiment, we utilize two different workloads, a random write workload of FIO for a malicious application and a Fileserver workload of filebench for a high-priority application. We ran the FIO and Fileserver workload for 240 and 300 seconds, respectively. To minimize the performance interference of the malicious application, we set the priority of FIO as 100 and that of the Fileserver as 500. First, we run the FIO benchmark prior to the Fileserver in order to occupy cache entries with data generated by FIO. After 30 seconds, we execute the Fileserver workload and measure the I/O throughput using iotop [36]. Fig. 19 shows the I/O throughput of the workloads with the conventional cache management and WaC. As shown in the figure, the Fileserver workload with the conventional scheme achieves around 83% lower I/O throughput than that with WaC on average. This significant performance difference stems from the fact that the conventional cache management   fails to prioritize the Fileserver workload, although the Fileserver workload has 5 times higher priority than the FIO workload. Especially, since we execute the FIO workload ahead of the Fileserver workload, the data from the FIO workload occupy most of the cache entries. Therefore, the Fileserver workload cannot effectively utilize the cache resources. Additionally, the conventional cache allocation is priorityoblivious, thereby failing to prioritize the Fileserver workload.
On the other hand, WaC preferentially allocates cache entries to the Fileserver workload over the FIO workload due to its WaL. Additionally, its WaR tries to keep more data of the Fileserver workload than the FIO workload. As a result, the read and write throughput of the Fileserver with WaC are 291 MB/s and 281 MB/s, respectively, while those with the conventional are 66.5 MB/s and 65.9 MB/s. Note that the I/O throughput of FIO with the conventional scheme is 805 MB/s where as that of WaC is 663 MB/s.

DISCUSSION
Since proportional I/O sharing is an essential requirement to construct a cloud computing environment, there has been several research to improve I/O proportionality. For example, J. Kim et al. [34], [37] proposed A+CFQ and H+BFQ which extends Linux I/O schedulers considering internals of SSDs. Additionally, S. Ahn et al. [38] proposed a new scheme, which predicts the future I/O demands of each container and collaboratively manages read/write operations. Similarly, Woo et al. [39] suggested a light-weight fair-queueing scheme to provide fairness while minimizing CPU consumption. However, such schemes still do not consider the existence of cache layers in the I/O path, which is introduced to improve I/O performance in almost all types of system. WaC, on the other hand, is a new cache management scheme that improves application-level I/O proportionality even with the cache layers.
Cgroup v2 [40], which is the next version of cgroup, provides features to control writeback by setting dirty page ratio. However, cgroup v2 still cannot control the page cache with the I/O weight although I/O weight is a straight-forward and user-friendly method. Additionally, cgroup v2 cannot solve problems incurred by the locking mechanism in page allocation. Finally, cgroup v2 has no ability to control bcache for the sake of application-level I/O proportionality. P. Sharma et al. [41] proposed a per-VM page cache partitioning scheme. The main contribution of the paper is to increase hit ratio with small-sized memory by isolating VMs in the page cache layer. Most recently, S. Kashyap et al. [29] has proposed the shuffling mechanism that re-orders lock waiting queue based on a certain policy. They mainly focused on solving the conventional lock problems, such as memory footprint, with NUMA-awareness. On the other hand, the contribution of this paper is to propose a new cache management design to achieve application-level I/O proportionality.
In this paper, we address the I/O proportionality problem of virtualized environments, particularly with Docker virtualization, because proportional I/O sharing is a critical factor in such environments. However, since the page cache and bcache mechanism are identical in a system without virtualization, WaC is also applicable to non-virtualized environments.
We extended the previous version of our research [12], [42] in two ways. First, WaC extended WaL to consider the NUMA topology when deciding the next lock holder. The NUMA architecture has been widely adopted in many servers due to its superior performance. Therefore, NUMA-aware lock design is necessary to build a high-performance system. Second, there are various I/O caches in the current computer systems. However, the previous version of our work focused only on the page cache. In this paper, we re-designed our scheme to be general so that it can be adopted to various I/O caches. Additionally, we implemented and evaluated our idea on both the page cache and bcache in this paper.

CONCLUSION
In this paper, we proposed a new I/O cache management scheme, called WaC (Weight-aware Cache), to achieve application-level proportional I/O sharing. WaC prioritizes higherweighted applications both in cache allocation and reclamation. For cache allocation, WaC utilizes WaL which considers I/ O weight and NUMA-topology when deciding the next lock holder. For cache reclamation, WaR of WaC tries to keep the number of cache entries proportional to I/O weight. We implemented and evaluated our idea on both the page cache and bcache, and the experimental results demonstrate that WaC effectively improves application-level I/O proportionality, compared with the conventional cache management.