ABSTRACT
A primary use of chip-multiprocessor (CMP) systems is to speed up a single application by exploiting thread-level parallelism. In such systems, threads may slow each other down by issuing memory requests that interfere in the shared memory subsystem. This inter-thread memory system interference can significantly degrade parallel application performance. Better memory request scheduling may mitigate such performance degradation. However, previously proposed memory scheduling algorithms for CMPs are designed for multi-programmed workloads where each core runs an independent application, and thus do not take into account the inter-dependent nature of threads in a parallel application.
In this paper, we propose a memory scheduling algorithm designed specifically for parallel applications. Our approach has two main components, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers. First, the runtime system estimates threads holding the locks that cause the most serialization as the set of limiter threads, which are prioritized by the memory scheduler. Second, the memory scheduler shuffles thread priorities to reduce the time threads take to reach the barrier. We show that our memory scheduler speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.
- A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA, 2009. Google ScholarDigital Library
- Q. Cai et al. Meeting points: Using thread criticality to adapt multicore hardware to parallel regions. In PACT, 2008. Google ScholarDigital Library
- S. Chen et al. Scheduling threads for constructive cache sharing on CMPs. In SPAA, 2007. Google ScholarDigital Library
- H.-Y. Cheng et al. Memory latency reduction via thread throttling. In MICRO, 2010. Google ScholarDigital Library
- D. H. Bailey et al. NAS parallel benchmarks. Technical report, NASA Ames Research Center, 1994.Google Scholar
- E. Ebrahimi et al. Coordinated control of multiple prefetchers in multi-core systems. In MICRO, 2009. Google ScholarDigital Library
- E. Ebrahimi et al. Fairness via source throttling: A configrable and high-performance fairness substrate for multi-core memory systems. In ASPLOS, 2010. Google ScholarDigital Library
- E. Ebrahimi et al. Parallel application memory scheduling. Technical Report TR-HPS-2011-001, UT-Austin, Oct. 2011.Google ScholarDigital Library
- E. Ebrahimi et al. Prefetch-aware shared resource management for multi-core systems. In ISCA, 2011. Google ScholarDigital Library
- D. G. Feitelson and L. Rudolph. Gang scheduling performance benefits for fine-grain synchronization. JPDC, 16(4):306--318, 1992.Google ScholarCross Ref
- I. Hur and C. Lin. Adaptive history-based memory scheduler. In MICRO, 2004. Google ScholarDigital Library
- E. Ipek et al. Self-optimizing memory controllers: A reinforcement learning approach. In MICRO, 2008.Google ScholarDigital Library
- R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In ICS, 2004. Google ScholarDigital Library
- R. Iyer et al. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, 2007. Google ScholarDigital Library
- S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004. Google ScholarDigital Library
- Y. Kim et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.Google Scholar
- Y. Kim et al. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010. Google ScholarDigital Library
- J. Li et al. The thrifty barrier: energy-aware synchronization in shared memory multiprocessors. 2004. Google ScholarDigital Library
- Y.-J. Lin et al. Hierarchical memory scheduling for multimedia MPSoCs. In ICCAD, 2010. Google ScholarDigital Library
- C. Liu et al. Exploiting barriers to optimize power consumption of CMPs. In IPDPS, 2005. Google ScholarDigital Library
- S. A. McKee et al. Dynamic access ordering for streamed computations. IEEE TC, 49:1255--1271, Nov. 2000. Google ScholarDigital Library
- Micron. Datasheet: 2Gb DDR3 SDRAM, MT41J512M4 - 64 Meg x 4 x 8 banks, http://download.micron.com/pdf/datasheets/dram/ddr3.Google Scholar
- T. Moscibroda and O. Mutlu. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX Security, 2007. Google ScholarDigital Library
- T. Moscibroda and O. Mutlu. Distributed order scheduling and its application to multi-core DRAM controllers. In PODC, 2008. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA, 2008. Google ScholarDigital Library
- K. J. Nesbit et al. Fair queuing memory systems. In MICRO, 2006. Google ScholarDigital Library
- J. K. Ousterhout. Scheduling techniques for concurrent systems. In IEEE Distributed Computer Systems, 1982.Google Scholar
- N. Rafique et al. Effective management of DRAM bandwidth in multicore processors,.Google Scholar
- C. Ranger et al. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA, 2007. Google ScholarDigital Library
- S. Rixner et al. Memory access scheduling. In ISCA, 2000. Google ScholarDigital Library
- M. A. Suleman et al. Accelerating critical section execution with asymmetric multi-core architectures. In ASPLOS, 2009. Google ScholarDigital Library
- W. Zuravleff and T. Robinbson. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. U. S. Patent Number 5,630,096, 1997.Google Scholar
Index Terms
- Parallel application memory scheduling
Recommendations
Adaptive Time-Based Least Memory Intensive Scheduling
MCSOC '15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-ChipDRAM memory is a major resource shared in multi-core system, hence memory requests from different applications interfere with each other. Therefore, different applications running together on the same chip can experience extremely different memory ...
Time-Based Least Memory Intensive Scheduling
MCSOC '14: Proceedings of the 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCsIn the modern chip-multiprocessor system, main memory is a shared resource among multiple concurrently executing threads/applications. The memory scheduling algorithms are developed to resolve memory contention by arbitrating memory access in such a way ...
Thread Cluster Memory Scheduling
Memory schedulers in multicore systems should carefully schedule memory requests from different threads to ensure high system performance and fair, fast progress of each thread. No existing memory scheduler provides both the highest system performance ...
Comments