skip to main content
10.1145/2155620.2155663acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Parallel application memory scheduling

Published:03 December 2011Publication History

ABSTRACT

A primary use of chip-multiprocessor (CMP) systems is to speed up a single application by exploiting thread-level parallelism. In such systems, threads may slow each other down by issuing memory requests that interfere in the shared memory subsystem. This inter-thread memory system interference can significantly degrade parallel application performance. Better memory request scheduling may mitigate such performance degradation. However, previously proposed memory scheduling algorithms for CMPs are designed for multi-programmed workloads where each core runs an independent application, and thus do not take into account the inter-dependent nature of threads in a parallel application.

In this paper, we propose a memory scheduling algorithm designed specifically for parallel applications. Our approach has two main components, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers. First, the runtime system estimates threads holding the locks that cause the most serialization as the set of limiter threads, which are prioritized by the memory scheduler. Second, the memory scheduler shuffles thread priorities to reduce the time threads take to reach the barrier. We show that our memory scheduler speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.

References

  1. A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Q. Cai et al. Meeting points: Using thread criticality to adapt multicore hardware to parallel regions. In PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chen et al. Scheduling threads for constructive cache sharing on CMPs. In SPAA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H.-Y. Cheng et al. Memory latency reduction via thread throttling. In MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. H. Bailey et al. NAS parallel benchmarks. Technical report, NASA Ames Research Center, 1994.Google ScholarGoogle Scholar
  6. E. Ebrahimi et al. Coordinated control of multiple prefetchers in multi-core systems. In MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Ebrahimi et al. Fairness via source throttling: A configrable and high-performance fairness substrate for multi-core memory systems. In ASPLOS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Ebrahimi et al. Parallel application memory scheduling. Technical Report TR-HPS-2011-001, UT-Austin, Oct. 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Ebrahimi et al. Prefetch-aware shared resource management for multi-core systems. In ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. G. Feitelson and L. Rudolph. Gang scheduling performance benefits for fine-grain synchronization. JPDC, 16(4):306--318, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  11. I. Hur and C. Lin. Adaptive history-based memory scheduler. In MICRO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Ipek et al. Self-optimizing memory controllers: A reinforcement learning approach. In MICRO, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In ICS, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Iyer et al. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Kim et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.Google ScholarGoogle Scholar
  17. Y. Kim et al. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Li et al. The thrifty barrier: energy-aware synchronization in shared memory multiprocessors. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y.-J. Lin et al. Hierarchical memory scheduling for multimedia MPSoCs. In ICCAD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Liu et al. Exploiting barriers to optimize power consumption of CMPs. In IPDPS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. A. McKee et al. Dynamic access ordering for streamed computations. IEEE TC, 49:1255--1271, Nov. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Micron. Datasheet: 2Gb DDR3 SDRAM, MT41J512M4 - 64 Meg x 4 x 8 banks, http://download.micron.com/pdf/datasheets/dram/ddr3.Google ScholarGoogle Scholar
  23. T. Moscibroda and O. Mutlu. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX Security, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Moscibroda and O. Mutlu. Distributed order scheduling and its application to multi-core DRAM controllers. In PODC, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. J. Nesbit et al. Fair queuing memory systems. In MICRO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. K. Ousterhout. Scheduling techniques for concurrent systems. In IEEE Distributed Computer Systems, 1982.Google ScholarGoogle Scholar
  29. N. Rafique et al. Effective management of DRAM bandwidth in multicore processors,.Google ScholarGoogle Scholar
  30. C. Ranger et al. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Rixner et al. Memory access scheduling. In ISCA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. A. Suleman et al. Accelerating critical section execution with asymmetric multi-core architectures. In ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Zuravleff and T. Robinbson. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. U. S. Patent Number 5,630,096, 1997.Google ScholarGoogle Scholar

Index Terms

  1. Parallel application memory scheduling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
      December 2011
      519 pages
      ISBN:9781450310536
      DOI:10.1145/2155620
      • Conference Chair:
      • Carlo Galuzzi,
      • General Chair:
      • Luigi Carro,
      • Program Chairs:
      • Andreas Moshovos,
      • Milos Prvulovic

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 December 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate484of2,242submissions,22%

      Upcoming Conference

      MICRO '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader