Skip to main content

Locality-Aware Task Scheduling and Data Distribution on NUMA Systems

  • Conference paper
OpenMP in the Era of Low Power Devices and Accelerators (IWOMP 2013)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 8122))

Included in the following conference series:

Abstract

Modern parallel computer systems exhibit Non-Uniform Memory Access (NUMA) behavior. For best performance, any parallel program therefore has to match data allocation and scheduling of computations to the memory architecture of the machine. When done manually, this becomes a tedious process and since each individual system has its own peculiarities this also leads to programs that are not performance-portable.

We propose the use of a data distribution scheme in which NUMA hardware peculiarities are abstracted away from the programmer and data distribution is delegated to a runtime system which is generated once for each machine. In addition we propose using task data dependence information now possible with the OpenMP 4.0RC2 proposal to guide the scheduling of OpenMP tasks to further reduce data stall times.

We demonstrate the viability and performance of our proposals on a four socket AMD Opteron machine with eight NUMA nodes. We identify that both data distribution and locality-aware task scheduling improves performance compared to default policies while still providing an architecture-oblivious approach for the programmer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1–12 (2012)

    Google Scholar 

  2. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in hpc applications. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 180–186 (2010)

    Google Scholar 

  3. Ribeiro, C.P., Mhaut, J.F.: Minas: Memory affinity management framework (2009)

    Google Scholar 

  4. Kleen, A.: A numa api for linux. Novel Inc. (2005)

    Google Scholar 

  5. Terboven, C., Schmidl, D., Cramer, T.: an Mey, D.: Assessing OpenMP Tasking Implementations on NUMA Architectures. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  6. McCurdy, C., Vetter, J.S.: Memphis: Finding and fixing NUMA-Related performance problems on multi-core platforms. Proceedings of the IEEE (2010)

    Google Scholar 

  7. Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In: International Conference on Parallel Processing, ICPP 2009, pp. 124–131 (2009)

    Google Scholar 

  8. AMD: BIOS and kernel developers guide for AMD family 10h processors

    Google Scholar 

  9. Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., Hughes, B.: Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30(2), 16–29 (2010)

    Article  Google Scholar 

  10. Molka, D., Schne, R., Hackenberg, D., Mller, M.: Memory performance and SPEC OpenMP scalability on quad-socket x86_64 systems. Algorithms and Architectures for Parallel Processing, 170–181 (2011)

    Google Scholar 

  11. Pillet, V., Labarta, J., Cortes, T., Girona, S.: Paraver: A tool to visualize and analyze parallel code. WoTUG-18, 17–31 (1995)

    Google Scholar 

  12. Huang, L., Jin, H., Yi, L., Chapman, B.: Enabling locality-aware computations in OpenMP. Scientific Programming 181, 169–181 (2010)

    Article  Google Scholar 

  13. Majo, Z., Gross, T.R.: Matching memory access patterns and data placement for NUMA systems. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 230–241 (2012)

    Google Scholar 

  14. Nikolopoulos, D.S., Papatheodorou, T.S., Polychronopoulos, C.D., Labarta, J.: Is data distribution necessary in OpenMP? In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing (CDROM), p. 47 (2000)

    Google Scholar 

  15. Terboven, C., Schmidl, D., Jin, H., Reichstein, T.: Data and thread affinity in openmp programs. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: a Solved Problem? pp. 377–384 (2008)

    Google Scholar 

  16. Broquedis, F., Furmento, N., Goglin, B., Namyst, R., Wacrenier, P.-A.: Dynamic task and data placement over NUMA architectures: An openMP runtime perspective. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 79–92. Springer, Heidelberg (2009)

    Chapter  MATH  Google Scholar 

  17. Goglin, B., Furmento, N.: Enabling high-performance memory migration for multithreaded applications on linux. In: IEEE International Symposium on Parallel & Distributed Processing, IPDPS 2009, pp. 1–9 (2009)

    Google Scholar 

  18. Wittmann, M., Hager, G.: Optimizing ccNUMA locality for task-parallel execution under OpenMP and TBB on multicore-based systems. arXiv preprint arXiv:1101 (2010)

    Google Scholar 

  19. Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Spiegel, M., Prins, J.F.: OpenMP task scheduling strategies for multicore NUMA systems. International Journal of High Performance Computing Applications 26(2), 110–124 (2012)

    Article  Google Scholar 

  20. Pilla, L.L., Ribeiro, C.P., Cordeiro, D., Mhaut, J.F.: Charm++ on NUMA platforms: the impact of SMP optimizations and a NUMA-aware load balancer. In: 4th Workshop of the INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, IL, USA (2010)

    Google Scholar 

  21. Schmidl, D., Terboven, C.: an Mey, D.: Towards NUMA Support with Distance Information. In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) IWOMP 2011. LNCS, vol. 6665, pp. 69–79. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Muddukrishna, A., Jonsson, P.A., Vlassov, V., Brorsson, M. (2013). Locality-Aware Task Scheduling and Data Distribution on NUMA Systems. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds) OpenMP in the Era of Low Power Devices and Accelerators. IWOMP 2013. Lecture Notes in Computer Science, vol 8122. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40698-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40698-0_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40697-3

  • Online ISBN: 978-3-642-40698-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics