KART – A Runtime Compilation Library for Improving HPC Application Performance

Noack, Matthias; Wende, Florian; Zitzlsberger, Georg; Klemm, Michael; Steinke, Thomas

doi:10.1007/978-3-319-67630-2_29

Matthias Noack¹⁷,
Florian Wende¹⁷,
Georg Zitzlsberger¹⁸,
Michael Klemm¹⁸ &
…
Thomas Steinke¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10524))

Included in the following conference series:

International Conference on High Performance Computing

1739 Accesses
1 Citations

Abstract

The effectiveness of ahead-of-time compiler optimization heavily depends on the amount of available information at compile time. Input-specific information that is only available at runtime cannot be used, although it often determines loop counts, branching predicates and paths, as well as memory-access patterns. It can also be crucial for generating efficient SIMD-vectorized code. This is especially relevant for the many-core architectures paving the way to exascale computing, which are more sensitive to code-optimization. We explore the design-space for using input-specific information at compile-time and present KART, a library solution that allows developers to compile, link, and execute code (e.g., C, , Fortran) at application runtime. Besides mere runtime compilation of performance-critical code, KART can be used to instantiate the same code multiple times using different inputs, compilers, and options. Other techniques like auto-tuning and code-generation can be integrated into a KART-enabled application instead of being scripted around it. We evaluate runtimes and compilation costs for different synthetic kernels, and show the effectiveness for two real-world applications, HEOM and a WSM6 proxy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

OpenMP Compilers, September 2016. http://openmp.org/wp/openmp-compilers/
OpenMP®: Support for the OpenMP language, April 2016. http://openmp.llvm.org/
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing, November 2014
Google Scholar
Bezanson, J., Karpinski, S., Shah, V.B., Edelman, A.: Julia: a fast dynamic language for technical computing. http://julialang.org
Heinecke, A., Henry, G., Hutchinson, M., Pabst, H.: LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 84:1–84:11, SC 2016. IEEE Press, Piscataway (2016). http://dl.acm.org/citation.cfm?id=3014904.3015017
Heinecke, A., Klemm, M., Pflüger, D., Bode, A., Bungartz, H.J.: Extending a highly parallel data mining algorithm to the Intel\(^{\textregistered }\) many integrated core architecture. In: Alexander, M., et al. (eds.) Parallel Processing Workshops, Euro-Par 2011. LNCS, vol. 7156. Springer, Heidelberg (2011)
Google Scholar
Henderson, T., Michalakes, J., Gokhale, I., Jha, A.: Chapter 2 - Numerical weather prediction optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls, pp. 7–23. Morgan Kaufmann, Boston (2015)
Chapter Google Scholar
Joó, B.: LLVM and QDP-JIT. In: iXPUG Workshop, Berkeley (2015). https://www.ixpug.org/events/ixpug-annual-meeting-2015
Khronos OpenCL Working Group: The OpenCL Specification, Version 2.2. https://www.khronos.org/registry/cl/specs/opencl-2.2.pdf
Kreisbeck, C., Kramer, T., Aspuru-Guzik, A.: Scalable high-performance algorithm for the simulation of exciton dynamics. Application to the light-harvesting Complex II in the presence of resonant vibrational modes. J. Chem. Theory Comput. 10(9), 4045–4054 (2014). pMID: 26588548. http://dx.doi.org/10.1021/ct500629s
Article Google Scholar
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: CGO, pp. 75–88, San Jose, CA, USA, March 2004. llvm.org
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008). http://doi.acm.org/10.1145/1365490.1365500
Article Google Scholar
Noack, M., Wende, F., Oertel, K.D.: Chapter 19 - OpenCL: there and back again. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls, pp. 355–378. Morgan Kaufmann, Boston (2015)
Chapter Google Scholar
Noack, M., Wende, F., Steinke, T., Cordes, F.: A unified programming model for intra- and inter-node offloading on xeon phi clusters. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, 16–21 November 2014, pp. 203–214 (2014). http://dx.doi.org/10.1109/SC.2014.22
NVIDIA: NVRTC - CUDA Runtime Compilation User Guide. http://docs.nvidia.com/cuda/pdf/NVRTC_User_Guide.pdf
OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.5 (2015). http://www.openmp.org/
Schling, B.: The Boost C++ Libraries. XML Press, Fort Collins (2011)
Google Scholar
Schneider, T., Kjolstad, F., Hoefler, T.: MPI datatype processing using runtime compilation. In: Proceedings of the 20th European MPI Users’ Group Meeting, pp. 19–24. ACM, September 2013
Google Scholar
Siso, S.: DL_MESO Code Modernization. Intel Xeon Phi Users Group (IXPUG). IXPUG Workshop, Ostrava, March 2016
Google Scholar
Winter, F.T., Clark, M.A., Edwards, R.G., Joó, B.: A framework for lattice QCD calculations on GPUs. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1073–1082, IPDPS 2014 (2014). http://dx.doi.org/10.1109/IPDPS.2014.112

Download references

Acknowledgments

This work is partially supported by Intel Corporation within the “Research Center for Many-core High-Performance Computing” (Intel PCC) at ZIB. We thank the “The North-German Supercomputing Alliance - HLRN” for providing us access to the HLRN-III production system ‘Konrad’ and the Cray TDS system with Intel KNL nodes.

Intel, Xeon, and Xeon Phi are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

* Other names and brands are the property of their respective owners. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Author information

Authors and Affiliations

Zuse Institute Berlin, Berlin, Germany
Matthias Noack, Florian Wende & Thomas Steinke
Intel Deutschland GmbH, Feldkirchen, Germany
Georg Zitzlsberger & Michael Klemm

Authors

Matthias Noack
View author publications
You can also search for this author in PubMed Google Scholar
Florian Wende
View author publications
You can also search for this author in PubMed Google Scholar
Georg Zitzlsberger
View author publications
You can also search for this author in PubMed Google Scholar
Michael Klemm
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Steinke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Noack .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Hamburg, Germany
Julian M. Kunkel
TITECH, Tokyo, Japan
Rio Yokota
Department of Computer Science, University of Delaware, Newark, Delaware, USA
Michela Taufer
Lawrence Berkeley National Laboratory, Berkeley, California, USA
John Shalf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Noack, M., Wende, F., Zitzlsberger, G., Klemm, M., Steinke, T. (2017). KART – A Runtime Compilation Library for Improving HPC Application Performance. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-67630-2_29
Published: 20 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67629-6
Online ISBN: 978-3-319-67630-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics