Abstract
Vector computing can be read as equipping a processing unit with replicated ALUs. To benefit from this hardware concurrency, we have to phrase our calculations as operation sequences over small vectors. We discuss what well-suited loops that can exploit vector units have to look like, we discuss the relation between loops with vectorisation and partial loop unrolling, and we discuss the difference between horizontal and vertical vectorisation. These insights allow us to study realisation flavours of vectorisation on the chip: Are wide vector registers used, or do we introduce hardware threads with lockstepping, how do masking and blending facilitate slight thread divergence, and why does non-continuous data access require us to gather and scatter register content?
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In return, vendors sell you a temporary increase of the clock speed as turbo boost; which sounds nice but actually means that you will not benefit from this with a floating-point heavy code.
- 2.
Two different things are called CUDA: There is a programming language (a C dialect) that we call CUDA, and there is a hardware architecture paradigm called CUDA. We do not discuss the CUDA programming here.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Weinzierl, T. (2021). SIMD Vector Crunching. In: Principles of Parallel Scientific Computing. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-76194-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-76194-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76193-6
Online ISBN: 978-3-030-76194-3
eBook Packages: Computer ScienceComputer Science (R0)