VkFFT-A Performant, Cross-Platform and Open-Source GPU FFT Library

The Fast Fourier Transform is an essential algorithm of modern computational science. The highly parallel structure of the FFT allows for its efficient implementation on graphics processing units (GPUs), which are now widely used for general-purpose computing. This paper presents the VkFFT - an efficient GPU-accelerated multidimensional Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL/Level Zero and Metal projects. VkFFT aims to provide the community with a cross-platform open-source alternative to vendor-specific solutions while achieving comparable or better performance. This paper also discusses the optimizations implemented in VkFFT and verifies its performance and precision on modern high-performance computing GPUs. VkFFT is released under an MIT license.


I. INTRODUCTION
The recent advances in graphics processing units (GPUs) in terms of their computational power, data transfer speed and available memory size make them extremely appealing for general-purpose use, expanding their original rendering and visualization target tasks. The single instruction -multiple threads (SIMT) design approach used in GPUs has proven extremely effective in data-level parallel tasks. In this approach, threads are grouped in warps, which are then operated by a scheduler unit so that all threads from a warp perform the same operation on thread-local registers. A common number of threads in a warp is 32 or 64 for modern GPU architectures [1]. One such highly parallel task is a multidimensional Fast Fourier Transform (FFT) [2]. The FFT algorithm is widely used in signal-processing applications to extract the frequency properties of a given input signal. Another major use case of the FFT is to calculate big convolutions in convolutional neural networks (CNNs) and image analysis workloads using the Convolution theorem. The FFT is also used as a base The associate editor coordinating the review of this manuscript and approving it for publication was Tomas F. Pena . algorithm for all other polynomial transforms, like the Discrete Cosine Transform (DCT) or the Spherical Harmonics Transform (SHT). The major speedup is achieved through the reduction of the complexity from O(N 2 ) to O(NlogN ), which results in a substantial decrease in computational effort [3].
There are multiple application programming interfaces (APIs) available for GPU computing and corresponding FFT implementations. The most known and used for GPU compute tasks, Nvidia Compute Unified Device Architecture (CUDA) platform, has been in development for over a decade [4]. It has a sophisticated ecosystem and is refined to achieve the best results on Nvidia GPUs. Nvidia's Math library collection has an FFT implementation -cuFFT, which has high-performance and is widely used for building commercial and research applications [5]. However, the Nvidia CUDA platform can be only used on Nvidia GPUs, which limits the portability of the code. Another drawback of using Nvidia's Math library collection is that it is closed-source and cannot be optimized at the code level for specific tasks.
AMD has its own API -Heterogeneous Interface for Portability (HIP) and related software stack, called Radeon Open Compute (ROCm) which is modeled to be as close to CUDA VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as possible, so developers can port their code more easily [6]. However, it is less mature than CUDA, resulting in worse functionality and performance. For example, their GPU FFT library -rocFFT, is significantly slower than cuFFT (on a similar level of HPC GPUs). For the open-source and cross-platform computations, Khronos Group maintains the OpenCL framework (originally developed by Apple), which aims to unify programming on heterogeneous platforms, including GPUs, CPUs and FPGAs [7]. It also has its own FFT implementation -clFFT, which stopped development in 2016. OpenCL is undermined by inferior driver support, as all the vendors are mainly interested in the development of their own APIs.
Intel has also released the Level Zero API for their upcoming GPUs and is also developing a GPU version of their CPU Math Kernel Library -oneMKL, aimed at their GPUs [8].
Apple is developing its own proprietary API called Metal for its operating systems that target GPUs, like Apple's M-series systems on chip (SoC) [9]. Apple's Accelerate Framework at the moment of writing this paper only had CPU implementation of the FFT algorithm, with no GPU support plans announced.
The Vulkan API is a low-overhead, cross-platform 3D graphics and computing API, initially released in 2016 with major updates in 2018 and 2020 [10], [11]. Vulkan allows for better and lower-level control of the GPU than its predecessor, OpenGL. Vulkan is backed up by the big gaming industry (unlike OpenCL), which results in good driver support for all vendors -Vulkan code can be launched on Nvidia, AMD, Intel and mobile GPUs and on a compatibility layer). Vulkan is released as open-source, so everybody can contribute to its development. However, a good math library collection in Vulkan has not yet been created. This paper proposes an implementation of the FFT algorithms -VkFFT, that can work with all the aforementioned programming interfaces. VkFFT source code can be found on GitHub: https://github.com/DTolm/VkFFT. The paper structure is as follows. Section II introduces background knowledge on the Discrete Fourier transforms and implemented in VkFFT algorithms. Section III describes the GPU architecture and what needs to be considered when designing performant GPU libraries. Section IV contains the VkFFT structure details and describes a novel approach to GPU runtime kernel optimization. Section V is dedicated to algorithms and optimizations implemented in VkFFT. Section VI gives the benchmark results and performance analysis of VkFFT compared to cuFFT and rocFFT. Section VII contains the precision verification details of VkFFT, also comparing it to cuFFT and rocFFT. Section VIII is dedicated to the paper's conclusion.

II. FAST FOURIER TRANSFORM ALGORITHMS OVERVIEW A. DISCRETE FOURIER TRANSFORMS AND COOLEY-TUKEY ALGORITHM
The Fourier transform of a sequence is called Discrete Fourier Transform (DFT). It is defined by the following formula: where x n is the input sequence, N is the length of the input sequence and k ∈ [0, N − 1], k ∈ Z is the output index, corresponding to the frequency in Fourier space. Corresponding to that, the inverse DFT is defined as the following: The fastest known method of obtaining the DFT of a sequence of arbitrary dimensionality is the Fast Fourier Transform algorithm. The most commonly used FFT strategy is called the Cooley-Tukey algorithm [2]. It is based on a divide-andconquer strategy that breaks the DFT into a sequence of smaller-size DFTs. The idea behind it is to reformulate the DFT of a sequence with composite size N = N 1 · N 2 as a combination of smaller-sized DFTs: 1) Perform digit-reversal permutation to reorder data.
If performed before stage one -Decimation in Time (DIT), after all stages -Decimation in Frequency (DIF). 2) Perform N 1 DFTs of size N 2 . Here we assume N 1 to be smaller than N 2 . 3) Perform O(N ) multiplications by twiddle factors -complex roots of unity defined by the radix. This stage is needed to combine the results of smaller DFTs, so we can get the DFT of the input sequence. It can also be decomposed and merged with stage four. 4) Perform N 2 DFTs of size N 1 . Here N 1 is a small factor and is designated as the radix of the current transformation. N 1 -FFT stages and a twiddle multiplication have a small number of arithmetic operations required (if N 1 is sufficiently small, we can write the DFT algorithm explicitly) and can be performed a large number of times (if the size of the input sequence N is large). These two stages will be called a step of FFT in this paper. Applying these steps recursively to N 2 -DFT and further will result in the decomposition of size N DFT as a combination of smaller DFTs. For the simplest scenario, where N is a power of two and N 1 = 2, this decomposition will result in the final algorithm containing only N /2 length-2 DFTs (called radix-2 butterflies) and twiddle multiplications performed log 2 N times and a single digit-reversal permutation to reorder data. This results in the total complexity of O(N logN ). A similar approach can be applied to other radix sizes and their combinations. To compute an inverse FFT, we have to calculate twiddle factors with negative complex roots of unity and normalize the result, otherwise the algorithm remains unchanged. Multidimensional FFTs can be performed for each axis separately as a set of 1D transforms.

B. STOCKHAM ALGORITHM
In the Cooley-Tukey algorithm, digit-reversal is performed once -before or after performing all steps of the recursive algorithm. Another option is to have intermediate digit-reversals after each small radix FFT. This approach is called the Stockham version of the Cooley-Tukey algorithm and is commonly used in GPU FFT libraries due to its better cache locality at each step [12].

C. THE FOUR-STEP FFT ALGORITHM
The Cooley-Tukey algorithm described above can also be used to represent one big 1D FFT sequence as a 2D FFT of smaller sizes [13]. This method is called the Four-step FFT and is used when the full FFT sequence is too big to be stored in the L1 cache of the GPU or to split the workload between multiple GPUs. Firstly, we compute FFTs along the columns. Then we multiply the sequence by twiddle factors and perform FFTs along the rows. The digit-reversal stage can be seen as an N 2 × N 1 transposition in this case and is usually done after the row FFTs to preserve locality.

D. RADER's AND BLUESTEIN's FFT ALGORITHMS
The parallelization of Cooley-Tukey and Stockham algorithms is usually done with a single thread (either CPU or GPU) responsible for a single N 1 radix kernel. Then we can have N 2 = N ÷ N 1 threads working at the same time. To achieve good occupancy on GPUs, N 2 has to be big (at least the number of available cores). However, this is often not the case when N 1 is a big prime (for example, when the full sequence length N is a prime). Another issue is that the generation of optimized radix kernels for big primes is a non-trivial task in itself. Luckily, there exist two FFT algorithms that solve this issue: Rader's FFT algorithm and Bluestein's FFT algorithm [14], [15].
Rader's FFT algorithm computes the FFT of a prime-sized length P as a P − 1 length cyclic convolution (Fig. 1). The main idea behind the algorithm is that residuals [1, P − 1] form a multiplicative group of integers modulo P with a generator g. Below, all generator powers are calculated modulo P. Rader then reformulates the FFT in question (reorder steps in Fig. 1) as: Bluestein's FFT algorithm step-by-step description. This figure shows how an FFT of length N can be computed as a radix convolution of length N * ≥ 2N − 1. Convolution is performed with the help of the convolution theorem (light green), which can be done efficiently with the Stockham algorithm for primes from 2 to 13.
The second summation is P − 1 length cyclic convolution. It can be calculated in two ways -as a direct multiplication with a precomputed kernel, or by using the convolution theorem (P − 1 FFT to IFFT steps in Fig. 1) for the following two sequences: The b n sequence and its DFT can be precomputed. The direct multiplication approach does not scale well with increasing P as it has O(N 2 ) complexity. The convolution theorem version of Rader's algorithm has another limitation -we rely on being able to compute a P − 1 length sequence FFT with a regular Stockham algorithm approach. This is not the case for Sophie-Germain safe primes (numbers like 47, 59, 83) and sequences divisible by them. For these numbers, another general algorithm can be used -Bluestein's algorithm (Fig. 2). This algorithm works for arbitrary sequence size N through reformulation of the DFT definition: To get the DFT of an input sequence, we then have to calculate the convolution between the following two sequences (N * FFT to IFFT steps in Fig. 2): If we pad them both with zeros to a sequence size higher than 2N − 1 (FFT of which can be done with a Stockham algorithm) we can remove the circular part of the convolution and compute the convolution by using the convolution VOLUME 11, 2023 theorem. The b n sequence and its DFT can be precomputed. This algorithm covers all the remaining primes and composite sequences. As Bluestein's algorithm is performing convolutions of length N * ≥ 2N −1 as opposed to Rader's N −1, it is expected to have lower performance. While usually true, this is at times not the case as Bluestein's algorithm's length can be chosen to have only small primes, like 2 and 3, allowing for better parallelization.

E. REAL TO COMPLEX FFTs
If there is prior knowledge of the input structure, some optimizations to the regular algorithms can be made. Real-tocomplex (R2C) and complex-to-real (C2R) transforms can be explained as complex-to-complex (C2C) transforms with the imaginary part set to zero [16]. They exploit the Hermitian symmetry of the result: X k = X * N −k of the sequence. This results in the reduction of required memory to store the complex result -we may only store floor( N x 2 ) + 1 complex numbers instead of N x . Computational complexity can also be reduced -two real sequences can be packed as one complex sequence, or an even real sequence can be calculated as a complex sequence of half the length. Overall, these two optimizations result in a 2x speedup over the regular C2C transformation.

F. REAL TO REAL FFTs
Real-to-real transforms in VkFFT are implemented in the form of Discrete Cosine Transforms of types I, II, III and IV. Their definitions and transforms match FFTW implementation [17], [18], [19], [20]: the inverse of DCT-I (itself) 2) DCT-II: , the inverse of DCT-IV (itself) R2R transforms are performed by redefinition of them to the C2C transforms (the internal C2C sequence length can be different from the input R2R sequence length).

III. GPU ARCHITECTURE IN-DEPTH ANALYSIS
Before discussing VkFFT GPU-specific design choices, it is important to specify which types of hardware are accessible by the GPU and what limitations they have. We will focus on the two most important types of circuitry -compute-related and memory-related.

A. ON-CHIP COMPUTE CIRCUITRY
Different vendors have different GPU compute unit designs, however, as all of them are based on the SIMT architecture, there are plenty of similarities. The main outcome of this design approach is that there is only one scheduler and dispatch unit that manages a relatively large group of cores. This saves space and allows the manufacturers to pack thousands of cores inside a single GPU. We will take Nvidia's compute unit (streaming multiprocessor) as an example hereon [21], [22], [23]. We will omit the discussion of all graphics-related circuitry in this section.
We start with a short description of the scheduler and the dispatch unit. Their job is to provide work assignments in a form of warps to the hardware cores. They can manage multiple kernels on a single GPU and take care of all available in compute unit resources.
The source of compute power inside GPUs is their numerous cores. They have pipelined structure -dispatch port, operand collector, arithmetic unit and result queue [24]. Pipelined structure means that the core can queue new instructions while previous ones are still active or being outputted. The arithmetic unit is often specialized -for integer, single-precision or double-precision floating point operations. Different operations executed require different numbers of clock cycles to complete. If a warp encounters a bank conflict, it will be stalled, increasing the number of cycles needed to maintain the convergence of the lanes inside a warp.
Often compute units contain specialized circuitry for some specific operations, that is useful but is too big to fit in each core. Examples of this are special function units (SFUs) and tensor cores. The first ones allow for fast calculation of single precision transcendental functions and the second ones can perform matrix operations [25], [26], [27].
The last type of non-memory circuitry present in compute units is the load/store unit. Their main job is handling memory transfers to cores from the shared memory, caches and global memory. They are designed to feed multiple cores at the same time (according to SIMT architecture), which results in strict memory coalescing requirements. Nvidia GPUs have 8 load-store units that can process data in 32-byte chunks. The load/store units open up another section of discussionthe different types of memory available on a GPU.

B. GPU MEMORY TYPES
The main principle of the memory hierarchy is -the closer something is located to the cores, the faster the core's access to it will be and the smaller the size of it. The overall ranking of memory speeds can be seen in Fig. 3 [23].

1) REGISTER FILE
The fastest available memory on the GPU is located onchip. It has been limited to 256KB for multiple generations of GPUs as this size corresponds to the best achievable by the current technology level as a balance between register file latency (which increases with the size) and the number of resources that warps can access (also increases with the size) [23]. Depending on the architecture of the GPU, the register file is split into multiple banks, each with a different number of ports capable of handling a single 32-bit memory request per clock cycle. It is possible to exchange data between registers of threads from a single warp via specialized subgroup/warp operations, otherwise, registers are thread-local and can only be accessed by the corresponding thread. If threads try to use more registers than the register file can provide, register spilling occurs and data is cached to the global memory at a much lower bandwidth.

2) L0 INSTRUCTION CACHE
This is the cache that stores instructions to be issued with the scheduler and the dispatch unit. Usually, there is no need for interaction from the developer side with this cache, other than avoiding doing large conditional code jumps inside the kernel -in this case some of the code may not be prefetched to the cache resulting in additional stalls [23].

3) SHARED MEMORY/L1 CACHE
The design of a combination of shared memory and L1 cache into a single entity with configurable partition has proven to be effective and has been used by Nvidia (and other vendors) since the Volta architecture. This data cache also resides on-chip and offers high bandwidth and low latency -28 cycles for an L1 cache hit and 19 cycles for shared memory without bank conflicts on Nvidia's V100, according to [23]. Unlike the register file, all of this memory can be seen and accessed by all threads in a block, which is useful for thread communication purposes. This is achieved by having a larger amount of banks so that each of them can process memory requests in parallel -the latest Nvidia GPUs have 32 banks with 4-byte bank width. If the number of banks is equal to the number of threads and each thread accesses a different bank (or the same word from one bank), the data are transferred in one request and the bandwidth is similar to the register file bandwidth. In the case of a bank conflict where multiple threads require different data from a single bank, requests become serialized which can lead to a big bandwidth decrease up to the case when all threads access one bank. Shared memory has to be explicitly synchronized to avoid race conditions. In Vulkan, static shared memory allocations are limited to 48KB per thread block on Nvidia GPUs, 64KB for AMD GPUs and 32KB-64KB for Intel GPUs [23], [26], [28]. CUDA and HIP APIs, however, allow dynamic shared memory allocations, increasing this size up to 96KB for V100, 192KB for A100 and 256KB for H100 GPUs, which is extremely helpful for processing more data in a single upload to the chip (at a cost of a lower number of active warps per compute unit/streaming multiprocessor). VkFFT is optimized to use all shared memory available per thread block. The L1 cache of most vendors uses a 32-byte cache line with a 128-byte update granularity so that every thread from a 32-sized warp gets one 32-bit word per request [23].

4) L2 CACHE (AND L3 CACHE)
This is intermediate cache between L1 cache and global memory. It has a bigger size than the L1 cache (usually 1-5MB, 6MB on V100, 40MB on A100, 50MB in H100) and faster speeds and lower latency than global memory -193 cycles per hit on V100 [23], [25]. It is located on-chip but is not compute unit specific -all compute units are connected to it. L2 cache is used as an intermediate bufferdata requested or written by the GPU goes to global memory through the L2 cache, until the new information is written over it. On recent Nvidia and AMD GPUs, the L1-L2 cache line is 64-bytes wide. On a cache miss, each sector can be filled with a data transaction from global memory, served in 32-byte granularity -this allows coalescing requirements to be less strict and also eases the strided access performance impact, as the L2 cache will still be able to combine separate requests to fully utilize on-chip cache lines with 128-byte update granularity. On Intel integrated GPUs, all transactions are performed with 64-byte granularity [28]. It is worth noting that the modern GPU generation, such as Nvidia's Ampere and AMD's RDNA2, made a move towards bigger than-before intermediate cache sizes (40MB L2 cache for Nvidia DGX A100 and 128MB L3 cache for AMD Radeon RX 6000 series) and allow for better control of them.

5) GRAPHICS CARD GLOBAL MEMORY
Many GPUs have access to big amounts of high bandwidth dedicated memory, however, it is located off-chip and poses a big latency hit (> 200 cycles per request). Typical values of bandwidth global memory provides are 1.5TB/s for HBM2 memory, 500GB/s for GDDR6 and 1TB/s for GDDR6X memory (specification values) [23]. Integrated GPUs often use memory reserved from the system's random access memory (RAM) and are dependent on its bandwidth -the latest Apple M1 Pro systems use LPDDR5 technology and can achieve up to 200GB/s of bandwidth per chip. Data transactions from global memory are served in 32/64 byte granularity, so to fully utilize bandwidth, it is important to avoid strided accesses from threads to global memory. If all threads from a warp request data from one global memory page and these requests correspond to multiple consecutive 32/64 bytes in memory, each of these transfers can be performed in one transaction and then combined into a single 128-byte transaction in the L1 cache line update. This technique is called memory coalescing and it is essential for memory-bound problems. An important note is that accesses to different VRAM pages often can not be combined by the L2 cache into a single 128-byte transaction in the L1 cache line. In this case, memory accesses have to be requested with a 128-byte granularity from the beginning.
Another source of memory problems related to global memory is L2 cache port serialization and it is especially relevant to AMD GPUs [29]. If the memory accesses have very large distances in global memory between them (more than 256KB-64MB), even if they are locally coalesced, they will use only a limited number of cache ports and have lower total bandwidth. There are multiple ways to combat port serialization which try to increase the locality of data accesses as much as possible. First, it can be reduced by increasing the number of bytes coalesced. Secondly, avoiding reordering in the Four-step FFT algorithm also achieves similar results. Overall, it has not yet been solved fully in VkFFT.
On Nvidia, distant coalesced accesses can result in a 2x reduction of L2 bandwidth if they are not 32-byte alignedfor example, in the case of odd-length Four-step FFTs. In this case, the memory controller often requires two instructions per memory request.

6) SYSTEM MEMORY
This is computer primary storage, which is connected through a PCI-E lane to a discrete GPU or through a memory controller if the GPU is integrated. The PCI-E bandwidth is usually at least an order of magnitude lower than the dedicated memory bandwidth, so it is best to have as few requests to system memory during execution as possible.

C. GPU ALGORITHM BOTTLENECKS
There are multiple bottlenecks that can be encountered in GPU programming which correspond directly to the raw computational power of the GPU and the data transferal speeds. The first type is the compute-bound bottleneck when a GPU is limited by how many operations it can perform. Usually, in this case, the bus load is low and GPU core utilization is high. The main approaches to alleviating this bottleneck would be to reduce the amount of performed operations via algorithm refinement, get a more powerful GPU or switch to a multiple GPU setup. However, the recent HPC GPUs have tens of thousands of cores and usually can perform 10-100 mathematical operations per single global memory request.
This brings us to the second type of GPU problem -the global memory bandwidth-bound problems. They happen when a GPU is limited by how much information it can transfer from its main VRAM pool per second. In this case, the L2-global bus load will stay high (often close to 100% of theoretical peak) throughout the whole execution time and the time taken only by data transfers will constitute the largest portion of the overall time spent. Algorithm refinement can still be helpful in this case, however, it has to be performed from the standpoint of reducing memory transfers, even at the cost of increasing the number of operations. A more powerful GPU will only gain performance scaling with the global memory bandwidth, in this case, often independent of the amount of compute units.
Moving up the bandwidth hierarchy, it is worth noting another intermediate type of problem -shared memory bandwidth-bound bottlenecks. They occur when an algorithm is optimized to split workload efficiently between compute units not requiring much communication between them, however, requiring large numbers of data transfers between threads within a compute unit. These algorithms are often much more complex than the ones in the previous two categories, so the best solution there is to improve shared memory communication patterns. They also scale almost linearly with the number of compute units, so getting a more powerful GPU (the same solution as for compute-bound problems) works in this case as well.
Some FFT-related algorithms that illustrate the three mentioned bottlenecks are given below: 1) Pure compute-bound problems: polynomial expansion sin/cos calculations, FP64 calculations on systems with low FP64:FP32 core ratio. 2) Pure global memory bandwidth-bound problems: Stockham FFT with small primes. 3) Shared memory bandwidth-bound problems: Rader's algorithm (especially its direct multiplication version), Bluestein's algorithm.
Section V will go into detail on how VkFFT tackles these problems.

IV. THE STRUCTURE OF VkFFT
This section will describe the implementation of the VkFFT library, focusing on the runtime kernel generation platform it is based on and how it allows to combine and optimize all of the implemented GPU algorithms.

A. GENERAL VkFFT DESCRIPTION
VkFFT library is released under an MIT license as a header-only interface library written in C. It generates deviceoptimized GPU code at run-time with an option to reuse the code.
VkFFT adopts the usual multidimensional memory layout: data is stored in the following order (sorted by increase in strides): the width, the height, the depth, the coordinate (the number of feature maps) and the batch number.
Similarly to other GPU FFT libraries, VkFFT uses the Stockham algorithm to avoid a separate digit-reversal stage. VkFFT uses a radix-decomposition approach for sequences decomposable as a multiplication of an arbitrary number of the following primes: 2/3/5/7/11/13. These sequences have comparable performance to that of powers of 2.
VkFFT uses a convolution theorem version of Rader's FFT algorithm for primes from 17 up to max shared memory length ( 10000). VkFFT uses a direct multiplication Rader FFT algorithm for small Sophie-Germain safe primes -47, 59 and 83. Both versions are inlined and can be viewed as an extension to radix kernels computed with multiple threads per prime. Rader's algorithm works without additional memory transfers, except for the look-up tables (LUT) containing Rader's kernel values.
Bluestein's FFT algorithm is used for all other sequences. It is optimized to have as few memory transfers as possible by using zero-padding and merged convolution support of VkFFT. VkFFT also allows choosing which sequence to pad to in Bluestein's algorithm.
VkFFT supports single, double and half-precision (only used for data storage with computations performed in single precision). VkFFT has two modes of operation -it can calculate sines and cosines used in twiddle factors directly on the GPU (using special function units (SFU), if the GPU has them, or using a polynomial approximation of transcendental functions) or use precomputed on CPU values stored in the LUT. The CPU precomputation can be done in higher precision, for example, FP128. The LUT approach allows GPUs to get better precision (it is the default mode of operation in FP64) and for GPUs with low double-precision core count to reduce the amount of performed arithmetic operations.
Real-to-complex, complex-to-real and real-to-real transforms are optimized in the multidimensional case by packing two consecutive sequences as real and imaginary parts of a complex sequence. The packing and unpacking does not require additional memory transfers and reduces the number of performed transforms by a factor of two. If an FFT cannot be performed in one upload to the chip, the Four-step FFT algorithm is used to split the sequence into a combination of smaller FFTs. VkFFT performs FFT in-place with an option to use an additional buffer to do an out-of-place transform or perform transposition in the Four-step FFT algorithm.
VkFFT has the option to optimize generated kernels for convolution calculation and system zero-padding with enhanced performance. VkFFT supports 1 × 1, 2 × 2 and 3 × 3 matrix convolutions with the symmetric or nonsymmetric kernel, multiple feature/batch convolutions -one input, multiple kernels. VkFFT supports native zero-padding to model open systems. One can specify the range of sequences filled with zeros and the direction in which zero-padding is applied at the read or write stage).
VkFFT works on Nvidia, AMD, Intel and Apple GPUs. VkFFT supports all GPUs, ranging from low-power mobile GPUs up to HPC GPUs. VkFFT works on Windows, Linux and macOS.

B. VkFFT KERNEL GENERATION PLATFORM
VkFFT has a hierarchical structure design: Application -> Plan -> Code. This allows the creation of code optimizations for the target device architecture at runtime. Below a more detailed description of the VkFFT platform structure is given, starting from the level that is closest to the user application stage.

1) APPLICATION STAGE
At this stage, the VkFFT platform performs all interactions with the user and resources management. The user interaction covers: 1) Parsing the input configuration given by the user, where all FFT system parameters and hints to the generator are given. 2) Application initialization, which consists of calls to the functions from the next Plan stage, corresponding to the provided configuration. 3) Application update, which can be used to update some of the parameters after plan creation calls, for example providing different input/output buffers. 4) Application dispatch, which appends the corresponding plan dispatch command to the user's queue of GPU actions. 5) Application deletion, which frees resources allocated by VkFFT. 6) Binaries caching, which allows reusing code generated with the same input configuration.

2) PLAN STAGE
This is the internal configuration stage that constructs the FFT plan -the intermediate representation of the FFT problem. This can be seen as a preparation and post-handling step to code generation. Overall the processes included in this stage are: 1) Plan stage is called by the application. It is provided with a local copy of the user configuration (with some VOLUME 11, 2023 values being automatically configured, if they were not specified). 2) Decision making -the main algorithm selection step.
This step completely defines the structure of the generated kernels, algorithms used, number of threads, registers and shared memory. This step also determines the sequence of primes used by the Stockham FFT algorithm. 3) Resource allocation -LUT, intermediate transposition buffers. 4) Plan initialization, which consists of calls to the functions from the next Code generation stage. 5) Kernel compilation or loading of provided binaries. 6) Plan update handles, called by application update handles. 7) Plan dispatch handles, called by application dispatch handles. 8) Plan deletion, which is also called during application deletion.

3) CODE GENERATION STAGE
This is the lowest stage in the VkFFT platform. Its main goal is to generate a string that will hold GPU code for a particular API that can be later compiled and used. The code generation stage is further subdivided into three intermediate levels to facilitate the reuse of code. 1) Level 2 kernels -a clear description of the problem via a sequence of calls to lower levels, kernel layout configuration. 2) Level 1 kernels -simple routines: matrix-vector multiplication, FFT, pre-and post-processing, R2C/R2R mappings. 3) Level 0 kernels -memory management, basic math functions inlining, API-dependent definitions.

V. VkFFT ALGORITHMS AND MEMORY TRANSFER OPTIMIZATIONS
The FFT algorithm on most modern GPUs is a heavily memory-bound problem. Due to this fact, the VkFFT library has been designed to have as few global memory-on-chip memory data transfers as possible and optimized to have efficient shared memory usage. In the best-case scenario, the memory layout should follow these rules: • No CPU-GPU transfers during execution except for asynchronous downloads from the GPU.
• Maximize on-chip memory usage, but avoid register spilling, shared memory bank conflicts and maintain occupancy -the ratio of the average number of active warps on the compute unit to the maximum number of active warps supported by the compute unit. Low occupancy may result in stalling of either compute or memory resources.

A. VkFFT GLOBAL MEMORY TRANSFERS LAYOUT
One of the main challenges in implementing the GPU version of the FFT library is managing memory coalescing. Strided accesses occur in the multidimensional case when accessing data for axes with non-unit stride and in the Four-step FFT algorithm when accessing a column set of sequences. The typical approach implemented in many other FFT libraries would be to perform a transposition of a matrix aligning the FFT sequence to a unit stride direction. The transposition routine achieves non-strided access by operating on a rectangular kernel by splitting the input multidimensional system into tiles and performing transposition in shared memory, which does not have a strided access problem if the tile is padded so that every column resides in a different bank. Each of these transpositions, however, requires an additional upload and download of the system from the global memory.
VkFFT incorporates a different way of memory coalescing which does not require separate transposition and can be used in other algorithms operating on multidimensional data as well. The main idea is to implement a tiled upload approach, similar to the one used in matrix transposition, and group the consecutive sequences spanning along the non-unit stride axis, so that each global memory transaction is filled from neighboring elements with unit stride. To fill a 32-byte data transaction, we have to group at least four sequences in single and two sequences in double precision. This limits the maximum sequence size that can be stored in the shared memory more than the non-strided access pattern does. As the memory uploads are always performed along the unit stride axis, it is guaranteed that reads and writes to any position along the non-unit strided axis will be fully coalesced. This property also allows performing a transposition afterward to return data to the correct order required by the Four-step FFT algorithm. This transposition is merged with the store phase of the last stage of the Four-step algorithm and does not need additional memory transfers. However, it requires an additional memory buffer with the size of an input system, as these writes to global memory are not done in place. An example of a task that does not require transposition is a convolution performed via the convolution theorem, which will be discussed later. If we assume that the thread block can use 32KB of shared memory (for bigger values sequence sizes will scale accordingly), in the case of a unit strided axis access, up until the sequence size of 4096 single-precision complex floats (exactly 32KB of data), the full sequence can be stored in shared memory, so an FFT can be performed in one upload. For bigger sequences, the Four-step FFT algorithm follows the same logic as above, as it uses a representation of a 1D sequence as a multidimensional matrix with one difference that one stage of it has a unit stride. During this stage, if we omit the transposition at the end, FFTs can fill the full 32KB of shared memory as the memory accesses are non-strided. However, if transposition is required, it limits this sequence size to 1024 to coalesce column writes after the last FFT step along the rows. VkFFT currently supports the Four-step splits in up to three uploads, so for the system with 32KB of shared memory the maximum sequence that can be done will be 2 30 . This length is further limited if the number is decomposable as a multiplication of big primes, like 11 7 . In this case, if we split it as 11 7 = 11 3 ·11 2 ·11 2 , one of the factors (11 3 = 1331) will be out of the 1024 range. These cases can be done by coalescing less data or by implementing more than 3 upload schemes, which will be done in a future release.
One issue that is also worth noting is related to the distant coalesced accesses described before. It occurs in the three-upload scheme during the transposition write part (if we omit the transposition the performance is restored). VkFFT performs a single transposition in this case, swapping the logical axis of upload 0 and the axis of upload 2 as the last write operation. VkFFT attempts to solve it by coalescing at least 128 bytes of data, however, this only partially helps with the L2-cache additional instructions problem. It is possible, that a different transposition pattern is required to solve this issue.
VkFFT allows for multiple buffers of FFT reads/writes, where the FFT system is stored in multiple distinct memory chunks. This is helpful for systems with a limit of memory allowed per single allocation (usually 4GB) and allows for better reuse of buffers. An example use-case of this feature is that VkFFT can reuse multiple small buffers left after temporary calculations in other parts of code for a temporary transposition buffer after the Four-step FFT algorithm. It also opens up a possibility for a multi-GPU implementation in the future.

B. VkFFT SHARED MEMORY OPTIMIZATIONS
Efficient use of shared memory is an essential part of having optimal FFT performance. If the global memory transfer number is reduced to the minimum, shared memory bandwidth will be the next limiting factor. Shared memory in VkFFT is used as a communication buffer, where a copy of the FFT system from registers is loaded/stored after each step of the FFT algorithm.
To optimize shared memory usage it is important to avoid bank conflicts. The first stages of the non-strided Stockham DIT algorithm for small sequences often cause problems with this -the writes to shared memory there are also non-strided. VkFFT solves this issue by performing multiple non-strided FFTs with a single thread block and by performing a transposition in shared memory before calculating the FFT. This way the number of combined sequences will be the lower bound to the number of memory banks used. VkFFT also uses padding (adding free space) between sequences or inside sequences in shared memory if it determines that the access pattern can improve by doing so.
For large sequences that have no occupancy problems, it is beneficial to merge small radix kernels (like 2, 3 and others) used by the Stockham algorithm. This reduces the possible number of threads per block, increases the number of registers per thread and reduces the number of shared memory communications. VkFFT, at the moment of publication, employs additional kernels for composite lengths of 4, 6, 8, 9, 10, 12, 14, 15, 16 and 32. During the read/write stages, VkFFT prioritizes optimization for coalesced access patterns to global memory. However, if it is possible VkFFT will optimize data transfers directly to/from registers, omitting the shared memory transfers.

C. VkFFT OCCUPANCY OPTIMIZATIONS
In cases when the kernel has to use small FFT kernels of different prime lengths, the number of active threads also changes. For example, for an FFT of size 77 = 7 · 11, the FFT kernel of length 7 can be launched with 11 threads, while the FFT kernel of length 11 can be launched with 7 threads. The solution is to launch 11 threads with 11 · 2 registers (we use complex numbers) and make 4 threads inactive during the FFT of size 11. The decision process becomes more subtle when there are more primes involved. VkFFT has a handcrafted selection of the number of registers used by many combinations of primes from 2 to 13. This selection has been made following two main design decisions -minimize the number of inactive threads and have enough threads to utilize the GPU (try not to overuse the number of registers per thread).

D. VkFFT REGISTER FILE OPTIMIZATIONS
Another way to reduce the number of data transfers is to use a register file to store the data instead of shared memory. This is mostly useful for GPUs with small amounts of allocatable shared memory -for example, Nvidia GPUs in Vulkan or OpenCL only allow use of 48KB per thread block. In this case, using 128KB of register file to store the sequence data and another 128KB for operations allows an increase in the maximum unit-strided sequence length to be done in one upload to 2 14 in single precision. Each thread stores its own part of the input sequence in its registers and uses shared memory as a communication buffer. This approach significantly reduces the number of warps a compute unit can maintain at the same time and increases the number of operations done per FFT stage due to the additional communication step, with explicit synchronization between consecutive radix stages. However, it proves to be effective to bridge the gap when the transition from one-upload to two-upload Fourstep FFT occurs -at least for the simplest case of powers of 2. This approach can also be used in non-unit strided FFT computations, effectively increasing the maximum FFT sequence length there by a factor of four.

E. VkFFT CONVOLUTION OPTIMIZATIONS
The convolution theorem states that convolution between a system and a kernel can be computed as an inverse FFT of a direct multiplication of a system FFT and a kernel FFT. This allows the performing of convolutions with the same complexity as FFT, instead of initial O(n 2 ). The number of memory transfers can be reduced in this case if we combine the VOLUME 11, 2023 last stage of the FFT, kernel multiplication and the first stage of the inverse FFT in one program and do not perform system download from the chip after the FFT, system upload and download for multiplication and system upload for the inverse FFT -a total of four memory transfers. Each of these transfers is used to transfer the full system at the global memory bandwidth, so it is possible to estimate how much less memory is transferred after this optimization. For example, a 1D batched FFT without Four-step FFT (one system upload/download per FFT) will drop from three uploads/downloads to only one (plus kernel memory transfer to the GPU). In the case of a 1D batched FFT with a Four-step FFT, we can omit the transposition step in the algorithm, as the inverse FFT will return data to the original layout. This allows us to not allocate an additional buffer for this transposition and reduces global memory usage by a factor of two. It also allows for better locality and helps to reduce L2 cache serialization as the last stage of the Four-step algorithm will write output with unit stride.

F. VkFFT ZERO-PADDING OPTIMIZATIONS
The concept of calculation of convolution as a multiplication in the frequency domain (the convolution theorem) has one important property -it yields circular convolution, which means that it assumes the input data to be periodic. While this holds true for the modeling of periodic systems, it can produce incorrect results on open systems, uninfluenced by anything outside the system. For this specific case, the zero-padding technique is used -we pad cells along each dimension to double the size. All the extra values are then filled with zeros. While not changing the circular structure of the performed convolution, this trick sets all the influence from the periodic images to zero. The memory layout can be significantly improved in this case. Firstly, as we know which axes are padded with zeros, we can upload only half of them to the GPU and logically set the other half to zero onchip. Secondly, if we have a multidimensional zero-padding, there exist sequences full of zeros, which can be omitted from the calculation altogether, as their FFT is a sequence full of zeros. This allows us to get up to a 2x speed increase for multidimensional FFTs. The decrease in transferred memory also improves all caches' hit rates. VkFFT also allows for frequency zero-padding and a specific selection of zero-padding range placement in the system. This allows zero-pad high frequencies, which is useful for upscaling applications.

VI. VkFFT BENCHMARKS
To measure the performance of VkFFT, we compare it to Nvidia's cuFFT and AMD's rocFFT libraries, which are vendor solutions developed mainly for HPC use cases. We use Nvidia A100 and AMD MI250 GPUs for the tests, which are the latest publicly released HPC GPUs available at the time of the creation of this paper. Both GPUs were measured to use approximately 250W of power. Nvidia's A100 GPU has 40GB of HBM2 memory and 192KB of shared memory, while AMD MI250's GPU has 64GB of HBM2 memory (we use a single-chip version of it) and 64KB of shared memory [25], [26]. The GPUs use CUDA 11.7 and ROCm 5.2, respectively. We use VkFFT version 1.2.31.
The first test will perform an FFT of multiple batched 1D sequences of all lengths in the 2 to 4096 range followed by a corresponding inverse FFT. The number of sequences is chosen based on the sequence length to have a constant system size of 512MB-1GB in memory, so all compute units of the GPU are utilized. To mitigate the dispatch call overhead of submitting compute tasks to the GPU and remove random noise, the test performs the FFT and inverse FFT 100 times in a row per dispatch and then averages the time taken. Each test is performed three times to verify the consistency of the results. For memory-bound problems, a good performance representation exists in a form of a theoretical algorithm bandwidth, defined as: The peak global memory bandwidth of the A100 and MI250 (obtained by a simple large-data memory copy test) is 1.3TB/s.

A. DOUBLE-PRECISION, ALL SIZES UP TO 4096
The resulting pattern seen in the benchmark plots of Fig. 4 and Fig. 5 will be similar to the ones seen in all other benchmarks. It is closely related to which algorithm is required for a particular FFT length. VkFFT incorporates three main FFT algorithms that cover the whole range of numbers -Stockham autosort, Rader's algorithm (two versions) and Bluestein's algorithm. For cuFFT and rocFFT, the coloring is an educated guess made by analysis of profiling results. Stockham autosort algorithm is designated as radix(2-13) on the performance plots (red and black colors). In doubleprecision (DP), all three libraries use it for sequences representable as a multiplication of the arbitrary number of primes up to 13. VkFFT is designed to have as few global memory transfers as possible, so it performs all these sequences in a single upload to the chip, thus achieving close to the peak GPU bandwidth performance for both A100 (Fig. 4) and MI250 (Fig. 5). This is mainly possible due to the fact that VkFFT generates kernels at runtime and does not ship binaries for all sequences. cuFFT, on the other hand, provides only precompiled PTX files and switches to multiple uploads for some of the sequences, which can be detected by some black triangles way below the 1.3TB/s peak bandwidth. rocFFT is developing a runtime kernel generator as well, however, there are still many sequences that fall back to less optimized implementation, which can be seen by the abundance of black triangles results below 300GB/s (Fig. 5).
For sequences that are decomposable as a multiplication of primes up to 4096 (if P-1 FFT can be done with the Stockham autosort algorithm), VkFFT uses an FFT-convolution version of Rader's algorithm, shown as cyan circles. Each Rader-FFT algorithm prime requires 2x more shared memory transfers than radix 2-13 kernels plus an additional global memory  LUT access for Rader's kernel (which becomes noticeable for big primes). For some small primes -47, 59 and 83, which have non-radix 2-13 primes in their Rader algorithm (47 − 1 = 46, which is divisible by 23 -23 is called a Sophie Germain prime and 47 is called a safe prime) it is still beneficial to use a direct multiplication version of Rader's algorithm. However, its performance is inferior to Bluestein's algorithm for primes after 83 due to its high number of shared memory data transfers -this algorithm is shared memory bandwidth-limited. The cuFFT library only has the direct For all other sequences divisible by Sophie Germain safe primes (or numbers non-decomposable by the previous two algorithms) for VkFFT, primes after 128 for cuFFT and primes after 17 for rocFFT (approximately) Bluestein's algorithm is used. It is shown as magenta circles for VkFFT and green triangles for other libraries in Fig. 4 and Fig. 5. In VkFFT, the sequence after 2N − 1 to pad to has to be chosen as a balance between high decomposition (like a power of 2) and memory transferred. For example, 1031 is a prime number non-decomposable with Rader's algorithm (1030 is divisible by 103, which is not a radix 2-13 prime), so we have to pad it to at least 2061. However, the next power of 2 is 4096 -which is almost another 2x increase in size. By extensive testing of all radix 2-13 decomposable numbers from 2061 to 4096, the best performance is achieved by padding to 2187 = 3 7 . This prior testing, convolution support in VkFFT (they are optimized to have as few memory transfers as possible) and zero-padding optimizations make VkFFT's Bluestein's algorithm faster than either cuFFT and rocFFT implementations. It is also worth noting that 64KB of shared memory in MI250 can only support FP64 complex sequences up to 4096, so taking Bluestein's padding into account, all sequences that have to pad to a number bigger than 4096 have to be done in two uploads in VkFFT -this can be seen by a drop in performance after 2048 for magenta circles.

B. SINGLE-PRECISION, ALL SIZES UP TO 4096
In single-precision ( Fig. 6 and Fig. 7), VkFFT uses the same algorithm configuration as in double-precision (radix+Rader+Bluestein). The main difference comes from the usage of GPU's special function units for the calculation of sines and cosines -so there is less memory transferred for bigger sequences. For MI250 (Fig. 7), all sequences from the range (including Bluestein's padded) can fit into 64KB of shared memory and are done in a single upload -so there is no performance drop at 2048 (it happens after 4096). VkFFT also uses packed math instructions available in MI250 -this GPU can perform two FP32 instructions at the same time. cuFFT does not use the Rader algorithm for some reason in single precision and switches to Bluestein's algorithm for primes after 17 (Fig. 6). Rader's algorithm implementation in VkFFT works just as well in FP32 as in FP64.
C. DOUBLE-PRECISION, RADIX 2-7, ALL SIZES UP TO 2 30 To test the performance of VkFFT on the whole supported range of transforms, we select all sequences decomposable as a multiplication of the arbitrary number of 2s, 3s, 5s and 7s ( Fig. 8 and Fig. 9). The main consideration behind this choice is that these are the most commonly used sizes and that testing all 2 30 numbers is unfeasible for a developing library.  For big sequences, all uploads have to be coalesced -the sequence is represented as a 2D or 3D array. For a 32-byte length cache line assumption, this means that we have to upload at least two 16-byte complex numbers at the same time. For the non-strided axis (last one) there exist two options -if we need to restore the correct FFT layout, we have to perform the coalesced transposition there. This means that it is essentially also considered as a strided axis, as the final write there will be strided. However, if we do not have to restore the correct layout (for example, in a convolution calculation the inverse FFT will restore the correct layout anyway) the transposition can be omitted and we can do bigger sequences in fewer memory transfers. This is used in Bluestein's FFT implementation of VkFFT for big sequences.
The switch to a two-upload scheme happens at approximately 10500 for A100 (Fig. 8) and 4096 for MI250 (Fig. 9) due to the shared memory limitations. The switch to the three-upload scheme happens at 2 22 for A100 and 2 19 for MI250. The four-step algorithm requires a single twiddle factor multiplication for each of the uploads except the first. For high double-precision core count GPUs, it turns out to be beneficial to use a slow polynomial approximation of sines and cosines and pay the price of additional FMA instructions rather than transfer additional data from LUT. For the consumer-level Nvidia GPUs, this is not the case, so VkFFT has both options available.
For the three-upload scheme, multiple other problems have been encountered, all closely related to the final transposition (they are not present if the transposition to restore the layout is not required). The first issue is related to the size of systems becoming quite big (1GB and more) -the L2 cache is often not able to combine four 32-byte transactions to fill a 128-byte cache line. This can happen if the final global memory requests target different memory banks. So, VkFFT coalesces to 128 bytes in a three-upload scheme, if possible. The second issue is also related to L2 cache performanceif the stride is odd, the access pattern can not be aligned to 128-byte granularity, so the L2 cache has to serve twice the number of memory requests, thus limiting its maximum bandwidth by a factor of up to 2×. The possible solution to this is reimplementing the single transposition into the one with multiple intermediate transpositions, but it has not been tested in VkFFT yet. The last issue is mainly related to AMD GPUs -for coalesced accesses with big strides they can experience memory pin serialization, where all memory transactions to global memory happen through a single memory pin. This issue is known since the Polaris generation of GPUs [29] and there has been no simple workaround to it. Single precision results ( Fig. 10 and Fig. 11) resemble the double precision results as both A100 (Fig. 10) and MI250 (Fig. 11) have a high FP32 to FP64 core ratio. The main difference is that coalescing to 32 bytes requires four sequences in FP32 instead of two sequences in FP64. This moves the switch from a single upload to multiple uploads from 10500 to 21000 for A100 and from 4096 to 8192 for MI250. All the issues related to the final transposition are present here as well. This test (Fig. 12 and Fig. 13) evaluates VkFFT performance on a wide range of multidimensional systems, from small ones that are not able to saturate a single compute unit(like 2 3 ) to excessively big ones filling full memory of a single GPU (like 1024 3 ). The bandwidth is multiplied by 3 compared to the 1D cases, as it is expected that all libraries use separate uploads for each of the axes. This way it is easier to compare results to the peak theoretical bandwidth of the GPU. Red circles correspond to VkFFT library results and black triangles are competitor library results. The performance plots also show the zero-padding performance (cyan circles), which will be described in the next subsection. VkFFT uses Rader's  and Bluestein's algorithms in the multidimensional case as well and all performance considerations from the previous sections also hold true to this case.
System sizes taking up to 256KB (cubes up to 2 5 ) experience a high L1 hit rate but have the lowest amount of available warps, which results in low GPU utilization as many compute units are idle or have not enough workload. A100 and MI250 performance in this range is mostly limited by the dispatch latencies.
System sizes of 512KB-8MB (cubes up to 2 7 ) experience a high L2 hit rate and have a large number of available warps, active compute units and occupancy compared to smaller FIGURE 12. Benchmark of AMD A100 GPU with VkFFT (red circles) and cuFFT (black triangles) in 3D single-precision FFT+IFFT cube computations. This plot also shows the zero-padding optimization performance of VkFFT (cyan circles). systems. The nature of the test (it performs multiple FFTs and inverse FFTs) utilizes this high L2 hit rate, resulting in higher than peak global memory bandwidth. The L2 cache of the A100 is much faster than the L2 cache of MI250, achieving 2.4 TB/s on 128 3 system. System sizes bigger than the L2 cache (cubes up to 2 10 ) have a performance determined mainly by global memory bandwidth. Memory pin serialization of AMD GPUs is most noticeable here for the sizes divisible by high powers of 2.  F. ZERO-PADDING PERFORMANCE Fig. 12 and Fig. 13 also demonstrate the zero-padding optimization performance in the multidimensional case (cyan circles). In the 3D case reduced memory transfers and increased cache hit rate of zero-padding resulted in a 100% performance increase (on system sizes not affected by low occupancy, as additional branch instructions executed there can make the code worse). The performance increase is most noticeable on system sizes bigger than the L2 cache, mainly determined by the global memory bandwidth.

G. DISCRETE COSINE TRANSFORMS (R2R) PERFORMANCE
The performance plots of Fig. 14-16 present double precision performance results of DCTs of types I-IV. The test configuration is the same as for the C2C in double precision.  . FP64 precision comparison of VkFFT (black squares for AMD MI250, cyan triangles for Nvidia A100), cuFFT (green diamonds for Nvidia A100) and rocFFT (red circles for AMD MI250) against FP128 version of FFTW for a single forward FFT [17].
We compare the performance of the AMD EPYC 7742 (64 cores) CPU with threaded FFTW with Nvidia A100 and AMD MI250 GPUs with VkFFT, as no other GPU vendor library supports DCTs [17]. The high bandwidth of GPU memory allows it to greatly outperform CPU implementation in FFTW.

VII. VkFFT PRECISION
VkFFT precision is verified by comparing its results with the FP128 version of FFTW. We test all FFT lengths from the [2, 100000] range. We perform tests in double (Fig. 17) and single (Fig. 18) precision on random input data from the [−1; 1] range. The test results are compared to the FIGURE 18. FP32 precision comparison of VkFFT (black squares for AMD MI250, cyan triangles for Nvidia A100), cuFFT (green diamonds for Nvidia A100) and rocFFT (red circles for AMD MI250) against FP128 version of FFTW for a single forward FFT.
corresponding results of Nvidia's cuFFT and AMD's rocFFT libraries.
For both precisions, all tested libraries exhibit logarithmic error scaling. The main source of error is imprecise twiddle factor computation -sines and cosines used by the FFT algorithms. For FP64 (Fig. 17) they are calculated on the CPU either in FP128 or in FP64 and stored in the look-up tables. With FP128 precomputation VkFFT is more precise than cuFFT and rocFFT -the L2 error of it stays below 10 −15 on the full test range.
For FP32 (Fig. 18), twiddle factors can be calculated on the fly in FP32 or precomputed in FP64/FP32. With FP32 twiddle factors, VkFFT is slightly less precise in Bluestein's and Rader's algorithms. If needed, this can be solved with FP64 precomputation.

VIII. CONCLUSION
This paper presents a cross-platform, open-source and performant FFT library for GPU accelerators -VkFFT. Several GPU memory management and optimization techniques are described in detail. A number of novel solutions to the problems limiting GPU performance are presented and tested in this paper. By applying described memory layout and developed solutions to the FFT problem, VkFFT is able to show better performance than already established GPU FFT libraries, while being scalable and tunable to a much wider range of GPUs presented on the market. Full open-source access to the source code of VkFFT allows for the implementation of user-specific optimizations, which are demonstrated on an example of native zero-padding, in which case FFT experiences up to a 2x speed increase ( Fig. 12 and   Fig. 13). The results of this paper also present the platform for multiple-API cross-platform GPU code generation.