Analysis on the AES Implementation with Various Granularities on Different GPU Architectures

The Advanced Encryption Standard (AES) is One of the most popular symmetric block cipher because it has better efficiency and security. The AES is computation intensive algorithm especially for massive transactions. The Graphics Processing Unit (GPU) is an amazing platform for accelerating AES. it has good parallel processing power. Traditional approaches for implementing AES using GPU use 16 byte per thread as a default granularity. In this paper, the AES-128 algorithm (ECB mode) is implemented on three different GPU architectures with different values of granularities (32,64 and 128 bytes/thread). Our results show that the throughput factor reaches 277 Gbps, 201 Gbps and 78 Gbps using the NVIDIA GTX 1080 (Pascal), the NVIDIA GTX TITAN X (Maxwell) and the GTX 780 (Kepler) GPU architectures.


Introduction
Nowadays, The demand for constructing fast and secure communication networks is very important.As the size of data sets increases, the speed of encryption process also increases.One of the most widely used block ciphers is the AES [1] that consists of many intensive computations [2], [3], [4] and [5].Usch computations can be performed on GPUs which are originally developed to deal with accelerating graphical and video applications.GPUs are also designed to deal with non-graphical computations (i.e., General-Purpose Graphics Processing Unit (GPGPU)).
In this paper, we implement the AES algorithm using GPU taking into consideration different granularity values of 32, 64, and 128 Bytes/thread seeking to increase the AES performance (i.e., throughput).The implementations of the AES are performed on three different GPU architectures: Kepler (Nvidia GTX 780), Maxwell (Nvidia GTX TITAN X), and Pascal (Nvidia GTX 1080) using various input block sizes with random plain text.The overall structure of the AES algorithm [1].

The AES Algorithm Mechanism
In this section, the overall structure of the AES encryption algorithm is overviewed with three cipher im-plementations: AES-128, AES-192, and AES-256 as shown in Fig. 1.All of those implementations use a 128-bit block input, however, with a key size of 128 bits, 192 bits, and 256 bits, respectively.The AES encryption algorithm consists of 10 rounds using the 128bit key, 12 rounds using the 192-bit key, and 14 rounds using the 256-bit key.Each of these rounds uses a different 128-bit round key, which is evaluated from the original AES key using the key expansion technique.

Key Expansion Step
The AES key expansion step is used to expand the cipher key from 128, 192, or 256 bits to 1280 bits (10 round-keys), 1472 bits (12 round-keys) or 1664 bits (14 round-keys), respectively.The round keys are used in the AddRoundKey transformation [1].It consists of three tasks: • Substitute Word (SubWord): takes a 32-bit word a four-Byte as input and makes a substitution based on the S-Box for each of the four Bytes to produce an output word.
• Rotate Word (RotWord): takes a 32-bit word and makes one cyclic left rotation to produce an output word.

Round Transformations Step
The input plain text data are divided into 16-Byte Blocks arranged in a 4×4 column-majored array called state.The AES algorithm has four different kinds of layers.
• Add round key layer: is a bitwise XOR operation between each column in the state and the corresponding round key from the key expansion.
• Bytes substitution layer (using S-box): is a non-linear substitution, where each Byte in the state is independently substituted according to a given substitution table (i.e., S-Box).The first four bits of the Byte are used to index the S-Box rows, while the second four bits are used to index the S-Box columns.
• ShiftRows layer: rotates the last three rows in the state to the left.The result is a new matrix consisting of the same 16 Bytes, but shifted with respect to each other.
• MixColumns layer: operates on each column in the state by using Galois Field (GF) math.This is done by treating every column as a four-term polynomial over GF (2 • 8).The result is a new matrix consisting of new 16 Bytes.

AES Modes of Operation
In Tab.[1].In this paper, we implemented a CUDAbased AES algorithm using the ECB mode as it could be implemented in a parallel manner.Given the ECB mode, the message is divided into blocks, then each block can be encrypted separately.

The Fast AES Implementation Algorithm
In the fast AES algorithm, the lookup table solution [7] and [8] could be replaced with four versions of the AES round transformation step.In this fast algorithm, the four lookup tables are defined as: T 0 , T 1 , T 2 , and T 3 .Each table accepts one Byte of input, and comes out with a 32-bit column vector.The operations of each round transformation can be defined as follows: where a 0,j represents the round input, K j is one column of the stage key, and e j denotes one column of the round output in terms of Bytes of a 0,j .T 0 , T 1 , T 2 , and T 3 refer to the lookup tables which have 256 of 32-bit word entries each.Each table needs 1 KBytes of storage space.

Implementation Techniques on GPU Platform
This section presents the traditional implementations of the AES algorithm on GPU platforms.The GPUbased implementations of the AES algorithm can be divided into three major categories: memory optimization, parallel granularity, and GPU platform specific optimization [9], [10] and [11].

Memory Optimization
In this subsection, we will deal with two parameters: (Lookup Tables) lookup tables and (Encryption Keys Storage) Encryption keys storage.
• Lookup Tables: need 4 KB of storage space.The best choice is to store (T-box) in the shared memory as it has high access speed.It is allocated per thread block, so all threads in the block have access to the same shared memory [9].
• Encryption Keys Storage: The AES performance can be also further enhanced if the encryption keys are first computed inside the CPU.Then, those keys are stored in the GPU global memory.
When the GPU kernel is launched, each thread in a warp copies key value(s) from the global memory and stores it in two registers.This process is known as warp shuffle [9].There are 32 threads in a warp, so only 64 keys can be hold in total.Using this strategy, all encryption keys can be saved in registers with higher access speed.

Parallel Granularity
In the parallel processing concept, many threads can be assigned to perform one process for speeding up the functionality, thus reducing the execution time.In the AES algorithm, the default number of blocks needed to be encrypted (i.e., granularity) per one thread is one block (i.e., 16 Bytes) as shown in Tab. 3 [9], [12], [13], [14], [15], [16] and [17].Having the parallel processing concept been embedded into the AES algorithm, the number of data blocks encrypted using one thread can be further increased to 2, 4, or 8 blocks.This is called parallel granularity of the AES algorithm.In this paper, we examine new parallel granularities that have never been used in the literature, such as 32 Bytes/thread (i.e., two data blocks), 64 Bytes/thread (i.e., four data blocks), and 128 Bytes/thread (i.e., eight data blocks).

GPU Platform Specific Optimization
Each new GPU architecture has its own hardware specifications that are different in design from previous architectures in order to enhance the overall performance.
The GPU occupancy is a measure of thread parallelism in a CUDA program.The higher the occupancy, the higher the performance (i.e., throughput).The field thread block size affects the GPU occupancy that depends on the GPU architecture.For example, given the Maxwell GPU SM 5.

Related Work
There are several researches showing implementing the AES algorithm using the GPU architectures as mentioned above with CUDA language.encrypts the counter blocks processes in GPU plus CPU with throughput of 72.0 Gbps.
Li et al. in [13] stored T-boxes on on-chip shared memory.Moreover, the granularity where one thread handles a 16 Bytes AES block was adopted, yielding throughput of 60 Gbps on NVIDIA Tesla C2050 GPU.Nishikawa et al. in [15] presented implementation of block ciphers in NVIDIA and AMD GPU based on Geforce GTX 680 architecture with throughput of 68.6 Gbps.
Wai-Kong Lee et al. in [9] presented implementation of AES in NVIDIA GTX 980 with Maxwell architecture and utilized the advanced features warp shuffle approach to further accelerate the performance.Although, all aforementioned approaches have used different GPU architectures to speedup the execution time, the granularity is still limited to 16 Bytes/thread (i.e., one block per thread).In this paper, we examine all available granularities on various GPU architectures to achieve the optimized settings for each architecture.

Proposed AES Algorithm
We implemented the AES algorithm with two techniques of optimization as mentioned in Sec. 3. The proposed parallel granularities used and the encrypted key storage are explained in Subsec.5.1.and Subsec.5.2., respectively.

Parallel Granularity
There is a trade-off between the number of data blocks needed to encrypted by one thread (i.e., parallel granularity), and the processing functionality of each thread that negatively affects the AES performance according to the specifications of the GPU used.We propose larger granularities (32, 64, and 128 Bytes/thread) in order to encrypt more than one data block instead of only one as shown in Tab. 4 compared to conventional granularity of 16 Bytes/thread.

Encrypted Keys Storage
We used three different techniques for storing the AES encryption keys inside GPU in order to show its effects using different parallel granularities.
• Shared memory storage: the encryption keys are stored in the shared memory inside the GPU.
• Warp shuffle storage: the encryption keys are stored in the CPU registers and can be accessed using the warp shuffle feature.
• Global memory storage: the encryption keys are stored in the global memory inside the GPU.

Implementation Setup and Performance Evaluation
• GPU platforms: We use three different platforms supporting three different GPU architec-tures (Kepler, Maxwell, and Pascal) as shown in Tab. 5.
• Implementation: All AES implementations, used in our experiments, are performed using the CUDA Toolkit 7.5 on Linux (i.e., Ubuntu 14.04).
The proposed approach is performed on the Pascal GPU (i.e., NVIDIA GTX 1080).
• Competing approaches: For fair comparison, the proposed approach is compared to the approach shown by Nishikawa et al. [15], on the Kepler GPU (i.e., NVIDIA GTX 780), and that shown by Lee [9] on the Maxwell GPU (i.e., NVIDIA GTX TITAN X).
• Parallel granularity values: The granularity is set to different values (i.e., 16, 32, 64, and 128) in all competing approaches using three different GPU architectures as shown in Tab. 5 for a thorough analysis.
• Performance evaluation: the performance of all competing approaches is evaluated by the throughput metric, in Giga bits per second (Gbps), on the basis of the higher the better.
Tab. 5: Configuration of the three GPU platforms.

Experiments and Results
In our experiments, we focus on computing the kernel time in the GPU that exempts the data transfer time between both CPU and GPU.We repeated the same experiment with a specific setting on a particular GPU for 30 times and took the average throughput.Four different charts are presented to show our results for a specific GPU architecture.In all charts, as mentioned in Tab.6, we used different granularities with a specific encryption keys storage.

Experimental Results on Kepler Platform
We evaluated the average throughput for different input block sizes (16 MBytes to 512 MBytes) as follows: • Shared memory-based chart, in Fig. 2(a), shows that the default granularity (i.e., 16 Bytes/thread) provides a higher average throughput compared to other granularities.
• Warp Shuffle chart, in Fig. 2(b), shows that the default granularity (i.e., 16 Bytes/thread) provides a higher average throughput compared to other granularities.
• Global memory-based chart, in Fig. 2(c), shows that the parallel granularity (i.e., 32 Bytes/thread) provides a higher average throughput compared to other granularities.
• All-in-one chart, in Fig. 3, shows that the default granularity (i.e., 16 Bytes/thread) with the shared memory key storage provides a higher average throughput of 78 Gbps.

7.2.
Comparison Between GPU-Based and CPU-Based Implementations

Experimental Results on Maxwell Platform
We evaluated the average throughput for different input block sizes (16 MBytes to 512 MBytes) as follows: • Shared memory-based chart, in Fig. 4(a), shows that the parallel granularity (i.e., 32 Bytes/thread) provides a higher average throughput compared to other granularities.
• Warp Shuffle chart, in Fig. 4(b), shows shows that the parallel granularity (i.e., 32 Bytes/thread) provides a higher average throughput compared to other granularities.
• Global memory-based chart, in Fig. 4(c), shows that the parallel granularity (i.e., 64 Bytes/thread) provides a higher average throughput compared to other granularities at most input file sizes.However, in case of setting    the input file size to 512M, the new parallel granularity (i.e., 128 Bytes/thread) provides a higher average throughput.
• All-in-one chart, in Fig. 5, shows that the parallel granularity (i.e., 32 Bytes/thread) with the shared memory key storage provides a higher average throughput of 201 Gbps.However, in case of setting the input file size to 256M, the parallel granularity (i.e., 32 Bytes/thread) with the warp shuffle memory storage provides a higher average throughput.

Experimental Results on Pascal Platform
We evaluated the average throughput for different input block sizes (16 MBytes to 512 MBytes) as follows: • Shared memory-based chart, in Fig. 6(a), shows that the parallel granularity (i.e., 32 Bytes/thread) provides a higher average throughput compared to other granularities.
• Warp shuffle chart, in Fig. 6   vides a higher average throughput compared to other granularities.
• Global memory-based chart, in Fig. 6(c), shows that the parallel granularity (i.e., 64 Bytes/thread) provides a higher average throughput compared to other granularities.
• All-in-one chart, in Fig. 7, shows that the parallel granularity (i.e., 32 Bytes/thread) with the shared memory key storage provides a higher average throughput of 276 Gbps.
The speedup-factor of using the GPU architecture over the CPU is needed to be determined.In case of using the CPU, the AES algorithm is implemented on an 8-core dual processor.In case of using the GPU, three different GPU architectures are exploited.The speedup-factor is determined as 40 on the Kepler GPU, 100 on the Maxwell GPU, and 130 on the Pascal GPU at a specific input file size.Note that the speedupfactor increases as the input block size increases.Table 7 shows the speedup-factor obtained on the three different GPU architectures over the CPU implementation with different input file sizes.

Analysis and Discussion
We presented four parallel AES charts using the parallel granularity and round keys storage to achieve a higher performance (i.e., throughput).In this section, we can analyze our experimental results mentioned above.

Parallel Granularity
The new 32, 64, and 128 granularities affect the AES performance depending on the GPU architecture used as follows: 1) The Kepler GPU • The default granularity(i.e., 16 Bytes/thread) provides a higher throughput of the AES algorithm compared to other granularities with the shared memory-based, warp shuffle-based, and All-in-one charts.
• The 32 Bytes/thread granularity outperforms other granularities with the global memory-based storage.Nevertheless, it didn't achieve as good results for the AES.It can be a good alternative in optimizing another algorithm using the Kepler GPU.

2) Maxwell and Pascal GPUs
• The 32 Bytes/thread granularity provides a higher throughput of the AES algorithm compared to the default granularity (i.e., 16 Bytes/thread) with the shared memory-based, warp shuffle-based, and All-in-one charts.
• The 64 Bytes/thread granularity outperforms other granularities with the global memory-based storage.Nevertheless, it didn't achieve as good results for the AES.It can be also a good alternative in optimizing another algorithm.
As aforementioned in Sec. 5. , there is a trade-off between the number of data blocks to be encrypted by one thread, and the processing functionality of each thread according to the specifications of the GPU used.In recent GPU architectures, such as Maxwell and Pascal, both the number of Streaming Multiprocessors (SMs), and the number of active blocks per multiprocessors are largely increased compared to those of the Kepler GPU.In turn, reducing the number of threads by using higher granularities overcomes the issue of increasing the processing functionality of each thread, thus increasing the AES performance.

Round Keys Storage
Although the registers are faster than the shared memory inside the GPU, using the latter provides a higher AES throughput compared to the warp shufflebased storage in all GPU platforms used (i.e., Kepler, Maxwell, and Pascal).This is because storing the round keys in two registers as well as accessing them using 32 threads inside a warp results in higher race conditions that negatively impact the AES performance.

Conclusions
We implemented the AES with CUDA language considering the parallel granularity with different round keys storage techniques to eventually increase the AES performance (i.e., throughput).The AES-128 is implemented with different parallel granularities on different GPU architectures compared to the CPU implementation.The proposed approach achieves throughput of 277 Gbps, 201 Gbps, and 78 Gbps on the Pascal GPU, the Maxwell GPU, and the Kepler GPU with granularity values of 32, 32, and 16 Bytes/thread, respectively.In addition, the speedup factor of implementing the AES algorithm using those GPUs are 130.992, 98.517, and 37.456, respectively, at 512 MBytes input file size with the best granularity value mentioned shortly.
Fig.1: The overall structure of the AES algorithm[1].

Table 3
[16] throughput rate using 16 Bytes/thread granularity.Zola et al.[16]proposed a speculative AES-CTR scheme.The proposed scheme Tab.3: Summary of AES implementations on different GPU architectures.
Tab. 6: Four different charts depending on two techniques of optimization.