The design of scalar AES Instruction Set Extensions for RISC-V

Secure, efficient execution of AES is an essential requirement on most computing platforms. Dedicated Instruction Set Extensions (ISEs) are often included for this purpose. RISC-V is a (relatively) new ISA that lacks such a standardised ISE. We survey the state-of-the-art industrial and academic ISEs for AES, implement and evaluate five different ISEs, one of which is novel. We recommend separate ISEs for 32 and 64-bit base architectures, with measured performance improvements for an AES-128 block encryption of 4× and 10× with a hardware cost of 1.1K and 8.2K gates respectively, when compared to a software-only implementation based on use of T-tables. We also explore how the proposed standard bit-manipulation extension to RISC-V can be harnessed for efficient implementation of AES-GCM. Our work supports the ongoing RISC-V cryptography extension standardisation process.


Introduction
Implementing the Advanced Encryption Standard (AES). Compared to more general workloads, cryptographic algorithms like AES present a significant implementation challenge. They involve computationally intensive and specialised functionality, are used in a wide range of contexts, and form a central target in a complex attack surface. The demand for efficiency (however measured) is an example of this challenge in two ways. First, cryptography often represents an enabling technology vs. a feature and is often viewed as an overhead from a user's perspective. Addressing this is complicated by constraints associated with the context, e.g., a demand for high-volume, low-latency, high-throughput, low-footprint, and/or low-power implementations. Second, although efficiency is a goal in itself, it also acts as an enabler for security. This is because one should not compromise security to meet efficiency requirements. Hence, a more efficient implementation leaves greater margin to deliver countermeasures against an attack.
AES is an interesting case-study wrt. secure, efficient implementation. For example, per the request for candidates announcement, 1 the AES process was instrumental in popularising a model in which both "security" (e.g., resilience against cryptanalytic attack) and "algorithm and implementation characteristics" form important quality metrics for the design, in order to facilitate techniques for higher quality implementations of it. Additionally, the design and implementations of AES are long-lived. The importance of AES has led to special emphasis on related research and development effort before, during, and, most significantly, after the AES process. The 20+ years since standardisation have forced an evolution of implementation techniques, to match changes in the technology and attack landscape. For example, [NBB + 01, Section 3.6] covers implementation (e.g., side-channel) attacks: this field has become richer, and the associated threat more dangerous during said period.

Support via Instruction Set Extensions (ISEs).
A large number of implementation styles often exist for a given cryptographic algorithm. Techniques can be algorithm-agnostic or algorithm-specific, and based on the use of hardware only, software only, or a hybrid approach using ISEs [GB11,BGM09,RI16]. For the ISE case, the aim is to identify through benchmarking, instances of algorithm-specific functionality which are inefficiently represented in the base ISA. Said functions are then implemented in hardware, and exposed to the programmer via one or more new instructions.
ISEs are an effective option for both high-end, performance-oriented and low-end, constrained platforms. They are particularly effective for the latter where resource constraints are tightest. For example, an ISE can be smaller and faster than a pure software implementation, and more efficient in terms of performance gain per additional logic gate than a hardware-only option.
Abstractly, an ISE design constitutes an interface to domain-specific functionality through the addition of instructions to a base ISA. As a fundamental and long-lived computer systems interface, the design and extension of an ISA demands careful consideration (cf. [Gue09,Section 4]) and the production of a concrete ISE design is not trivial. It must deliver a quantified improvement to the workload in question and consider numerous design goals including but not limited to: • Limiting the number and complexity of changes and interactions with the parent ISA.
• Avoiding the addition of too many instructions, or requiring large additional hardware modules to implement: this will damage commercial adoption. • Adhering to the design constraints and philosophies of the base ISA.
• Maximising the utility of the additional functionality, i.e., favour general-purpose over special-purpose functionality. Special-purpose functions can be justified in terms of how frequently the workload is required. For example, though an AES ISE might only be useful for AES, a webserver might execute AES millions of times per day.
The x86 architecture provides many examples of ISE design, having been extended numerous times by Intel and AMD. Various generations of non-cryptographic Multi-Media eXtensions (MMX), Streaming SIMD Extensions (SSE), and Advanced Vector Extensions (AVX) support numerical algorithms via vector (or SIMD) vs. scalar computation. Likewise, the cryptographic Advanced Encryption Standard New Instructions (AES-NI) [Gue09, DGvK19] ISE supports AES: it significantly improves latency and throughput (see, e.g., [FHLdO18]), and represents a useful case-study in the design goals above. It adds just 6 additional (vs. 1500+ total) instructions, reduces overhead by sharing the preexisting XMM register file, and facilitates compatibility via the CPUID [X8618a, Chapter 20] feature identification mechanism. It is also (sometimes unexpectedly) useful beyond AES: the Grøstl hash function [GKM + 11] uses the S-box, and the YAES [BV14] authenticated encryption scheme uses a full round. It can even be used to accelerate the Chinese SM4 block cipher. 2 On the one hand, RISC-V represents an excellent target for such work: the ISA is extensible by design and its open nature makes exploration of extensions easier through the availability of (often open-source) implementations. Increased commercial deployment of such implementations suggests that work on RISC-V is timely and potentially of high impact. On the other hand, RISC-V also presents unique challenges vs. previous work. For example, RISC-V could in fact be viewed as three related base ISAs, RV32I [RV:19a, Section 2], RV64I [RV:19a, Section 5], and RV128I [RV:19a, Section 6], that each support a different word size: designing ISEs that are applicable (or scale) across these options is a complicating factor. We hope this work supports RISC-V in becoming the first widely implemented ISA to support AES acceleration across all implementation profiles, from embedded IoT devices to application and server class processors.

AES specification
Syntax. As a block cipher, AES defines two algorithms such that m = Dec(k, c = Enc(k, m)). That is, given a plaintext m and cipher key k, Enc encrypts m under k; given the same k, Dec will invert Enc and so the same m can be recovered from the associated ciphertext c. In addition, it defines an algorithm KeyExp that expands [FIP01, Section 5.2] the cipher key into a sequence of round keys then used by Enc or Dec; where appropriate, we use to denote said algorithm as specialised to suit Enc and Dec respectively.
Parameterisation. An AES parameter set [FIP01, Figure 4] is a triple (N k, N b, N r) where N k dictates the number of 32-bit words in k, N b dictates the number of 32-bit words in m or c (i.e., a block), and N r dictates the number of rounds. The standard AES parameter sets are AES-128 → (4, 4, 10) AES-192 → (6, 4, 12) AES-256 → (8, 4, 14) such that the number of bits in a plaintext (resp. ciphertext) block is fixed to 8·4·N b = 128. From here on, we focus wlog. on encryption using AES-128 (other parameter sets are catered for naturally, and decryption with minor differences) so use the terms AES and AES-128 synonymously.
Design. The mathematics underpinning AES are described in [FIP01,Section 4]. In particular, it can be defined in terms of operations in the finite field F 2 8 constructed as i,j respectively, with superand/or subscripts omitted whenever irrelevant.
AES is an iterative block cipher, based on a substitution-permutation network. This means encryption using AES can be described [FIP01, Section 5.2] as follows: 1) the input plaintext is pre-whitened to yield s (0) = m ⊕ rk (0) = m ⊕ k, 2) each r-th round, for 1 ≤ r ≤ N r, demands computation of s (r+1) = P-layer(S-layer(s (r) )) ⊕ rk (r) , and therefore use of round key rk (r) , 3) the output ciphertext is c = s (N r) . Note that an alternative round definition, namely s (r+1) = P-layer(S-layer(s (r) ⊕rk (r) )) , is plausible: this shifts the pre-whitening step before 2) into an analogous post-whitening step after 2) to yield an equivalent result. At a low(er) level, the computation of each round is specified via four round functions (each of which has an inverse, to support decryption): • SubBytes [FIP01, Section 5.1.1] operates element-wise, computing s via application of the S-box: given an element x, this component can be described as where g is an inversion, and f is a specially selected affine transformation. Where appropriate, we overload SubBytes by allowing it to denote application of the S-box to any collection, e.g., a row, column, or, more generally, a sequence, of elements.
• ShiftRows [FIP01, Section 5.1.2] operates row-wise, rotating each i-th row of s (r) by i elements to form the associated row of s (r+1) , i.e., s i,j and thereby mixing a round key into the state. Note that S-layer = SubBytes, and P-layer = MixColumns • ShiftRows in rounds 1 ≤ r < N r ShiftRows in round N r i.e., the last, N r-th round differs from the initial N r − 1 rounds. As such, a round as defined above is constructed via AddRoundKey • MixColumns • ShiftRows • SubBytes or AddRoundKey • ShiftRows • SubBytes respectively, where, because ShiftRows and SubBytes commute, the order they are applied in can be selected to suit.

AES implementation 2.2.1 Representation
A field element in F 2 8 can be represented by an 8-bit byte, where the i-th bit of x for 0 ≤ i < 8 represents the i-th polynomial coefficient. Beyond this, the state and round key matrices can be represented in several ways. The most direct option would be termed array-based (or unpacked): the matrix is represented as a 16-element array of 8-bit bytes, each representing field elements. We use R to refer to the register width of a target platform. For RISC-V, R = XLEN where we consider XLEN ∈ 32, 64. Where R ≥ 32, an entire row or column of the AES state matrix can be packed into each register: we term these "row-packed" and "column-packed" representations respectively. Where R ≥ 128, it is plausible to pack an entire AES state matrix into a single register: we term this a "fully-packed" representation.

Hardware-only implementations
In a hardware-only implementation, execution of AES is performed by a dedicated hardware module (e.g., a memory-mapped co-processor). A large design space exists for hardware implementations of AES. Gaj and Chodowiec [GC00, Section 3.3] give an overview, detailing iterative, combinatorial (unrolled), and pipelined architectures. Similarly, [PMDW04,GB05,GC09] survey concrete implementations on a variety of fabrics including FPGAs and ASICs.
Although hardware-only designs are not our focus, the associated techniques can guide ISE-related design choices. First, they guide the ISE interface. For example, some ISEs can be characterised as offering an interface to hardware constituting one round (i.e., aligned with an iterative hardware implementation). Second, they guide the ISE implementation. For example, a significant body of work focuses on efficient hardware implementation of the S-box: [Can05, BP12, RMTA18].

Software-only implementations
In a software-only implementation, execution of AES and the associated application program is performed by a general-purpose processor core, using only instructions in the base ISA. Since we only consider use of the RISC-V scalar base ISA, we exclude work on the use of vector-like extensions [Ham09].
Software-only techniques are important because many ISEs are evaluated against baseline ISA implementations. Work such as that of Bernstein and Schwabe [BS08], Osvik et al. [OBSC10], and Schwabe and Stoffelen [SS16] present and compare multiple techniques across a range of platforms, but, for completeness, we present a (limited) survey in what follows.

Compute-oriented.
A compute-oriented implementation of AES favours online computation, thus reducing memory footprint at the cost of increased latency. Following [DR02, Section 4.1], for example, the idea is to simply 1) adopt an array-packed representation of state and round key matrices, then 2) construct a round implementation by following the algorithmic description of each round function in a direct manner. Addition in F 2 8 can be implemented with a base ISA XOR instruction. Base ISA support is rarely present for multiplication and inversion in F 2 8 however. Hence it is common to pre-compute the S-box and/or xtime functions. This requires pre-computation and storage of a 256 B look-up table per function, but significantly reduces execution latency.
On platforms where R = 32, Bertoni et al.
[BBF + 02] improve execution latency by exploiting the wider data-path. They adopt a row-packed representation of state and round key matrices, implementing ShiftRows using native rotation instructions to act on the packed rows. MixColumns is implemented using the SIMD Within A Register (SWAR) paradigm: applying xtime across a packed row in parallel.
where extraction of elements caters for ShiftRows, then XOR'ing the j-th column of rk (r) to cater for AddRoundKey.
As such, each round becomes a sequence of look-ups into T i , plus XORs to combine their result. Doing so demands pre-computation and storage of a 256 · 4 B = 1 kB look-up table per T i . The overhead related to extraction of each element from packed columns representing s (r) (to form look-table offsets) can be significant: Fiskiran and Lee [FL01] analyse the impact of different addressing modes on this issue, with Stoffelen [Sto19, Section 3.1] concluding that RISC-V is ill-equipped to reduce said overhead, due to the provision of a sparse set of addressing modes. Further, in systems with data caches, T-table based implementations are susceptible to timing attacks [Ber05].
Bit-sliced. The term bit-slicing is an implementation technique due to Biham [Bih97], which constitutes 1. a non-standard representation of data where each R-bit word x is transformed intox, i.e., R slices, sayx[i] for 0 ≤ i < R, wherex[i] j = x i for some j, and 2. a non-standard implementation of operation: each operation f used as r = f (x) must be transformed into a "software circuit"f , i.e., a sequence of Boolean instructions acting on the slices st.r =f (x).
Bit-slicing introduces some overhead related to conversion of x intox andr into r, plus the (relative) inefficiency off vs. f wrt. latency and footprint. However, if each slice is itself an R-bit word, then it is possible to compute R instances off in parallel on suitably packedx. A common analogy is that of transforming the R-bit, 1-way scalar processor into a 1-bit, R-way SIMD processor, thus giving (or recouping) up to a R-fold improvement in latency.
As evidenced by [MN07,K08] and [KS09], the application of bit-slicing to AES can be very effective; Stoffelen [Sto19, Section 3.1] specifically investigates this fact within the context of RISC-V.

Existing AES ISEs
Here, we survey AES-related ISE designs split into 1) industry-specified ISEs, which are standard extensions, and 2) academia-specified ISEs, which are non-standard extensions, wrt. a given base ISA. Each ISE is classified as either workload-specific, if it is only useful for AES, or workload-agnostic, if it is useful for AES and other workloads. Note that we exclude work where an ISE for another workload can be applied to AES but was not designed for AES (see, e.g., Tillich and Großschädl [TG04] who apply an ISE intended for ECC to AES).

Standard, industry-specified ISEs
Intel introduced support for AES in x86 per [X8618a, Section 12.13]. Instructions use a destructive 2-address (1 source, 1 source/destination) or non-destructive 3-address (2 source, 1 destination) format depending on the variant (e.g., XMM-vs. AVX-based), and operate on data housed in the pre-existing vector register file, implying R = 128. AES is implemented by 1) adopting a fully-packed representation of state and round key matrices, then 2) using AESENC [X8618b, Page 3-54] to construct a round implementation as AESENC → AddRoundKey • MixColumns • SubBytes • ShiftRows IBM introduced support for AES in POWER per [POW18, Section 6.11.1]. Instructions use a non-destructive 3-register (2 source, 1 destination) format, and operate on data housed in the pre-existing vector register file, implying R = 128. AES is implemented by 1) adopting a fully-packed representation of state and round key matrices, then 2) using vcipher [POW18, Page 304] to construct a round implementation as Instructions use a destructive 2-address (1 source, 1 source/destination) format, and operate on data housed in the pre-existing vector register file, implying R = 128. AES is implemented by 1) adopting a fully-packed representation of state and round key matrices, then 2) using AESE [ARM20, Section C7.2.8 ] and AESMC [ARM20, Section C7.2.10] to construct a round implementation as Oracle introduced support for AES in SPARC per [SPA16, Sections 7.3+7.4]. Instructions use a non-destructive 4-address (3 source, 1 destination) format, and operate on data housed in the pre-existing general-purpose register file, implying R = 64. AES is implemented by 1) using a column-packed representation of state and round key matrices, then 2) using AES_EROUND01 [SPA16, Page 109] and AES_EROUND23 [SPA16, Page 109] to construct a round implementation as (AES_EROUND01; AES_EROUND23) → AddRoundKey • MixColumns • ShiftRows • SubBytes in two steps: the first step processes columns 0 and 1 via AES_EROUND01 whereas the second step processes columns 2 and 3 via AES_EROUND23.

Non-standard, academia-specified ISEs
Burke et al. [BMA00] propose a workload-agnostic ISE based on workload characterisation for the DEC Alpha architecture [C + 14]. Per [BMA00], pertinent examples for AES include a) ROL and ROR, which perform left-and right-rotate, and b) SBOX, which extracts elements to form look-up table offsets. In one configuration, the resulting memory accesses are supported by a set of special-purpose "S-box caches".
Fiskiran and Lee [FL05] propose a workload-agnostic ISE that employs a so-called Parallel Table Lookup Module (PTLU) for a "RISC like" instruction set. For AES, this accelerates implementations based on T-tables by affording an addressing mode that a) integrates extraction of elements to form look-up table offsets, and b) performs the associated table look-ups in parallel, supported by a dedicated scratch-pad memory.
Biham et al. [BAK98,Page 232] propose (in theory) and Grabher et al. [GGP08] explore (in practice) a workload-agnostic ISE that supports bit-sliced implementations for their custom CRISP ("RISC like") architecture. The ISE allows computation using configurable 4-input, 2-output Boolean functions, vs. fixed 2-input, 1-output alternatives such as NOT, AND, OR, and XOR. Sequences of native Boolean instructions, which dominate bit-sliced implementations, can thereby be "compressed" into use of the ISE. Doing so improves both latency and footprint. [GGP08, Section 4] details the application to AES.
Nadehara et al. [NIK04] propose a workload-specific ISE that could be described as "hardware-assisted T-tables": observing that ∀x, , they support on-the-fly computation (vs. via look-up) of T-table entries. The ISE constitutes a single instruction AESENC → T i , supported by a dedicated hardware module (see [NIK04, Figure 6]). Instances of AESENC 1) extract an input element from a packed input column 2) use the input to compute an output element equivalent to a look-up from the T-table, and 3) store the output element into a packed output column. This approach was reapplied by Saarinen [Saa20] within the context of RISC-V.
Tillich et al. [TGS05] propose a workload-specific ISE that could be described as "hardware-assisted S-box" for the SPARC V8 architecture. The ISE constitutes a single instruction sbox → SubBytes, supported by a dedicated hardware module (see [TGS05, Figure 1]). Instances of sbox 1) extract an input element from a packed input row or column, 2) use the input to compute an output element equivalent to a look-up from the S-box, and 3) insert the output element into a packed output row or column. Using insert vs. overwrite semantics allows ShiftRows to be computed for free.
Tillich and Großschädl [TG06] propose a workload-specific ISE that could be described as "hardware-assisted round functions" for the SPARC V8 architecture.

Security
While the security of AES against a cryptanalytic attack is defined by the design, and so is out of scope, implementation attacks are of central importance. An implementation attack focuses on the concrete instance of a construct rather than the abstract specification. Countermeasures against such attacks must therefore be considered alongside implementations they relate to. Since AES is an important target, a significant body of literature exists around implementation attacks on it, including both active (e.g., fault injection) or passive (i.e., side-channel monitoring) attack techniques. The latter can be sub-divided into those dependent on analogue (power-based [MOP07]) or discrete (time-based [KQ99]) leakage.
Use of ISEs can provide some inherent protection against certain attacks. For example, ISEs typically yield constant time execution, preventing some classes of timing or micro-architectural attack techniques (see [Sze19, Section 4] and [GYCH18, Section 4]). Unfortunately, use of ISEs also presents some unique challenges. For example, Saab et al. [SRH16] discuss power-based attacks on AES-NI; concluding that naive use of AES-NI yields exploitable information leakage. Mitigation of such leakage demands the ISE address instances where the leakage stems from "inside" the ISE, and work with appropriate countermeasures (e.g., hiding [MOP07, Chapter 7] or masking [MOP07, Chapter 10]). Tillich et al. [THM07] consider this problem to an extent, including an ISE-based option in their investigation of hardened AES implementations. However, the challenge of developing suitable ISEs is under-studied in general.

Exploring AES ISEs for RISC-V
Section 2.3 outlined a range of ISE designs, demonstrating a large design space of options that we could consider. To narrow the design space into those we do consider, we use the requirements outlined below: Requirement 1. The ISE must support 1) AES encryption and decryption, and 2) all parameter sets, i.e., AES-128, AES-192, and AES-256. Support for auxiliary operations, e.g., key schedule, is an advantage but not a requirement.
Requirement 2. The ISE must align with the wider RISC-V design principles. This means it should favour simple building-block operations, and use instruction encodings with at most 2 source registers and 1 destination register. This avoids the cost of a general-purpose register file with more than 2 read ports or 1 write port. Requirement 3. The ISE must use the RISC-V general-purpose scalar register file to store operands and results, rather than any vector register file. This requirement excludes the majority of standard ISEs outlined in Section 2.3. Requirement 4. The ISE must not introduce special-purpose architectural state, nor rely on special-purpose micro-architectural state (e.g., caches or scratch-pad memory).
Requirement 5. The ISE must enable data-oblivious execution of AES, preventing timing attacks based on execution latency (e.g., stemming from accesses to a pre-computed S-box). Requirement 6. The ISE must be efficient, in terms of improvement in execution latency per area required: this balances the value in both metrics vs. an exclusive preference for one or the other. Efficiency wrt. auxiliary metrics, e.g., memory footprint or instruction encoding points, is an advantage but not a requirement.
Overall, the requirements combine to intentionally target the ISE at low(er)-end, resourceconstrained (e.g., embedded) platforms. We view such a focus as reasonable, because existing work on adding cryptographic support to the standard vector extension [RV:19a, Section 21] already caters for high(er)-end alternatives.
We arrive at five ISE variants using the requirements, the description of which is split into an intuitive description in the following Sections and a technical description (e.g., a list of instructions and their semantics) in an associated Appendix.

Variant 1 (V 1 ): SubBytes + MixColumn + explicit ShiftRows
By reproducing [TG06, Section 4.2], V 1 assumes XLEN = 32 and adopts a column-packed representation of state and round key matrices. As detailed in Figure 2, V 1 adds 4 instructions (2 for encryption, 2 for decryption). For example, saes.v1.encs applies SubBytes to elements in a packed column, and saes.v1.encm applies MixColumn to a packed column; the instruction format for saes.v1.encs and saes.v1.encm specifies 1 source and 1 destination register. Since saes.v1.encs requires 4 applications of the S-box, a trade-off between latency and area is possible st. n physical S-box instances are (re)used in 4/n cycles (e.g., 1 instance in 4 cycles, or 4 instances in 1 cycle). Figure 7 demonstrates that use of V 1 to implement AES encryption requires 47 instructions per round: 4 lw instructions to load the round key, 4 xor instructions to apply AddRoundKey, 4 saes.v1.encs instructions to apply SubBytes, 31 instructions to apply ShiftRows, and 4 saes.v1.encm instructions to apply MixColumns.

Variant 2 (V 2 ): SubBytes + MixColumn + implicit ShiftRows
By reproducing [TG06, Section 4.3], V 2 assumes XLEN = 32 and adopts a column-packed representation of state and round key matrices. As detailed in Figure 3, V 2 adds 4 instructions (2 for encryption, 2 for decryption). For example, saes.v2.encs applies SubBytes to elements in a packed column, and saes.v2.encm applies MixColumn to a packed column; the instruction format for saes.v2.encs and saes.v2.encm specifies 2 source and 1 destination register. V 2 improves V 1 by applying ShiftRows implicitly: this is possible by careful indexing of elements in source and destination columns during application of SubBytes and MixColumns, and also permits saes.v2.encs to be used within the key schedule. The same trade-off is possible as in V 1 , whereby n physical S-box instances are (re)used in 4/n cycles (e.g., 1 instance in 4 cycles, or 4 instances in 1 cycle). Figure 8 demonstrates that use of V 2 to implement AES encryption requires 16 instructions per round: 4 lw instructions to load the round key, 4 xor instructions to apply AddRoundKey, 4 saes.v1.encs instructions to apply SubBytes, and 4 saes.v1.encm instructions to apply MixColumns. In the N r-th round, which omits MixColumns, ShiftRows must be applied explicitly using an additional 12 instructions. BBFR06,Saa20]; it assumes XLEN = 32 and adopts a columnpacked representation of state and round key matrices.

Variant 3 (V 3 ): hardware-assisted T-tables
As detailed in Figure 4, V 3 adds 4 instructions (2 for encryption, 2 for decryption). The basic idea is to support an implementation strategy aligned with use of T-tables [DR02, Section 4.2], but compute entries in hardware vs. storing the look-up entries in memory. For example, saes.v3.encsm extracts an element from a packed column, applies SubBytes to the element, expands the element into a packed column, applies MixColumn, then applies AddRoundKey. The inclusion of AddRoundKey follows [Saa20], which improves on [NIK04,BBFR06]; as a result of this, the instruction format for saes.v3.encsm specifies 2 source and 1 destination register. The requirement for 1 application of the S-box allows for a more efficient functional unit than V 1 or V 2 , for example, either wrt. latency or area. Figure 9 demonstrates that use of V 3 to implement AES encryption requires 20 instructions per round: 4 lw instructions to load the round key, and 16 saes.v3.encsm instructions to apply SubBytes, ShiftRows, MixColumns, and AddRoundKey. In the N r-th round, which omits MixColumns, saes.v3.encsm is replaced by saes.v3.encs.

Variant 4 (V 4 ): 64-bit data-path
V 4 requires XLEN = 64 and adopts a double column-packed representation of state and round key matrices, i.e., two columns (or 8 elements) are packed into a 64-bit word. It is similar in principle to the SPARC [SPA16, Page 109] ISE, but improves on it by adhering to the 2 source and 1 destination register format. By sourcing two 64-bit registers, and writing a single 64-bit register, a single instruction can accept all of the current round state as input and produce half of the next round state as output.
SPARC [SPA16, Page 109] adds 9 instructions (4 for encryption, 4 for decryption, and 1 auxiliary). For example, AES_EROUND01 and AES_EROUND23 produce columns 0 and 1 and columns 2 and 3 respectively. Each instruction sources 3 64-bit registers, and writes a single 64-bit register. As shown in Figure 5, V 4 improves this by adding only 7 instructions (2 for encryption, 2 for decryption, and 3 auxiliary). This is realised by utilising the Equivalent Inverse Cipher representation detailed in [FIP01, Section 5.3.5]. This enables all of the round transformations to be applied in the same order for both encryption and decryption. The AddRoundKey step can then lifted out of the round function instructions (where otherwise it would appear in the middle of the decryption round), and implemented using a base ISA xor instruction. The round key then no longer needs to be an input to the instruction, meaning it only needs 2 source register operands. We then note that the nature of ShiftRows means we do not need separate instructions to compute the next values of columns (0,1) or columns (2,3) as the SPARC instructions do. Instead, we can simply reverse the order of the source register operands, and get the same effect. This is detailed in Figure 5, and an example round function is shown in Figure 10.
For example, saes.v4.encsm rd, rA, rB applies SubBytes, ShiftRow, and MixColumn to elements in a packed column and produces the next round values for packed columns (0,1). Executing saes.v4.encsm rd, rB, rA, with no change in values of rA or rB, will produce the next round state values for packed columns (2, 3). Figure 10 demonstrates that use of V 4 to implement AES encryption requires 6 instructions per round: 2 ld instructions to load the round key, 2 xor instructions to apply AddRoundKey, 2 saes.v4.encsm instructions to apply SubBytes, ShiftRows, and MixColumns. In the N r-th round, which omits MixColumns, saes.v4.encsm is replaced by saes.v4.encs. Note that use of the Equivalent Inverse Cipher representation necessitates inclusion of the saes.v4.imix instruction, in order to efficiently imply the inverse MixColumn step to words of the Key-Schedule.

Variant 5 (V 5 ): quadrant-packed
V 5 assumes XLEN = 32 and adopts a novel, quadrant-packed representation of state and round key matrices as shown in Figure 1. This means that each quadrant of the standard 4 × 4 byte AES state representation is packed into a single 32-bit register word. This allows either two complete rows (to perform ShiftRows) or two complete columns (to perform MixColumns) of the state can be accessed by accessing two quadrants. Based on this, such a representation can 1) afford advantages of both row-and column-packed alternatives, and 2) allow an instruction format that meets the 2 source and 1 destination register address constraint of a RISC-V pipeline. However, it also requires conversion of any input and output data between quadrant-packed and standard column-packed representation.
Although such conversion is amortised by N r rounds of computation, it still represents an overhead vs. other variants.
As detailed in Figure 6, V 5 adds 7 instructions (3 for encryption, 3 for decryption, and 1 auxiliary). Taking encryption as an example, we define two instructions to perform the ShiftRows and SubBytes steps. saes.v5.esrsub.lo performs ShiftRows and SubBytes on the two bottom quadrants, and saes.v5.esrsub.hi does the same for the two top quadrants. The two instructions are necessary to account for the different rotation amounts applied to the top and bottom rows as part of ShiftRows. A single instruction saes.v5.emix applies the MixColumns transformation to two columns. The instruction can source two entire column owing to the quadrant packed representation, but can only write a single quadrant back. Hence, two executions of the same instruction are needed to apply the entire MixColumns step to each two quadrants. Figure 11 demonstrates that use of V 5 to implement AES encryption requires 16 instructions per round: 4 lw instructions to load the round key, 4 xor instructions to apply AddRoundKey, 4 saes.v5.esrsub.
[lo|hi] instructions to apply SubBytes and ShiftRows, and 4 saes.v5.emix instructions to apply MixColumns. Note that conversion into (resp. from) quadrant-packed representation requires a further 12 instructions; this can be reduced to 4 pack[h] instructions using the standard bit-manipulation extension [RV:19a, Section 17].
V 5 instructions may be implemented with between 1 and 4 SBox instances, with a corresponding tradeoff between area and latency. As with V 1 and V 2 however, additional storage elements are required if fewer than 4 SBoxes are instanced in order to store intermediate results. The auxiliary saes.v5.sub instruction is used during the Key-Schedule, and can act simply as an interface to the SBoxes already required by the round instructions.

Implementation
The evaluation of each ISE considers two different RISC-V compliant base micro-architectures, which constitute two different host cores: • The SCARV 4 core supports the RV32IMC instruction set, i.e., the 32-bit [RV:19a, Section 2] base integer ISA plus standard Multiplication [RV:19a, Section 7] and Compressed [RV:19a, Section 16] extensions. Per the block diagram shown in Figure 12, the core executes instructions using a 5-stage, in-order pipeline. No branch prediction is supported. There are two memory interfaces for instruction fetch and data memory accesses. No instruction or data caches are supported. The core implements various performance counters, and elements of the RISC-V Privileged Resource Architecture (PRA) [RV:19b, Chapter 3] related to exception and interrupt handling. • The Rocket [AAB + 16] core executes instructions using a 5-stage, in-order pipeline which is highly configurable. We take advantage of this, considering two variants whose exact configuration is outlined in Figure 13 and Figure 14: the variants represent single 32-bit and 64-bit cores respectively, and so support the RV32IMC (resp. RV64IMC) instruction set, i.e., the 32-bit      Rocket Custom Coprocessor (RoCC) [AAB + 16, Section 4] interface. Since Requirement 2 (each instruction uses at most 2 source and 1 destination register) is fulfilled, neither micro-architecture required further structural alteration. A synthesis-time parameter was used to switch between different ISEs.

Evaluation
Hardware. Each ISE variant was integrated into the two host cores described in Section 3.6. The variants which assume XLEN = 32 (V 1 , V 2 , V 3 , and V 5 ) were evaluated on both the 32-bit SCARV core and the 32-bit Rocket core; the variant which assumes XLEN = 64 (V 4 ) was evaluated on only the 64-bit Rocket core. For V 1 , V 2 and V 5 a trade-off between latency and area exists. Each such case is considered through two optimisation goals: the (A)rea goal instantiates 1 S-box and has a n-cycle execution latency, whereas the (L)atency goal instantiates 4 S-boxes and has a 1-cycle execution latency. We focus on ASIC implementations (rather than FPGA implementations) because this is the more relevant metric to the industrial (rather than academic) RISC-V community. Table 1 shows the separated cost of the standalone ISE logic and the combined cost of the core and integrated ISE. Numbers highlighted in bold are the best result for each metric. The Baseline rows indicate the metrics for the the unmodified host CPU cores.
We use the open source Yosys [Wol] synthesis tool (v0.9+1706) with default settings to provide post-synthesis (as opposed to post-layout) circuit area in the form of NAND2 gate equivalents (ISE Area, Tables 1 and 8) and circuit depths in the form of gate delays (ISE Latency, Tables 1 and 8). While more abstract than providing exact area and frequency results for a particular ASIC standard cell library, it is much easier to reproduce 5 while still providing meaningful results. This methodology has also been used for other RISC-V standard extension proposals, namely the bit-manipulation extension [ris, Section 3.1, Page 54]. We found that none of the ISEs affected the critical gate delay path of either the SCARV or Rocket core. These were 97 for the 32-bit SCARV core and 231 and 167 for the 32 and 64-bit Rocket core respectivley 6 . Considering each ISE as implemented on the Rocket core, we note the overhead wrt. area is marginal: this stems from the fact that the baseline area of Rocket includes the data and instruction caches.
In Table 8 we consider the hardware costs when only encryption instructions are implemented. This is relevant to systems which only care about certain block cipher modes of operation, such as Galos/Counter-mode. We discuss this further in Section 4.

Software.
We evaluated each ISE variant by implementing the AES-128 Enc, Dec plus Enc-KeyExp and Dec-KeyExp. We use our own implementation of a non-ISE T-table based implementation as a baseline. The variants which assume XLEN = 32 (V 1 , V 2 , V 3 , and V 5 ) used a rolled strategy wrt. loops: V 1 , V 2 , and V 5 used 1 round per iteration, whereas V 3 used 2 rounds per iteration to avoid needless register move operations. The variant which assumes XLEN = 64 (V 4 ) used an unrolled strategy. In all cases the state is naturally aligned, 7 meaning any input (resp. output) can be loaded (resp. stored) using 4 lw instructions on a 32-bit core or 2 ld instructions on a 64-bit core.
Table 2 records the memory footprint (i.e., code footprint and static data footprint) of each software implementation. Again, numbers highlighted in bold are the best result for each metric. Where an entry for Dec-KeyExp is zero, this implies that Enc-KeyExp = Dec-KeyExp so there is no overhead. Where an entry for Dec-KeyExp is non-zero, this implies that Enc-KeyExp = Dec-KeyExp, and the equivalent inverse cipher construction [FIP01,Section 5.3.5] is used. This allows Dec-KeyExp to call Enc-KeyExp, then perform some additional post processing, with the quoted footprint therefore reflecting the latter only. Table 3 and Table 4 record instruction (i.e., iret) and cycle counts of each implementation, as executed on the SCARV and Rocket cores respectively.
Discussion. Table 1 demonstrates that all ISE variants imply a modest area overhead relative to their host core. For the RV32 Rocket the area overhead of a synthesised Rocket Tile with caches was less than 1% in all cases. For the SCARV, the area overhead ranged between 13% (V 5 (L)) and 3% (V 3 ). Table 2 shows all ISE variants having similarly small memory footprints in terms of both instruction code and data. Beyond this, and per Section 3, the primary metric of interest is efficiency in terms of the latency-area product. This metric draws on data from Table 1 plus either Table 3 or Table 4 for the SCARV or Rocket core respectively. We note the small difference in instruction count in some cases between the cores. This is due to slightly different compiler behaviour at the mesured function call sites in each core: the Rocket core saves an extra register to the stack. We deliberately omit the area of the host core from this calculation, as this fixed overhead dominates the final value and detracts from the comparison between ISEs themselves. Table 5 captures the results for the Rocket core, although the same conclusion can be drawn for the SCARV core. Qualitatively, we place more of a weight on Encryption (Enc) and Decryption (Dec) vs. Encryption Key Expansion (Enc-KeyExp) and Decryption Key Expansion (Dec-KeyExp), because typically many Enc or Dec operations are performed per KeyExp.
For a 32-bit core, our conclusion is that V 3 is the best option. Despite not being the fastest (by a small margin), it is the most efficient, and simplest to implement. The area optimised V 2 implementation sometimes comes close in efficiency, but requires a more complex multi-cycle implementation in this case. We note that V 3 has relatively poor performance for the decryption key schedule. This is because it uses the Equivalent Inverse Cipher representation, and must first create an encryption orientated key schedule, before applying the Inverse MixColumns transform to each word in the key schedule. Each word requires 8 instructions to apply only the Inverse MixColumns transform. We believe this is reasonable, as one typically performs many block decryptions per key schedule operation. We also note that for the common AES-GCM usecase, decryption functionality is not necessary. We discuss this further in Section 4. Compared to past work, our implementation of V 3 is slightly smaller than its original description in [Saa20]: 1149 v.s. 1240 gates. [Saa20] estimates a 5× performance improvement, which is slightly better than our measured 4× improvement, though this is dependant on relative memory access latencies. We would expect this improvement to increase in systems which store T-tables in (relatively) high latency flash memory. V 3 performs considerably better than [TGS05], which achieves only a 2× speedup in the best case. We note that despite needing the same number of instructions per round as V 2 (based on [TGS05]), our V 5 design suffers in terms of performance. This is due to the conversion between quadrant-packed and column-packed representations.
For a 64-bit core, V 4 is the best option, which is somewhat obvious because it specifically makes use of the wider data-path. It is 10× faster to perform a block encryption than a baseline T-table implementation targeting a 64-bit base RISC-V architecture. With reference to Table 4, note that the number of cycles per instruction executed is relatively high. This fact stems from use of the ROCC interface, in that forwarding of the result from an ISE instruction (that uses the ROCC) incurs an overhead vs. an ISE instruction; fine-grained integration of the AES-FU could therefore incrementally improve the results.
We believe it is sensible to standardise different ISEs for the RV32 and RV64 base ISAs.   This allows each ISE design to better suit the constraints of each base ISA. In the RV32 case, this acknowledges that such cores will most often appear in resource-constrained, embedded or IoT class devices. Hence, the most efficient ISE design is appropriate. For necessarily larger RV64-based designs, it makes sense to take advantage of the wider data-path, and acknowledge that these are more likely to be application class cores. Hence, they will place a higher value on performance than area-efficiency.

Using ISEs to implement AES-GCM
The Galois/Counter Mode (GCM) [NIS07] is a block cipher mode of operation which supports authenticated encryption. AES-GCM refers to an instantiation using AES as the underlying block cipher, which is the only case mandated by TLS 1.3 [Res18, Section 9.1]; the importance of this construction means GCM and AES are frequently considered together from an implementation and evaluation perspective. The computational core of AES-GCM is formed from two components. GCTR [NIS07, Section 6.5] is responsible for encryption using AES, and GHASH [NIS07, Section 6.4] is responsible for authentication.
Having dealt with efficient implementation of AES and hence GCTR in Section 3, we turn our attention to GHASH. Rather than further embellish the ISE for AES, we instead focus on re-use of the proposed standard bit-manipulation extension [RV:19a, Section 17] (at the time of writing, the draft extension proposal is found in [ris]). This approach is attractive for two reasons. AES-GCM is a very common construction, but AES is not the only block cipher which can be used with GCM. Likewise, AES may not always be used with GCM, so separation of the two constructs from an instruction set point of view is prudent.
Implementation. GHASH [NIS07, Section 6.4] is a universal hash defined over the finite field F 2 128 constructed as F 2 [x]/(x 128 + x 7 + x 2 + x + 1). Conversion of the input into the correct endianness can be realised using the grev (or generalised reverse) instruction, which can reverse the bits in each byte of an input word: 4 (resp. 2) grev instructions are therefore required on RV32IB (resp. RV64IB). Beyond this, operations in F 2 128 dominate. Addition in F 2 128 is equivalent to XOR: thus 4 (resp. 2) xor instructions are required on RV32IB (resp. RV64IB). Multiplication in F 2 128 can be split into two steps: a (128 × 128)-bit polynomial multiplication, followed by a reduction of the 256-bit result modulo x 128 + x 7 + x 2 + x + 1. The multiplication step can be realised using pairs of "carry-less" multiplication instructions clmul and clmulh. These compute the least significant (resp. most-significant) half of a carry-less product (i.e., product over F 2 ). Pairs of clmul and clmulh should be scheduled adjacently, allowing capable micro-architectures to fuse them. Use of a school book approach requires 16 (resp. 4) pairs on RV32IB (resp. RV64IB). Optimisation using the Karatsuba method requires 9 (resp. 3) such pairs on RV32IB (resp. RV64IB), plus some additional xor instructions.
The reduction step can be implemented in two ways: a shift-based reduction, made possible by the low Hamming weight of the primitive polynomial, or a multiplication-based reduction, analogous to the Montgomery or Barret methods. The most efficient approach depends on the relative execution latency of clmul[h] vs. xor and s[lr]li. Note that the entire GHASH operation, including clmul[h], must exhibit data-oblivious execution latency (e.g., avoid data-dependent optimisations like early-termination) to avoid associated side-channel attacks (cf. [GOPT09]). Table 6 lists instruction counts for multiplication in F 2 128 , implemented using combinations of the base ISA, and approaches for the polynomial multiplication and reduction steps. Table 7 then models the execution latency (measured in cycles)    We recommend the carry-less multiply instructions specified in the proposed RISC-V bit-manipulation extension also be included in the RISC-V cryptography extension. Implementers would otherwise need to implement (a subset of) the B extension, potentially adding functionality and cost that is not necessary.

Discussion.
An important consideration for the GCTR component of GCM is that it only requires the encryption function for a block cipher. Given this, we re-evaluate the hardware costs of each ISE, assuming that only the encryption instructions are implemented. These results are shown in Table 8. Compared to the hardware results for encrypt and decrypt being implemented in Table 1, the area overhead for all ISE variants is approximately halved, and there is a small reduction in circuit depth. For our recommended variants, V 3 and V 4 , the area savings when only encryption instructions are implemented are 0.46× and 0.54× respectively. For very constrained devices which have exact functionality requirements, we believe that making implementation of the decryption instruction optional could be beneficial. If these systems do require AES decryption, it could still be implemented in software, with a performance and code size similar to the baseline implementations in Table 3 and Table 4.

Conclusion
Motivated by ongoing efforts to standardise support for AES in RISC-V, we have implemented and evaluated five ISE designs on two different RISC-V compliant base microarchitectures. Our conclusion is that 1) V 3 is the best option for AES on 32-bit cores, 2) V 4 is the best option for AES on 64-bit cores, and 3) the standard B [RV:19a, Section 17] extension can combine with either option to support AES-GCM.
Our evaluations of the different ISEs have focused primarily on performance, code size and hardware cost metrics. Because our work is a departure from historic AES ISEs in that they are designed to be suitable for small, embedded CPU cores, power and EM side-channel security will likely be a consideration for implementations of these ISEs. We consider side-channel secure ISE design to be an open problem, particularly in terms of making the same code portably side-channel secure across multiple implementations of the same ISE. Future efforts would be well spent in studying this problem, perhaps looking at creating custom extensions based on the recommendations here to support side-channel security.