Minimalism of Software Implementation

Matsui, Mitsuru; Murakami, Yumiko

doi:10.1007/978-3-662-43933-3_20

Minimalism of Software Implementation

Extensive Performance Analysis of Symmetric Primitives on the RL78 Microcontroller

Mitsuru Matsui¹⁶ &
Yumiko Murakami¹⁶

Conference paper
First Online: 01 January 2014

1603 Accesses
3 Citations
4 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8424))

Abstract

This paper studies state-of-the-art software implementation of lightweight symmetric primitives from embedded system programmer’s standpoint. In embedded environments, due to many possible variations of ROM/RAM-size combinations, it is not always easy to obtain an entire performance picture of a given primitive and to create a fair benchmark from top speed records.

In this study we classify these size combinations into several categories and optimize operation speed in each category. We implemented on Renesas’ RL78 microcontroller - a typical CISC embedded processor, four block ciphers and seven hash functions with various combinations of ROM and RAM sizes to make performance characteristics of these primitives clearer. We also discuss how to create an interface and measure size and speed of a given primitive from a practical point of view.

As a result, our AES encryption codes run at as fast as 3,855 cycles/block in the ROM-1KB RAM-64B category, and 6,622 cycles/block in the ROM-512B RAM-128B category. For another examples aiming at minimizing a ROM size, we have achieved 453-byte Keccak, 396-byte Skein-256 and 210-byte PRESENT encryption codes on this processor.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Lightweight crypto has become one of hot topics in cryptography, with increasing market requirements of embedded security as a background. In the SHA-3 project, suitability to embedded applications was regarded as an important metric for selection, and ISO/IEC 29192 is standardizing lightweight cipher primitives. Lightweight crypto has been more often discussed in hardware contexts, such as low area and low power consumption, but some of recent studies focus on software implementation on low resource processors, which is, we believe, equally important since it is rather common in embedded systems that encryption is carried out in hardware, but decryption is done in software.

One of such activities is ECRYPT II block cipher and hash function projects [1, 2], which have published performance evaluation results of many symmetric primitives on the ATtiny45 processor. All codes were written in an assembly language, aiming at low-cost implementation. These works are effectively the first extensive benchmarking on a low-end microprocessor.

The paper also deals with assembly language programming of symmetric primitives on a low-end embedded processor, but takes different approaches. First of all, our target processor RL78 has an accumulator-based CISC architecture with 8 registers and read-modify instructions, while ATtiny is a RISC processor with 32 registers and a fixed instruction length. Looking at implementations of the same algorithm on different processor architectures will be of independent interest.

Secondly we aim at demonstrating various ROM/RAM-size and speed trade-offs for each primitive, not only pursuing pin-point top speed records. Embedded system programmers often deal with a crypto routine as an almost black box and want to know beforehand whether given size and speed can be achieved or not on a target processor. One of our purposes is to give them information about what ROM/RAM-size combinations are possible or impossible to implement on this processor. To do this, we first classify the size combinations into several categories and optimize each primitive in each category. Additionally we show a code toward a fastest speed and another code focusing on a smallest ROM size, accepting (very) slow computation speed.

Also we discuss interface and metric issues of symmetric primitives for embedded applications. In particular we point out that currently there is no consensus of how to count a RAM size of a given program. We here again take embedded programmers’ viewpoint. What they are interested in is the amount of resources that they must allocate for a primitive. In this regard, we count the entire temporary area internally used in the primitive as RAM bytes, say, argument area and stack consumption including callee save register storage with a standard subroutine interface.

Our target primitives are AES [3], Camellia [4] and Clefia [5] with 128-bit key and Present [6] with 80-bit key for block ciphers. Note that AES and Camellia are included in ISO/IEC 18033-3, and Clefia and Present have been recently adopted as ISO/IEC 29192-2, a standard of lightweight block ciphers. For hash functions, our choices are SHA-256, SHA-512 [7], Keccak-256 [8], Skein-256, Skein-512 [9] Grøstl-256 and Grøstl-512 [10], where Keccak-256, Skein-256 and Skein-512 denote Keccak[r = 1088,c = 512], Skein-256-256 and Skein-512-512, respectively.

As a result, it is shown that AES achieves excellent size-speed balances for all ROM/RAM combinations on this processor. It runs at the speed of 3,855 cycles/block in the ROM-1KB RAM-64B category. Its ROM size was able to be reduced down to 486 bytes. Camellia outperforms AES in decryption. It is also demonstrated that the key scheduling of Clefia is a bottleneck for minimizing a code and Present is slow due to its harware-oriented nature, but its simple structure contributes to creating a very small program; we were able to write its encryption code with 210 ROM bytes.

For hash functions, it is shown that SHA-256 and SHA-512 are still good choices from a performance point of view. For 256-bit hash functions SHA-256 is fastest if 1 KB or more ROM is given, and for 512-bit hash functions Skein-512 is the only option if only 256-byte RAM is given. It is also demonstrated that Keccak and Skein can be implemented in a very compact way; our smallest codes of Keccak-256/Skein-256 had 453/396 ROM bytes, respectively.

2 The RL78 Microcontroller

RL78 is Renesas Electronics’ next-generation low-power microcontroller family combining advanced features from both the 78 K and R8C families [11] which have been widely used in embedded applications such as in-vehicle controlling and mobile communication systems. It supports a wide range of pin, package and memory size combinations, currently covering Flash-ROM/RAM size variations of low-end 2KB/256B up to 512KB/32KB.

RL78 has a typical CISC architecture with an 8-bit accumulator-based instruction set including a small number of 16-bit instructions. It has eight 8-bit general registers a,x,b,c,d,e,h,l, which can be also used as register pairs ax,bc,de,hl. Most instructions allow only register a as a destination register, and only register pair hl as a general address pointer. For instance, xor a,[hl] is a valid instruction, but xor b,[hl] and xor a,[de] are not. This often causes size and speed penalties in programming symmetric primitives.

On the other hand, an advantage of this architecture is that it supports read-modify instructions and its average instruction length is short. Most instructions of RL78 used in a small model i.e. all segments are within 64 KB, are one- to three-byte long. For instance, xor a,[hl] is a read-modify one-cycle instruction whose length is one byte.

As for the memory access speed, reading from internal RAM takes only one cycle, but reading from ROM takes four cycles. Moreover when an address register is modified in the preceding instruction, an additional one-cycle delay happens due to an address generation interlock stall. Hence a table lookup can be costly on this processor.

Table 1 shows some of the instructions essential in our programming:

Table 1. Key instructions on RL78 in symmetric programming.

Abstract

1 Introduction

2 The RL78 Microcontroller

3 Interface and Metrics

3.1 Interface

3.2 ROM/RAM Count

3.3 Categorization as to Resources

3.4 Portability

4 Implementations

4.1 Block Ciphers

4.2 Hash Functions

5 Comparative Figures

6 Concluding Remarks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Summary of Our Implementation Results

Appendix: Summary of Our Implementation Results

1.1 Block Ciphers

1.2 Hash functions

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation