High-speed devices for modular reduction with minimal hardware costs

Abstract Asymmetric cryptosystems have an important advantage over symmetric systems, since only the public key is transmitted. However, asymmetric cryptographic algorithms have a lower speed compared to symmetric ones. When encrypting and decrypting in asymmetric cryptographic algorithms, complex and cumbersome procedures are used to raise very large numbers to a power modulo (modular exponentiation). In this case, the most resource-consuming operation is the modular reduction operation. One of the solutions to improve performance is the development of high-speed circuit solutions for modular reduction, the main task of which is to obtain the remainder of the division of a reducible number by the module. The structure of a high-speed former of partial remainders based on one binary adder and three comparison circuits is proposed, which can significantly decrease the hardware costs of devices for reducing numbers of multi bits in modulus. Based on the proposed former of partial remainders, a block diagram of a high-speed device for reducing the number modulo with sequential action was developed. Using this principle, a structural block diagram of a device of sequential action of a matrix type is developed. Based on the matrix circuit, a pipelined matrix circuit for reducing the number modulo is designed to process the data stream. A formula is given for estimating the gain in time when processing data streams. Algorithmic validation and verification of the high-speed devices for modular reduction with minimal hardware costs of sequential action was carried out on programmable logic-integrated circuits (FPGAs). For this, The Nexys 4 board based on the Artix-7 Field Programmable Gate Array (FPGA) from Xilinx was chosen. Verilog HDL is used to describe the circuit for reducing a number modulo. The results of a timing simulation of the device are presented in the form of time diagrams for a given 8-bit and 16-bit numbers, confirming the correct operation of the device.

Abstract: Asymmetric cryptosystems have an important advantage over symmetric systems, since only the public key is transmitted. However, asymmetric cryptographic algorithms have a lower speed compared to symmetric ones. When encrypting and decrypting in asymmetric cryptographic algorithms, complex and cumbersome procedures are used to raise very large numbers to a power modulo (modular exponentiation). In this case, the most resource-consuming operation is the modular reduction operation. One of the solutions to improve performance is the development of high-speed circuit solutions for modular reduction, the main task of which is to obtain the remainder of the division of a reducible number by the module. The structure of a high-speed former of partial remainders based on one binary adder and three comparison circuits is proposed, which can significantly decrease the hardware costs of devices for reducing numbers of multi bits in modulus. Based on the proposed former of partial remainders, a block diagram of a high-speed device for reducing the number modulo with sequential action was developed. Using this principle, a structural block diagram of a device of sequential action of a matrix type is developed. Based on the matrix circuit, a pipelined matrix circuit for reducing the number modulo is designed to process the data stream. A formula is given for estimating the gain in time when processing data streams. Algorithmic validation and verification of the high-speed devices for modular reduction with minimal hardware costs of sequential action was carried out on programmable logic-integrated circuits (FPGAs). For this, The Nexys 4 board based S. Tynymbayev ABOUT THE AUTHOR S. Tynymbayev has been working as a Professor in the Department of Information security systems of Almaty University of Power Engineering and Telecommunication. He has more than 48 years of scientific and pedagogical experience. He is interested in computer science, cryptographic hardware and embedded systems and physical and numerical modeling of operating units for digital devices.
More than 30 papers were published by Dr. Tynymbayev. The results of this article are important for creating efficient hardware implementations and improving public key cryptosystems.

PUBLIC INTEREST STATEMENT
Public-key cryptographic algorithms are different in that two keys are used to encrypt and decrypt information: a private (secret) key and a public key. They do not need to transfer and ensure the authenticity of secret keys compared to symmetric cryptosystems. However, public-key cryptosystems require more time to encrypt and decrypt due to the need to work with multibit numbers. Moreover, public-key cryptosystems bases on mathematical irreversible (and complex) transformations. Hardware implementation allows you to encrypt data much faster and safer than software implementation. Therefore, the development of high-speed operating units of hardware cryptoprocessors for public-key cryptographic algorithms is an important task, despite their high cost. on the Artix-7 Field Programmable Gate Array (FPGA) from Xilinx was chosen. Verilog HDL is used to describe the circuit for reducing a number modulo. The results of a timing simulation of the device are presented in the form of time diagrams for a given 8-bit and 16-bit numbers, confirming the correct operation of the device.

Introduction
The wide use of asymmetric cryptosystems with high security in comparison with symmetric cryptosystems is constrained by their low speed, as encryption and decryption procedures use complex and cumbersome mathematical calculations over very large numbers.
Hardware encryption has a number of significant advantages over software encryption, one of which is high-speed performance. The hardware implementation of cryptography algorithm guarantees its integrity, encryption and storage of keys is performed in the encoder board itself rather than in the computer's RAM. Thus, the security of the implementation of the algorithm is ensured, which is also an important advantage. Therefore, when designing hardware and softwarehardware cryptosystems with a public key, the task of developing circuit solutions for implementing one of the basic operations-modular reduction becomes relevant (Aitkhozhayeva, 2014;Al-Haija, Smadi, Al-Ja'fari, & Al-Shua'ibi, 2014;Ismail & Nuray, 2014).
There are many different methods of calculating the remainder when dividing by the module P (Erdem & Serdar, 2018;Hars, 2004;Safiullah, Khalid, & Yasir, 2018;Tengfei Wang, Wei Guo, & Jizeng Wei, 2019;Yu, Bai, & Hao, 2015). In (Petrenko, Sidorchuk, & Kuz'minov, 2009), a device for reduction of a 2n-bit number to an n-bit module in n/2 steps was considered. In it, to form the next partial remainder r i , eight binary adders were required, which leads to an increase in the complexity of the device.
In (Tynymbayev, Gnatyuk, Aitkhozhayeva, Berdibayev, & Namazbayev, 2019;Tynymbayev, Shaikulova, Imanbaev, & Ziro, 2017), the devices for reducing the number modulo are considered, where the former of partial remainders are constructed on three binary adders. Since the hardware costs of the n-bit adder is more than three times that of the n-bit comparator (Harris & Harris, 2012). By replacing the two n-bit binary adders with three n-bit comparators, it is possible to significantly reduce the hardware costs for constructing the formers of partial remainders. Especially it is strongly felt when constructing a device for modular reduction on matrix or pipelined circuits.

Materials and methods
2.1. Structure of the former of partial remainders (FPR) Figure 1 shows a functional diagram of the FPR consisting of one binary adder ADD and three comparators Com1, Com2 and Com3.
The inputs of the FPR the tripled values of the module 3Р and 3P and the values of the module P in the true representation and one's complement-Р and P from the corresponding registers are received. The values 2 P and 2P are formed by shifting the values of P and P to the left by one bit, respectively. Besides, the value of the previous remainder with a shift of two bits to the left 4r i-1 is fed to the input of the FPR. At the output of the adder, as a result of one of the three additions 4r i-1 + 3P + 1, 4r i-1 + 2P + 1 or 4r i-1 + P + 1, a partial remainder-ri is formed.
Multiplied by four previous partial remainder 4r i-1 is fed to the first inputs of the adder ADD and to the first inputs of the comparators Com1, Com2, Com3. The input Com1 is fed with the value of P and its one's complement P, which simplifies the structure of Com1. Similarly, 2P and 2P are fed to the other inputs of Com2. The inputs of Com3 are given the values 3P and 3P.
In the comparator Com1, the codes 4r i-1 and P are compared. If 4r i-1 ≥ Р, then a "1" signal is generated at its output 2. Conversely, if 4ri-1 < Р, then at the output 1 the unit impulse is generated.
The Com2 compares the value 4r i-1 with doubled module 2P. Then at the output 1 of this circuit, the signal "1" is set, if 4r i-1 < 2P, while "0" is set at output 2. If 4r i-1 ≥ 2P, output 1 is set to "0" and at the output 2 is signal "1".
The Com3 compares the codes 4r i-1 and 3p. If 4r i-1 < 3p, then at the output 1 of this circuit a "1" signal is formed and "0" is set at the output 2. When 4r i-1 ≥ 3p, at the output 1 is formed by the signal "0" and at the output 2 the signal "1" is set. Table 1 shows the executable operations, depending on the ratios of 4r i-1 with different values of the modules P, 2P and 3P.
With the ratios P ≤ 4r i-1 < 2P, a unit impulse is generated at the output of the AND1 gate, which is simultaneously fed to the inputs of OR2 and AND3 gates, on the second inputs of which are fed with the one's complement module bits P. Output AND3 gates are fed to the right inputs of the adder ADD via the OR1 gates. On the left inputs ADD, the codes of the value 4r i-1 are fed, and Table 1. Types of operations at different ratios 4r i-1 with P, 2P, 3P

Ratios
Operations through OR2 the signal "+1" is fed to the input of the lowest order bit position of this adder, the operation r i = 4r i-1 + P + 1 is performed.
For the ratios 4r i-1 ≥ 2P and 4r i-1 < 3P, the outputs of the AND2 gate unit impulse is generated, which is fed to the input of the OR2 gate and the block of the AND4 gates. At the second data inputs of AND4 are fed the one's complement doubled module bits. The value of the module 2P through the block of the OR1 gates is fed to the right inputs of the adder ADD, and the code "+1" is supplied to the input of the lowest order bit position and the operation r i = 4r i-1 + 2P + 1 is performed in the adder.
With the ratios 4r i-1 ≥ 3P from second output of the comparator Com3, a unit impulse is applied to the input of the circuit of the AND5. At the data inputs of AND5 gates are fed with bits of the module 3P. Codes 3P through the block of OR1 gates are transmitted to the right inputs of the ADD.
In this case, the operation 4r i-1 + 3P + 1 is performed in the adder.
In sequential action devices for modular reduction under 4r i-1 < Р condition previous remainder is stored in the remainder register.
In matrix circuits at 4r i-1 < Р, the value 4r i-1 through the AND0 gate and the block of the logic gates OR3 is transmitted to the next FPR. Figure 2 shows the functional diagram of the device for modular reduction a sequential action, which consists:

Structure of devices of sequential action
-2n+2-bits register, where the number A is shifted by two bit positions to the left-RgA; registers Rg3P and RgP where values of the tripled module-3P and the module P are taken, respectively, before the operations begin; Figure 2. Functional diagram of a high-speed device for modular reduction a sequential action.
former of partial remainders FPR; control block, which includes a subtracting counter.
The highest order bit positions of the register RgA through the block of the AND9 gates are connected with the FPR. Through the AND9 gates, under the control of the clock pulse CP from the controller, a value of 4ri-1 is transmitted from the register RgA. The inputs of the FPR are given the values 3Р, 3P, 2Р, 2P and Р, P. From the output of the FPR via the next partial remainder is fed to the inputs RgA. By using the "End of Operation" signal the result of the calculation is output through the block of the AND10 gates. The inputs of the controller are fed with the signal "Start", the clock pulses CP and the number of shifts n/2, necessary for calculating R = AmodP.
The device works as follows. With the signal "Start", the operands A, 3P and P are, respectively, received in the registers RgA, Rg3P and RgP. In addition, by the "Start" signal, the number of shifts n/2 is fed in the counter of the clock pulses CCP. At each step of modular reduction, after receiving the operands, the controller sends its clock pulse to its output, which shifts the contents of RgA two bits to the left. After shifting, through the delayed CP on the delay lines DL, the value of the highest bits RgA, where the value 4R 0 is generated over the block of AND9 gates, is transferred to the inputs of the adder of the FPR, at the output of which the value of the partial remainder ri is formed. This remainder via the OR4 gates will be written in RgA. With the clock pulse CP the counter reading is reduced by one. By this time, the next CP is fed into the circuit, which generates the next partial remainder in the FPR that is sent to RgA, and so on.
After the n/2 th clock pulse is applied, the n/2 th partial remainder is generated at the output of the FPR, which is stored in RgA. With this clock pulse, the CCP is set to zero and produces the signal "End of Operation". With this signal, the result from the highest order bit positions of RgA is output by the AND5 gates.

Matrix and pipelined circuits for modular reduction
Now consider the matrix circuit for reducing the number modulo, the functional diagram of which is shown in Figure 3 The circuit is constructed for the number А = а 11 а 10 … а 1 а 0 and Р = а 7 а 6 … а 1 а 0 and consists of the registers RgA and Rg3P, RgP and the formers of partial remainders FPR1, FPR2 and FPR3. At the input of these formers, the values 3Р, 3P, 2Р, 2P and Р, P are applied.
At input FPR1 a value r 1 is formed that with a shift by two bit positions to the left L(2) 4r 1 , is fed to the inputs of the FPR2. At the same time, the bits a 3 and a 2 of the number A are attached.
At the output of the FPR2, the value of the partial remainder r 2 is formed. The inputs of the FPR3 are fed to the partial remainder r 2 that shifted by two bit positions to the left 4r 2 and docked to the bits a 1 and a 0 . At the output of the FPR3, the remainder R = r 3 is formed. Delay time DL is determined by ƮDL = 3ƮFPR.
When pipelining, the whole process is divided into a sequence of completed steps. Each of the stages of the division procedure is performed at its stage pipeline, with all stages running in parallel. The results calculated at the i-th stage are transferred for further processing to the (i + 1) stage. The transfer of information from the stage to the stage occurs through the buffer register placed between them. The one that performs its operation puts the result in the buffer register and can start processing the next portion of these operations, while the next stage of the pipeline uses the data stored in the buffer registers located at its inputs as initial ones. Synchronization of the pipeline work is provided by clock pulses (CP), the period of which τ is determined by the slowest stage of the pipeline τi and the delay in the buffer register.
In the pipelined device of modular reduction with K stages, the input data that multiplied modulo can be fed to the input with an interval of K times smaller than in the case of the usual reduction of the numbers modulo. With the same frequency, the result appears at the output of the device.
For constructing such a pipelined device is required to have registers of reducible number, the registers of the partial remainders, the group of AND gates, formers of partial remainders, modulo-2 adders and result register. Figure 4 shows the pipelined matrix circuit for reducing the number modulo, constructed on the basis of the matrix scheme (Figure 3). The pipeline consists of three stages. Each stage consists of a FPR and buffer registers of the partial remainder and bits of the number A that have not yet entered the operation and the registers Rg3P and RgP. The pipeline is controlled by the clock pulses CP. After each CP is applied to the input and pipeline is filled, the results of the pairs Ai and Pi are formed. As you can see from Figure 4, the pipeline is synchronous linear. The performance of a synchronous pipeline basically depends on the correct choice of the duration of the clock period T p . The minimum allowable value of T p can be defined as the sum of the largest of the processing time on a separate stage of the pipeline T max and the time of writing the results of the calculation in the buffer registers of the stage T Rg , then In the considered pipeline, T max is determined by processing time on the former of the partial remainders. It is determined by the time modulo-2 addition T m2 and the switching time of the doubled partial remainder, or the result of adding the doubled remainder with the module P to the outputs of the multiplexer, i.e. time delay on the multiplexer-T ms and the comparison time 4r i-1 c P, 2P and 3P-T com . Then The processing time of the N input data stream on a pipeline with K stages with a clock period T p can be determined by the formula (Orlov & Tsilker, 2010). Example. Let N = 20 and K = 3, then C = (20*3)-22 = 28Tp From the considered example, it can be seen that the use of the pipeline will reduce the processing time to 28 units.

Implementation on FPGA
Algorithmic validation of the high-speed devices for modular reduction with minimal hardware costs of sequential action was carried out. For this, The Nexys 4 board based on the Artix-7 Field Programmable Gate Array (FPGA) from Xilinx was selected ( Figure 5). To describe the circuit for modular reduction, the hardware description language Verilog was chosen (Digilent Nexys 4 Artix-7 FPGA Trainer Board, 2018;IEEE Standard for Verilog Hardware Description Language, 2018;Navabi, 2007). Table 2 shows the number of main resources of FPGA Artix-7 (XC7A100T-1CSG324C): To input the data and visualize the intermediate results, the FPGA Board is equipped with all necessary ports and peripherals, the main of which are 16 switches, 16 LEDs, as well as a USB-UART bridge, DDR2 128MB, etc. Figure 6 shows the timing diagram for the formation of the values of partial remainders with the reduction of the number A a7Äa0 ¼ 187 10 ¼ 10111011 2 , P ¼ 14 10 ¼ 1110 2 , 2P ¼ 28 10 ¼ 11100 2 , 3P ¼ 42 10 ¼ 101010 2 and the highest bits A a7÷a0 represent r 0 ¼ 11 10 ¼ 1011 2 .
In Figure 6, on the rising edge of the clock pulse CP1, the contents of register RgA is shifted by two bit positions to the left and in this register on the highest six bits positions 4r 0 þ a 3 a 2 is formed, which corresponds to the binary code 1,011,102 = 4610. At the outputs of AND9 gates (Figure 2), the number 46 is compared with the numbers P ¼ 14 10 , 2P ¼ 28 10 , 3P ¼ 42 10 and signal "1" is generated at the On the rising edge of the clock pulse CP2 is shifted by two bit positions to the left register RgA and in it is established4r 1 þ a 1 a 0 ¼ 16 þ 3 ¼ 19 10 . The number 19 is compared to P, 2P and 3P and signal "1" is generated at the output 2 of the Com1, which leads to perform operations 19-P = 19-14 = 5, which can be seen in Figure 6, i.e. R = 5 is the result of the operation 187mod14 = 5. Figure 7 shows the timing diagram for the formation of partial remainders for the numbers A a15Äa0 ¼ 27317 10 uP ¼ 209 10 ; 2P ¼ 418 10 u3P ¼ 627 10 .
When calculating r2 = r3 at the outputs of the adder ADD, we get negative differences, which were blocked, keeping the "old" remainders in RgA.

Conclusion
In devices for modular reduction in former of partial remainders, replacing two binary adders with three comparators, which results in minimizing the structure of a high-speed device for modular reduction. The presented device allows to accelerate the calculation by reducing the 2n-bit number A modulo P by two times. Not all calculations go beyond the bit grid of the module.
When processing a large amount of data on the same algorithm, the most productive are the pipeline structures. When encrypting data, the modular reduction operation is performed for a large   amount of different numbers. Therefore, to increase the speed, it is advisable to use pipeline structures. On the base of the modifications, it is possible to implement the pipelined matrix circuit for reducing the number modulo on the FPGA.

Funding
The work was carried out by the authors within the framework of program-targeted financing of the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan (BR053236757 "Development of software and hardware and software for cryptographic protection of information during its transmission and storage in info-communication systems and general purpose networks").