Research on EDAC Schemes for Memory in Space Applications

: Memory used for storing the conﬁguration bitstream of ﬁeld programmable gate array in space applications often encounters single event upset problems, which may disrupt the integrity of data in memory and lead to unpredictable failures. For commercial memories used in low Earth orbit (LEO), single-bit errors and double-byte errors account for a large proportion. Meanwhile, error detection and correction (EDAC) schemes, e.g., triple modular redundancy, linear block codes, memory scrubbing, and the combination of these schemes, are very popular in LEO missions. To further these works, a novel EDAC scheme with cascaded “Bose–Chaudhuri–Hocquenghem and cyclic redundancy check” codes and a proper scrubbing method is presented in this paper. The performance of the proposed design is measured and compared with state-of-the-art EDAC schemes in terms of hardware overhead, time overhead and error correction and detection capabilities. It is concluded that the proposed EDAC scheme is better suited for memory in space applications.


Introduction
It is well known that data stored in memory chips suffer from single event upsets (SEUs) in space applications. Bit-flips are induced naturally by cosmic radiation, extreme temperature, electromagnetic radiation, etc. There is a famous experiment of SEU observations for commercial memories based on the Alsat-1 satellite in low Earth orbit (LEO). The Alsat-1 was launched on November 2002 and the experiment lasted over eight years. Table 1 shows a summary of 32 MBytes Ramdisk of SEU observations [1]. The probability of single-bit errors is the highest, which is 98.59%, followed by the probability of double-byte errors, which is 1.223% and is dominated by double-bit error in single-byte.
Error detection and correction (EDAC) technologies which can detect and correct errors in a certain degree are very popular in improving the reliability of memory devices. For the transmission of secure data between devices and its local memory, triple modular redundancy (TMR) is widely used for anti-SEU design, but its probabilistic error correction ability can cause cumulative error [2,3]. Another disadvantage of TMR is its large memory overhead (200%) [4]. Linear block codes are also well-known and widely used. However, for the error correction requirements in Table 1, these famous linear block codes, e.g., Hamming code, Parity code, Berger code, are not available because of the limitation of their error-correcting capabilities [5,6]. In addition, linear block codes with excessive errorcorrecting capabilities (e.g., LDPC, Reed-Solomon Code) are not necessary because they will increase design complexity and cause large hardware overhead [7,8]. For scrubbing methods which are also famous in space applications, it is too time-consuming to rewrite the entire memory [9]. Besides, some researchers have found a transistor-level EDAC method with small hardware overhead, which helps design the radiation-hardened memory cell library but is not suitable for error correction, in Table 1 [10][11][12].  Figure 1 shows the block diagram of an SRAM-based field programmable gate array (FPGA) in space applications. Once SEU occurs, it will lead to data flipping of the memory chip, which will affect the logical state of the FPGA, or even threaten the safety of the FPGA, resulting in irreparable failures [13]. An in-orbit controller acts as a bridge for the FPGA and its configuration memories. Programmable read-only memory (PROM) which is immune to SEU can be programmed only once. The bitstream stored in the PROM is usually used for the initialization of the FPGA, while the bitstream stored in the memory array is usually used for the functional reconfiguration of the FPGA. In addition, the bitstream stored in the memory array has the function of in-orbit maintenance and update through universal asynchronous receiver/transmitter (UART). In order to improve the reliability of data in the memory array, the in-orbit controller is usually designed to detect and correct the configuration bitstream that may be affected by SEU. The corresponding steps are as follows: (1) Data encoding: The configuration data are received through the UART, and then redundant encoding is performed by EDAC method. (2) Periodic readback checking: Check the data stored in the memory array periodically through the FPGA readback function. (3) Periodic scrubbing: According to the scrub interval set in advance, the memory array can be rewritten unconditionally by continuous scrubbing.
This paper focuses on hardware implemented EDAC schemes for memory in space applications. Section 2 presents the state-of-the-art EDAC schemes for memory in space applications. Then, a novel proposed EDAC scheme is given in Section 3. Section 4 analyzes experimental results. Finally, conclusions are drawn in Section 5.

Present State-of-the-Art EDAC Schemes
EDAC is committed to adding redundant bits to data in a specific way. When the data is corrupted, the redundant bits can help to identify and eliminate the corruption in a certain range. There are many EDAC schemes that meet the design purpose of this paper [14][15][16]. At present, the EDAC schemes that have been applied to space applications include TMR, linear block codes, memory scrubbing, and the combination of these schemes.

TMR
TMR is widely used for anti-SEU design. The implementation consists of three identical memories storing the same data and a voter [2,3], as shown in Figure 2. The voter outputs the majority of the data from the three memories, so that when SEU occurs in one of the memories, the output remains valid. Since the probability of error events occurring simultaneously in two memories is minimal, the error correction ability of the circuit can be improved by the TMR. However, there is still a certain probability of error decoding, resulting in cumulative error after subsequent propagation. Furthermore, as can be seen from Figure 2, memory overhead is a major drawback of the TMR. It wastes twice the required memory.

Linear Block Code Schemes
The EDAC circuit for memory array including an encoder and a decoder is shown in Figure 3. The encoder generates redundant data bits based on input data, while the decoder uses redundant data bits to detect and eliminate errors in a certain range. For linear block code (n, k), the number of codeword data bits, encoded data bits and redundant data bits are n, k and c (c = n − k), respectively. Therefore, the encoder has k-bit input and n-bit output, while the decoder has n-bit input and k-bit output. However, for the data bus of the memory array, a DDR ×8 chip with a bit width of eight is one of the most common options. Typically, there are four parameters used to measure the performance of linear block codes, i.e., error correction capacity (t c ), error detection capacity (t d ), code rate and bit overhead. They are expressed by where d is the minimum Hamming distance. In addition, time and space complexity needs to be considered to measure the efficiency of coding schemes. According to the data in Table 1, it is necessary to select the coding scheme with proper Hamming distance so that t c is equal to two or slightly higher. Therefore, some of the schemes, such as Hamming code, Parity code, Berger code, etc., do not meet this requirement [5,6]. In addition, Reed-Solomon code which is good at correcting non-binary codes is not considered in this paper [17]. Table 2 shows linear block code schemes with proper error correction capabilities, where p × q means the size of memory block. As can be seen from Table 2, Hadamard code has the lowest time and space complexity, but the ratio of code rate to bit overhead (C2B) is also the lowest, which means that the coding overhead is too high. Compared to Hadamard code, Golay codes have better coding overhead, but they are still not good enough. As the length of the Bose-Chaudhuri-Hocquenghem (BCH) code increases, the size of the corresponding memory block increases and the C2B ratio becomes better, but the time complexity increases exponentially [18]. 4D Parity codes have great C2B ratio and great time and space complexity. A big problem of 4D Parity codes is that it is applicable to the whole memory block, and its error correction and detection capability needs to be applied to the whole memory block, which means that the error correction and error detection time will be sacrificed in a certain degree [19,20].

TMR Based EDAC
To further improve the reliability of data in memory array, the combination of TMR and linear block code is very effective [4,21,22], as shown in Figures 4 and 5.  Both circuits inherit the error correction and detection function of the linear block code and the redundancy of TMR technology. However, the circuit shown in Figure 4 requires an additional two times the required memory, while the circuit shown in Figure 5 requires an additional eight times the required EDAC module and one time the required memory.

Memory Scrubbing
There are two kinds of scrubbing operation, one is scrubbing based on the UART in a fixed interval, the other is scrubbing based on the EDAC circuit [23,24]. For the former, the memory is rewritten unconditionally and the scrubbing interval is set according to the SEU rate. For the latter, the SEU is scrubbed out sequentially by decoding and rewriting operations. Since the EDAC circuit has inherent error correction and detection capabilities, potential errors can be automatically corrected and overridden. Memory scrubbing is also an effective EDAC solution, but the problem is that rewriting all data in memory is extremely time-consuming.

A New Proposed EDAC Scheme
For the purpose of reducing hardware and time overhead while meeting the anti-SEU requirements shown in Table 1, this section presents a new EDAC scheme by designing a cascaded code scheme and an improved scrubbing method.

Cascaded Code Scheme
Taking into account the time and space complexity and the C2B ratio in Table 2, BCH and 4D parity codes are suitable for data protection of memory in space applications. However, 4D parity codes have greater time overhead than BCH in response to errors, which can adversely affect memory scrubbing. Accordingly, this paper prefers BCH codes with similar C2B ratios to 4D parity codes. Figure 6 illustrates the effect of increasing code length on C2B ratio of BCH and 4D parity codes. Greater code length means greater space complexity and greater hardware overhead. Therefore, the middle BCH(63, 51, 5) which has similar C2B ratio to 4D parity codes is more appropriate for single-bit and double-bit error correction. For other error cases with less error probability in Table 1, this paper uses cyclic redundancy check 32 (CRC32) code which has high error detection ability for error checking.  Figure 7 shows the block diagram of the proposed EDAC system. Because the data bit width of the memory array in this paper is 8 bits, in byte, the calculation parallelism of the encoders (i.e., CRC_EN and BCH_EN) and decoders (i.e., CRC_DE and BCH_DE) designed below is eight bits.
where m 55 ,m 54 ,. . . ,and m 51 are zeros. Then Since the encoded data bits cannot be divisible by eight, the formula for the first eight iterations of the encoder is and the formula for the last iteration of the encoder is Accordingly, the schematic of the 8-bit parallel BCH(63, 51, 5) encoder is shown in Figure 8. Then the codeword vector is given by (c 62 , c 61 , · · · , c 1 , c 0 ) =(m 50 , m 49 , · · · , m 1 , m 0 , p 11 , p 10 , · · · , p 1 , p 0 )

Decoding
The decoding process of BCH includes three steps: (1) Computing the syndrome polynomial s(x) based on the input r(x) to the decoder.
(2) Calculating the error location polynomial Λ(x) by the key equation.
(3) Calculating the roots (e(x)) of Λ(x) by Chien search algorithm, and correcting errors based on the roots: The flow of BCH decoding is shown in Figure 9.

(2) Key equation
The error location polynomial Λ(x) for the BCH(63, 51, 5) is defined as and can be calculated by the SiBM algorithm through the key equation where Ω(x) is the error value polynomial. The SiBM algorithm and its formula derivation process are described in detail in Ref. [25]. Figures 11 and 12 show the implementation of the SiBM algorithm for the BCH(63, 51, 5) and its basic processing element (PE), respectively.  After two iterations, the output of the registers (R 0 ∼ R 2 ) shown in Figure 11 is the coefficients (λ 0 ∼ λ 2 ) of the error location polynomial Λ(x). (

3) Chien search
The Chien search algorithm searches the error location by checking whether Λ(α i ) is zero. Λ(α i ) can be written as If Λ(α i ) is equal to zero, it means that the input data r i has an error, otherwise there is no error. For 8-bit parallel computing, eight of the above formulas are calculated simultaneously, as shown in Figure 13. After Chien search, the roots (e(x)) of Λ(x) are calculated. Then errors in the decoded codewordĉ(x) can be corrected by Formula (14).

CRC
Since the BCH(63, 51, 5) is not sensitive to errors greater than two bits, CRC code can be used as a supplement for error detection of other error cases shown in Table 1. The probability of missed detection of the CRC32 is 2 −32 , which can cover almost all error cases in Table 1. CRC code is mainly used for the binary data. The data polynomial m(x) and the generator polynomial g(x) of the CRC32 are defined as m(x) =m 0 + m 1 x + · · · + m 63 x 63 (20) g(x) =x 32 + x 26 + x 23 + x 22 + x 16 + x 12 + x 11 where m 63 is zero.

Encoding
The encoding process of the CRC32 is similar to the BCH (63, 51, 5). The codeword polynomial c(x) is given by where the remainder is Let Then The formula for the iterations of the encoder is Accordingly, the schematic of the 8-bit parallel CRC32 encoder is shown in Figure 14. After 13 iterations, the codeword vector is given by (c 95 , c 94 , · · · , c 1 , c 0 ) =(m 63 , m 62 , · · · , m 1 , m 0 , p 31 , p 10 , · · · , p 1 , p 0 )

Decoding
The decoding process of the CRC32 is similar to its encoding process. The novel remainder isp where r(x) is the received input data from memory. The schematic of the 8-bit parallel CRC32 decoder is shown in Figure 15. After 13 iterations, if the bitwise-OR result ofp(x) is equal to zero, it indicates that the input r(x) has no error, otherwise an error is detected.

Proposed EDAC Process
The proposed EDAC process includes data encoding, periodic readback checking and scrubbing, respectively.
(1) Data encoding: Each 51 bits of data received through the UART is encoded by the BCH_EN to 63 bits. After adding one bit zero, the new 64 bits data are encoded by the CRC_EN to 96 bits. The codeword vector stored in memory array is (c 95 , c 94 , · · · , c 1 , c 0 ) =(0, m 50 , m 49 , · · · , m 1 , m 0 , p b11 , p b10 , · · · , p b1 , p b0 , p c31 , p c30 , · · · , p c1 , p c0 ), where p bi and p ci means that the remainder values of the BCH_EN and the CRC_EN, respectively. In addition, as each 51 bits of data are compiled to 96 bits, the memory overhead of the proposed scheme will increase by 88.24%.
(2) Periodic readback checking: Check the data stored in memory array periodically through the FPGA readback function. The errors in Table 1 are divided into three cases, i.e., no errors, single-bit or double-bit errors and multiple-bit errors which have more than two bits errors. The corresponding decoding process are:

•
No errors: Both the BCH_DE and the CRC_DE indicate that there are no errors. • Single-bit or double-bit errors: If the decoding result of the BCH_DE is wrong, the decoded 51 bits data are encoded again by the BCH_EN and written to the blank area of the memory array. Using new encoded data and the old CRC redundant bits, the CRC_DE can identify whether the error has been corrected by the BCH_DE. If yes, overwrite the original data in the memory array with the new encoded data. Otherwise, it means that there are multiple-bit errors and the corresponding memory address should be marked.
• Multiple-bit errors: Other than the above multiple-bit errors, if the decoding result of the BCH_DE is right and the decoding result of the CRC_DE is wrong, it also means that there are multiple-bit errors and the corresponding memory address should be marked.
(3) Scrubbing: According to the marked memory addresses, memory array can be partially rewritten through the UART.
Obviously, the probability of the multiple-bit errors in Table 1 is relatively small, which means that the scrubbing step is rarely implemented. Therefore, the proposed memory scrubbing method is very time-saving.

Experimental Results
The proposed EDAC scheme was implemented in an In-orbit controller chip (Bsv5cbrh) of Beijing Microelectronics Technology Institute as described in Section 3. The EDAC system shown in Figure 7 was verified both in function and timing. Meanwhile, a PCBlevel fault injection system based on Figure 1 was developed to validate the proposed scheme, including Bsv5cbrh, Xq5vsx95t, Xcf32p, etc., as shown in Figure 16. The proposed approach of the in-orbit controller was implemented based on the process described in Section 3.4. Correspondingly, the relationship between EDAC capability and the use of EDAC resources is shown in Table 3.  Based on the description of Section 3.4, it can be indicated from the results of the BCH_DE and the CRC_DE if there were no errors. For single-bit errors, all the locations in the memory were simulated to ensure 100% fault coverage. The result of the CRC_DE should be correct in this case. The BCH_EN was used to encode the decoded data from the BCH_DE, and the BCH_DE was used twice to check whether the error can be corrected or not. For double-bit errors in each 63-bit, pseudo-random sequence (i.e., Pseudo-Noise Code) to simulate the fault injection address was used, in which case the corresponding 32-bit data checked by the CRC_DE should also be correct. The use of EDAC resources in this case was the same as the use of EDAC resources to handle single-bit errors. For multiple-bit error detection, there was no need to rewrite the whole memory or the memory corresponding to the marked address, it could be judged by the combination of the results from the BCH_DE and the CRC_DE. Accordingly, the uses of EDAC resources were similar to the two cases described earlier (i.e., no errors and single-bit errors). For multiple-bit error correction, in addition to the previous case, memory scrubbing function was triggered to ensure 100% error correction, in which case all EDAC resources were used.
To measure the performance of the proposed EDAC, a comparison between the stateof-the-art EDAC schemes and the proposed EDAC scheme is shown in Table 4. The overall memory overhead is an increase of 88.24% in stored data, which is smaller than when using TMR or EDAC-TMR methods. The overall EDAC overhead is an increase of 100% in EDAC module, which is significantly smaller than when using EDAC-TMR method shown in Ref. [26]. The time overhead of the proposed EDAC is greater than the time overhead of TMR but less than the time overhead of EDAC-TMR methods shown in Table 4. Furthermore, since it inherited the error correction and detection capability of the cascaded "BCH(63, 51, 5) and CRC32" codes, the proposed EDAC scheme could correct single-bit and double-bit errors and detect multiple-bit errors. By memory scrubbing circuit, the proposed EDAC scheme could also correct multiple-bit errors. Hence, the proposed EDAC scheme had the highest error detection and correction capability shown in Table 4. Based on the proposed EDAC scheme, the encoding delay for the cascaded "BCH_EN and CRC _EN" was only 21 clock cycles, while the decoding delays for the BCH_DE and the CRC_DE were 20 clock cycles and 13 clock cycles, respectively. In addition, instead of rewriting all data in memory in a certain interval, the proposed EDAC scheme automatically scrubbed out single-bit errors and double-bit errors through the proposed EDAC circuit, and overrode the data through the UART with the marked memory addresses. For memory scrubbing based on the EDAC circuit, the proposed design could correct errors after 96 bits data were detected, while the design based on 4D parity codes must correct errors after a whole memory block was detected. For memory scrubbing through the UART, the memory addresses were marked only if there were multiple-bit errors in each 96-bit block, severe errors, and hardware errors shown in Table 1, which meant that the probability of memory scrubbing through the UART was less than 1.408%. Therefore, compared to the error-correcting time based on the EDAC circuit using 4D parity codes and the rewriting time for the whole memory, the proposed scrubbing method was relatively time-saving.

Conclusions
This paper has presented a novel EDAC scheme of the In-orbit controller chip (Bsv5cb rh) for memory in space applications. The EDAC scheme is based on the combination of the cascaded "BCH(63, 51, 5) and CRC32" codes and an improved scrubbing method. This scheme is sufficient to handle the typical SEU rate at LEO environment. The design is verified both in function and timing. The overall system cost is significantly smaller than the present state-of-the-art EDAC schemes. Experiments to simulate the error cases in Table 1 have shown that 100% of the errors can be detected and corrected.