A New Countermeasure against Brute-Force Attacks That Use High Performance Computers for Big Data Analysis

Using high performance parallel and distributed computing systems, we can collect, generate, handle, and transmit ever-increasing amounts of data. However, these technical advancements also allow malicious individuals to obtain high computational power to attack cryptosystems. Traditional cryptosystem countermeasures have been somewhat passive in response to this change, because they simply increase computational costs by increasing key lengths. Cryptosystems that use the conventional countermeasures cannot preserve secrecy against various cryptanalysis approaches, including side channel analysis and brute-force attacks. Therefore, two new countermeasures have recently been introduced to address this problem: honey encryption and the structural code scheme. Both methods look different; however, they have similar security goals and they both feature distribution transforming encoders based on statistical schemes. We unify them into a statistical code scheme in this study. After explaining the statistical code scheme, we describe the structural code scheme, which has not been adopted as widely as the honey encryption.


Introduction
Currently, we can collect, process, analyze, and transmit massive amounts of data using high performance parallel and distributed computing systems. For instance, we can process and analyze a substantial amount of data in real time using the graphics processing units (GPUs) or the field-programmable gate arrays (FPGAs). These are specialized electronic circuits designed to rapidly manipulate vast amounts of data (commonly referred to as big data) and to accelerate their computation. However, this technology also allows malicious individuals to access high computational power to attack the cryptosystems. In other words, if malicious individuals can crack cryptosystems with GPUs or FPGAs, they can also find secret keys within a reasonable amount of time. Typically, attackers do not use sophisticated techniques to crack the cryptosystem but rather simple techniques such as brute-force attacks.
The brute-force attack is one of the simplest and strongest cryptanalysis methods. This attack searches for only one key, which returns meaningful plain text by computing all keys in the key space. From ancient Egypt to the present day, this cryptanalysis method has threatened many cryptosystems [1]. If attackers have sufficient time and computational power, they would be able to crack most cryptosystems by bruteforce attack [2]. It is known that one-time pad (OTP) is the only secure cryptosystem against the brute-force attack.
Accordingly, modern cryptosystems have been developed with two different countermeasures. The first countermeasure is to increase computational complexity. If computational complexity is high, an attacker cannot crack the cryptosystem within a polynomial time. Therefore, cryptosystems protect secrecy by increasing the key length or the entropy using a pseudorandom number generator (PRNG). The second countermeasure, which has been recently invented [3,4], protects secrecy by using statistical features. Two techniques from this countermeasure are utilized in this study: the honey encryption and the structural code scheme [4]. In this paper, we unify both approaches as the statistical code scheme, because both are based on statistical properties.
While conventional codes schemes, including the American Standard Code for Information Interchange (ASCII) and Unicode, use deterministic encoders and decoders, the statistical code scheme uses statistical features to encode and decode words. While Juels and Ristenpart published a seminal article about the honey encryption in 2014 [4], a patent for a similar concept, the structural code schemes [5], was applied for that in 2013; this scheme was originally designed for statistical deniable encoders. Articles describing the structural code scheme were also published in the academic literature in 2014 [3].
As shown in Table 1, the statistical code scheme is divided into the honey encryption and the structural code scheme. Both countermeasures use statistical features against bruteforce attacks; however, they have different designs. While honey encryption focuses on generating a similar distribution, the structural code scheme's focus is to build a similar structure from the learning system using Markov processes. Both schemes provide semantic security; however, the structural code scheme also provides syntactic security. In addition, while the honey encryption does not support natural language processing (NLP), the structural code scheme can handle natural languages using language models. However, unlike the honey encryption, the formal mathematical proof has not yet been provided for the structural code scheme. In this paper, we focus on the structural code scheme, which is relatively less popular than honey encryption in the fields of cryptography and security.
The key contributions of this paper can be summarized as follows: (i) we claim that conventional cryptosystems can be cracked by brute-force attacks utilizing high performance computers used for big data analysis; (ii) we explain why new countermeasures against bruteforce attacks must be developed; (iii) we propose that the structural code scheme and the honey encryption can be unified in the statistical code scheme; (iv) we compare the characteristics of the honey encryption and the structural code scheme; (v) we describe the algorithms of the structural code scheme in detail; (vi) we evaluate and analyze the performance of structural code scheme.
The paper is organized as follows. In the following section, we explain the information interchange code, probabilistic language model, and Markov process. We will describe the details of the structural code scheme in Section 4. Next, we highlight the differences of the proposed scheme through a comparative analysis with ASCII and evaluate the performance in Sections 5 and 6. We then describe our conclusions and provide suggestions for future research.

Related Works
We mentioned two different types of countermeasures against the brute-force attacks in the introduction. Increasing the computational complexity using key length or PRNG is a strategy that has been used and studied for years; however, it shows limitations and weaknesses against brute-force attacks.
For example, PSNR's lack of entropy decreases key space; therefore, the cryptosystem's key is revealed by a brute-force attack [6]. Goldberg's study suggests that the predictability of Netscape's seed value is related to low PRNG performance [7]. Debian OpenSSL, RSA Public key factoring, and Bitcoin's Elliptic Curve Digital Signature Algorithm also show vulnerability against brute-force attacks caused by low PRNG performance [8][9][10]. The same problem was also discovered in a fuzzy fingerprint vault system that protects fingerprints for user authentication [11]. Another factor that may lead to brute-force attacks is hardware with chips such as GPU and field-programmable gate arrays (FPGAs). Hardware with such chips can compute large amounts of data at once because they have advanced computing power through parallel processing. Therefore, although extending the key length increases computational complexity, hardware can solve it if sufficient time and efficient CPUs are available. In practice, data encryption standard (DES) was decrypted by the parallel processing capabilities of the DES cracking machine built by Electronic Frontier Foundation (EFF) [12].
Given these factors, simply increasing the computational complexity is not sufficient to maintain secrecy. Hence, the statistical code scheme is presented as a new countermeasure. In 2013 the related work "Honeywords" appeared, and Honey encryption was presented in January 2014 [4,13].

Code for Information
Interchange. Information interchange code represents various computer data in electronic communication devices and accommodates the language and symbols of many countries. Although a variety of schemes already exists, the most representative code scheme is ASCII. As users encode and decode data, they use an information interchange code table that includes bit and sign information.
Bit information corresponds to sign information, and this relation is a one-to-one function. Therefore, every user (a) 1st order Markov chain obtains the same result from the code table by deterministic features.

Probabilistic Language Model.
A probabilistic language model represents a probability of a sequence of words or strings given a stochastic process S = 1 2 3 , . . . , . Each character of the words is realized (generated) from the stochastic process: 0 ≤ ( ) ≤ 1, = 1, . . . , . A discrete random process is a set of random variables that can be expressed in a discrete time order. The stochastic random process sp(⋅) has values in a state space A = { 1 , 2 , . . . , }, which is a countable set. For instance, if we consider only integer numbers, then A = {0, 1, 2, . . . , 9}. In a similar manner, A = { , , , . . . , , , , . . . , } for characters. We can locate a natural language rule from the special features of sets that contain random variables per discrete time. For the probabilistic language model, we define the discrete time as a discrete section. A set of the word's probabilities in the discrete section is an -gram model, and it indicates the probability of the subsequent words. For example, a 1-gram model shows the probability of one character such as " " or " " and a 2-gram model indicates the probability of the subsequent two characters such as " " or " . "

Markov Process.
A Markov process is a process that satisfies a Markov property. This property indicates that the system can predict a future state depending on the current state. It is divided into a continuous-time Markov process and a discrete-time Markov process. In the Markov chain, discrete value 1 , 2 , . . . , is a state and a set of states is named state space . A Markov chain is defined by initial probability ( 1 ) and transition probability = ( = | = ). In the Markov chain, a chain will be connected to the front and the rear chains. Figure 1 shows the 1st order Markov process (a) and the 2nd order Markov process (b), respectively.

Structural Code Scheme
The structural code scheme is a new information interchange code designed to prevent brute-force attacks. The main purpose of this scheme is to create false plain text to protect the original plain text. While the false plain text consists of plain text decrypted with the wrong cryptography key, the correct Select text data Generate a corpus Encoding or decoding plain text is plain text decrypted with the decryption key.
Here, the false plain text is different from the correct plain text. This scheme produces semantically meaningful false plain text, so an attacker cannot determine whether the correct key was used. A probabilistic language model and a Markov process are used in the structural code scheme. Figure 2 shows the practical steps of the structural code scheme. First, a system user trains text data to obtain the word frequency, given an assumption that plain text follows the Markov process discussed in Section 3.3. The next step is to generate a corpus, which is a database for an encoding and decoding process [14]. The final step is to encode or decode user data with the corpus. We will explain the details of each step in the following paragraph.

Select Text Data.
In the structural code scheme, we obtain the frequency of each character from the text data and use this frequency to create a probabilistic language model. Thus, the probabilistic language model reflects the properties of the text data. As we know, many studies on the frequency of characters in alphabets have already been conducted [15]. Interestingly, they revealed that the probabilities of " " and " " are high in English sentences. That is to say, when we generate a probabilistic language model from general text data, we will have high probabilities for " " and " . " If we generate a language model using the play "Romeo and Juliet, " the highest probability can be changed from " " to Juliet's " . " In other words, if we locate " ℎ" in the language model, it is highly probable that the next character will be " " in a language model that includes the common word " ℎ . " In short, the probability depends on the properties of the text data. We always create a different probabilistic language model with this property when different data are used. After this step, we apply the th order Markov process to create a corpus.

Generate a Corpus.
The second step in the structural code scheme is to create a corpus for encoding and decoding. A corpus is a structured set (or database) of natural language data for a specific study. First, we apply the th order Markov process to a probabilistic language model. The probabilistic model in a countable set has a Markov chain characterized by adopting Markov process. The th order Markov process predicts a future state, based on the number of previous states, such that the ( − 1)-gram language model can be expressed as a th order Markov process. The full joint probability of the -order Markov process is defined as  . For instance, ( 3 = " " | 2 = "ℎ", 1 = " ") is the 2nd order Markov probability that " " follows " ℎ. " Probabilistic values of (1) are the foundation of a corpus that will be used as an encoder and a decoder. Next, we introduce a function corpus that maintains the corpus structure: .
The corpus is a multiplication of a cumulative mass function (CMF) value and a constant value . In (2), corpus is a CMF value of a random variable, which is 0 ≤ ( ) ≤ 1. When multiplying the constant value by CMF, the corpus's scope will be changed as 0 ≤ corpus ( ) ≤ . There are two reasons for multiplying by constant value . The first reason is to represent all characteristics in the probabilistic language model. Using CMF, we assign a large amount of cumulative mass to high probability words and assign a smaller amount of cumulative mass to low probability words. Thus, large and small probabilities are reflected to the corpus directly. The second reason for multiplying by constant value is a data preprocessing step for lossless decoding. While changing to the corpus, if the probability is extremely small or zero, then it could be missed in the original model of (2). That is, every single random variable should correspond to a symbol value; therefore, we modify (2) as follows: where can be estimated using corpus ( = | x − : −1 ) = .
Equation (3) specifies that the corpus will have a range from 1 to , 1 ≤ corpus ( = | x − : −1 ) ≤ . When the probability value is smaller than 1/ , we can maintain lossless encoding and decoding of data by setting a minimum value using (3). We prepared Figure 3 as an example of these steps. Figure 3 shows the process of creating a corpus [15]. Figure 3(a) represents the 0th order Markov process probabilistic language model and its corresponding CMF value (when the first character is " , " ( 1 = " ")). Figure 3(b) represents the CMF value of the 2nd order Markov process (when the third character is " " under the joint probability ( 3 = " " | 2 = "ℎ", 1 = " ")) and the corpus that results from multiplying by constant value . As shown in Figure 3, the corpus has bit strings for direct use in encoding and decoding. In Figure 3 the probability of character " " is highest when the first character is " " and the second one is "ℎ. "

Encoding and Decoding.
In the encoding step, we select a character and search for the corresponding bit string in False key Key used by the attacker the corpus. Every CMF value has at least one bit string; therefore, we can represent every character. When we find several corresponding bit strings for one character such as " " in Figure 3, we select the first or last bit string for convenience. In Figure 3, we assume the prior probabilities of characters " " and "ℎ" and compute the conditional probability of " " as the third character. Conditional probability values are also saved in our corpus as a cumulative mass function. Therefore, the prior probability affects the next character's probability in the corpus, and it plays a main role for maintaining syntactic functionality. We encode " " with the corresponding bit string from the corpus. Next, we create another corpus, which is constructed of conditional probability values from when the previous character was " . " The decoding step locates the corresponding bit string from the corpus. The random bit string is likely to correspond to a character sequence having many low probabilities in the corpus. In other words, a character having a high probability will be a first letter, and conditional probability will create meaningful words. As a result, the attacker cannot distinguish this word (false plain text) from real plain text and therefore secrecy will increase.

Scenario Comparison with ASCII System.
Difference between ASCII and the structural code scheme will be clarified by means of a comparison scenario. Before describing the scenario, we explain its terms in Table 2.
The scenario assumes a secure communication that includes a sender, receiver, and an attacker. There are two scenarios labeled (a) and (b). Both scenarios assume the use of a cryptosystem. A sender and a receiver will exchange the correct plain text with the correct key during communication, and an attacker will obtain false plain text with a false key. Figure 4 demonstrates two scenarios: an ASCII code scheme (a) and the structural code scheme (b). Figure 4(a) shows a secure communication process that uses the ASCII code scheme. The sender encodes correct plain text "secret" to a coded message "73 65 63 72 65 74" by using the ASCII encoder. The sender then encrypts the coded message with the correct key enc . Therefore, the ciphertext "88 76 2A F8 63 FD" will be sent in the communication process. The receiver obtains the ciphertext through the communication. Decrypting with shared key enc and decoding with the ASCII table will show the correct plain text to the receiver.   However, if there is an attacker in the middle of the communication, ASCII reveals a vulnerability. Attackers who can intercept the data in the communication will obtain ciphertext . They decrypt the ciphertext with false key wrong and will obtain "29 71 25 6C 43 57" as a result. We can get ")q%lCW" as the result of ASCII decoding. This word is not syntactical; therefore, the attackers will realize that wrong is not correct for decryption. Thus, the ASCII code scheme is used as a measuring stick for success or failure.
Scenario (b) resolves this problem by using the structural code scheme. The communication between sender and receiver is the same; however, the attackers obtain different results from this scheme. This scheme returns the word "simple" as a result of false key wrong using the statistical code table. Although the attackers used false key wrong , they obtain false word wrong , which has syntactical meaning so they cannot distinguish correct plain text from false plain text and secrecy will be increased.

Experiment Environment.
We analyzed the probability of obtaining a false plain text in repeated trials to test the performance of the natural language format scheme. The main purpose of this experiment was to identify a final convergence value by increasing the number of repeated trials. Through this experiment, we can estimate the number of repeated trials that are necessary to obtain meaningful false plain texts.
First, we collected text data to set the corpus. In this study, we used the text of the play "Romeo and Juliet" (248 KB) [16]. We built a language model based on the text data  and converted the language model to a corpus for encoding and decoding. We used MATLAB 2013a to conduct the experiment. The code used in the experiment was composed of a training portion, computation portion, and an encoding, decoding portion.
In the experiment, we analyzed how many repeated trials were required to build syntactically meaningful false plain text. The number of repeated trials used in the experiment was 5, 10, 20, and 100. There were two standards for setting the repeated trial number. First, the repeated trial number had to be larger than a minimum number, which indicates the number of repeated trials required to obtain meaningful false plain text. Second, we set the maximum number of repeats to 100 because an experimenter had to check the grammatical results manually. With these standards, we selected random numbers between 5 and 100. Table 3 shows the results of the experiment. All ten plain texts correct were selected from words related to a keyword "Cipher. " The left column shows ten plain text encryption techniques used in the test. We counted the number of syntactically meaningful false plain texts. The right column shows the results of the experiments. It shows the frequency rate of false but meaningful plain texts from the repeated trials. In the experiment, we only counted words that were composed of meaningful dictionary characters. For example, if the result "ah, thou shall" is composed of meaningless words, we conclude that it is not the false plain text. However, if the result "romeo and the" is composed of only dictionary words, we judged that it is the false plain text. The last line of Table 3 shows the convergence value of the probabilities. This value indicates the minimum number of repeated trials that are required to obtain meaningful false plain text. The result 0.38 shows that at least four repetitions were required. Therefore, users must perform at least four repeated trials to obtain false plain text wrong , which will be used for confronting the attack. These false plain texts are based on training data processed by a probabilistic language model. Therefore, if the user trained a movie script, the false plain text will look like a line from a movie script. The attackers will be unable to differentiate between false plain text and real plain text, thus enabling the user to maintain secrecy against brute-force attacks.

Discussion
We have described and tested the new countermeasure. However, this study focused on text data only. As a result, readers may conclude that they can only preserve secrecy when they communicate using text data. However, we found that this scheme can be applied to various formats. We will expand training targets from text data to other data formats in subsequent studies. With this goal, we will determine methods to secure various data formats in addition to text data.
We compared our proposed scheme with American Standard Code for Information Interchange (ASCII) scheme because it is the most well-known information interchange scheme in modern computers. As one knows, there are many other encoding schemes such as Unicode and Chinese Character Code for Information Interchange (CCCII). Note that our proposed structural code scheme is generic in all information interchanges because it uses a frequencybased corpus which can be adaptively and systematically constructed in each information interchange.

Conclusion
To date, increasing computational complexity has been the only countermeasure against brute-force attacks. However, simply increasing the computational complexity does not guarantee the security of cryptosystems in an era of everincreasing threats. Vulnerabilities are caused by the weakness of PRNGs and by attackers' ability to access advanced hardware. Therefore, we explored improved methods to confront brute-force attack.
In this paper, we presented a statistical code system and analyzed it as a new countermeasure. It comprises two categories: the honey encryption and the structural code scheme. Our study focused on the structural code scheme; thus, we explained it in detail. We described the details of the procedure and tested the performance of this recommended method. Unlike the honey encryption, this scheme uses the syntactical features of natural language to exchange information and generates false plain text to confuse the attacker. In other words, if we use this countermeasure to encode and decode our secret data, we can protect our secrecy more effectively, even in natural language.