Improving Security and Reliability in Merkle Tree-Based Online Data Authentication with Leakage Resilience

: With the successful proliferation of data outsourcing services, security and privacy issues have drawn signiﬁcant attention. Data authentication in particular plays an essential role in the storage of outsourced digital content and keeping it safe from modiﬁcations by inside or outside adversaries. In this paper, we focus on online data authentication using a Merkle (hash) tree to guarantee data integrity. By conducting in-depth diagnostics of the side channels of the Merkle tree-based approach, we explore novel solutions to improve the security and reliability of the maintenance of outsourced data. Based on a thorough review of previous solutions, we present a new method of inserting auxiliary random sources into the integrity veriﬁcation proof on the prover side. This prevents the exposure of partial information within the tree structure and consequently releases restrictions on the number of veriﬁcation execution, while maintaining desirable security and reliability of authentication for the long run. Based on a rigorous proof, we show that the proposed scheme maintains consistent reliability without being affected by continuous information leakage caused by repetitions of the authentication process. In addition, experimental results comparing with the proposed scheme with other state-of-the-art studies demonstrate its efﬁciency and practicality.


Introduction
In accordance with the dramatic increase in data volume, advances in information and communication technology (ICT) have facilitated the move from local data management to remote data outsourcing services. Although data outsourcing has several benefits in terms of its low cost, agility, scalibility, and ease of maintenance, it also has potential problems that users may overlook. Outsourcing data to third-party storage means that control of the data is delegated to the authority managing the remote repository. Unintended data breaches or losses are possible because third-party storage service may be less vigilant than the data owner. Data breaches and losses may lead to serious financial damage as well as wasteful efforts, and can happen for various reasons such as negligent • We analyze potential information leakage during the online verification process. It includes partial information of the Merkle tree and size information, which weaken the security and reliability of authentication (Section 3).

•
We propose a leakage-resilient integrity verification protocol (Section 4). Through a rigorous security proof, we illustrate its effectiveness regardless of the number of executions without requiring additional trusted third-party (Section 5). • We evaluate efficiency of the proposed scheme by implementing it in a real-world application. It shows that our approach can flexibly be adjusted to required system resources with minimal overhead. Nonetheless, it still supports leakage resilience that was not guaranteed in previous research (Section 6).
This paper is organized as follows. Merkle tree-based authentication is described in Section 2. Possible information leakage of Merkle tree-based authentication is analyzed and then vulnerabilities of the previous schemes are analyzed in Section 3. In Section 4, a leakage-resilient online data integrity verification protocol is proposed. The security and efficiency of the propssed scheme are then analyzed in Sections 5 and 6, respectively. Finally, the paper concludes in Section 7.

Merkle Tree-Based Authentication
A Merkle tree [20] is constructed from a series of data blocks, where the value of an internal node is assigned based on the hash value of its children, while the value of a leaf node is assigned the direct hash value of the corresponding data block (Figure 1). In the tree construction procedure, the hash function satisfies the preimage-resistance property, which implies it is computationally infeasible to find the preimage of the given hash value. Also, since this forms a binary tree, the maximum depth from leaf to root is at most log 2 n for n data blocks. Thus, the Merkle tree acts as an authenticated data structure for efficient verification of the online content.
In Merkle tree-based online authentication, there are two entities, prover P and verifier V: • Prover P is an entity who attempts to convince the other party (i.e., the verifier V) that it owns all of the data. To converve network bandwidth, the prover sends a small piece of verifiable information instead of all of the content. • Verifier V is another entity who tries to determine whether prover P's claim is correct or not.
To reduce storage requirements, the verifier usually stores only the value of the root node of the Merkle tree instead of all nodes of the tree.
It is notable that the Merkle tree-based online authentication is a protocol that verifies that the prover and verifier own the same data. Unlike public verification, therefore, it assumes that the verifier has some secret (i.e., not publicly available) information about the data to be validated. This issue is dealt with in detail in Section 5.
Based on the hardness assumption that it is infeasible to find a preimage of a given hash value within a computationally reasonable time [21], it can be guaranteed that only entities possessing the same data can obtain the same Merkle tree. In brief, the security of authentication based on Merkle tree is based on the security of hash function in use. Therefore, the verifier V only stores the value of the root node of the tree and removes the rest of the metadata once the tree is constructed. On the other hand, the prover P is required to generate a series of (different) hash values leading to a value of the root node that is identical to the one held by the verifier with each authentication cycle.
In the example shown in Figure 1, the verifier V chooses a random block index (e.g., 1) as a challenge. The prover P then constructs a Merkle tree from its local data, followed by sending the corresponding unique sibling paths from the leaves to the root node (i.e., (H 1 , H 2 , H 3−4 )) to the verifier. Upon receiving the proof response, the verifier V derives the root value of the Merkle tree (i.e., H(H(H 1 , H 2 ), H 3−4 )) and determines whether the result is identical to the value of the root node held in local storage. In the above protocol, adversaries may not be able to uncover the underlying plain data from the communication as long as a secure (preimage resistant) hash function is used. However, we observed that it is vulnerable to a side-channel attack, which allows deducing meaningful information from communications during the authentication process and narrowing down the scope of the attack vector, thereby weakening the reliability of the authentication and possibly nullyfying its effectiveness of authentication completely. Thus, we first analyze the weakness of Merkle tree-based authentication method to side-channel attacks on the same data in Section 3. After this, we present a simple method for improving security and reliability, of Merkle tree-based authentication with minimal overhead in Section 4.

Information Leakage Analysis of Merkle Tree-Based Authentication Schemes
In this section, we investigate the vulnerabilities against side-channel attacks of Merkle tree-based authentication method (Section 3.1). Then, we demonstrate the previous authentication schemes are not secure against the side-channel attacks we found (Section 3.2). For the rest of this paper, we assume that there is an adversary eavesdropping communication between the prover P and verifier V.
Exposure of structural information in Merkle tree-based authentication and its potential risks were previously analyzed by Kundu et al. [22] and Buldas and Laur [23]. Kundu et al. [22], especially, developed a notion called secure name to prevent information leakage about the correlation between nodes in the tree and the graph. However, to the best of our knowledge, detailed diagnosis of information leakage has not yet been conducted in the research.

Analysis of Merkle Tree-Based Authentication
Prior to authentication, the prover and verifier need to agree on the hash function to be used in Merkle tree construction, the size of the data blocks, and the rules for identifying specific data. In the authentication process, the prover P first sends an identifier of the data to the verifier V and proves complete possession of the data in question.

Leakage of Data Size Information
Looking at the communication between P and V, an eavesdropper can figure out the approximate size of the underlying data from a single authentication proof. Specifically, the adversary can determine the length of sibling path(s) from the knowledge of the hash function in use, data block size, and the size of the proof transmitted by P. The minimum and maximum number of leaf nodes can be easily determined from the height of the tree (i.e., the length of the sibling path −1) when the Merkle tree is constructed in a left-to-right and bottom-up manner.
Let us assume that the size of a single data block is |B|, the size of hash value is |H|, and the size of the proof is |P|. The length of the sibling path L can then be derived from L = |P|/|H|. When the length of the sibling path is acquired, (L − 1) becomes the height of the constructed Merkle tree, and the total size of target data S can be approximated as where the right-hand-side of the inequality is the full and complete binary tree [24]. This information about size obtained by eavesdropping can be used to narrow the attack space for the range of target sizes and filter out unnecessary data. It gives the attacker the powerful option to select target data of an appropriate size. Therefore, it is more desirable for an authentication method to hide size information of data.

Leakage of Merkle Tree Hash Values
Typically, data authentication is expected to operate reliably, regardless of the number of times it occurs. Contrary to this expectation, however, the maximum number of effective authentication is bounded by the size of the Merkle tree due to the leakage of hash values of it. Specifically, authentication can be conducted only as many times as the logarithmic number of leaf nodes in the tree, because responded hash values of the tree are leaked to the adversaries. After the limited number of authentications, the attacker can construct the entire Merkle tree through eavesdropping even though it has no information about the underlying content.
Observing Figure 1, the two sibling paths which are proof for the challenged leaf nodes (1 or 2) and (3 or 4) provide all of the information required to construct the entire Merkle tree (i.e., {H 1 , H 2 , where the values inside the parentheses can be derived from the other values given as part of the proof).
Therefore, it can be seen that the authentication range of the data is reduced at an exponential rate in the presence of an adversary exploiting information accumulated about the tree gained during repeated authentication attempts. Once the prover P passes the authentication proof for a challenged block (e.g., B 1 ), an eavesdropper can obtain infomation about the subtree rooted at the child of the root node (e.g., (H 1 , H 2 , H 1−2 ) as a subtree rooted at H 1−2 in Figure 1). The subsequent authentication attempt guarantees the integrity of at most half of the entire data (e.g., {B 3 , B 4 } in Figure 1). Otherwise, the attacker can reuse the other half of the tree already known from the previous authentication attempt. Therefore, the authentication coverage reduces further or the adversary becomes able to bypass the verification process with overwhelming probability even when it does not know the corresponding data by exploiting the obtained hash values.
The typical coverage pattern is illustrated in Figure 2. When data is composed of 2 i blocks, the maximum authentication coverage C(·) of the data (M) at the j-th execution attempt can be defined as As the number of demanding sibling paths for challenged leaf nodes in a single verification increases, the number of allowable verification attempts decreases sharply. (According to Ateniese et al. [25], data composed of 10,000 blocks requires 460 samples of leaf nodes to be verified in order to achieve 99% confidence. When it comes to Merkle tree-based authentication [16], effectiveness is only guaranteed for 21 times. After this, the entire Merkle tree can be reconstructed via eavesdropping so that the attacker can successfully pass authentication.)

Previous Schemes and Their Vulnerabilities
Taking aforementioned side channels into account, we analyze weakness of the previous authentication schemes.

Generic Merkle Tree-Based Authentication
In generic Merkle tree-based authentication [12], the original data block is used together with the tree, rather than using the tree only, which was further adopted to proof-of-ownership process in the cloud data deduplication literature [16].
As shown in Figure 3, the proof in authentication includes the content of the challenged block along with its sibling path. In this example, the challenged data block B 1 and partial information about the Merkle tree rooted at H 1−4 (i.e., {H 1 , 3 , H 4 }) can be exposed to the public after the first authentication request. Therefore, in the second authentication attempt, the challenging block might be randomly chosen in {B 5 , B 6 , B 7 , B 8 } for maximal authentication coverage, and this covers only half of the entire data set (because it excludes the blocks {B 1 , B 2 , B 3 , B 4 }). The next challenging block can be chosen from {B 3 , B 4 } or {B 5 , B 6 }, covering a quarter of the data in a similar way.  Thus, generic Merkle tree-based authentication method does not guarantee consistent authentication coverage as authentication is done repeatedly, and the maximum number of challenge-responses is limited to the number of data blocks.
However, the adversary is still able to guess the size of the authenticated data by using Equation (1), given publicly accessible data block size |B|, proof size |P|, hash value size |H|, and the size of sibling path L = (|P| − |B|)/|H| + 1.

Authentication without a Merkle Tree
Zhao and Chow [26] pointed out the possibility of a replay attack on Merkle tree-based data authentication and proposed a probabilistic protocol inserting randomness based on hardcore function to achieve resilience against the replay attack (the protocol is summarized in Algorithm 1).
Unfortunately, in their scheme, the size of data to be verified is known to the public (because the hash function G in Algorithm 1 specifies the size of the output dependent on the data). Due to this assumption, this approach does not prevent the leakage of size information.
The protocol requires the verifier to perform the same computation as the prover unless the same random seed s is used repeatedly. If the same seed value is used repeatedly, the proof becomes always the same, thus an adversary can bypass authentication for the entire data with a single eavesdropping on the proof. Consequently, as long as a newly chosen seed is used for every authentication attempt, the efficiency on the verifier side would be reduced because it uses all of the data in the verification process and pre-computation cannot occur.
In addition, according to their analysis based on the Goldreich-Levin theorem, the probability that an adversary deceives the verifier is at most 1/|M|, which is not negligible on the data size. Because the bitstring r (Step 2 in Algorithm 1) can become known to the adversary, the adversary can extract at most log(|M|) bits from the transferred proof. To overcome this problem and prevent further information leakage, they recommended encrypting all traffic using a session key generated by additional protocols for secure connection establishment (e.g., SSL/TLS, IPSed), which requires non-negligible overhead in practice. This is dealt with in Sections 3.2.3 and 3.2.4.
There are also other approaches that do not rely on Merkle tree structure in data authentication. For example, Atallah et al. [27] proposed a technique for efficient integrity verification of 2-dimensional range data such as image and GIS data and suggested a method to maintain the communication overhead constant. Atallah et al. [28] and Benjamin and Atallah [29] also proposed several novel integrity verification techniques without Merkle tree.

Merkle Tree-Based Authentication of Encrypted Data
Bellare et al. [30] and Xu et al. [31] independently investigated data authentication of encrypted data in the context of proof-of-ownership (PoW). Their strategy is to encrypt the data first, then to perform authentication over the encrypted data based on a Merkle tree to guarantee data confidentiality and verify the complete possesion of the underlying plain data. In combination with a secure encryption algorithm, the underlying plain content can be hidden from unauthorized users, including malicious service providers.
Even if their schemes preserve data confidentiality by encrypting data itself, however, they are vulnerable to the side-channel attacks we found due to the inherent property of Merkle tree. In other words, even if the plaintext corresponding to the encrypted data is not known, the information about the Merkle tree generated from the ciphertext can be obtained by eavesdropping, so the authentication coverage falls with repeated authentication attempts. As with the Merkle tree-based authentication of unencrypted data, the size of the underlying (encrypted) data can still be inferred.

Merkle Tree-Based Authentication with Transmission in Encrypted Form
Li et al. [32] employed a Merkle tree-based approach for online authentication in a smart grid system. To avoid information leakage, they combined it with a secure encryption algorithm. Specifically, the prover (i.e., a smart meter) and verifier (i.e., a neighborhood gateway) engage in a Diffie-Hellman key agreement protocol, followed by AES encryption to preserve the privacy of the authentication proof generated by the prover. (Inspired by Li et al.'s work [32], we can employ secure encryption algorithms to obfuscate communication between the prover and verifier. In [32], a Merkle tree is used for sender identification instead of data authentication, in which the Merkle tree is constructed from random elements chosen by P and then V validates the origin of the received power measurement (in a way that guarantees only one who can construct the valid Merkle tree is P). However, in their research, the main purpose of exploiting secure key agreement and encryption algorithms was to minimize side effects caused by side channels during Merkle tree-based online authentication.) The overall data authentication process for Merkle tree-based data authentication using encrypted channels is described in Algorithm 2. (This algorithm requires key agreement and encryption of communications. Compared to adopting full SSL/TLS, it is more efficient since it requires Diffie-Hellman key agreement and efficient encryption algorithms such as AES, which will be analyzed in Section 6.)

Algorithm 2 Merkle tree-based online authentication with encrypted communication
Public parameters: Multiplicative cyclic group G of large prime order p with generator g Key size of summetric key encryption/decryption algorithm κ according to security parameter λ Cryptographic hash function H 0 : Check the integrity of M through Choose uniform random exponent k v ∈ Z p 4. Compute P's partial key K p = g kp ∈ G 5-2. Establish the agreed session key K = H(g kv ·kp ) = H(K v kp )
Possessing data M , construct Merkle tree Although the exact proof becomes indistinguishable when the transmitted data is encrypted, size information for the underlying data can be deduced from the size of the transferred ciphertext. One approach to prevent size information leakage is to dynamically change the size of the data blocks and the hash function used for each authentication attempt (Step 6 in Algorithm 2). However, this approach requires the verifier to construct a new Merkle tree for every authentication request, rendering pre-computation of the Merkle tree and its reuse on the verifier side impossible. Therefore, dynamically changing the size of the data blocks and hash functions for each authentication attempt significantly reduces efficiency from the verifier's perspective. Another approach is to insert dummy data into the ciphertext. While this can obfuscate information by increasing the size of data, it also increases the computational and communication overhead for both sides.
With regard to efficiency, it requires higher computation cost for data encryption and decryption during data verification than in Merkle tree-based authentication, which is another practical drawback of this approach. (Detailed analysis can be found in Section 6.)

Randomized Online Authentication
In this section, we present a probabilistic authentication protocol by exploiting Merkle tree without a reduction in verification coverage. Before describing the proposed protocol, the adversarial model and its goals are summarized in the following subsections.

Adversarial Model
We consider adversaries who are able to collect valid proofs of data authentication from public channels. This adversary can be either (1) a passive attacker eavesdropping on communications between valid prover P and verifier V without intervention, or (2) an adaptive online adversary. In the latter case, the adversary acts as a more active attacker by passing a set of random challenges of its choosing to an oracle and collecting a valid proof set for the upload-requested data. In other words, valid prover P can be an oracle for the target data and the adversary attempts to circumvent the authentication process by manipulating the obtained proofs.
Without loss of generality, we assume that the adversary has no prior knowledge about the data to be challenged (proved). Specifically, we assume that the adversary is unable to extract size information from eavesdropping on interactions during the authentication process, except for initial upload.
The goal of these adversaries is to weaken the reliability of the authentication process by exploiting information gathered through wiretapping.

Goal
In order to minimize information leakage when a Merkle tree is used for online data authentication, the proposed scheme needs to satisfy the following requirements: • Prevention of size information leakage: The authentication mechanism should block the outflow of information about the size of the target data, which can be used by adversaries to select and predict the required number of authentication proofs.

•
Prevention of replay attacks: The protocol should not allow adversaries to launch replay attacks, in which a collected valid set of authentication proofs are used in subsequent authentication requests. In other words, the adversary cannot learn any information from the disclosed information via public channels during the authentication process.

•
Minimal requctions in efficiency: The effective handling of side channels should be achieved with acceptable computation and communication overhead, maintaining the advantages of the Merkle tree-based approach. • Compatability: Given that the Merkle tree-based approach is widely deployed in industry and academia due to its intuitive nature and ease of utilization, the proposed approach should be applicable to existing uses. This includes adaptability to lightweight devices with limited resources and restrictions on the installation of additional libraries depending on the system architecture, such as IoT terminal devices and sensors.

Construction
To be resistant against information leakage regardless of the number of authentication attempts, the transmitted authentication proof (i.e., sibling paths) needs to be randomized so that an eavesdropping adversary cannot gather any valuable information from the transcript. In this section, we present a simple amendment that inserts random inputs and significantly increases the reliability of online authentication. The overall process is illustrated in Algorithm 3, with example data composed of four blocks. Associated notations are summarized in Table 1.
Make pr f look random by invoking where n is the number of blocks that make up D Construct Merkle tree MT D withD as leaf nodes Put the values for the siblings of the nodes that lie on the path from the chal-th leaf node to the root node in MT D (including the root) into pr f while pr f > L: : Set vt to be the first element in pr f Compute verification term vt by evaluating hash function H of vt and an element in pr f from second element to the last one .5emin a recursive manner if vt = MT root D : res ← True else:

Authentication Initiation
First of all, the verifier V constructs a Merkle tree for data D to be verified following the generic Merkle tree construction process (Step 1). Notice that V needs to construct this tree only once (usually before verification) to store the number of leaf nodes and the root node value in the tree, and then discards all remaining information about the tree. As for specifying the data to be verified, the verifier and the prover can use H(H(D)) as an identifier. Due to the collision-resistant property of the cryptographic hash function, we assume that the probability that different data files produce the same hash tree is negligible.

Randomized Challenge Generation
In this phase, the verifier V generates a random challenge for the claim that prover P manages the data properly as V desires. Unlike previous Merkle tree-based authentication approaches in which the challenge is selected from within a limited range (i.e., {1, 2, . . . , n}), the verifier V selects a random integer without restriction. V also specifies the length of the proof to be received and sends this value and the challenge to the prover (Step 2). The length of the proof can be an arbitrary number when it is greater than 2. In our approach, the proof length does not depend on the Merkle tree structure, unlike the original Merkle tree-based verification process. Specifically, the value of the sibling nodes in the Merkle tree may not be used in the proposed scheme when the requested proof length is shorter than the length of the sibling path from the leaf (challenged) node to the root. For a detailed description, see Section 4.3.4.

Original Challenge Restoration
Upon receipt of the challenge and proof length specification, the prover P restores the intended challenge index chal (Step 3). The prover P can specify the data block to which the challenge points, on the assumption that the prover P and the verifier V have common knowledge about the number of data blocks constituting the data D. The index of the challenged block becomes the remainder after dividing the challenge by the total number of data blocks.

Proof Generation
Using the restored challenge and proof length specification, the prover P generates the corresponding proof pr f . Unlike the typical Merkle tree-based approach, the proposed protocol requires the value of root node to be appended to the end of the proof (Step 4).
It is worth noting that the proposed protocol does not depend on the Merkle tree structure for concealing size information. In other words, a proof that is shorter than the length of the sibling nodes is possible, thus making the data look smaller than its actual size. (In this case, the values of the nodes closest to the leaf node in the certificate must be removed in order, but the value remaining at the end after this removal (corresponding to the leaf node in the certificate) must be derived from the removed values.) In addition, a proof longer than the length of the sibling nodes is also possible, making the data look larger than its actual size, as described in the following subsection.

Proof Obfuscation
In this phase, the prover P obfuscates the proof, prepends a random bitstring to the proof if necessary for the purpose of concealing the size information (making the data appear larger than the actual size), and then passes the resulting proof to the verifier V (Step 5).
Based on the algorithm ObfuscateProof, the prover P first selects a random bitstring s with a length equal to the hash value. The bit string s is then masked iteratively by applying a bitwise XOR operation to each element of the sibling path sib.
To obtain the bit-length of the resuting proof L · |H| when the requested proof length is longer than the obfuscated proof, a randomly selected bitstring R of length (L − h − 2) · |H| is prepended to the proof pr f (when L > h + 2). Note that, before generating the proof, the prover P can derive the height h of the Merkle tree that is to be constructed because P knows the total number of leaf nodes (i.e., data blocks). Thus, P calculates the length of hash values to be appended as L − h − 2 (h for the number of siblings on the path from the challenged node to the root, 1 for the challenged node, and another 1 for the root node). From this calculation, prover P generates an arbitrary bitstring R with a length equal to (L − h − 2) hash values .
The key to this phase is allowing individual provers to insert random sources into the verification proof in a non-deterministic manner.

(Original) Proof Restoration
When the verifier V receives the masked authentication proof from the prover P on the challenge chal with a bit-length equal to L · |H|, V restores it to the generic form of a sibling path (Step 6).
First, the unnecessary heading (L − h − 2) · |H|-bit bitstring is removed from the obfuscated proof pr f . The last element in pr f , corresponding to the masked value of the root node MT root D is then XORed with V's value of MT root D to obtain the masking factor mask. For the remaining elements, mask is recursively XORed for each one in reverse order.

Proof Verification
In this phase, the restored proof pr f is validated by the verifier V in the same way as in the typical approach (Step 7).
The hash value corresponding to the root node of the tree is obtained by repeating the process of re-hashing two neighboring hash values from the first hash value of the proof. If the calculated hash value is the same as the value stored by the verifier V, the authentication succeeds. Otherwise, validation is considered a failure.

Security Analysis
In this section, the security of the proposed scheme is analyzed in detail. First, the security of Merkle tree-based online authentication, which is assumed to be conducted only once, is discussed. Using this as a baseline, the security of the proposed method and its ability to improve reliability are then examined.

Security of Merkle Tree-based Authentication
The primitive used to construct a Merkle tree is a cryptographic hash function that satisfies preimage resistance, second preimage resistance, and collision resistance properties.

Definition 1. (Preimage-resistant hash function)
Given image y of a hash function h, for all pre-defined outputs, the function is preimage-resistant if it is computationally infeasible to find any preimage x such that y = h(x) [33].
In the verification of data integrity stored in remote storage, the Merkle tree-based approach begins with the assumption that the verifier and the prover share the same information (i.e., the value of the root node and the number of leaf nodes in the tree). Otherwise, the verifier can neither generate a valid challenge nor validate the correctness of the proof. Under this assumption, when online authentication is performed only once, its security can be summarized as Theorem 1.

Theorem 1. (Security of Merkle tree-based authentication)
Given a randomly chosen leaf index, the probability that an adversary without knowledge of the entire tree (data) can forge a valid sibling path is negligible if a cryptographic (specifically, preimage-resistant) hash function is used to construct the tree.

Proof.
Suppose that there is an adversary who knows the number of leaf nodes and the value of the root node in the Merkle tree generated from the target data. For the adversary to pass validation, it has to find a preimage of the root node with a bit-length twice that of the hash value. Regarding each half of the discovered preimage as children of the target node, the adversary has to repeatedly search for preimage of each half until the preimage corresponds to the leaves (i.e., log n times, where n is the number of leaf nodes in the tree). However, this contradicts the assumption that each hash value is an output of the cryptographic hash function. Therefore, the probability of the adversary forging a valid proof is negligible as long as a cryptographic (specifically, preimage-resistant) hash function is used to build the Merkle tree. Formal security model and proof of unforgeability for Merkle tree-based authentication can be found in [20,[34][35][36].

Security of the Proposed Scheme
The data authentication process can be completely bypassed if eavesdropping is performed on the initial transmission of the underlying data. As noted in [35], the reliability of online authentication is weakened through extra information gathered by eavesdropping unless the Merkle tree is combined with private keys. In short, one-time online authentication is reliable only when a Merkle tree is used without modification. However, since this data transmission is performed at most once and subsequent data authentication can be conducted several times, the general assumption that the initial data is transmitted through a secure channel if necessary is reasonable.

Security of One-time Secret Delivery
One simple way to improve security and reliability is to have the prover P and the verifier V agree on an extra shared secret additional to the Merkle tree itself. To achieve this, we devise a one-way secret delivery mechanism following Definition 2.

Definition 2. (One-way secret delivery)
Let two parties, say A and B, share secret information shared of bit-length λ. A can send another secret value toShare to B by embedding toShare in shared such that transmitted = shared ⊕ toShare where ⊕ represents a bitwise exclusive-or (XOR) operation and transmitted is data transmitted via a public channel. The recipient B can then recover the secret key such that toShare = transmitted ⊕ shared.
Specifically, every time the prover P tries to convince the verifier V, P can choose a uniform random mask and securely send it to V by exploiting the one-way secret delivery mechanism in the proposed scheme. This can be achieved by embedding the mask in the shared value, which is the value of the root node in the Merkle tree such that transmitted = mask ⊕ MT root D , where mask and MT root D are a mask randomly chosen by P and the value of root node shared between P and V, respectively. The one-way secret delivery mechanism can be thought of as a one directional password-authenticated key exchange (PAKE), in which the previously shared information is considered to be a password [37]. Prior to examining the security of the proposed scheme, notice that the result of the bitwise-exclusive (XOR) operation of a random value is also random regardless of other operands. Lemma 1. Let X, Y ∈ {0, 1} be random variables, where Pr[X = 0] = Pr[X = 1] = 1/2 and Y is drawn from any distribution. The distribution for X ⊕ Y is also random as long as Y is independent of X such that for any fixed bits x, y ∈ {0, 1}.
Proof. Let b ∈ {0, 1} be a fixed bit. Then, Therefore, the result of XORing a certain bit with a random bit is also random. Lemma 2. Let X = (X 1 , X 2 , . . . , X r ) ∈ {0, 1} r be a random variable where Pr[X i = 0] = Pr[X i = 1] = 1/2 and X i and X j are independent of each other for any positive integer r, 1 ≤ i, j ≤ r, and i = j. The distribution for X ⊕ Y is also random as long as Y is independent of X regardless of the distribution Y is drawn from.
Proof. Let each bit of X and Y be X i and Y i for 1 ≤ i ≤ r, respectively. The probability that the XOR result of X i and Y i becomes any of {0, 1} is thus 1/2 according to Lemma 1. Because X i and X j for 1 ≤ i, j ≤ r and i = j are independent variables, the probability of Pr[X ⊕ Y = bs] is (1/2) r for any fixed bitstring bs ∈ {0, 1} r . Therefore, the result of XORing a certain bitstring with a random bitstring is also random.
Using Lemma 2, the one-time security of the one-way secret delivery mechanism can be proven.

Theorem 2. (One-time Security of One-way Secret Delivery)
The one-way secret delivery protocol in the proposed scheme is one-time secure against adversaries as long as the mask value is drawn independently and uniformly at random.

Proof.
Following the definition of entropy [38], the random mask has maximum uncertainty because it is chosen independently and uniformly at random from {0, 1} λ , where λ is the bit-length of the hash value. According to Lemma 2, the XORed value with this random mask is also unpredictable (i.e., indistinguishable from other random bitstrings).

Security of the Proposed Scheme
In the proposed approach, the size of the proof generated by the prover P can be either shorter than, exactly equal to, or longer than that of the typical Merkle tree-based approach. Typical types of proof according to proof size are presented in Figure 4.
First, consider a passive adversary who does not affect the designated protocol but collects proof information leaked by eavesdropping on a public channel. (Cases in which the prover and the verifier collude are not considered in this paper because this action invalidates the effectiveness of the authentication process and is beyond the scope of our discussion.) Because this kind of adversary has no knowledge of the underlying data used to construct the Merkle tree, it can be assumed not to have all of the necessary information in advance before the attack. 2) Requested length = tree level +1 3) Requested length > tree level +1   Proof. The proposed protocol allows the verifier to randomly select the proof length regardless of the Merkle tree structure. In the proposed scheme, the verifier sends a uniform random value as part of the challenge and the prover obtains the index value of the leaf node by taking the remainder after dividing the value by the number of blocks already known. This prevents the adversary from inferring the upper bound of the leaf node index during a repeated verification process as long as the verifier chooses different uniform random challenges with each execution, unlike typical Merkle tree-based authentication. Further, the sibling path generated using the Merkle tree is reduced or enlarged according to the requested proof length, which is also chosen uniformly at random by the verifier as another component of the challenge. As a result, there is no relationship between the size of the final proof and the size of the actual Merkle tree, making it impossible for the adversary to infer the size of the underlying data by calculating how the proof size corresponds to the number of leaf nodes in the tree.
Upon closer inspection, the proof generated in the proposed scheme includes the value of the root node in the tree in masked form, which differs from typical Merkle tree-base authentication. Using this feature, it may be possible to uncover the mask when the number of data blocks is known to the adversary. In this case, the adversary can identify the starting position of the meaningful portion (i.e., the location of MT root D in Figure 4b) of the proof sent by the prover to the verifier. All they need to do is to uncover this mask value. Proof. Even if it were possible to track the location of the beginning portion (excluding bitstrings filled with random values) of the meaningful proof from the total number of blocks and the requested proof length, the adversary cannot recover the value of the root node in the tree from the masked proof in Theorem 2. Consequently, the probability of recovering the value of a root node is identical to finding the mask value, which is (1/2) λ , where λ is the bit-length of the security parameter (or the hash value). It is notable that the prover chooses different random mask values on every proof generation. Therefore, the reliability of the challenge-response protocol remains the same as long as the adversaries cannot uncover the value of the root node.
In this context, the increased security of the proposed scheme exploiting Merkle tree is dependent on the secrecy of the value of the root node of the tree. We define an experiment to show the formal security of the proposed scheme in the presence of eavesdropping adversary.

Definition 3.
The experiment is defined for the proposed Merkle tree-based authentication Π for the security parameter λ and an adaptive adversary A who only receives oracle accesses to the prover. The oracle access to the prover is again divided into access to the proof O proo f P and access to the corresponding data O data P . The oracle randomly creates and stores data D i locally, generates obfuscated proof proo f i (by performing Proof. Note that the random mask used in Step 2 presented in the Definition 3 is randomly selected in the uniform distribution for each proof generation. Although the adversary A can verify the validity of the received proof (by performing proo f i ← Π.RestoreProof(proo f i , h i , MT root D i ) and res i ← Π.VerifyProof(proo f i , MT root D i ) successively) in Step 3, A cannot distinguish whether proo f b received in Step 4 is valid proof or not by Theorem 3. Furthermore, A cannot know the mask value used to generate the proof proo f b by Theorem 2 so that the prior knowledge does not help to break the proposed authentication mechanism. In the same context, Steps 5 and 6 also only give at most negligible advantage to determine the validity of the proof proo f b given by the challenge. This means the leakage resilience of the proposed authentication scheme even in the presence of adaptive adversary. Now, we consider another adversary who has knowledge of both the number of blocks in the tree and the value of the root node in the Merkle tree. However, it make sense to assume that this kind of adversary has no knowledge of the underlying data without loss of generality (e.g., when an adversary is delegated to audit data integrity held in remote storage by a valid data owner, while the actual content is kept private). Nevertheless, even though the above two pieces of information are known to the adversary, the proposed scheme provides security that is as strong as that of typical Merkle tree-based authentication (Section 5.1).

Efficiency Analysis
In this section, the efficiency of the proposed scheme is evaluated based on the experimental implemention of related schemes.

Experimental Environment
According to the Commercial National Security Algorithm (CNSA) Suite [39] recommended by the National Security Agency (NSA), we used SHA-3 384-bit as a cryptographic hash function, a Diffie-Hellman key with a 3072-bit modulus, and AES-256 for key agreement and a secure encryption algorithm when implementing comparison schemes.
All experiments were performed on a single machine with a 3.5 GHz CPU (Intel i7-7800x) and 64 GB RAM (3600 MHz 4 × 16 GB) running Windows 10. Each algorithm was implemented as a singrypto version 2.6.1) [40] for AES and Diffie-Hellman key agrele-threaded 32-bit Python [41] program, using the Python cryptography toolkit (pycement and the SHA-3 wrapper for Python (Pysha3 version 1.0.2) [42] for SHA-3, respectively. In addition, the data was split into 256-byte blocks when the Merkle tree was constucted to allow for a consistent comparison.
To minimize errors caused by outliers, each experiment was repeated 1000 times in the same environment, and then the average and the standard deviation are calculated and reported. It is worth nothing that there is room for additional performance improvement because the specified libraries were used without further optimization.

Computation Overhead
The computation time for each experiment was measured based on CPU time. The performance of each algorithm for varying data sizes is analyzed and the time overhead is compared .

Authentication Based on Merkle Tree
Conventional online authentication applying Merkle tree guarantees neither consistent reliability nor protection from information leakage, but it was added to the experiment as a baseline indicator for efficiency. The size of the data block was fixed for the system initialization but could be varied according to the system configuration. To allow for a consistent comparison, the block size was set to 256 bytes in this and following experiments.
A comparison of the computation time required for Merkle tree-based authentication for different data sizes is presented in Figure 5. The prover constructs a Merkle tree for the possessed data and generates a proof by finding sibling nodes in the tree, while the verifier selects a random index for the leaf node (corresponding to the index for the data block) and validates the proof received from the prover by repeatedly applying a hash function for each element in the proof. The required computation time increases as the soze of the data becomes larger. Merkle tree construction has a linear relationship with the number of blocks (i.e., leaf nodes) because the number of nodes in the tree can be at most 2n − 1 for n blocks of data. Meanwhile, the computation time for proof generation and verification is logarithmically proportional to the number of data blocks because the number of elements is related to the tree height.
In the Merkle tree-based approach, the computation time between the prover and the verifier is not equal. This is because the prover has to construct the entire Merkle tree from the underlying data but the verifier does not. The verifier only has to apply the hash function using the received proof and compare the bitstrings of the result with locally stored information. For a data file of 1 MB, Merkle tree generation by the prover accounts for 99.9% of the computational time, which is 581.5 times longer than the verification time required for the verifier. Detailed results are summarized in Table 2.

Authentication Based on the Hardcore Function
A comparison of the computation time required for hardcore function-based authentication for different data sizes is displayed in Figure 6, in which Verifier and Prover indicate the computation time required by the verifier and the prover, respectively. The verifier selects a random seed, generates a pseudorandom bitstring based on the selected seed, generates a proof using the generated bitstring, and validates the proof received from the prover by comparing it with the locally generated proof. On the other hand, the prover generates a pseudorandom bitstring based on the seed received from the verifier and generates a proof using the independently generated bitstring.  The required computation time increases as the volume of data becomes larger. Seed generation time is almost constant because the size of the seed does not change. However, pseudorandom bitstring generation and verification are proportional to the logarithmic size of the data because they are closely related to the size of the proof. Proof generation is lenearly proportional to the size of the data.
There is little difference in the computation time between the prover and the verifier. This is because both the verifier and the prover generate their own proofs based on the self-generated pseudorandom bitstring, which requires the most computation time, although the verifier adiitionally conducts the initial seed generation and proof validation through a simple comparison, which requires relatively little time. For 1 MB of data, proof generation accounts for 98.2% of computation time for both the verifier and the prover. The detailed experiment results are summarized in Table 3.

Authentication Based on Merkle Tree with Transmission in Encrypted Form
This approach requires the encryption and decryption of data transmitted between the prover and the verifier using a key agreed upon by both entities in addition to the typical Merkle tree-based authentication. Therefore, there is an additional need for a trusted authority in order to set the parameters to create an environment for key agreement. In this experiment, a Diffie-Hellman key agreement mechanism was adopted with a modulus of 3072 bits. (If communication parties require data authentication and continuous communication, the computation time for key agreement may be excluded from computation overhead. However, in this paper, we experimented with parameter setting for the same security level on the assumption that only communication for online data authentication is done.) Additionally, the data to be transmitted to the other party is encrypted with the agreed key and decrypted on the recipient's side, leaving the rest of the process the same as in the typical Merkle tree-based approach. In other words, the verifier encrypts and transmits a random challenge and decrypts the proof received from the prover. The prover decrypts the challenge received from the verifier to generate the proof, and then encrypts that proof.
If the parties communicate and perform data authentication continuously, the computation time for key agreement may be excluded from computation overhead. In this case, however, there is a possibility that an adversary can bypass authentication from eavesdropping of repeated authentication process, and there is still a leak in size information. In practice, since most of the authentication is done by a large number of independent users, individual users need to establish a new session (using a new session key) and perform authentication. Therefore, in this paper, we only measured communication overhead for online data authentication in encrypted form after a key agreement on the same security level.
A comparison of the computation time required for Merkle tree-based authentication for different data sizes is presented in Figure 7, in which Public setting refers to the time required for parameter generation by the trusted authority for Diffie-Hellman key agreement. As this approach is also based on a Merkle tree, the computation time increases as the volume of data increases. In addition, computational load for the prover and the verifier is also similar to that of the Merkle tree-based approach. However, public setting in the initial stage requires a relatively high computation time even for 1 GB of data (although it is executed only once and requires a constant amount of time). Encryption, decryption, and key agreement also increase the computation time for both the verifier and the prover for exchanges of challenges and proofs, respectively.
The detailed experiment results are summarized in Table 4. For 1MB of data, key agreement accounts for 46.9% and 99.3% of the computation time for the prover and the verifier, respectively, while encryption and decryption uses only 0.2% and 0.5%, respectively.

Authentication Based on the Proposed Approach
Similar to Merkle tree-based authentication after encryption, the proposed mechanism obfuscates the transmitted data (i.e., challenges from the verifier and proofs from the prover). Notice that, however, unlike the other scheme, ours does not require an additional trusted authority to generate public parameters for key agreement as a preprocessing stage before the challenge-response process. In addition, the proposed scheme hides the size information by randomizing the proof size regardless of the Merkle tree structure. Specifically, the verifier requests an arbitrary proof length in terms of the hash values and the prover generates and obfuscates (and truncates if necessary) the sibling path according to the requested proof length, followed by padding with a random bitstring when the generated proof is shorter than the specified length. A comparison of the computation time required for the proposed approach by data size is illustrated in Figure 8. The detailed experiment results are summarized in Table 5. The computation time increases as the volume of the data (consequently, the size of Merkle tree) increases, and most of the time is used to construct the tree. For 1 MB of data, Merkle tree generation requires 92.1% of the time, while the time used by the prover to obfuscate the sibling path and to add a random bitstring is just 1.6% and 6.4%, respectively. On the verifier side, mask removal is additionally performed, taking a similar amount time as the verification. However, it is logarithmically proportional to the number of data blocks and accounts for negligible amount of time.

Analysis of Computation Overhead
Based on analyses of individual algorithms, the computation overhead for the prover and the verifier is summarized in Figures 9 and 10, respectively. Although Merkle tree-based authentication (the first bar in the figures) does not consider information leakage, its computation overhead is used as a reference for ideal computation efficiency.
On the prover side, the operation of authentication based on a hardcore function (the second bar) is performed on the entire data while repeating the log of the data bit-length, leading to a computation overhead that is linearly proportional to data size. All of the other algorithms only require all of the data when constructing a Merkle tree, and the proof generation process has a relatively low overhead because it deals only with the logarithm of the data in bit-length. For'data smaller than 10 MB in size, the proposed scheme demonstrates the most efficient computation (next to the one adopting only a Merkle tree). For data over 10 MB in size, the computation overhead is very similar for the three algorithms exploiting Merkle trees. This indicates that the overhead generated by encryption/decryption and random masking in the proposed scheme is negligible.
On the verifier side, there is a relatively clear difference between the algorithms because the computation required is lower than that of the prover. Other than authentication based on Merkle tree, the proposed scheme exhibits the greatest efficiency, followed by authentication based on Merkle tree with encrypted transmission, with hardcore function-based authentication demonstrating the lowest efficiency. The majority of the overhead is due to key agreement stage in the algorithm requiring encrypted communication and sibling path obfuscation in the proposed scheme. However, this difference does not exceed 1ms regardless of the data size in the experimental results. Furthermore, the proposed scheme might be able to further narrow the gap by optimizing the bitwise exclusive-or (XOR) operation, which is not natively supported in Python. Considering the features of the related schemes summarized in Table 6, the verifier in authentication based on a hardcore function allows anyone to know the size of the underlying data (because the bit-length of the transmitted seed is logarithmically proportional to the data volume) even though the transmitted data is randomized. Authentication based on Merkle tree with encrypted communication (the third bar) is resilient to replay attacks that let the adversary reuse previously successful validation, but is still susceptible to size information leakage. In short, none of the comparison algorithmsare able to reduce information leakage to the same extent as the proposed scheme.
Nevertheless, the proposed scheme requires the least computation overhead for both the prover and the verifier (except for Merkle tree-based authentication, which does not consider information leakage).

Communication Overhead
For all of the compared schemes, the proof is generated using all of the data, but the final proof transmitted to the verifier is proportional to the log of the data bit-length. Looking closely at the amount of data for each entity, however, there are noticible differences between approaches.
In the transmission from the prover to the verifier, authentication based on a hardcore function generates and sends a proof of bit-length (|M| + log(M) − 1) for data M. Therefore, the size of the generated proof becomes very small. Specifically, the proof size is only 1 Byte when the data is 100 Bytes in size, 2 Bytes for data between 1 KB and 10 KB in size, 3 Bytes for 100 KB-10 MB of data, and 4 Bytes for 1 GB of data. On the other hand, the other approaches generate and send a proof. The proof corresponds to a series of hash values and is logarithmically proportional to the number of all of the data blocks, where the size of the hash value is 384 bits (i.e., 48 Bytes). Authentication based on Merkle tree requires the additional transmission of a partial key generated by the prover that is 3072 bits (i.e., 384 Bytes) in size. The comparison of the data transmission from the prover to the verifier is presented in Figure 11.
Recall that the size of a proof, which is embedded in the challenge, is determined by the verifier. Therefore, the transmitted proof size is independent of the actual data size. As specified in Table 5, the average requested proof length (which is proportional to the number of hash values) of 51 is much longer than the sibling path in the Merkle tree approach. For example, 1 MB of data has a sibling path length of 13 and 1 GB of data has a sibling path length of 23. The communication overhead when the requested proof length is fixed at 25 is also illustrated as the last bar in Figure 11. In this case, the communication overhead is almost the same as that of 100 MB of data in conventional authentication based on Merkle tree even for 1 GB of data. This characteristic of the proposed scheme is positive in that it provides flexibility for the verifier in setting the proof length regardless of the actual data size. On the other hand, in the transmission from the verifier to the prover, only a constant amount of transmission is required regardless of the data size, because only the challenge is transmitted in all schemes except authentication based on a hardcore function. The comparison of the data transmission from the prover to the verifier is presented in Figure 12.
In terms of storage, there is no additional overhead because the random sources can be removed from the local storage immediately after the hash evaluations.

Conclusions
At the present time, when data storage and maintenance costs can be reduced due to advances in information and communciation technologies, it is easy to overlook whether data is correctly and legitimated managed when outsourced to remote repositories. In this paper, we investigated the types of information leakage that can occur when data integrity is compromised between physically separate entities and reviewed representative approaches to handling this issue. A simple but efficient approach is presented to improve the security and reliability of data integrity validations, something which has been neglected in previous research. Providing rigorous security analysis, the effectiveness of the proposed scheme is examined in terms of resilience against the leakage of size information and replay attacks. Performance analysis shows that our method provides the highest efficiency in terms of computation load and improves security and reliability.