research-article

Public Access

OptORAMa: Optimal Oblivious RAM

Authors:
Gilad Asharov

Bar-Ilan University, Ramat-Gan, Israel

Bar-Ilan University, Ramat-Gan, Israel
View Profile

,
Ilan Komargodski

Hebrew University and NTT Research, Jerusalem, Israel

Hebrew University and NTT Research, Jerusalem, Israel
View Profile

,
Wei-Kai Lin

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Kartik Nayak

Duke University, Durham, NC

Duke University, Durham, NC
View Profile

,
Enoch Peserico

Università degli Studi di Padova, Padova (PD), Italy

Università degli Studi di Padova, Padova (PD), Italy
View Profile

,
Elaine Shi

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

Authors Info & Claims

Journal of the ACM Volume 70 Issue 1Article No.: 4pp 1–70https://doi.org/10.1145/3566049

Published:19 December 2022Publication History

Journal of the ACM

Abstract

Oblivious RAM (ORAM), first introduced in the ground-breaking work of Goldreich and Ostrovsky (STOC ’87 and J. ACM ’96) is a technique for provably obfuscating programs’ access patterns, such that the access patterns leak no information about the programs’ secret inputs. To compile a general program to an oblivious counterpart, it is well-known that Ω (log N) amortized blowup in memory accesses is necessary, where N is the size of the logical memory. This was shown in Goldreich and Ostrovksy’s original ORAM work for statistical security and in a somewhat restricted model (the so-called balls-and-bins model), and recently by Larsen and Nielsen (CRYPTO ’18) for computational security.

A long-standing open question is whether there exists an optimal ORAM construction that matches the aforementioned logarithmic lower bounds (without making large memory word assumptions, and assuming a constant number of CPU registers). In this article, we resolve this problem and present the first secure ORAM with O(log N) amortized blowup, assuming one-way functions. Our result is inspired by and non-trivially improves on the recent beautiful work of Patel et al. (FOCS ’18) who gave a construction with O(log N⋅ log log N) amortized blowup, assuming one-way functions.

One of our building blocks of independent interest is a linear-time deterministic oblivious algorithm for tight compaction: Given an array of n elements where some elements are marked, we permute the elements in the array so that all marked elements end up in the front of the array. Our O(n) algorithm improves the previously best-known deterministic or randomized algorithms whose running time is O(n ⋅ log n) or O(n ⋅ log log n), respectively.

1 INTRODUCTION

Oblivious RAM (ORAM), first proposed by Goldreich and Ostrovsky [29, 31], is a technique to compile any program into a functionally equivalent one, but whose memory access patterns are independent of the program’s secret inputs. The overhead of an ORAM is defined as the (multiplicative) blowup in the runtime of the compiled program. Since Goldreich and Ostrovsky’s seminal work, ORAM has received much attention due to its applications in cloud computing, secure processor design, multi-party computation, and theoretical cryptography (for example, [6, 25, 26, 28, 45, 46, 47, 51, 57, 59, 60, 64, 67, 68]).

For more than three decades, the biggest open question in this line of work is regarding the optimal overhead of ORAM. Goldreich and Ostrovsky’s original work [29, 31] showed a construction with $O(\log ^3 N)$ blowup in runtime, assuming the existence of one-way functions, where $N$ denotes the memory size consumed by the original non-oblivious program. On the other hand, they proved that any ORAM scheme must incur at least $\Omega (\log N)$ overhead in memory accesses, but their lower bound is restricted to schemes that treat the contents of each memory word as “indivisible” (see Boyle and Naor [7]) and make no cryptographic assumptions. In a recent work, Larsen and Nielsen [41] showed that $\Omega (\log N)$ overhead in memory accesses is necessary for all online ORAM schemes,¹ even one that use cryptographic assumptions and might perform non-trivial encodings on the contents of the memory. Since Goldreich and Ostrovsky’s work, a long line of research has been dedicated to improving the asymptotic efficiency of ORAM [12, 33, 40, 58, 61, 63]. Prior to our work, the best known scheme, allowing computational assumptions, is the elegant work by Patel et al. [53]: they showed the existence of an ORAM with $O(\log N \cdot \log \log N)$ overhead, assuming one-way functions. In comparison with Goldreich and Ostrovksy’s original $O(\log ^3 N)$ result, Patel et al. seemed tantalizingly close to matching the lower bound, but unfortunately, we were still not there and the construction of an optimal ORAM continued to elude us even after more than 30 years.

1.1 Our Results: Optimal Oblivious RAM

We resolve this long-standing problem by showing a matching upper bound to Larsen and Nielsen’s [41] lower bound: an ORAM scheme with $O(\log N)$ overhead and negligible security in $\lambda$, where $N$ is the size of the memory and $\lambda$ is the security parameter, assuming one-way functions. More concretely, we show²:

Theorem 1.1.

Assume that there is a PRF family that is secure against any $\texttt{PPT}$adversary except with a negligible small probability in $\lambda$. Assume that $\lambda \le N \le T \le \mathsf {poly}(\lambda)$ for any fixed polynomial $\mathsf {poly}(\cdot)$, where $T$ is the number of accesses. Then, there is an ORAM scheme with $O(\log N)$ amortized overhead (over $T\ge N$ operations) whose security failure probability is upper bounded by a suitable negligible function in $\lambda$.

An ORAM consists of a client (a.k.a. CPU) and a remote RAM. The client has a small trusted memory. Throughout this article, we shall assume a standard word-RAM where each memory word has at least $w=\log N$ bits, i.e., large enough to store its own logical address. We assume that word-level addition and boolean operations can be done in unit cost. We assume that the CPU has a constant number of private registers. For our ORAM construction, we additionally assume that a single evaluation of a pseudorandom function (PRF), resulting in at least word-size number of pseudo-random bits, can be done in unit cost.³ Note that all earlier computationally secure ORAM schemes, starting with the work of Goldreich and Ostrovsky [29, 31], make the same set of assumptions. Additionally, we remark that our result can be made statistically secure if one assumes a private random oracle to replace the PRF (the known logarithmic ORAM lower bound [29, 31, 41] still holds in this setting). In our model, the memory is completely passive and does not perform any computation beyond what it is instructed. Finally, we note that our construction suffers from huge constants due to the use of certain expander graphs; improving the concrete constant is left for future work.

In Appendix A we provide a comparison with previous works, where we make the comparison more accurate and meaningful by explicitly stating the dependence on the error probability (which was assumed to be some negligible functions in previous works).

1.2 Our Results: Optimal Oblivious Tight Compaction

Closing the remaining $\log \log N$ gap for ORAM turns out to be highly challenging. Along the way, we actually construct an important building block, that is, a deterministic, linear-time, oblivious tight compaction algorithm. This result is an important contribution on its own, and has intimate connections to classical algorithms questions, as we explain below.

Tight compaction is the following task: given an input array of size $n$ containing either real or dummy elements, output a permutation of the input array where all real elements appear in the front. Tight compaction can be considered as a restricted form of sorting, where each element in the input array receives a 1-bit key, indicating whether it is real or dummy. One naïve solution for tight compaction, therefore, is to rely on oblivious sorting to sort the input array [1, 32]; unfortunately, due to recent lower bounds [22, 44], we know that any oblivious sorting scheme must incur $\Omega (n \cdot \log n)$ time on a word-RAM, either assuming that the algorithm treats each element as “indivisible” [44] or assuming that the famous Li-Li network coding conjecture [43] is true [22].

A natural question, therefore, is whether we can do asymptotically better than just naïvely sorting the input. It turns out that this question is related to a line of work in the classical algorithms literature, that is, the design of switching networks and routing on such networks [1, 4, 5, 23, 55, 56]. First, a line of combinatorial works showed the existence of linear-sized super-concentrators [54, 55, 62], i.e., switching networks with $n$ inputs and $n$ outputs such that vertex-disjoint paths exist from any $k$ elements in the inputs to any $k$ positions in the outputs. One could leverage a linear-sized super-concentrator construction to obliviously route all the real elements in the input to the front of the output array deterministically and in linear time (by routing elements along the routes), but it is not clear yet how to find routes (i.e., a set of vertex-disjoint paths) from the real input positions to the front of the output array.

In an elegant work in 1996, Pippenger [56] showed a deterministic, linear-time algorithm for route-finding but unfortunately, the algorithm is not oblivious. Shortly afterward, Leighton et al. [42] showed a probabilistic algorithm that tightly compacts $n$ elements in $O(n \cdot \log \log \lambda)$ time with $1-{\sf negl}(\lambda)$ probability—their algorithm is almost oblivious except for leaking the number of reals and dummies. After Leighton et al. [42], this line of work remained somewhat stagnant for almost two decades. Only recently, did we see some new results: Mitchell and Zimmerman [49] as well as Lin et al. [44] showed how to achieve the same asymptotics as Leighton et al. [42] but now making the algorithm fully oblivious.

In this article, we give an explicit construction of a deterministic, oblivious algorithm that tightly compacts any input array of $n$ elements in linear time, as stated in the following theorem:

Theorem 1.2 (Linear-time Oblivious Tight Compaction).

There is a deterministic, oblivious tight compaction algorithm that compacts $n$ elements in $O(\lceil D/w \rceil \cdot n)$ time on a word-RAM where $D$ is the bit-width for encoding each element and $w \ge \log n$ is the word size.

Our algorithm is not comparison-based and not stable and this is inherent. Specifically, Lin et al. [44] recently showed that any stable, oblivious tight compaction algorithm (that treats elements as indivisible) must incur $\Omega (n \cdot \log n)$ runtime, where stability requires that the real elements in the output must appear in the same order as the input. Further, due to the well-known 0-1 principle [18, 66], any comparison-based tight compaction algorithm must incur at least $\Omega (n\cdot \log n)$ runtime as well.⁴

Not only does the above compaction algorithm play a key role at several points in our ORAM construction, but it is a useful primitive independently. For example, we use our compaction algorithm to give a perfectly oblivious algorithm that randomly permutes arrays of $n$ elements in (worst-case) $O(n\cdot \log n)$ time. All previously known such constructions have some probability of failure.

1.3 Related Works

Goldreich and Ostrovsky [29, 31] introduced the problem of Oblivious RAM, and also introduced the hierarchical framework. The works of [12, 53] (see also [33, 40]) showed how to reduce the overhead of Goldreich and Ostrovsky’s construction. We refer to Appendix A for a detailed analysis of the different overheads. Besides the hierarchical paradigm, Shi et al. [58] propose the tree-based paradigm for constructing ORAMs. Subsequent works [15, 16, 61, 63] have improved tree-based constructions, culminating in the works of Circuit ORAM [63] and Circuit OPRAM [15].

2 TECHNICAL ROADMAP

We give a high-level overview of our results. In Section 2.1 we provide a high-level overview of our ORAM construction which uses an oblivious tight compaction algorithm. In Section 2.2 we give a high-level overview of the techniques underlying our tight compaction algorithm.

2.1 Oblivious RAM

In this section, we present a high-level description of the main ideas and techniques underlying our ORAM construction. Full details are given later in the corresponding technical sections.

Hierarchical ORAM. The hierarchical ORAM framework, introduced by Goldreich and Ostrovsky [29, 31] and improved in subsequent works (e.g., [12, 33, 40]), works as follows. For a logical memory of $N$ blocks, we construct a hierarchy of hash tables, henceforth denoted $T_1,\ldots ,T_L$ where $L = \log N$. Each $T_i$ stores $2^i$ memory blocks. We refer to table $T_i$ as the $i$th level. In addition, we store a next to each table a flag indicating whether the table is full or empty. When receiving an access request to ${\mathsf {read}}/{\mathsf {write}}$ some logical memory address ${\mathsf {addr}}$, the ORAM proceeds as follows:

—	Read phase. Access each non-empty level $T_1,\ldots ,T_L$ in order and perform ${\sf Lookup}$ for ${\mathsf {addr}}$. If the item is found in some level $T_i$, then when accessing all non-empty levels $T_{i+1},\ldots ,T_L$ looks for a dummy.
—	Write back. If this operation is ${\mathsf {read}}$, then store the found data in the read phase and write back the data value to $T_1$. If this operation is ${\mathsf {write}}$, then ignore the associated data found in the read phase and write the value provided in the access instruction in $T_1$.
—	Rebuild: Find the first empty level $\ell$. If no such level exists, set $\ell ~{\sf :=}~ L$. Merge all $\lbrace T_j\rbrace _{1 \le j\le \ell }$ into $T_\ell$. Mark all levels $T_1,\ldots ,T_{\ell -1}$ as empty and $T_\ell$ as full.

For each access, we perform $\log N$ lookups, one per hash table. Moreover, after $t$ accesses, we rebuild the $i$th table $\lceil t/2^{i} \rceil$ times. When implementing the hash table using the best known oblivious hash table (e.g., oblivious Cuckoo hashing [12, 33, 40]), building a level with $2^k$ items obliviously requires $O(2^k \cdot \log (2^k))=O(2^k \cdot k)$ time. This building algorithm is based on oblivious sorting, and its time overhead is inherited from the time overhead of the oblivious sort procedure (specifically, the best known algorithm for obliviously sorting $n$ elements takes $O(n\cdot \log n)$ time [1, 32]). Thus, summing over all levels (and ignoring the $\log N$ lookup operations across different levels for each access), $t$ accesses require $\sum _{i=1}^{\log N} \lceil \frac{t}{2^{i}}\rceil \cdot O(2^{i} \cdot i) = O(t \cdot \log ^2 N)$ time. On the other hand, lookup takes essentially constant time per level (ignoring searching in stashes which introduce an additive factor), and this $O(\log N)$ per access. Thus, there is an asymmetry between build time and lookup time, and the main overhead is the build.

The work of Patel et al. [53].. Classically (e.g., [12, 29, 31, 33, 40]), oblivious hash tables were built to support (and be secure for) every input array. This required expensive oblivious sorting, causing the extra logarithmic factor. The key idea of Patel et al. [53] is to modify the hierarchical ORAM framework to realize ORAM from a weaker primitive: an oblivious hash table that works only for randomly shuffled input arrays. Patel et al. describe a novel oblivious hash table such that building a hash table containing $n$ elements can be accomplished without oblivious sorting and consumes only $O(n \cdot \log \log \lambda)$ total time⁵ and lookup consumes $O(\log \log n)$ total time. Patel et al. argue that their hash table construction retains security not necessarily for every input, but when the input array is randomly permuted, and moreover the input permutation is unknown to the adversary.

To be able to leverage this relaxed hash table in hierarchical ORAM, a remaining question is the following: whenever a level is being rebuilt in the ORAM (i.e., a new hash table is being constructed), how do we make sure that the input array is randomly and secretly shuffled? A naïve answer is to employ an oblivious random permutation to permute the input, but known oblivious random permutation constructions require oblivious sorting which brings us back to our starting point. Patel et al. solve this problem and show that there is no need to completely shuffle the input array. Recall that when building some level $T_\ell$, the input array consists of only unvisited elements in tables $T_0,\ldots ,T_{\ell -1}$ (and $T_\ell$ too if $\ell$ is the largest level). Patel et al. argue that the unvisited elements in tables $T_0,\ldots ,T_{\ell -1}$ are already randomly permuted within each table and the permutation is unknown to the adversary. Then, they presented a new algorithm, called multi-array shuffle, that combines these arrays to a shuffled array within $O(n \cdot \log \log \lambda)$ time, where $n = |T_0|+|T_1|+\cdots +|T_{\ell -1}|$.⁶ The algorithm is somewhat involved, randomized, and has a negligible probability of failure.

The blueprint.. Our construction builds upon and simplifies the construction of Patel et al. To get better asymptotic overhead, we improve their construction in two different aspects:

(1)	We show how to implement our variant of multi-array shuffle (called intersperse) in $O(n)$ time. Specifically, we show a new reduction from intersperse to tight compaction.
(2)	We develop a hash table that supports build-in $O(n)$ time assuming that the input array is randomly shuffled. The lookup is $O(1)$, ignoring time spent on looking in stashes. Achieving this is rather non-trivial: first, we use a “packing” style trick to construct oblivious Cuckoo hash tables for small sizes where $n \le \mathsf {poly}\log \lambda$, achieving linear-time build and constant-time lookup. Relying on the advantage we gain for problems of small sizes, we then show how to solve problems of medium and large sizes, again relying on oblivious tight compaction as a building block. The bootstrapping step from medium to large is inspired by Patel et al. [53] at a very high level, but our concrete construction differs from Patel et al. [53] in many technical details.

We describe the core ideas behind these improvements next. In Section 2.1.1, we present our multi-array shuffle algorithm. In Section 2.1.2, we show how to construct a hash table for shuffled inputs achieving linear build time and constant lookup.

2.1.1 Interspersing Randomly Shuffled Arrays.

Given two arrays, ${\bf I}_1$ and ${\bf I}_2$, of size $n_1,n_2$, respectively, where each array is randomly shuffled, our goal is to output a single array that contains all elements from ${\bf I}_1$ and ${\bf I}_2$ in a randomly shuffled order. Ignoring obliviousness, we could first initialize an output array of size $n=n_1+n_2$, mark exactly $n_1$ random locations in the output array, and place the elements from ${\bf I}_1$ arbitrarily in these locations. The elements from ${\bf I}_2$ are placed in the unmarked locations.⁷ The challenge is how to perform this placement obliviously, without revealing the mapping from the input array to the output array.

We observe that this routing problem is exactly the “reverse” problem of oblivious tight compaction, where one is given an input array of size $n$ containing keys that are 1-bit and the goal is to sort the array such that all elements with key 0 appear before all elements with key 1. Intuitively, by running this algorithm “in reverse”, we obtain a linear time algorithm for obliviously routing marked elements to an array with marked positions (that are not necessarily at the front). Since we believe that this procedure is useful in its own right, we formalize it independently and call it oblivious distribution. The full details appear in Section 6.

2.1.2 An Optimal Hash Table for Shuffled Inputs.

In this section, we first describe a warmup construction that can be used to build a hash table in $O(n \cdot \mathsf {poly}\log \log {\lambda })$ time and supports lookups in $O(\mathsf {poly}\log \log {\lambda })$ time. We will then get rid of the additional $\mathsf {poly}\log \log {\lambda }$ factor in both the build and lookup phases.

Warmup: oblivious hash table with $\mathsf {poly}\log \log {\lambda }$ slack.. Intuitively, to build a hash table, the idea is to randomly distribute the $n$ elements in the input into $B ~{\sf :=}~ n / \mathsf {poly}\log {\lambda }$ bins of size $\mathsf {poly}\log {\lambda }$ in the clear. The distribution is done according to a pseudorandom function with some secret key $K$, where an element with address ${\mathsf {addr}}$ is placed in the bin with index ${\sf PRF} _K({\mathsf {addr}})$. Whenever we perform lookup for a real element ${\mathsf {addr}}^{\prime }$, we access the bin ${\sf PRF} _K({\mathsf {addr}}^{\prime })$; in which case, we might either find the element there (if it was originally one of the $n$ elements in the input) or we might not find it in the accessed bin (in the case where the element is not part of the input array). Whenever we perform a dummy lookup, we just access a random bin.

Since we assume that the $n$ balls are secretly and randomly permuted, to begin with, the build procedure does not reveal the mapping from original elements to bins. However, a problem arises in the lookup phase. Since the total number of elements in each bin is revealed, accessing in the lookup phase all real keys of the input array would produce an access pattern that is identical to that of the build process, whereas accessing $n$ dummy elements results in a new, independent balls-into-bins process of $n$ balls into $B$ bins.

To this end, we first throw the $n$ balls into the $B$ bins as before, revealing loads $n_1,\ldots ,n_B$. Then, we sample new secret loads $L_1,\ldots ,L_B$ corresponding to an independent process of throwing $n^{\prime } ~{\sf :=}~ n\cdot (1-1/\mathsf {poly}\log \lambda)$ balls into $B$ bins. By a Chernoff bound, with overwhelming probability $L_i \lt n_i$ for every $i\in [B]$. We extract from each bin arbitrary $n_i- L_i$ elements obliviously and move them to an overflow pile (without revealing the $L_i$’s). The overflow pile contains only $n/\mathsf {poly}\log \lambda$ elements so we use a standard Cuckoo hashing scheme such that it can be built in $O(m\cdot \log m) = O(n)$ time and supports lookups effectively in $O(1)$ time (ignoring the stash).⁸ The crux of the security proof is showing that since the secret loads $L_1,\ldots ,L_B$ are never revealed, they are large enough to mask the access pattern in the lookup phase so that it looks independent of the one leaked in the build phase.

We glossed over many technical details, the most important ones being how the bin sizes are truncated to the secret loads $L_1,\ldots ,L_B$, and how each bin is being implemented. For the second question, since the bins are of $O(\mathsf {poly}\log {\lambda })$ size, we support lookups using a perfectly secure ORAM constructions that can be built in $O(\mathsf {poly}\log {\lambda }\cdot \mathsf {poly}\log \log {\lambda })$ and looked up in $O(\mathsf {poly}\log \log {\lambda })$ time [14, 19] (this is essentially where our $\mathsf {poly}\log \log$ factor comes from in this warmup). The first question is slightly more tricky and here we employ our linear time-tight compaction algorithm to extract the number of elements we want from each bin.

The full details of the construction appear in Section 7.

Remark 2.1

(Comparison of the Warmup Construction with Patel et al. [53]).

Our warmup construction borrows the idea of revealing loads and then sampling new secret loads from Patel et al. However, our concrete instantiation is different and this difference is crucial for the next step where we get an optimal hash table. Particularly, the construction of Patel et al. has $\log \log {\lambda }$ layers of hash tables of decreasing sizes, and one has to look for an element in each one of these hash tables, i.e., searching within $\log \log {\lambda }$ bins. In our solution, by tightening the analysis (that is, the Chernoff bound), we show that a single layer of hash tables suffices; thus, lookup accesses only a single bin. This allows us to focus on optimizing the implementation of a bin towards the optimal construction.

Oblivious hash table with linear build time and constant lookup time.. In the warmup construction, (ignoring the lookup time in the stash of the overflow pile⁹), the only super-linear operation that we have is the use of a perfectly secure ORAM, which we employ for bins of size $O(\mathsf {poly}\log {\lambda })$. In this step, we replace this with a data structure with linear time build and constant time lookup: a Cuckoo hash table for lists of polylogarithmic size.

Recall that in a Cuckoo hash table each element receives two random bin choices (e.g., determined by a PRF) among a total of ${\sf c}_{\sf cuckoo} \cdot n$ bins where ${\sf c}_{\sf cuckoo} \gt 1$ is a suitable constant. During build-time, the goal is for all elements to choose one of the two assigned bins, such that every bin receives at most one element. At this moment it is not clear how to accomplish this build process, but supposing we could obliviously build such a Cuckoo hash table in linear time, then the problem would be solved. Specifically, once we have built such a Cuckoo hash table, lookup can be accomplished in constant time by examining both bin choices made by the element (ignoring the issue of the stash for now). Since the bin choices are (pseudo-)random, the lookup process retains security as long as each element is looked up at most once. At the end of the lookups, we can extract the unvisited elements through oblivious tight compaction in linear time—it is not hard to see that if the input array is randomly shuffled, the extracted unvisited elements appear in a random order too.

Therefore the crux is how to build the Cuckoo hash table for polylogarithmically-sized, randomly shuffled input arrays. Our observation is that classical oblivious Cuckoo hash table constructions can be split into three steps: (1) assigning two possible bin choices per element, (2) assigning either one of the bins or the stash for every element, and (3) routing the elements according to the Cuckoo assignment. We delicately handle each step separately:

(1)	For step (1) the $n = \mathsf {poly}\log \lambda$ elements in the input array can each evaluate the PRF on its associated key, and write down its two bin choices (this takes linear time).
(2)	Implementing step (2) in linear time is harder as this step is dominated by a sequence of oblivious sorts. To overcome this, we use the fact that the problem size $n$ is of size $\mathsf {poly}\log \lambda$. As a result, the index of each item and its two bin choices can be expressed using $O(\log \log \lambda)$ bits which means that a single memory word (which is $\log \lambda$ bits long) can hold $O(\frac{\log \lambda }{\log \log \lambda })$ many elements’ metadata. We can now apply a “packed sorting” type of idea [2, 13, 17, 36] where we use the RAM’s word-level instructions to perform SIMD-style operations. Through this packing trick, we show that oblivious sorting and oblivious random permutation (of the elements’ metadata) can be accomplished in $O(n)$ time!
(3)	Step (3) is classically implemented using oblivious bin distribution which again uses oblivious sorts. Here, we cannot use the packing trick since we operate on the elements themselves, so we use the fact that the input array is randomly shuffled and just route the elements in the clear.

There are many technical issues we glossed over, especially related to the fact that the Cuckoo hash tables are of size ${\sf c_{\sf cuckoo}}\cdot n$ bins, where ${\sf c_{\sf cuckoo}}\gt 1$. This requires us to pad the input array with dummies and later to use them to fill the empty slots in the Cuckoo assignment. Additionally, we also need to get rid of these dummies when extracting the set of an unvisited elements. All of these require several additional (packed) oblivious sorts or our oblivious tight compaction.

We refer the reader to Section 8 for the full details of the construction.

2.1.3 Additional Technicalities.

The above description, of course, glossed over many technical details. To obtain our final ORAM construction, there are still a few concerns that have not been addressed. First, recall that we need to make sure that the unvisited elements in a hash table appear in a (pseudo-)random order such that we can make use of this residual randomness to re-initialize new hash tables faster. To guarantee this for the Cuckoo hash table that we employ for $\mathsf {poly}\log \lambda$-sized bins, we need that the underlying Cuckoo hash scheme we employ satisfies an additional property called the “indiscriminating bin assignment” property: specifically, we need that the two pseudo-random Cuckoo-bin choices for each element do not depend on the order in which they are added, their keys, or their positions in the input array. In our technical sections later, this property will allow us to do a coupling argument and prove that the residual unvisited elements in the Cuckoo hash table appear in random order.

Additionally, some technicalities remain in how we treat the smallest level of the ORAM and the stashes. The smallest level in the ORAM construction cannot use the hash table construction described earlier. This is because elements are added to the smallest level as soon as they are accessed and our hash table does not support such an insertion. We address this by using an oblivious dictionary built atop a perfectly secure ORAM for the smallest level of the ORAM. This incurs an additive $O(\mathsf {poly}\log \log {\lambda })$ blowup. Finally, the stashes for each of the Cuckoo hash tables (at every level and every bin within the level) incur $O(\log {\lambda })$ time. We leverage the techniques from Kushilevitz et al. [40] to merge all stashes into a common stash of size $O(\log ^2 {\lambda })$, which is added to the smallest level when it is rebuilt.

On deamortization.. As the overhead of our ORAM is amortized over several accesses, it is natural to ask whether we can deamortize the construction to achieve the same overhead in the worst case, per access. Historically, Ostrovsky and Shoup [51] deamortized the hierarchical ORAM of Goldreich and Ostrovsky [31], and related techniques were later applied on other hierarchical ORAM schemes [12, 34, 40]. Unfortunately, the technique fails for our ORAM as we explain below (it fails for Patel et al. [53], as well, by the same reason).

Recall that in the hierarchical ORAM, the $i$th level hash table stores $2^i$ keys and is rebuilt every $2^i$ accesses. The core idea of existing deamortization techniques is to spread the rebuilding work over the next sequence of $2^i$ ORAM accesses. That is, copy the $2^i$ keys (to be rebuilt) to another working space while performing a lookup on the same level $i$ to fulfill the next $2^i$ accesses. However, by plugging such copy-while-accessing into our ORAM, an adversary can access a key in level $i$ right after the same level is fully copied (as the copying had no way to foresee future accesses). Then, in the adversarial eyes, the copied keys are no longer randomly shuffled, which breaks the security of the hash table (which assumes that the inputs are shuffled). Indeed, in previous works, where hash tables were secure for every input, such deamortization works. Deamortizing our construction is left as an open problem.

2.2 Tight Compaction

Recall that tight compaction can be considered as a restricted form of sorting, where each element in the input array receives a 1-bit key, indicating whether it is real or dummy. The goal is to move all the real elements in the array to the front obliviously, and without leaking how many elements are reals. We show a deterministic algorithm for this task.

Reduction to loose compaction.. Pippenger’s self-routing super-concentrator construction [56] proposes a technique that reduces the task of tight compaction to that of loose compaction. Informally speaking, loose compaction receives as input a sparse array, containing a few real elements and many dummy elements. The output is a compressed output array, containing all real elements but the procedure does not necessarily remove all the dummy elements. More concretely, we care about a specific form of loose compactor (parametrized by $n$): consider a suitable bipartite expander graph that has $n$ vertices on the left and $n/2$ vertices on the right where each node has a constant degree. At most $1/128$ fractions of the vertices on the left will receive a real element, and we would like to route all real elements over vertex-disjoint paths to the right side such that every right vertex receives at most 1 element. The crux is to find a set of satisfying routes in linear time and obliviously. Once a set of feasible routes have been identified, it is easy to see that performing the actual routing can be done obliviously in linear time (and for obliviousness we need to route a dummy element over an edge that bears 0 load). During this process, we effectively compress the sparse input array (represented by vertices on the left) by $1/2$ without losing any element.

Using Pippenger’s techniques [56] and with a little extra work, we can derive the following claim—at this point we simply state the claim while deferring algorithmic details to subsequent technical sections. Below $D$ denotes the number of bits it takes to encode an element and $w$ denotes the word size:

Claim: There exist appropriate constants $C, C^{\prime } \gt 6$ such that the following holds: if we can solve the aforementioned loose compaction problem obliviously in time $T(n)$ for all $n \le n_0$, then we can construct an oblivious algorithm that tightly compacts $n$ elements in time $C \cdot T(n) + C^{\prime } \cdot \lceil D/w \rceil \cdot n$ for all $n \le n_0$.

As mentioned, the crux is to find satisfying routes for such a “loose compactor” bipartite graph obliviously and in linear time. Achieving this is non-trivial: for example, the recent work of Chan et al. [14] attempted to do this but their route-finding algorithm requires $O(n \log n)$ runtime—thus Chan et al. [14]’s work also implies a loose compaction algorithm that runs in time $O(n \log n + \lceil D/w \rceil \cdot n)$. To remove the extra $\log n$ factor, we introduce two new ideas, packing, and decomposition—in fact both ideas are remotely reminiscent of a line of works in the core algorithms literature on (non-comparison-based, non-oblivious) integer sorting on RAMs [2, 17, 36] but obviously, we apply these techniques to a different context.

Packing: linear-time compaction for small instances.. We observe that the offline route-finding phase operates only on metadata. Specifically, the route-finding phase receives the following as input: an array of $n$ bits where the $i$th bit indicates whether the $i$th input position is real or dummy. If the problem size $n$ is small, specifically, if $n \le w/\log w$ where $w$ denotes the width of a memory word, we can pack the entire problem into a single memory word (since each element’s index can be described in $\log n$ bits). In our technical sections we will show how to rely on word-level addition and boolean operations to solve such small problem instances in $O(n)$ time. At a high level, we follow the slow route-finding algorithm by Chan et al. [14], but now within a single memory word, we can effectively perform SIMD-style operations and we exploit this to speed up Chan et al. [14]’s algorithm by a logarithmic factor for small instances.

Relying on the above Claim that allows us to go from loose to tight, we now have an $O(n)$-time oblivious tight compaction algorithm for small instances where $n \le w / \log w$; specifically, if the loose compaction algorithm takes $C_0 \cdot n$ time for $C_0\ge 1$, then the runtime of the tight compaction would be upper bounded by $C \cdot C_0 \cdot n + C^{\prime } \cdot \lceil D/w \rceil \cdot n \le C \cdot C_0 \cdot C^{\prime } \cdot \lceil D/w \rceil \cdot n$.

Decomposition: bootstrapping larger instances of compaction.. With this logarithmic advantage we gain in small instances, our hope is to bootstrap larger instances by decomposing larger instances into smaller ones.

Our bootstrapping is done in two steps—as we calculate below, each time we bootstrap, the constant hidden inside the $O(n)$ runtime blows up by a constant factor; thus it is important that the bootstrapping is done for only $O(1)$ times.

(1)

Medium instances: $n \le (w/\log w)^2$. For medium instances, our idea is to divide the input array into $\sqrt {n}$ segments each of $B := \sqrt {n}$ size. As long as the input array has only $n/128$ or fewer real elements, then at most $\sqrt {n}/4$ segments can be dense, i.e., each containing more than $\sqrt {n}/4$ real elements (1/4 is loose but sufficient). We rely on tight compaction for small segments to move the dense segments in front of the sparse ones. For each of the $3\sqrt {n}/4$ segments, we next compress away $3/4$ of the space using tight compaction for small instances. Clearly, the above procedure is a loose compaction and consumes at most $2 \cdot C \cdot C^{\prime } \cdot C_0 \cdot \lceil D/w \rceil \cdot n + 6 \lceil D/w \rceil \cdot n \le 2.5 \cdot C \cdot C^{\prime } \cdot C_0 \cdot \lceil D/w \rceil \cdot n$ runtime.

So far we have constructed a loose compaction algorithm for medium instances. Using the aforementioned Claim, we can in turn construct an algorithm that obliviously and tightly compacts a medium-sized instance of size $n \le (w/\log w)^2$ in time at most $3 C^2 \cdot C^{\prime } \cdot C_0 \cdot \lceil D/w \rceil \cdot n$.

(2)

Large instances: arbitrary $n$. We can now bootstrap to arbitrary choices of $n$ by dividing the problem into $m := n/(\frac{w}{\log w})^2$ segments where each segment contains at most $(\frac{w}{\log w})^2$ elements. Similar to the medium case, at most $1/4$ fraction of the segments can have real density exceeding $1/4$ —which we call such segments dense. As before, we would like to move the dense segments in the front and the sparse ones to the end. Recall that Chan et al. [14]’s algorithm solves loose compaction for problems of arbitrary size $m$ in time $C_1 \cdot (m \log m + \lceil D/w \rceil m)$ Thus due to the above claim we can solve tight compaction for problems of any size $m$ in time $C \cdot C_1 \cdot (m \log m + \lceil D/w \rceil \cdot m) + C^{\prime } \cdot \lceil D/w \rceil \cdot m$. Thus, in $O(\lceil D/w \rceil \cdot n)$ time we can move all the dense instances to the front and the sparse instances to the end. Finally, by invoking medium instances of tight compaction, we can compact within each segment in time that is linear in the size of the segment. This allows us to compress away $3/4$ of the space from the last $3/4$ segments which are guaranteed to be sparse. This gives us loose compaction for large instances in $O(\lceil D/w \rceil \cdot n)$ time—from here we can construct oblivious tight compaction for large instances using the above Claim.¹⁰

Remark 2.2.

In our formal technical sections later, we in fact directly use loose compaction for smaller problem sizes to bootstrap loose compaction for larger problem sizes (whereas in the above version we use tight compaction for smaller problems to bootstrap loose compaction for larger problems). The detailed algorithm is similar to the one described above: it requires slightly more complicated parameter calculation but results in better constants than the above more intuitive version.

3 PRELIMINARIES

Throughout this work, the security parameter is denoted $\lambda$, and it is given as input to algorithms in unary (i.e., as $1^\lambda$). A function ${\sf negl}:\mathbb {N}\rightarrow \mathbb {R}^+$ is negligible if for every constant $c \gt 0$ there exists an integer $N_c$ such that ${\sf negl}(\lambda) \lt \lambda ^{-c}$ for all $\lambda \gt N_c$. Two sequences of random variables $X = \lbrace X_\lambda \rbrace _{\lambda \in \mathbb {N}}$ and $Y = \lbrace Y_\lambda \rbrace _{\lambda \in \mathbb {N}}$ are computationally indistinguishable if for any probabilistic polynomial-time algorithm $\mathcal {A}$, there exists a negligible function ${\sf negl}(\cdot)$ such that $| \Pr [\mathcal {A}(1^{\lambda }, X_\lambda) = 1] - \Pr [\mathcal {A}(1^{\lambda },Y_\lambda) = 1] | \le {\sf negl}(\lambda)$ for all $\lambda \in \mathbb {N}$. We say that $X \equiv Y$ for such two sequences if they define identical random variables for every $\lambda \in \mathbb {N}$. The statistical distance between two random variables $X$ and $Y$ over a finite domain $\Omega$ is defined by ${\sf SD}(X,Y) \triangleq \frac{1}{2}\cdot \sum _{x\in \Omega }^{} \left|\Pr [X=x] - \Pr [Y=x]\right|$. For an integer $n \in \mathbb {N}$ we denote by $[n]$ the set $\lbrace 1,\ldots , n\rbrace$. By $\Vert$ we denote the operation of string concatenation.

Definition 3.1

(Pseudorandom Functions (PRFs)).

Let $\mathsf {PRF}$ be an efficiently computable function family indexed by keys $\texttt{sk} \in \lbrace 0,1\rbrace ^\lambda$, where each $\mathsf {PRF}_{\texttt{sk} }$ takes as input a value $x\in \lbrace 0,1\rbrace ^{n(\lambda)}$ and outputs a value $y\in \lbrace 0,1\rbrace ^{m(\lambda)}$, where $n$ and $m$ are polynomials. A function family $\mathsf {PRF}$ is $\delta$-secure if for every (non-uniform) probabilistic polynomial-time algorithm ${\mathcal {A}}$, it holds that $\begin{align*} \left| \Pr _{\texttt{sk} \leftarrow \lbrace 0,1\rbrace ^\lambda } \left[{\mathcal {A}} ^{\mathsf {PRF}_\texttt{sk} (\cdot)}(1^\lambda) = 1 \right] - \Pr _{f\leftarrow F_\lambda }\left[{\mathcal {A}} ^{f(\cdot)}(1^\lambda) = 1\right] \right| \le \delta (\lambda), \end{align*}$ for all large enough $\lambda \in \mathbb {N}$, where $F_\lambda$ is the set of all functions that map $\lbrace 0,1\rbrace ^{n(\lambda)}$ into $\lbrace 0,1\rbrace ^{m(\lambda)}$.

It is known that one-way functions are existentially equivalent to PRFs for any polynomial $n(\cdot)$ and $m(\cdot)$ and negligible $\delta (\cdot)$ [37, 50]. Our construction will employ PRFs in several places and we present each part modularly with its own PRF, but note that the whole ORAM construction can be implemented with a single PRF from which we can implicitly derive all other PRFs.

3.1 Oblivious Machines

We define an oblivious simulation of (possibly randomized) functionalities. We provide a unified framework that enables us to adopt composition theorems from secure computation literature (see, for example, Canetti and Goldreich [9, 10, 30]), and to prove constructions in a modular fashion.

Random-access machines.. A RAM is an interactive Turing machine that consists of a memory and a CPU. The memory is denoted as ${\mathsf {mem}} [{N},{w}]$, and is indexed by the logical address space $[N] = \lbrace 1,2,\ldots ,N\rbrace$. We refer to each memory word also as a block and we use ${{w}}$ to denote the bit-length of each block. The CPU has an internal state that consists of $O(1)$ words. The memory supports read/write instructions $(\mathsf {op},{\mathsf {addr}}, {\mathsf {data}})$, where $\mathsf {op} \in \lbrace {\mathsf {read}},{\mathsf {write}} \rbrace$, ${\mathsf {addr}} \in [N]$ and ${\mathsf {data}} \in \lbrace 0,1\rbrace ^{{w}}\cup \lbrace \bot \rbrace$. If $\mathsf {op} = {\mathsf {read}}$, then ${\mathsf {data}} =\bot$ and the returned value is the content of the block located in logical address ${\mathsf {addr}}$ in the memory. If $\mathsf {op} ={\mathsf {write}}$, then the memory data in logical address ${\mathsf {addr}}$ is updated to ${\mathsf {data}}$. We use the standard setting that ${w}= \Theta (\log N)$ (so a word can store an address). We follow the convention that the CPU performs one word-level operation per unit time, i.e., arithmetic operations (addition or subtraction), bitwise operations (AND, OR, NOT, or shift), memory accesses (read or write), or evaluating a pseudorandom function [12, 31, 33, 40, 41, 53].

Oblivious simulation of a (non-reactive) functionality.. We consider machines that interact with the memory via ${\mathsf {read}}/{\mathsf {write}}$ operations. We are interested in defining sub-functionalities such as oblivious sorting, oblivious shuffling of memory contents, and more, and then define more complex primitives by composing the above. For simplicity, we assume for now that the adversary cannot see memory contents, and does not see the ${\mathsf {data}}$ field in each operation $(\mathsf {op},{\mathsf {addr}},{\mathsf {data}})$ that the memory receives. That is, the adversary only observes $(\mathsf {op},{\mathsf {addr}})$. One can extend the constructions for the case where the adversary can also observe ${\mathsf {data}}$ using symmetric encryption in a straightforward way.

We define oblivious simulation of a RAM program. Let $f:\lbrace 0,1\rbrace ^* \rightarrow \lbrace 0,1\rbrace ^*$ be a (possibly randomized) functionality in the RAM model. We denote the output of $f$ on input $x$ to be $f(x) = y$. Oblivious simulation of $f$ is a RAM machine $M_f$ that interacts with the memory, has the same input/output behavior, but its access pattern to the memory can be simulated. More precisely, we let $({\sf out},{\sf Addrs})\leftarrow M_f(x)$ be a pair of random variables where ${\sf out}$ corresponds to the output of $M_f$ on input $x$ and where ${\sf Addrs}$ define the sequence of memory accesses during the execution. We say that the machine $M_f$ implements the functionality $f$ if it holds that for every input $x$, the distribution $f(x)$ is identical to the distribution ${\sf out}$, where $({\sf out},\cdot) \leftarrow M_f(x)$. In terms of security, we require oblivious simulation which we formalize by requiring the existence of a simulator that simulates the distribution of ${\sf Addrs}$ without knowing $x$.

Definition 3.2

(Oblivious Simulation).

Let $f:\lbrace 0,1\rbrace ^*\rightarrow \lbrace 0,1\rbrace ^*$ be a functionality, and let $M_f$ be a machine that interacts with the memory. We say that $M_f$ obliviously simulates the functionality $f$, if there exists a probabilistic polynomial time simulator ${\mathsf {Sim}}$ such that for every input $x \in \lbrace 0, 1\rbrace ^*$, the following holds: $\begin{align*} \left\lbrace ({\sf out},{\sf Addrs}) : ({\sf out},{\sf Addrs})\leftarrow M_f(1^\lambda , x) \right\rbrace _{\lambda } \approx \left\lbrace \left(f(x),{\mathsf {Sim}} (1^\lambda , 1^{|x|})\right)\right\rbrace _{\lambda }. \end{align*}$ Depending on whether $\approx$ refers to computational, statistical, or perfectly indistinguishability, we say $M_f$ is computationally, statistically, or perfectly oblivious, respectively.

Later in our theorem statements, we often wish to explicitly characterize the security failure probability. Thus we also say that $M_f$$(1-\delta)$-obliviously simulates the functionality $f$, iff no non-uniform $\texttt{PPT}$${\mathcal {A}} (1^\lambda)$ can distinguish the above joint distributions with probability more than $\delta (\lambda)$—note that the failure probability $\delta$ is allowed to depend on the adversary ${\mathcal {A}}$’s algorithm and running time. Additionally, a 1-oblivious algorithm is also called perfectly-oblivious.

Intuitively, the above definition requires indistinguishability of the joint distribution of the output of the computation and the access pattern, similarly to the standard definition of secure computation in which the joint distribution of the output of the function and the view of the adversary is considered (see the relevant discussions in Canetti and Goldreich [9, 10, 30]). Note that here we handle correctness and obliviousness in a single definition. As an example, consider an algorithm that randomly permutes some array in the memory, while leaking only the size of the array. Such a task should also hide the chosen permutation. As such, our definition requires that the simulation would output an access pattern that is independent of the output permutation itself.

Parametrized functionalities.. In our definition, the simulator receives no input, except the security parameter and the length of the input. While this is very restricting, the simulator knows the description of the functionality and therefore also its “public” parameters. We sometimes define functionalities with explicit public inputs and refer to them as “parameters”. For instance, the access pattern of a procedure for sorting of an array depends on the size of the array; a functionality that sorts an array will be parameterized by the size of the array, and this size will also be known by the simulator.

Modeling reactive functionalities.. We consider functionalities that are reactive, i.e., proceed in stages, where the functionality preserves an internal state between stages. Such a reactive functionality can be described as a sequence of functions, where each function also receives as input a state, updates it, and outputs an updated state for the next function. We extend Definition 3.2 to deal with such functionalities.

We consider a reactive functionality ${\mathcal {F}}$ as a reactive machine, that receives commands of the form $({\sf command}_i,{\sf inp}_i)$ and produces an output ${\sf out}_i$, while maintaining some (secret) internal state. An implementation of the functionality ${\mathcal {F}}$ is defined analogously, as an interactive machine $M_{\mathcal {F}}$ that receives commands of the same form $({\sf command}_i,{\sf inp}_i)$ and produces outputs ${\sf out}_i$. We say that $M_{\mathcal {F}}$ is oblivious, if there exists a simulator ${\mathsf {Sim}}$ that can simulate the access pattern produced by $M_{\mathcal {F}}$ while receiving only ${\sf command}_i$ but not ${\sf inp}_i$. Our simulator ${\mathsf {Sim}}$ is also a reactive machine that might maintain a state between execution.

In more detail, we consider an adversary $\mathcal {A}$ (i.e., the distinguisher or the “environment”) that participates in either a real execution or an ideal one, and we require that its view in both execution is indistinguishable. The adversary $\mathcal {A}$ chooses adaptively in each stage the next command $({\sf command}_i,{\sf inp}_i)$. In the ideal execution, the functionality ${\mathcal {F}}$ receives $({\sf command}_i,{\sf inp}_i)$ and computes ${\sf out}_i$ while maintaining its secret state. The simulator is then being executed on input ${\sf command}_i$ and produces an access pattern ${\sf Addrs}_i$. The adversary receives $({\sf out}_i,{\sf Addrs}_i)$. In the real execution, the machine $M$ receives $({\sf command}_i,{\sf inp}_i)$ and has to produce ${\sf out}_i$ while the adversary observes the access pattern. We let $({\sf out}_i, {\sf Addrs}_i) \leftarrow M_f({\sf command}_i,{\sf inp}_i))$ denote the joint distribution of the output and memory accesses pattern produced by $M$ upon receiving $({\sf command}_i,{\sf inp}_i)$ as input. The adversary can then choose the next command, as well as the next input, in an adaptive manner according to the output and access pattern it received.

Definition 3.3

(Oblivious Simulation of a Reactive Functionality).

We say that a reactive machine $M_{\mathcal {F}}$ is an oblivious implementation of the reactive functionality ${\mathcal {F}}$ if there exists a PPT simulator ${\sf Sim}$, such that for any non-uniform PPT (stateful) adversary $\mathcal {A}$, the view of the adversary $\mathcal {A}$ in the following two experiments ${\sf Expt}^{{\rm real}, {M}}_{\mathcal {A}}(1^\lambda)$ and ${\sf Expt}^{{\rm ideal}, {\mathcal {F}}}_{\mathcal {A},{\mathsf {Sim}}}(1^\lambda)$ is computationally indistinguishable:

Definition 3.3 can be extended in a natural way to the cases of statistical security (in which $\mathcal {A}$ is unbounded and its view in both worlds is statistically close), or perfect security ($\mathcal {A}$ is unbounded and its view is identical).

To allow our theorem statements to explicitly characterize the security failure probability, we also say that $M_f$$(1 -\delta)$-obliviously simulates the reactive functionality $\mathcal {F}$, iff no non-uniform probabilistic polynomial-time ${\mathcal {A}} (1^\lambda)$ can distinguish the above joint distributions with probability more than $\delta (1^\lambda)$—note that the failure probability $\delta$ is allowed to depend on the adversary ${\mathcal {A}}$’s algorithm and running time.

An example: ORAM. An example of a reactive functionality is an ordinary ORAM, implementing logical memory. Functionality 3.4 is a reactive functionality in which the adversary can choose the next command (i.e., either ${\mathsf {read}}$ or ${\mathsf {write}}$) as well as the address and data according to the access pattern it has observed so far.

Definition 3.3 requires the existence of a simulator that on each ${\sf Access}$ command only knows that such a command occurred, and successfully simulates the access pattern produced by the real implementation. This is a strong notion of security since the adversary is adaptive and can choose the next command according to what it has seen so far.

Hybrid model and composition.. We sometimes describe executions in a hybrid model. In this case, a machine $M$ interacts with the memory via ${\mathsf {read}}/{\mathsf {write}}$-instructions and in addition, can also send ${\mathcal {F}}$-instructions to the memory. We denote this model as $M^{\mathcal {F}}$. When invoking a functionality ${\mathcal {F}}$, we assume that it only affects the address space on which it is instructed to operate; this is achieved by first copying the relevant memory locations to a temporary position, running ${\mathcal {F}}$ there, and finally copying the result back. This is the same whether ${\mathcal {F}}$ is reactive or not. Definition 3.3 is then modified such that the access pattern ${\sf Addrs}_i$ also includes the commands sent to ${\mathcal {F}}$ (but not the inputs to the command). When a machine $M^{\mathcal {F}}$ obliviously implements a functionality ${\mathcal {G}}$ in the ${\mathcal {F}}$-hybrid model, we require the existence of a simulator ${\mathsf {Sim}}$ that produces the access pattern exactly as in Definition 3.3, where here the access pattern might also contain ${\mathcal {F}}$-commands.

The concurrent composition follows from [10], since our simulations are universal and straight-line. Thus, if (1) some machine $M$ obliviously simulates some functionality ${\mathcal {G}}$ in the ${\mathcal {F}}$-hybrid model, and (2) there exists a machine $M_{\mathcal {F}}$ that obliviously simulate ${\mathcal {F}}$ in the plain model, then there exists a machine $M^{\prime }$ that obliviously simulate ${\mathcal {G}}$ in the plain model.

Input assumptions.. In some algorithms, we assume that the input satisfies some assumptions. For instance, we might assume that the input array for some procedure is randomly shuffled or that it is sorted according to some key. We can model the input assumption ${\mathcal {X}}$ as an ideal functionality ${\mathcal {F}}_{\mathcal {X}}$ that receives the input and “rearranges” it according to the assumption $\mathcal {X}$. Since the mapping between an assumption $\mathcal {X}$ and the functionality ${\mathcal {F}}_{\mathcal {X}}$ is usually trivial, and can be deduced from context, we do not always describe it explicitly.

We then prove statements of the form: “The algorithm $A$ with input satisfying assumption $\mathcal {X}$ obliviously implements a functionality ${\mathcal {F}}$”. This should be interpreted as an algorithm that receives $x$ as input, invokes ${\mathcal {F}}_{\mathcal {X}}(x)$ and then invokes $A$ on the resulting input. We require that this modified algorithm implements ${\mathcal {F}}$ in the ${\mathcal {F}}_{\mathcal {X}}$-hybrid model.

4 OBLIVIOUS BUILDING BLOCKS

Our ORAM construction uses many building blocks, some of which are new to this work and some of which are known from the literature. The building blocks are listed next. We advise the reader to use this section as a reference and skip it during the first read.

—	Oblivious Sorting Algorithms (Section 4.1): We state the classical sorting network of Ajtai et al. [1] and present a new oblivious sorting algorithm that is more efficient in settings where each memory word can hold multiple elements.
—	Oblivious Random Permutations (Section 4.2): We show how to perform efficient oblivious random permutations in settings where each memory word can hold multiple elements.
—	Oblivious Bin Placement (Section 4.3): We state the known results for oblivious bin placement of Chan et al. [12, 15].
—	Oblivious Hashing (Section 4.4): We present the formal functionality of a hash table that is used throughout our work. We also state the resulting parameters of a simple oblivious hash table that is achieved by compiling a non-oblivious hash table inside an existing ORAM construction.
—	Oblivious Cuckoo Hashing (Section 4.5): We present an overview the state-of-the-art constructions of oblivious Cuckoo hash tables. We state their complexities and also make minor modifications that will be useful to us later.
—	Oblivious Dictionary (Section 4.6): We present and analyze a simple construction of a dictionary that is achieved by compiling a non-oblivious dictionary (e.g., a red-black tree) inside an existing ORAM construction.
—	Oblivious Balls-into-Bins Sampling (Section 4.7): We present an oblivious sampling of the approximated bin loads of throwing independently $n$ balls into $m$ bins, which uses the binomial sampling of Bringmann et al. [8].

4.1 Oblivious Sorting Algorithms

The elegant work of Ajtai et al. [1] shows that there is a comparator-based circuit with $O\left(n\cdot \log n\right)$ comparators that can sort any array of length $n$.

Theorem 4.1 (Ajtai et al. [1]).

There is a deterministic oblivious sorting algorithm that sorts $n$ elements in $O(\lceil D / w \rceil \cdot n \cdot \log n)$ time where $D$ denotes the number of bits it takes to encode a single element and $w$ denotes the length of a word.

Packed oblivious sort.. We consider a variant of the oblivious sorting problem on a RAM, which is useful when each memory word can hold up to $B \gt 1$ elements. The following theorem assumes that the RAM can perform only word-level addition, subtraction, and bitwise operations in unit cost (as defined in Section 3.1).

Theorem 4.2 (Packed Oblivious Sort).

There is a deterministic packed oblivious sorting algorithm that sorts $n$ elements in $O(\frac{n}{B} \cdot \log ^2 n)$ time, where $B$ denotes the number of elements each memory word can pack.

Proof.

We use a variant of bitonic sort, introduced by Batcher [5]. It is well-known that, given a list of $n$ elements, bitonic sort runs in $O(n \cdot \log ^2 n)$ time. The algorithm, viewed as a sorting network, proceeds in $O(\log ^2 n)$ iterations, where each iteration consists of $\frac{n}{2}$ comparators (see Figure 1). In each iteration, the comparators are totally parallelizable, but our goal is to perform the comparators efficiently using standard word-level operation, i.e., to perform each iteration in $O(\frac{n}{B})$ standard word-level operations. The intuition is to pack sequentially $O(B)$ elements into each word and then apply single-instruction-multiple-data(SIMD) comparators, where a SIMD comparator emulates $O(B)$ standard comparators using only constant time. We show the following facts: (1) each iteration runs in $O(\frac{n}{B})$ SIMD comparators and $O(\frac{n}{B})$ time, and (2) each SIMD comparator can be instantiated by a constant number of word-level subtraction and bitwise operations.

Fig. 1. A bitonic sorting network for 8 inputs. Each horizontal line denotes an input from the left end and output to the right end. Each vertical arrow denotes a comparator such that compares two elements and then swaps the greater one to the pointed end. Each dashed box denotes an iteration in the algorithm. The figure is modified from [65].

To show fact (1), we first assume without loss of generality that $n$ and $B$ are powers of 2. We refer to the packed array which is the array of $\frac{n}{B}$ words, where each word stores $B$ elements. Then, for each iteration, we want a procedure that takes as input the packed array from the previous iteration, and outputs the packed array that is processed by the comparators prescribed in the standard bitonic sort. To use SIMD comparators efficiently and correctly, for each comparator, the input pair of elements has to be aligned within the pair of two words. We say that two packed arrays are aligned if and only if the offset between each two words is the same. Hence, it suffices to show that it takes $O(1)$ time to align $O(B)$ pairs of elements. By the definition of bitonic sort, in the same iteration, the offset between any compared pair is the same power of 2 (see Figure 1). Since $B$ is also a power of 2, one of the following two cases holds:

(a)	All comparators consider two elements from two distinct words, and elements are always aligned in the input.
(b)	All comparators consider two elements from the same word, but the offset $t$ between any compared pair is the same power of 2.

In case (a), the required alignment follows immediately. In case (b), it suffices to do the following:

(1)	Split one word into two words such that elements of the offset $t$ are interleaved, where the two words are called odd and even, and then
(2)	Shift the even word by $t$ elements so the comparators are aligned to the odd word.

The above procedure takes $O(1)$ time. Indeed, there are two applications of the comparators, and thus it blows up the cost of the operation by a factor of 2. Thus, the algorithm of an iteration aligns elements, applies SIMD comparators, and then reverses the alignment. Every iteration runs $O\left(\frac{n}{B}\right)$ SIMD comparators plus $O\left(\frac{n}{B}\right)$ time.

For fact (2), note that to compare $k$-bit strings it suffices to perform $(k+1)$-bit subtraction (and then use the sign bit to select one string). Hence, the intuition to instantiate the SIMD comparator is to use “SIMD” subtraction, which is the standard word subtraction but the packed elements are augmented by the sign bit. The procedure is as follows. Let $k$ be the bit-length of an element such that $B\cdot k$ bits fit into one memory word. We write the $B$ elements stored in a word as a vector $\vec{a} = (a_1, \ldots , a_B)\in (\lbrace 0,1\rbrace ^{k})^B$. It suffices to show that for any $\vec{a} = (a_1, \ldots , a_B)$ and $\vec{b} = (b_1, \ldots , b_B)$ stored in two words, it is possible to compute the mask word $\vec{m} = (m_1, \ldots , m_B)$ such that $\begin{align*} m_i = {\left\lbrace \begin{array}{ll} 1^k & \mbox{if } a_i \ge b_i\\ 0^k & \mbox{otherwise.} \end{array}\right.} \end{align*}$ For binary strings $x$ and $y$, let $xy$ be the concatenation of $x$ and $y$. Let $*$ be a wild-card bit. Assume additionally that the elements are packed with additional sign bits, i.e., $\vec{a} = \left(*a_1, *a_2, \dots , *a_B\right)$. This can be done by simply splitting one word into two. Consider two input words $\vec{a} = \left(1 a_1, \right.\left. 1 a_2, \dots , 1 a_B\right)$ and $\vec{b} = \left(0 b_1, 0 b_2, \dots , 0 b_B\right)$ such that $a_i, b_i \in \lbrace 0,1\rbrace ^k$. The procedure runs as follows:

(1)	Let $\vec{s^{\prime }} = \vec{a} - \vec{b}$, which has the format $(s_1 ^k, s_2 ^k, \dots , s_B ^k)$, where $s_i \in \lbrace 0,1\rbrace$ is the sign bit* such that $s_i = 1$ iff $a_i \ge b_i$. Keep only sign bits and let $\vec{s} = (s_1 0^k, \dots , s_B 0^k)$.
(2)	Shift $\vec{s}$ and get $\vec{m^{\prime }} = (0^k s_1, \dots , 0^k s_B)$. Then, the mask is $\vec{m} = \vec{s} - \vec{m^{\prime }} = (0s_1^k, \dots , 0s_B^k)$.

The above takes $O(1)$ subtraction and bitwise operations. This concludes the proof.□

4.2 Oblivious Random Permutations

We say that an algorithm ${\sf ORP}$ is a statistically secure oblivious random permutation, iff ORP statistically obliviously simulates the functionality $\mathcal {F}_{\rm perm}$ which, upon receiving an input array of $n$ elements, chooses a random permutation $\pi$ from the space of all $n!$ permutations on $n$ elements, uses $\pi$ to permute the input array, and outputs the result. Note that this definition implies that not only does ORP output an almost random permutation of the input array; moreover, the access patterns of ORP are statistically close for all input arrays and all permutations. As before, we use the notation $(1-\delta)$-oblivious random permutation to explicitly denote the algorithm’s failure probability $\delta$.

Theorem 4.3.

Let $n \gt 100$ and let $D$ denote the number of bits it takes to encode an element. There exists a $(1-e^{-\sqrt {n}})$-oblivious random permutation for arrays of size $n$. It runs in time $O(T_{\rm sort}^{D + \log n}(n) + n)$, where $T_{\rm sort}^\ell (n)$ is an upper bound on the time it takes to sort $n$ elements each of size $\ell$ bits.

Later, in our ORAM construction, this version of ORP will be applied to arrays of size $n \ge \log ^3\lambda$, where $\lambda$ is a security parameter, and thus the failure probability is bounded by a negligible function in $\lambda$.

Proof of Theorem 4.3

We apply a similar algorithm as that of Chan et al. [11, Figure 2 and Lemma 10], except with different parameters:

—	Assign each element an $8\log n$-bit random label drawn uniformly from $\lbrace 0, 1\rbrace ^{8\log n}$. Obliviously sort all elements based on their random labels, resulting in the array ${\bf R}$. This step takes $O(T_{\rm sort}^{D + \log n}(n) + n)$ time.
—	In one linear scan, write down two arrays: (1) an array ${\bf I}$ containing the indices of all elements that have collisions; and (2) an array ${\bf X}$ containing all the colliding elements themselves. This can be accomplished in $O(n)$ time assuming that we can leak the indices of the colliding elements.
—	If the number of elements that collide is greater than $\sqrt {n}$, simply abort throwing an Overflow exception. Otherwise, use a naïve quadratic oblivious random permutation algorithm to obliviously and randomly permute the array ${\bf X}$, and let ${\bf Y}$ be the outcome. This step can be completed in $O(n)$ time where the quadratic oblivious random permutation performs the following: for each of $i \in \lbrace 1, 2, \ldots , n\rbrace$, sample a random index $r$ from $\lbrace 1, 2, \ldots , n-i+1\rbrace$, and write the $i$th element of the input to the $r$th unoccupied position of the output through a linear scan of the output array.
—	Finally, for each $j \in \|{\bf I}\|$, write back each element ${\bf Y}[j]$ to the position ${\bf R}[{\bf I}[j]]$ and output the resulting ${\bf R}$.

To bound the probability of Overflow, we first prove the following claim:

Claim 4.4.

Let $n \gt 100$. Fix a subset $S \subseteq \lbrace 1, 2, \ldots , n\rbrace$ of size $\alpha \ge 2$. Throw elements $\lbrace 1, 2, \ldots , n\rbrace$ to $n^8$ bins independently and uniformly at random. The probability that every element in $S$ has a collision with any other elements is upper bounded by $\alpha !/n^{2\alpha }$.□

Proof.

If all elements in $S$ see collisions for some sample path determined by the choice of all elements’ bins denoted $\psi$, then the following event $G^S$ must be true for the sample path $\psi$: there is a permutation $S^{\prime }$ of $S$ such that for every $i \in \lbrace \lceil \alpha /2 \rceil , \ldots , \alpha \rbrace$, $S^{\prime }[i]$ either collides with some element in $S^{\prime }$ whose index $j \lt i$ (i.e., with an element before itself) or with an element outside of $S$ (i.e., from $[n] \setminus S$).

Therefore, the fraction of sample paths for which a fixed subset $S$ of size $\alpha$ all have collision is upper bounded by the fraction of sample paths over which the above event $G^S$ holds. Now, the fraction of sample paths over which the $G^S$ holds is upper bounded by $\alpha ! \cdot (n/n^8)^{\lfloor \alpha /2 \rfloor } \le \alpha !/n^{2\alpha }$.□

We proceed with the proof of Theorem 4.5. The probability that there exists at least $\alpha$ collisions is upper bounded by the following expression since there are at most $n \choose \alpha$ possible choices for such a subset $S$: $\begin{align*} &\quad {n \choose \alpha } \cdot \frac{\alpha !}{n^{2\alpha }} = \frac{n!}{(n-\alpha)! \alpha !} \cdot \frac{\alpha !}{n^{2\alpha }} \le \frac{e \sqrt {n} (n/e)^n}{ \sqrt {2\pi (n-\alpha)} ((n-\alpha)/e)^{n-\alpha } \cdot \sqrt {2\pi \alpha } (\alpha /e)^{\alpha }} \cdot \frac{\alpha !}{n^{2\alpha }}\\ &\le \frac{e\sqrt {n}}{2\pi }\cdot \frac{n^n}{(n-\alpha)^{n-\alpha } \cdot \alpha ^\alpha } \cdot \frac{\alpha !}{n^{2\alpha }} = \frac{\alpha ! \cdot e\sqrt {n}}{2\pi }\cdot \left(\frac{n}{n-\alpha }\right)^{n-\alpha } \cdot \frac{1}{\alpha ^\alpha } \cdot \frac{1}{n^\alpha } \\ &\le \frac{\alpha ! \cdot e\sqrt {n}}{2\pi }\cdot \left(1 + \frac{\alpha }{n-\alpha }\right)^n \cdot \frac{1}{\alpha ^\alpha } \cdot \frac{1}{n^\alpha }. \end{align*}$ Plugging in $\alpha = \sqrt {n}$, we can upper bound the above expression as follows assuming large $n \gt 100$: $\begin{align*} &\quad \frac{\sqrt {n}! \cdot e\sqrt {n}}{2\pi }\cdot \left(1 + \frac{\sqrt {n}}{n-\sqrt {n}}\right)^{\sqrt {n} \cdot \sqrt {n}} \cdot \frac{1}{(n\sqrt {n})^{\sqrt {n}}} \le \frac{\sqrt {n}! \cdot e\sqrt {n}}{2\pi }\cdot \left(1 + \frac{1}{0.5\sqrt {n}}\right)^{0.5\sqrt {n} \cdot 2\cdot \sqrt {n}} \cdot \frac{1}{(n\sqrt {n})^{\sqrt {n}}} \\ &\le \frac{\sqrt {n}! \cdot e\sqrt {n}}{2\pi }\cdot \exp (2\sqrt {n}) \cdot \frac{1}{(n\sqrt {n})^{\sqrt {n}}} \le \exp (-\sqrt {n}). \end{align*}$

Having bounded the Overflow probability, the obliviousness proof can be completed in an identical manner to that of Lemma 10 in Chan et al. [11], since our algorithm is essentially the same as theirs but with different parameters. We stress the algorithm is oblivious even though the positions of the colliding elements are revealed.

Packed oblivious random permutation.. The following version of oblivious random permutation has good performance when each memory word is large enough to store many copies of the elements to be permuted tagged with their own indices. The algorithm follows directly by plugging in our oblivious packed sort (Theorem 4.2) into the oblivious random permutation algorithm (Theorem 4.3).

Theorem 4.5 (Packed Oblivious Random Permutation).

Let $n \gt 100$ and let $D$ denote the number of bits it takes to encode an element. Let $B = \lfloor w/(\log n + D) \rfloor$ be the element capacity of each memory word and assume that $B \gt 1$. Then, there exists an $(1-e^{-\sqrt {n}})$-oblivious random permutation algorithm that permutes the input array in time $O(\frac{n}{B} \cdot \log ^2 n + n)$.

Perfect oblivious random permutation.. Note that the permutation of Theorem 4.3 runs in time $O(n\cdot \log n)$ but it may fail w.p. $e^{-\sqrt {n}}$. We construct a perfectly oblivious random permutation in this article. This scheme comes as a by-product of our tight compaction and intersperse algorithms that we construct later in Sections 5 and 6.4.

Theorem 4.6 (Perfectly Oblivious Random Permutation).

For any $n$, any $m \in [n]$, suppose that sampling an integer uniformly at random from $[m]$ takes unit time. Then, there exists a perfectly oblivious random permutation such that permutes an input array of size $n$ in $O(n \cdot \log n)$ time.

Proof.

We will prove this theorem in Section 6.4.□

4.3 Oblivious Bin Placement

Let ${\bf I}$ be an input array containing real and dummy elements. Each element has a tag from $\lbrace 1,\ldots ,|{\bf I}|\rbrace \cup \lbrace \bot \rbrace$. It is guaranteed that all the dummy elements are tagged with $\bot$ and all real elements are tagged with distinct values from $\lbrace 1,\ldots ,|{\bf I}|\rbrace$. The goal of oblivious bin placement is to create a new array ${\bf I}^{\prime }$ of size $|{\bf I}|$ such that a real element that is tagged with the value $i$ will appear in the $i$th cell of ${\bf I}^{\prime }$. If no element was tagged with a value $i$, then ${\bf I}^{\prime }[i]=\bot$. The values in the tags of real elements can be thought of as “bin assignments” where the elements want to go to and the goal of the bin placement algorithm is to route them to the right location obliviously.

Oblivious bin placement can be accomplished with $O(1)$ number of oblivious sorts (Section 4.1), where each oblivious sort operates over $O(|{\bf I}|)$ elements [12, 15]. In fact, these works [12, 15] describe a more general oblivious bin placement algorithm where the tags may not be distinct, but we only need the special case where each tag appears at most once.

4.4 Oblivious Hashing

An oblivious (static) hashing scheme is a data structure that supports three operations ${\sf Build}$, ${\sf Lookup}$, and ${\sf Extract}$ that realizes the following (ideal) reactive functionality. The ${\sf Build}$ procedure is the constructor and it creates an in-memory data structure from an input array ${\bf I}$ containing real and dummy elements where each real element is a (key, value) pair. It is assumed that all real elements in ${\bf I}$ have distinct keys. The ${\sf Lookup}$ procedure allows a requestor to look up the value of a key. A special symbol $\bot$ is returned if the key is not found or if $\bot$ is the requested key. We say a (key, value) pair is visited if the key was searched for and found before. We assume non-recurrent lookups, namely, no real key is searched more than once in the lifetime of the hash table. Finally, ${\sf Extract}$ is the destructor and it returns a list containing unvisited elements padded with dummies to the same length as the input array ${\bf I}$.

An important property that our construction relies on is that if the input array ${\bf I}$ is randomly shuffled, to begin with (with a secret permutation), the outcome of ${\sf Extract}$ is also randomly shuffled (in the eyes of the adversary). In addition, we need obliviousness to hold only when the ${\sf Lookup}$ sequence is non-recurrent, i.e., the same real key is never requested twice (but dummy keys can be looked up multiple times). The functionality is formally given next.

Construction of naïveHT.. A naïve, perfectly secure oblivious hashing scheme can be obtained directly [14, 19] from a perfectly secure ORAM construction [14, 19]. Both schemes [14, 19] are Las Vegas algorithms: for any capacity $n$, it almost always takes $O(\log ^3 n)$ time to serve a request—however with negligible in $n$ probability, it may take longer to serve a request. We stress that although the runtime may sometimes exceed the stated bound, there is never any security or correctness failure in the known perfectly secure ORAM constructions [14, 19]. We observe that the scheme of Chan et al. [14] is a Las Vegas algorithm only because the oblivious random permutation they employ is a Las Vegas algorithm. In this article, we actually construct a perfect oblivious random permutation that runs in $O(n \cdot \log n)$ time with probability 1 (Theorem 4.6). Thus, we can replace the oblivious random permutation in Chan et al. [14] with our own Theorem 4.6. Interestingly, this results in the first non-trivial perfectly oblivious RAM that is not a Las Vegas algorithm.

Theorem 4.8 (Perfect ORAM (using [14] + Theorem 4.6)).

For any capacity $n\in \mathbb {N}$, there is a perfect ORAM scheme that consumes space $O(n)$ and worst-case time overhead $O(\log ^3 n)$ per request.

To construct ${{\sf naïveHT}}$ using an perfectly secure ORAM scheme, we use Theorem 4.8 to compile a standard, balanced binary search tree data structure (e.g., a red-black tree). Finally, Extract can be performed in linear time if we adopt the perfect ORAM of Theorem 4.8 which incurs constant space blowup. In more detail, we flatten the entire in-memory data structure into a single array, and apply oblivious tight compaction (Theorem 1.2) on the array, moving all the real elements to the front. We then truncate the array at length $|{\bf I}|$, apply a perfectly random permutation on the truncated array, and output the result. This gives the following construction.

Theorem 4.9 (naïveHT).

Assume that each memory word is large enough to store at least $\Theta (\log n)$ bits where $n$ is an upper bound on the total number of elements that exist in the data structure. There exists a perfectly secure, oblivious hashing scheme that consumes $O(n)$ space; further,

—	Build and Extract each consumes $n \cdot \mathsf {poly}\log n$ time;
—	Each Lookup request consumes $\mathsf {poly}\log n$ time.

Later in our article, whenever we need an oblivious hashing scheme for a small ($\mathsf {poly}\log (\lambda)$-sized) bin, we will adopt naïveHT since it is perfectly secure. In comparison, schemes whose failure probability is negligible in the problem size ($\mathsf {poly}\log (\lambda)$ in this case) may not yield ${\sf negl}(\lambda)$ failure probability. Indeed, almost all known computationally secure [29, 31, 33, 40] or statistically secure [58, 61, 63] ORAM schemes have a (statistical) failure probability that is negligible in the problem’s size and are thus unsuited for small, poly-logarithmically sized bins. In a similar vein, earlier works also employed perfectly secure ORAM schemes to treat poly-logarithmic size inputs [58].

4.5 Oblivious Cuckoo Hashing

A Cuckoo hashing scheme [52] is a hashing method with a constant lookup cost (ignoring the stash). Imagine that we wish to hash $n$ balls into a table of size ${\sf c_{\sf cuckoo}}\cdot n$, where ${\sf c_{\sf cuckoo}}\gt 1$ is an appropriate fixed constant. Additionally, there is a stash denoted ${\sf S}$ of size $s$ for holding a small number of overflowing balls. We also refer to each position of the table as a bin, and a bin can hold exactly one ball. Each ball receives two independent bin choices. During the build phase, we execute a Cuckoo assignment algorithm that picks either a bin-choice for each ball among its two specified choices, or assigns the ball to some position in the stash. It must hold that no two balls are assigned to the same location either in the main table or in the stash. Kirsch et al. [39] showed an assignment algorithm that succeeds with probability $1 - n^{-\Omega (s)}$ over the random bin choices, where $s$ denotes the stash size.

Without privacy, it is known that such an assignment can be computed in $O(n)$ time. However, it is also known that the standard procedure for building a Cuckoo hash table leaks information through the algorithm’s access patterns [12, 33, 60]. Goodrich and Mitzenmacher [33] (see also the recent work of Chan et al. [12]¹¹) showed that a Cuckoo hash table can be built obliviously in $O\left(n \cdot \log n\right)$ total time. In our ORAM construction, we will need to apply Chan et al. [12]’s oblivious Cuckoo hashing techniques in a non-blackbox fashion to enable asymptotically more efficient hashing schemes for randomly shuffled input arrays. Below, we present the necessary preliminaries.

4.5.1 Build Phase: Oblivious Cuckoo Assignment.

To obliviously build a Cuckoo hash table given an input array, we have two phases: (1) a metadata phase in which we select a bin among the two bin choices made by each input ball or alternatively assign the ball to a position in the stash; and (2) the actual (oblivious) routing of the balls into their destined location in the resulting hash-table data structure. The problem solved by the first phase (i.e., the metadata step), is called the Cuckoo assignment problem, formally defined as below.

Oblivious Cuckoo assignment.. Let $n$ be the number of balls to be put into the Cuckoo hash table, let ${\bf I}= \left((u_1, v_1), \ldots (u_n, v_n)\right)$ be the array of the two bin choices made by each of the $n$ balls, where $u_i, v_i \in [{\sf c_{\sf cuckoo}}\cdot n]$ for $i \in [n]$. In the Cuckoo assignment problem, given such an input array ${\bf I}$, the goal is to output an array ${\bf A} = \lbrace a_1, \ldots a_n\rbrace$, where $a_i \in \lbrace {\tt bin}(u_i), {\tt bin}(v_i), {\tt stash}(j)\rbrace$ denotes that the $i$th ball is assigned either to bin $u_i$ or bin $v_i$, or to the $j$th position in the stash. We say that a Cuckoo assignment ${\bf A}$ is correct iff it holds that (i) each bin and each position in the stash receives at most one ball, and (ii) the number of balls in the stash is bounded by a parameter $s$.

Given a correct assignment ${\bf A}$, a Cuckoo hash table can be built by obliviously placing the balls into the position it is assigned too. A straightforward way to accomplish this is through a standard oblivious bin placement algorithm (Section 4.3).¹²

Theorem 4.10 (Oblivious Cuckoo Assignment [12, 33]).

Let ${\sf c}_{\sf cuckoo} \gt 1$ be a suitable constant, $\delta \gt 0$, $n \in \mathbb {N}$, the stash size $s \ge \log (1/\delta)/\log n$, and let $n_{\sf cuckoo} ~{\sf :=}~ {\sf c}_{\sf cuckoo} \cdot n + s$ and $\ell ~{\sf :=}~ 8 \log _2 (n_{\sf cuckoo})$. Then, there is a deterministic oblivious algorithm denoted ${\sf cuckooAssign}$ that successfully finds a Cuckoo assignment problem with probability $1-O(\delta)-\exp (-\Omega (\frac{n^{5/7}}{\log ^4 (1/\delta)}))$ over the choice of the random bins ${\bf I}$. It runs in time $O(n_{\sf cuckoo} + T_{\rm sort}^\ell (n_{\sf cuckoo}) +\log \frac{1}{\delta } \cdot T_{\rm sort}^\ell (n^{6/7}))$, where $T_{\rm sort}^\ell (m)$ is the time bound for obliviously sorting $m$ elements each of length $\ell$.

As a special case, suppose that $n \ge \log ^8(1/\delta)$, $s = \log (1/\delta)/ \log n$, and the word size $w \ge \Omega (\log n)$, using the AKS oblivious sorting algorithm (Theorem 4.1), the algorithm runs in time $O(n\cdot \log n)$ with success probability $1 - O(\delta)$.

In fact, to obtain the above theorem, we cannot directly apply Chan et al. [12]’s Cuckoo hash-table build algorithm, but need to make some minor modifications. We refer the reader to Remark 4.13 Appendix B for more details.

A variant of the Cuckoo assignment problem.. Later, we will need a variant of the above Cuckoo assignment problem. Imagine that the input array is now of size exactly the same as the total size consumed by the Cuckoo hash-table, i.e., $n_{\sf cuckoo} ~{\sf :=}~ {\sf c}_{\sf cuckoo} \cdot n + s$ where $s$ denotes the stash size; but importantly, at most $n$ balls in the input array are real and the remaining are dummy. Therefore, we may imagine that the input array to the Cuckoo assignment problem is the following metadata array containing either real or dummy bin choices: ${\bf I}= \lbrace (u_i, v_i)\rbrace _{i \in [n_{\sf cuckoo}]}$ where $u_i, v_i \in [n_{\sf cuckoo}]$ if $i$ corresponds to a real ball and $u_i = v_i = \bot$ if $i$ corresponds to a dummy ball.

We would like to compute a correct Cuckoo assignment for all real balls in the input. This variant can easily be solved as follows: (1) obliviously sort the input array such that the upto $n$ real balls’ bin choices are in the front—let ${\bf X}$ denote the outcome; (2) apply the aforementioned ${\sf cuckooAssign}$ algorithm to ${\bf X}[1:n]$, resulting in an output assignment array denoted ${\bf A}$ of length $n$; (3) pad ${\bf A}$ to length $n_{\sf cuckoo}$ by adding $n_{\sf cuckoo} - n$ dummy assignment labels of appropriate length resulting in the array ${\bf A}^{\prime }$ and (4) reverse route ${\bf A}^{\prime }$ back to the input array—this can be accomplished if in step 1 we remembered the per-gate routing decisions in the sorting network. Henceforth, we refer to this slight variant as $\overline{\sf cuckooAssign}$. Note that Steps 1 and 4 of the above algorithm require oblivious sorting for elements of at most $\ell ~{\sf :=}~ 8 \log _2 ({\sf c}_{\sf cuckoo} \cdot n)$ bits.

Corollary 4.11 (Oblivious Cuckoo Assignment Variant).

Suppose that $\delta \gt 0$, $n \ge \log ^8(1/\delta)$, the stash size $s \ge \log (1/\delta)/\log n$. Then, there is a deterministic, oblivious algorithm denoted $\overline{\sf cuckooAssign}$ that successfully finds a Cuckoo assignment for the above variant of the problem with probability $1-O(\delta)$, consuming the same asymptotical runtime as Theorem 4.10.

Later, when we apply Corollary 4.11, if a single memory word can pack $B \gt 1$ elements of length $\ell ~{\sf :=}~ 8 \log _2 ({\sf c}_{\sf cuckoo} \cdot n)$—this happens when the Cuckoo hash-table’s size is small—we may use packed oblivious sorting to instantiate the sorting algorithm. This gives rise to the following corollary:

Corollary 4.12 (Packed Oblivious Cuckoo Assignment).

Suppose that $\delta \gt 0$, $n \ge \log ^8(1/\delta)$, the stash size $s \ge \log (1/\delta)/\log n$, and let $\ell ~{\sf :=}~ 8 \log _2 ({\sf c}_{\sf cuckoo} \cdot n + s)$. Then, there is a deterministic oblivious algorithm running in time $O(n_{\sf cuckoo} + (n_{\sf cuckoo}/{w}) \cdot \log ^3 n_{\sf cuckoo})$ that successfully finds a Cuckoo assignment (for both of the above variants) with probability $1-O(\delta)$ over the choice of the random bins, where $w$ denotes the word size.

Although we state the above corollary for general $n$, we only gain in efficiency when we apply it to the case of large word size $w$ and small $n$.

Proof.

Let $n_{\sf cuckoo} = {\sf c}_{\sf cuckoo} \cdot n + s$. If $\ell = \log n_{\sf cuckoo} \lt w$, we use packed oblivious sorting to instantiate all the oblivious sorting in the ${\sf cuckooAssign}$ or $\overline{\sf cuckooAssign}$ algorithms. By Theorem 4.2, the running time is upper bounded by $O(n_{\sf cuckoo} + (n_{\sf cuckoo}/(w/\log n_{\sf cuckoo})) \cdot \log ^2 n_{\sf cuckoo} + \log (1/\delta) \cdot (n_{\sf cuckoo}^{6/7}/(w/\log n_{\sf cuckoo}))\cdot \log ^2 n_{\sf cuckoo}) \le O(n_{\sf cuckoo} + (n_{\sf cuckoo}/{w}) \cdot \log ^3 n_{\sf cuckoo}))$ where the last inequality relies on the fact that $n \ge \log ^8(1/\delta)$. Finally, it is not hard to see that the theorem also holds for $\ell \ge w$ too—in this case, we can use the normal AKS sorting network to instantiate the oblivious sorting.□

Remark 4.13

(Indiscriminate Hashing).

As mentioned earlier, to get Theorem 4.10, Corollary 4.11, and Corollary 4.12, we cannot directly apply Chan et al. [12]’s oblivious Cuckoo hash-table building algorithm. In particular, their algorithm does not explicitly separate the metadata phase from the actual ball-routing phase; consequently, in their algorithm, the final assignment computed may depend on the element’s key, and not just the bin-choice metadata array ${\bf I}$. In our article, we need an extra indiscrimination property: after the hash-table is built, the location of any real ball is fully determined by its relative index in the input array as well as the bin-choice metadata array ${\bf I}$. This property will be needed to prove an extra property for our oblivious hashing schemes, that is, if the input array is randomly shuffled, then all unvisited elements in the hash-table data structure must appear in random order. Note that the way we formulated the Cuckoo assignment problem automatically ensures this indiscriminate property. See Appendix B for details, where we describe a variant of the algorithm of Chan et al. [12] that satisfies our needs.

4.5.2 Oblivious Cuckoo Hashing.

We get an oblivious Cuckoo hashing scheme as follows:

—	To perform Build, use the aforementioned cuckooAssign algorithm, while determining element’s $(k, v)$ two bin choices by evaluating a pseudorandom function ${\sf PRF}_{\texttt{sk} }(k)$ where $\texttt{sk}$ is a secret key sampled freshly for this hash-table instance, and stored inside the CPU.
—	For Lookup of key $k$, we evaluate the element’s two bin choices using the PRF, and look up the corresponding two bins in the hash-table. Besides these two bins, we also need to scan through the stash (no matter whether the element is found in one of the two bins). After an element has been looked up, it will be marked as removed.
—	Finally, to realize Extract we obliviously shuffle all unvisited elements using a perfect oblivious random permutation (Theorem 4.6) and output the resulting array.

We have the following theorem:

Theorem 4.14 (.

cuckooHT) Assume a $\delta _{{\sf PRF}}$-secure PRF. For any $\delta \gt 0, n \ge \log ^{8}(1/\delta)$, there is an oblivious hashing scheme denoted ${\sf cuckooHT}^{\delta ,n} = ({\sf Build}, {\sf Lookup}, {\sf Extract})$ which $(1- O(\delta) - \delta _{{\sf PRF}})$-obliviously simulates ${{\mathcal {F}}_{\sf HT}^{n}}$. Moreover, the algorithm satisfies the following properties:

—	Build takes as input ${\bf I}$ of length $n$, and outputs a Cuckoo table ${\mathcal {T}}$ of size $O(n)$ and a stash ${\sf S}$ of size $O(\log (1/\delta)/\log n)$. It requires $O\left(n\cdot \log n\right)$ time.
—	Lookup requires looking up only $O(1)$ positions in the table ${\mathcal {T}}$ which takes $O(1)$ time, and making a linear scan of the stash ${\sf S}$ consuming $O(\log (1/\delta)/\log n)$ time.
—	Extract performs a perfect oblivious random permutation, consuming $O\left(n \cdot \log n\right)$ time.

Remark 4.15.

This above scheme is different from the one of Chan et al. [12] in three aspects: (1) we explicitly separate the assignment phase from the ball-routing phase during Build; (2) we satisfy the indiscriminate property mentioned in Remark 4.13, and (3) we additionally support Extract (whereas Chan et al. do not).

4.6 Oblivious Dictionary

As opposed to the oblivious hash table from Section 4.4, which is a static data structure, an oblivious dictionary is an extension of oblivious hashing, which allows to add only one element at a time into the structure using an algorithm Insert, where Insert is called at most $n$ times for a pre-determined capacity $n$. Also, the dictionary supports Lookup and Extract procedures as described in oblivious hashing. Note that there is no specific order in which Insert and Lookup requests have to be made and they could be mixed arbitrarily. Another difference between our hashing notion and the dictionary notion is that the Extract operation outputs all elements, including “visited” elements (while Extract of oblivious hashing outputs only “unvisited” elements). In summary, an oblivious dictionary realizes Functionality 4.16 described below.

Corollary 4.17 (Perfectly Secure Oblivious Dictionary).

For any capacity $n \in \mathbb {N}$, there exists a perfectly-oblivious dictionary $({\sf Init}, {\sf Insert}, {\sf Lookup}, {\sf Extract})$ such that the time of ${\sf Init}$ and ${\sf Extract}$ is $O(n\cdot \log ^3 n), O(n\cdot \log ^3 n)$, respectively, the time of ${\sf Insert}, {\sf Lookup}$ are both $O(\log ^4 n)$.

Proof.

The realization of the oblivious dictionary is very similar to the naïveHT. Without security, the functionalities can be realized in $O(n)$ or $O(\log n)$ time using a standard, balanced binary search tree data structure (e.g., red-black tree) and the standard linear-time Fisher-Yates shuffle [24]. To achieve obliviousness, it suffices to compile the algorithms and the data structure using the perfect ORAM of Theorem 4.8, which is perfectly-oblivious and incurs $O(\log ^3 n)$ overhead per access.□

4.7 Oblivious Balls-into-Bins Sampling

Consider the ideal functionality ${\mathcal {F}}^{\mathsf {throw\text{-}balls}}_{n,m}$ that throws $n$ balls into $m$ bins uniformly at random and outputs the bin loads. A non-oblivious algorithm for this functionality will throw each ball independently at random and will run in ime $O(n)$. To achieve obliviousness, we need to be able to sample binomials.

Let ${\sf Binomial}(n,p)$ be the binomial distribution parameterized by $n$ independent trial with success probability $p$. Let ${\mathcal {F}}^{\mathsf {binomial}}$ be an ideal functionality that samples from ${\sf Binomial}(n,1/2)$ and outputs the result. The standard way to implement ${\mathcal {F}}^{\mathsf {binomial}}$ is to toss $n$ independent coins, but this takes time $O(n)$. Since this is too expensive for our purposes, we settle for an approximation using an algorithm of Bringmann et al. [8] (see also [21]).

Theorem 4.18 (Sampling Binomial Variables [8, Theorem 5]).

Assume word RAM arithmetic, logical operations, and sampling a uniformly random word takes $O(1)$ time. For any $n = 2^{O(w)}$, there is a $(1-n\cdot \delta)$-oblivious RAM algorithm ${\sf SampleApproxBinomial}_\delta$ that implements the functionality ${\mathcal {F}}^{\mathsf {binomial}}$ in time $O(\log ^5 (1/\delta))$.

Here is our implementation of ${\mathcal {F}}^{\mathsf {throw\text{-}balls}}$ using ${\sf SampleApproxBinomial}_\delta$.

If we use $\delta =0$ in the above algorithm, then ${\sf SampleBinLoad}$ perfectly and obliviously implements ${\mathcal {F}}^{\mathsf {throw\text{-}balls}}$. Using the efficient algorithm for sampling approximated binomials (Theorem 4.18), we get the following theorem.

Theorem 4.20.

For any integer $n = 2^{O(w)}$, $m$ a power of 2, ${\sf SampleBinLoad}_{m,\delta }$ $(1 - m\cdot n \cdot \delta)$-obliviously implements the functionality ${\mathcal {F}}^{\mathsf {throw\text{-}balls}}_{n,m}$ in time $O(m \cdot \log ^5 (1/\delta))$.

5 OBLIVIOUS TIGHT COMPACTION

In this section, we describe a deterministic linear-time procedure (in the balls and bins model), which solves the tight compaction problem: given an input array containing $n$ balls each of which is marked with a 1-bit label that is either 0 or 1, output a permutation of the input array such that all the 1 balls are moved to the front of the array.

Theorem 5.1 (Restatement of Theorems 1.2).

There exists a deterministic oblivious tight compaction algorithm that takes $O\left(\lceil D/w \rceil \cdot n\right)$ time to compact any input array of $n$ elements each can be encoded using $D$ bits, where $w$ is the word size.

Our approach extends to oblivious distribution: Given an array containing $n$ balls and an assignment array $A$ of $n$ bits such that each ball is marked with a 1-bit label that is either 0 or 1 and the number of 0-balls equals the number of 0-bits in $A$, output a permutation of the input balls such that all the 0-balls are moved to the positions of 0-bits in $A$. See Section 5.3 for details.

A bipartite expander.. Our construction relies on bipartite expander graphs where the entire edge set can be computed in linear time in the number of nodes.

Theorem 5.2.

For any constant $\epsilon \in (0,1)$, there exists a family of bipartite graphs $\lbrace G_{\epsilon ,n}\rbrace _{n\in \mathbb {N}}$ and a constant $d_\epsilon \in \mathbb {N}$, such that for every $n\in \mathbb {N}$ being a power of 2, $G_{\epsilon ,n}=(L,R,E)$ has $|L|=|R|=n$ vertices on each side, it is $d_\epsilon$-regular, and for every sets $S\subseteq L,T\subseteq R$, it holds that $\begin{align*} \left|e(S,T) - \frac{d_\epsilon }{n}\cdot |S|\cdot |T| \right| \le \epsilon \cdot d_\epsilon \cdot \sqrt {|S|\cdot |T|}, \end{align*}$ where $e(S,T)$ is the set of edges $(s,t)\in E$ such that $s\in S$ and $t\in T$.

Furthermore, there exists a (uniform) linear-time algorithm that on input $1^{n}$ outputs the entire edge set of $G_{\epsilon ,n}$.

Such graphs are well known (c.f. Margulis [48] and Pippenger [56]) and we provide a proof in Appendix C.1 for completeness. Note that the property that the entire edge set can be computed in linear time is crucial for us (but to the best of our knowledge has not been exploited before).

Organization.. The construction contains several sub-procedures. The following table depicts the relationship between the implementations of our different algorithms.

5.1 Reducing Tight Compaction to Loose Compaction

We first reduce the problem of tight compaction in linear time to loose compaction in linear time. A loose compaction algorithm is parametrized by a sufficiently large constant $\ell \gt 2$ (which will be chosen in Section 5.2), and the input is an array ${\bf I}$ of size $n$ that has real and dummy balls. It is guaranteed that the number of reals is at most $n/\ell$. The expected output of the procedure is an array of size $n/2$ that contains all the real balls.

From SwapMisplaced to TightCompaction .. The first observation for our tight compaction algorithm is that some balls already reside in the correct place, and only some balls have to be moved. In fact, the number of 0-balls that are “misplaced” equals exactly the number of 1-balls that are misplaced. Specifically, assume that there are $c$ balls marked 0; all 1 balls in the subarray ${\bf I}[1,\ldots ,c]$ are misplaced, and all 0 balls in ${\bf I}[c+1,\ldots ,n]$ are also misplaced. Notice that the number of misplaced 0 balls equals the number of misplaced 1 balls. Therefore, we have reduced the problem of tight compaction to the problem of swapping misplaced 0 balls with the misplaced 1 balls. The misplaced items will be labeled with red/blue colors. This reduction is described as Algorithm 5.3.

From LooseSwapMisplaced and LooseCompaction to SwapMisplaced .. In SwapMisplaced, we are given $n$ balls, each is labeled as either ${\sf red}$, ${\sf blue}$ or $\bot$. It is guaranteed that the number of blue balls equals the number of red balls. Our goal is to obliviously swap the locations of the blue balls with the red balls. To implement SwapMisplaced we use two subroutines, ${\sf LooseCompaction}_\ell$ and ${\sf LooseSwapMisplaced}_\ell$, parametrized with a number $\ell \gt 2$ that have the following input-output guarantees:

—	The algorithm ${\sf LooseCompaction}_\ell$ receives as input an array ${\bf I}$ consisting of $n$ balls, where at most $1/\ell$ fraction are real and the rest are dummies. The output is an array of size $n/2$ that contains all the real balls. We implement this procedure in Section 5.2.
—	The algorithm ${\sf LooseSwapMisplaced}_\ell$ receives the same input as SwapMisplaced: $n$ balls, each is labeled as either ${\sf red}$, ${\sf blue}$, or $\bot$, and the number of ${\sf blue}$s equals the number of ${\sf red}$s. This procedure swaps the locations of all the red-blue balls except at most $1/\ell$ fraction. All the swapped balls are labeled with $\bot$. We implement this procedure below in this subsection.

Using these two procedures, SwapMisplaced works by first running ${\sf LooseSwapMisplaced}_\ell$ which makes all the necessary swaps except for at most $1/\ell$ fraction. We then perform ${\sf LooseCompaction}_\ell$ on the resulting array, moving all the remaining ${\sf red}$ and ${\sf blue}$ balls to the first half of the array. Then, we continue recursively and perform ${\sf SwapMisplaced}$ on the first half of the array. To be able to facilitate the recursion, we record the original placement of the balls and their movements, and revert them in the end. Given a linear time algorithm for ${\sf LooseCompaction}_\ell$ and ${\sf LooseSwapMisplaced}_\ell$ (that we will achieve below), the recursive formula for the running time of the algorithm is $T(n)=T(n/2)+O(n)$, and therefore is linear. The description of ${\sf SwapMisplaced}$ is given in Algorithm 5.5 and we have the following claim.

Claim 5.4.

Let ${\bf I}$ be any input where the number of balls marked ${\sf red}$ equals the number of balls marked ${\sf blue}$, and let ${\bf O}={\sf SwapMisplaced}({\bf I})$. Then, ${\bf O}$ is a permutation of the balls in ${\bf I}$, where each ${\sf red}$ ball in ${\bf I}$ swaps its position with one ${\sf blue}$ ball in ${\bf I}$. Moreover, the runtime of ${\sf SwapMisplaced}({\bf I})$ is linear in $|{\bf I}|$.

Implementing ${\sf LooseSwapMisplaced}_\ell$.. The access pattern of the algorithm is determined by a (deterministically generated) expander graph $G_{\epsilon ,n} = (L,R,E)$, where the allowed swaps are vertices

of distance 2. That is, we interpret $L = R= [n]$; if two vertices $i,k \in R$ have some common neighbor $j \in L$, and ${\bf I}[i],{\bf I}[k]$ are both marked with different colors, then we swap them and change their mark to $\bot$. Choosing the expansion parameters of the graph appropriately guarantees that after performing these swaps, there are at most $n/\ell$ misplaced balls. As the graph is $d$-regular, there are at most ${d \choose 2}\cdot n$ neighbors of distance 2, and since $d=O(1)$, the total running time is $O(n)$.

Claim 5.7.

Let ${\bf I}$ be an input array in which the number of balls marked ${\sf blue}$ equals the number of balls marked ${\sf red}$. Denote as ${\bf O}$ the output array of ${\sf LooseSwapMisplaced}$. Then, ${\bf O}$ is a swap of the input array:

—	There exist pairs of indices $(i_1,j_1),\ldots ,(i_k,j_k)$ all distinct such that the following holds: For every $\ell \in [k]$, ${\bf I}[i_\ell ],{\bf I}[j_\ell ]$ are marked with different colors ($({\sf red},{\sf blue})$ or $({\sf blue},{\sf red})$), and ${\bf O}[i_\ell ] = {\bf I}[j_\ell ]$, ${\bf O}[j_\ell ] = {\bf I}[i_\ell ]$ and both ${\bf O}[i_\ell ],{\bf O}[j_\ell ]$ are marked $\bot$.
—	For every $i \not\in \lbrace i_1,\ldots ,i_k, j_1,\ldots ,j_k\rbrace$ then ${\bf O}[i] = {\bf I}[i]$ and both have the same mark.

In ${\bf O}$, the number of balls marked ${\sf red}$ equals the number of balls marked ${\sf blue}$, and there are at most $1/\ell$ fraction of marked balls.

Proof.

The algorithm only performs swaps between ${\sf red}$ and ${\sf blue}$ balls and therefore ${\bf O}$ is a permutation of ${\bf I}$, the three conditions hold, and the number of balls marked ${\sf red}$ equals the number of balls marked ${\sf blue}$. It remains to show that the number of ${\sf red}/{\sf blue}$ balls in ${\bf O}$ is at most $n/\ell$. In the end of the execution of the algorithm, let $R_{{\sf red}}$ be the set of all vertices in $R$ that are marked ${\sf red}$, and let $R_{{\sf blue}}$ be the set of vertices in $R$ that are marked ${\sf blue}$. Then, it must be that $\Gamma (R_{{\sf red}}) \cap \Gamma (R_{{\sf blue}}) = \emptyset$, as otherwise, the algorithm would have swapped an element in $R_{{\sf red}}$ with an element in $R_{\sf blue}$. Since the number of balls in $R_{\sf red}$ and in $R_{\sf blue}$ is equal, it suffices to show that for every subset $R^{\prime } \subset R$ of size greater than $n/(2\ell)$, it holds that $|\Gamma (R^{\prime })| \gt n/2$. This implies that $|R_{\sf red}|=|R_{\sf blue}| \le n/(2\ell)$, as otherwise $\Gamma (R_{{\sf red}}) \cap \Gamma (R_{{\sf blue}}) \ne \emptyset$. The fact that every set of vertices is expanding follows generically by the equivalence between spectral expansion (the definition of expanders we use) and vertex expansion. We give a direct proof below.

Let $R^{\prime } \subset R$ with $|R^{\prime }| \gt n/2\ell$ and let $L^{\prime } = \Gamma (R^{\prime })$ be its set of neighbors. Since the graph is $d_\epsilon$-regular for some $d_\epsilon \in O(1)$, it holds that $e(L^{\prime },R^{\prime }) = d_\epsilon \cdot |R^{\prime }|$. Thus, by the guarantee on the expander graph (Theorem 5.2) and by $\epsilon \le \frac{1}{2\sqrt {\ell }}$, it holds that $\begin{eqnarray*} d_\epsilon \cdot |R^{\prime }| = e(L^{\prime },R^{\prime }) \le \frac{d_\epsilon |L^{\prime }||R^{\prime }|}{n}+\frac{d_\epsilon }{2\sqrt {\ell }}\cdot \sqrt {|L^{\prime }||R^{\prime }|}. \end{eqnarray*}$ Dividing by $d_\epsilon \cdot |R^{\prime }|$ and rearranging, we get $\begin{align*} 1 - \frac{|L^{\prime }|}{n} \le \sqrt {\frac{|L^{\prime }|}{4\ell \cdot |R^{\prime }|}} \; . \end{align*}$ Since $|R^{\prime }| \gt n/(2\ell)$, we have $\begin{equation*} 1 - \frac{|L^{\prime }|}{n} \lt \sqrt {\frac{|L^{\prime }|}{2n}}. \end{equation*}$ Solving the above by squaring and rearranging, $(1 - \frac{|L^{\prime }|}{n})^2 - \frac{|L^{\prime }|}{2n} \lt 0$, we have $|L^{\prime }| \gt n/2$.□

5.2 Loose Compaction

In Section 5.2.1, we describe the algorithm ${\sf CompactionFromMatching}$—compacting an array given the required matching (via “folding”). In Section 5.2.2, we show how to compute the matching, both for the case where $m$ is “big” ($\mathsf {SlowMatch}$) and when $m$ is “small” ($\mathsf {FastMatch}$). In Section 5.2.3, we present the full loose compaction algorithm.

5.2.1 Compaction from Matching.

We show that with the appropriate notion of matching (given below), one can “fold” an array $A$, with a density of real balls being small enough, such that all the real balls reside in the output array of size $n/2$.

Definition 5.8

((B, B/4)-Matching)

Let $G = (L, R, E)$ be a bipartite graph, and let $S \subseteq L$ and $M \subseteq E$. Given any vertex $u \in L \cup R$, define $\Gamma _M(u) := \lbrace v \in L \cup R \mid (u,v) \in M\rbrace$ as the subset of neighboring vertices in $M$. We say that $M$ is a $(B,B/4)$-matching for $S$, iff (i) for every $u \in S$, $|\Gamma _M(u)| \ge B$, and; (ii) for every $v \in R$, $|\Gamma _M(v)| \le B/4$.

In Algorithm 5.9, we show how to compact an array given a $(B,B/4)$-matching for the set of all dense bins $S$, where a bin is said to be dense if it contains more than $B/4$ real balls. We assume that the matching itself is given to us via an algorithm ${\sf ComputeMatching}_{G}(S)$. The implementation of ${\sf ComputeMatching}_{G}(\cdot)$ is given in Section 5.2.2. Note that for any problem size $m \lt 2B$, it suffices to perform oblivious sorting (e.g., Theorem 4.1) instead of the following algorithm as $B$ is a constant.

Given that $|E|=O(m)$ and $B$ is a constant, the running time is linear in $\lceil D/w\rceil \cdot m$. Also, there are at most $\frac{m}{B} \cdot \frac{1}{32}$ dense bins as the total number of real balls is at most $\frac{m}{128}$ (i.e., $|S| \le \frac{m}{32B}$), so $M$ is a $(B,B/4)$-matching by ${\sf ComputeMatching}$, as we will show later in Claim 5.16. Hence, correctness holds. As for correctness, the $(B,B/4)$ matching $M$ guarantees that every vertex in the right vertices of $G$ contains at most $B/4$ real elements. As a result, after the distribute phase, all bins in the entire graph contain at most $B/4$ real elements, and we can fold the array without having any overflow. The following claim is immediate.

Claim 5.10.

Let ${\bf I}$ be an array of $m$ balls, where each ball is of size $D$ bits, and where at most $m /128$ balls are marked real. Then, the output of ${\sf CompactionFromMatching}_{D}({\bf I})$ is an array of $m/2$ balls that contains all real balls in ${\bf I}$. The running time of the algorithm is $O(\lceil D/w\rceil \cdot m)$ plus the running time of ${\sf ComputeMatching}_{G_{\epsilon ,m/B}}$.

5.2.2 Computing the Matching.

To compute the matching, we have two cases to consider, depending on the size of the input, and each algorithm results in a different running time.

Case I: $\mathsf {SlowMatch}$$(\frac{m}{B} \gt \frac{w}{\log w})$.. We transform the non-oblivious algorithm described in the overview, that runs in time $O(m)$, into an oblivious algorithm, by performing fake accesses. This results in an algorithm that requires $O(m \cdot \log m)$ time [14]. The bolded instructions in $\mathsf {SlowMatch}_{G_{\epsilon ,m/B}}(S)$ are the ones where we pay the extra $\log m$ factor in efficiency; these accesses will be avoided in Case II ($\mathsf {FastMatch}$).

In the following claim, we show that the size of $L^{\prime }$ decreases by a constant factor in every iteration which implies that the algorithm finishes after $\log (m/B)$ iterations. This means that $\mathsf {SlowMatch}$ outputs a correct $(B,B/4)$-matching for $S$ in time $O(m \cdot \log m)$ (see Claim 5.14).

Claim 5.13.

Let $S \subset L$ such that $|S| \le m/(32B)$, and let $\epsilon = \frac{1}{64}$. In each iteration of Algorithm 5.12, the number of unsatisfied vertices $|L^{\prime }|$ decreases by a factor of 2.

Proof.

Let $L^{\prime }$ be the set of unsatisfied vertices at beginning of any given round, let $R^{\prime }_{\sf neg} \subseteq R^{\prime } \subseteq \Gamma (L^{\prime })$ be the set of neighbors such that reply negative, and let $m^{\prime } = m/B$. Then, $e(L^{\prime }, R^{\prime }_{\sf neg}) \gt |R^{\prime }_{\sf neg} |\cdot B/4$. From the expansion property in Theorem 5.2, we obtain $|R^{\prime }_{\sf neg}|\cdot B/4 \lt e(L^{\prime },R^{\prime }_{\sf neg}) \le d_\epsilon |L^{\prime }||R^{\prime }_{\sf neg}|/m^{\prime }+\epsilon d_\epsilon \sqrt {|L^{\prime }||R^{\prime }_{\sf neg}|}$; dividing by $|R^{\prime }_{\sf neg}|d_\epsilon$ and rearranging this becomes $\epsilon \sqrt {|L^{\prime }|/|R^{\prime }_{\sf neg}|} \gt B/(4d_\epsilon)-|L^{\prime }|/m^{\prime }$. We chose $B$ as the largest power of 2 that is no larger than $d_\epsilon /2$, and so $B /d_\epsilon \gt 1/4$. Since $|L^{\prime }|/m^{\prime }\le 1/32$ (recall that $L^{\prime }$ is initially $S$, and the number of dense vertices, $|S|$, is at most $m/(32B)$), and since $\epsilon = \frac{1}{64}$, we have that: $\begin{align*} \sqrt {|L^{\prime }|/|R^{\prime }_{\sf neg}|} \gt \frac{1}{\epsilon }\cdot \frac{B}{4d_\epsilon }-\frac{1}{\epsilon }\cdot \frac{|L^{\prime }|}{m^{\prime }} \gt \frac{1}{\epsilon }\cdot \left(\frac{1}{16} - \frac{1}{32}\right)\!, \end{align*}$ then $\sqrt {|L^{\prime }|/|R^{\prime }_{\sf neg}|} \ge 64/16 - 64/32$, i.e., $|{R^{\prime }_{\sf neg}}| \le |L^{\prime }|/4$.

We conclude that the number of vertices in $R^{\prime }$ that reply negatively is at most $|L^{\prime }|/4$. As $L^{\prime }$ has $d_\epsilon |L^{\prime }|$ outgoing edges, and $R^{\prime }_{\sf neg}$ has at most $d_\epsilon |R^{\prime }_{\sf neg}| \le d_\epsilon |L^{\prime }|/4$ incoming edges, at most one quarter of edges in $L^{\prime }$ lead to $R^{\prime }_{\sf neg}$ and yield a negative reply. Since $d_\epsilon =B/2$ and every vertex in $L^{\prime }$ sends $d_\epsilon$ requests and all negative are from $R^{\prime }_{\sf neg}$, there are at most $d_\epsilon |L^{\prime }|/4$ negative replies, and therefore at most $|L^{\prime }|/2$ nodes in $L^{\prime }$ get more than $B=d_\epsilon /2$ negatives. We conclude that at least $|L^{\prime }|/2$ nodes become satisfied.□

Claim 5.14.

Let $S \subset L$ such that $|S| \le m/(32B)$, and let $\epsilon = \frac{1}{64}$. Then, $\mathsf {SlowMatch}_{ G_{\epsilon , m/B}}$ takes as input $S$, runs obliviously in time $O(m \cdot \log m)$, and outputs a $(B, B/4)$-matching for $S$.

Proof.

The runtime follows by there are $\log \frac{m}{B}$ iterations and each iteration takes $O(m)$ time. Obliviousness follows since the access pattern is a deterministic function of the graph $G_{\epsilon , m/B}$, which depends only on the parameter $m$ (but not on the input $S$). We argue correctness next. By Step 3(c)ii, every removal of vertex $u$ from $L^{\prime }$ has at least $B$ edges in the output $M$. Also, the edge added to $M$ must have a vertex in $R^{\prime }$ that has at most $B/4$ requests at Step 3(b)ii. Observing that the set of received requests at Step 3(b)ii is non-increasing over iterations, it follows that for every $v \in R$, $\Gamma _M(v) \le B/4$. By Claim 5.13, after $\log (m/B)$ iterations, we have $L^{\prime } = \emptyset$, and hence every $u \in S$ has at least $B$ edges in $M$.□

Case II: $\mathsf {FastMatch}$$(\frac{m}{B} \le \frac{w}{\log w})$.. Here, we improve the running time by relying on the fact that for instances where $m$ is really small (as above), the number of words needed to encode the whole graph is really small (i.e., constant). While obliviousness is obtained again by accessing the whole graph, this time it will be much cheaper as we will be able to read the whole graph while accessing a small number of words. The algorithm is described in Algorithm 5.15. Note that whenever we write, e.g., “access ${\sf Reply}[v]$”, we actually access the whole array ${\sf Reply}$ as it is $O(1)$ words, and do not reveal which vertex we actually access. Thus, each iteration takes time $O(|L^{\prime }|+|R^{\prime }|)$, and since the size of $|L^{\prime }|$ is reduced by a factor of 2 in each iteration (and since $|R^{\prime }|\le d_{\epsilon }\cdot |L^{\prime }|$), we perform $O(m)$ time in total. Additionally, note that we store the set $L^{\prime }$ as a set (i.e., a list) and not as a bit vector, as we cannot afford the $O(m)$ time required to run over all balls of $L$ and then ask whether the element is in $L^{\prime }$.

Claim 5.16.

Given the public $d_\epsilon$-regular bipartite graph $G_{\epsilon ,m/B} = (L,R,E)$ such that $B = d_\epsilon / 2$, let $S \subseteq L$ be a set such that $|S| \le \frac{m}{32 B}$. Then, ${\sf ComputeMatching}_{G_{\epsilon ,m/B}}(S)$ outputs a $(B,B/4)$-matching for $S$, where the running time is $O(m)$ if $\frac{m}{B} \le \frac{w}{\log w}$, and $O(m \cdot \log m)$ otherwise.

Proof.

The correctness follows from Claim 5.14 and the time bound follows from the time of $\mathsf {FastMatch}$ and $\mathsf {SlowMatch}$ in both cases.□

5.2.3 Oblivious Loose Compaction.

By combining Claims 5.16 and 5.10 we get the following corollary.

Corollary 5.17.

The total running time of ${\sf CompactionFromMatching}_{D}$ on an array of $m$ elements each of size $D$ bits is $O(\lceil D/w\rceil \cdot m)$ if $m \le \frac{w}{\log w}$, and $O(\lceil D/w\rceil \cdot m + m \cdot \log m)$ otherwise.

The next algorithm shows how to compute ${\sf LooseCompaction}_\ell ({\bf I})$ for any input array ${\bf I}$ in which there are at most $| {\bf I}| /\ell$ elements that are marked.

Theorem 5.19.

Let $\ell = 2^{38}$. For any input array ${\bf I}$ with $n$ balls such that each ball consists of $D$ bits and with at most $n/\ell$ real balls, the procedure ${\sf LooseCompaction}_{\ell }({\bf I})$ outputs an array of size $n/2$ consisting of all real balls in ${\bf I}$, and runs in time $O(\lceil {D/w} \cdot n)$.

Proof.

To show the correctness, we check that in all cases, both ${\sf CompactionFromMatching}$ and ${\sf LooseCompaction}_\ell$ are called with an input array ${\bf I}$ such that at most $|{\bf I}| / 128$ balls are real so that the correctness is implied by Claim 5.10 (henceforth referred to as the 128-condition). We proceed by checking the 128-condition for each case in ${\sf LooseCompaction}_\ell$. Note that $2^{38} = (2 \cdot (2 \cdot 128)^2)^2$.¹⁴

—	Case I: It is always called with $\ell = 2^{38}$. At Step 2, note that the number of dense blocks is at most $\frac{n}{{\mu }}\cdot \frac{1}{\sqrt {\ell }}$ as the total number of real balls is at most $\frac{n}{\ell }$. Hence, the two ${\sf CompactionFromMatching}$ takes as input at most $(\frac{n}{{\mu }}) / \sqrt {\ell }$ and then $(\frac{n}{2{\mu }}) / (\sqrt {\ell } / 2)$ dense blocks, and the 128-condition holds as $\sqrt {\ell } / 2 = (2 \cdot 128)^2 \gt 128$. The two ${\sf LooseCompaction}_{\sqrt {\ell } / 2}$ takes at most ${\mu }/ \sqrt {\ell }$ and $(\frac{{\mu }}{2}) / (\sqrt {\ell } / 2)$ real balls as each block is not dense, and then the 128-condition holds for $\sqrt {\ell }$ and $\sqrt {\ell } / 2$ similarly.
—	Case II: It can be either called directly with $\ell = 2^{38}$, or called indirectly from Case I with $\ell = (2 \cdot 128)^2$. Hence, the number of real is at most $n / (2 \cdot 128)^2$. By the same calculation as Case I, the two ${\sf CompactionFromMatching}$ and two ${\sf LooseCompaction}_{\sqrt {\ell } / 2}$ take an input array such that the sparsity is at least $\sqrt {(2 \cdot 128)^2} / 2 = 128$, and the 128-condition holds.
—	Case III: Similar to Case II, it can be called directly, indirectly from Case I, or indirectly from Case II with $\ell = 2^{38}, (2 \cdot 128)^2$, or 128 respectively. Hence, the given sparsity $\ell$ is at least 128, and hence the 128-condition holds directly for ${\sf CompactionFromMatching}$.

To show the time complexity, observe that, except for ${\sf CompactionFromMatching}$, all other procedures run in time $O(n)$. By Corollary 5.17, running ${\sf CompactionFromMatching}$ on $m$ items of size $D$ bits takes $O(\lceil D/w\rceil \cdot m)$ time if $m \le \frac{w}{\log w}$, or $O(\lceil D/w\rceil \cdot m + m \cdot \log m)$ otherwise. For any $n$, the depth of the recursive call to ${\sf LooseCompaction}$ is at most 2. Hence, it suffices to show that in each case, every ${\sf CompactionFromMatching}$ run in $O(\lceil D / w\rceil \cdot n)$ time. We proceed from Case III back to I.

—	Case III: Since $n \lt \frac{w}{\log w}$, the ${\sf CompactionFromMatching}_D$ takes an input size $m = n \lt \frac{w}{\log w}$ runs in $O(\lceil D/w\rceil \cdot n)$ time by Corollary 5.17.
—	Case II: Given that $n \lt (\frac{w}{\log w})^2$, the subsequent ${\sf CompactionFromMatching}_{D\cdot p}$ takes input size $m = \frac{n}{p} \lt \frac{w}{\log w}$. Hence, its running time is $O(\lceil D\cdot p/w\rceil \cdot m) = (\lceil D/w\rceil \cdot n)$, as in Case III.
—	Case I: For arbitrary $n$, the subsequent invocation of ${\sf CompactionFromMatching}_{D\cdot p^2}$, in Steps 3 and 4, takes an input size $m = n / p^2$ and then $m / 2$. By Corollary 5.17, in both cases, the procedure runs in time $O(\lceil D \cdot p^2 / w\rceil \cdot m + m \cdot \log m) = O(\lceil D / w\rceil \cdot n + (n/p^2) \cdot \log n)$. Since Case I is the starting point of the algorithm, by the standard RAM model, we have that $w = \Omega (\log n)$, which implies that $n/p^2 = O(n / \log n)$ as $p = w/\log w$. Thus, the total time is bounded by $O(\lceil D / w\rceil \cdot n)$.

□

Plugging ${\sf LooseCompaction}_{2^{38}}$ into ${\sf SwapMisplaced}$, by Claim 5.4 and Theorem 5.19, we have the linear-time tight compaction claimed in Theorem 5.1.

5.3 Oblivious Distribution

In oblivious distribution, the input is an array ${\bf I}$ of $n$ balls and a set $A \subseteq [n]$ such that each ball in ${\bf I}$ is labeled as 0 or 1 and the number of 0-balls is equal to $|A|$. The output is a permutation of ${\bf I}$ such that for each $i \in [n]$, the $i$th location is a 0-ball if and only if $i \in A$. By marking ${\sf red}, {\sf blue},$ or $\bot$ correspondingly, ${\sf SwapMisplaced}$ achieves oblivious distribution as elaborated in Algorithm 5.20, where the set $A$ is represented as an array of $n$ indicators such that $i \in A$ iff $A[i] = 0$.

The correctness, security, and time complexity of ${\sf Distribution}$ follow directly from those of ${\sf SwapMisplaced}$, and this gives the following theorem.

Theorem 5.21 (Oblivious Distribution).

There exists a deterministic oblivious distribution algorithm that takes $O(n)$ time on input arrays of size $n$.

6 INTERSPERSING RANDOMLY SHUFFLED ARRAYS

In this section, we present the following variants of shuffling on an array of $n$ elements. We suppose that for any $m \in [n]$, sampling an integer uniformly at random from the set $[m]$ takes unit time.¹⁵

6.1 Interspersing Two Arrays

We first describe a building block called ${\sf Intersperse}$ that allows us to randomly merge two randomly shuffled arrays. Informally, we would like to realize the following abstraction:

—	Input: An array ${\bf I} ~{\sf :=}~ {\bf I}_0\Vert {\bf I}_1$ of size $n$ and two numbers $n_0$ and $n_1$ such that $\|{\bf I}_0\| = n_0$ and $\|{\bf I}_1\| = n_1$ and $n=n_0+n_1$. We assume that each element in the input array fits in $O(1)$ memory words.
—	Output: An array ${\bf B}$ of size $n$ that contains all elements of ${\bf I}_0$ and ${\bf I}_1$. Each position in ${\bf B}$ will hold an element from either ${\bf I}_0$ or ${\bf I}_1$, chosen uniformly at random and the choices are concealed from the adversary.

Looking ahead, we will invoke the procedure ${\sf Intersperse}$ with arrays ${\bf I}_0$ and ${\bf I}_1$ that are already randomly and independently shuffled (each with a hidden permutation). So, when we apply ${\sf Intersperse}$ on such arrays the output array ${\bf B}$ is guaranteed to be a random permutation of the array ${\bf I} ~{\sf :=}~ {\bf I}_0 \Vert {\bf I}_1$ in the eyes of an adversary.

The intersperse algorithm.. The idea is to first generate a random auxiliary array of 0’s and 1’s, denoted ${\sf Aux}$, such that the number of 0’s in the array is exactly $n_0$ and the number of 1’s is exactly $n_1$. This can be done obliviously by sequentially sampling each bit depending on the number of 0’s we sampled so far (see Algorithm 6.1). ${\sf Aux}$ is used to decide the following: if ${\sf Aux}[i]=0$, then the $i$th position in the output will pick up an element from ${\bf I}_0$, and otherwise, from ${\bf I}_1$.

Next, to obliviously route elements from ${\bf I}_0$ (and ${\bf I}_1$, respectively) to the $i$th position such that ${\sf Aux}[i]=0$ (and ${\sf Aux}[i]=1$, respectively), it is performed using the deterministic oblivious distribution given in Section 5.3—mark every element in ${\bf I}_0$ as 0-balls, mark every element in ${\bf I}_1$ as 1-balls, and then run ${\sf Distribution}$ (Algorithm 5.20) on the marked array ${\bf I}= {\bf I}_0 \Vert {\bf I}_1$ and the auxiliary array ${\sf Aux}$.

The formal description of the algorithm for interspersing two arrays is given in Algorithm 6.1. The functionality that it implements (assuming that the two input arrays are randomly shuffled) is given in Functionality 6.2 and the proof that the algorithm implements the functionality is given in Claim 6.3.

Claim 6.3.

Let ${\bf I}_0$ and ${\bf I}_1$ be two arrays of size $n_0$ and $n_1$, respectively, that satisfies the input assumption as in the description of Algorithm 6.1. The Algorithm ${\sf Intersperse}_n({\bf I}_0\Vert {\bf I}_1,n_0,n_1)$ obliviously implements functionality ${{\mathcal {F}}_{\sf Shuffle}^{n}}({\bf I}_0 \Vert {\bf I}_1)$. The implementation has $O(n)$ time.

The proof of this claim is deferred to Appendix C.2.

6.2 Interspersing Multiple Arrays

We generalize the Intersperse algorithm to work with $k\in \mathbb {N}$ arrays as input. The algorithm is called ${\sf Intersperse}^{(k)}$ and it implements the following abstraction:

—	Input: An array ${\bf I}_1 \Vert \ldots \Vert {\bf I}_k$ consisting of $k$ different arrays of lengths $n_1,\ldots ,n_k$, respectively. The parameters $n_1,\ldots ,n_k$ are public.
—	Output: An array ${\bf B}$ of size $\sum _{i=1}^k n_i$ that contains all elements of ${\bf I}_1,\ldots ,{\bf I}_k$. Each position in ${\bf B}$ will hold an element from one of the arrays, chosen uniformly at random and the choices are concealed from the adversary.

As in the case of $k=2$, we will invoke the procedure ${\sf Intersperse}^{(k)}$ with arrays ${\bf I}_1,\ldots ,{\bf I}_k$ that are already randomly and independently shuffled (with $k$ hidden permutations). So, when we apply ${\sf Intersperse}^{(k)}$ on such arrays the output array ${\bf B}$ is guaranteed to be a random permutation of the array ${\bf I} ~{\sf :=}~ {\bf I}_1 \Vert \ldots \Vert {\bf I}_k$ in the eyes of an adversary.

The algorithm.. To intersperse $k$ arrays ${\bf I}_1,\ldots ,{\bf I}_k$, we intersperse the first two arrays using ${\sf Intersperse}_{n_1+n_2}$, then intersperse the result with the third array, and so on. The precise description is given in Algorithm 6.4.

We prove that this algorithm obliviously implements a uniformly random shuffle.

Claim 6.5.

Let $k\in \mathbb {N}$ and let ${\bf I}_1,\ldots ,{\bf I}_k$ be $k$ arrays of $n_1,\ldots ,n_k$ elements, respectively, that satisfy the input assumption as in the description of Algorithm 6.4. The Algorithm ${\sf Intersperse}^{(k)}_{n_1,\ldots ,n_k}({\bf I}_1 \Vert \ldots \Vert {\bf I}_k)$ obliviously implements the functionality ${{\mathcal {F}}_{\sf Shuffle}^{n}}({\bf I})$. The implementation requires $O(n + \sum _{i=1}^{k-1}(k-i)\cdot n_i)$ time.

The proof of this claim is deferred to Appendix C.2.

6.3 Interspersing Reals and Dummies

We describe a related algorithm, called ${\sf IntersperseRD}$, which will also serve as a useful building block. Here, the abstraction we implement is the following:

—	Input: An array ${\bf I}$ of $n$ elements, where each element is tagged as either real or dummy. The real elements are distinct. We assume that if we extract the subset of all real elements in the array, then these elements appear in random order. However, there is no guarantee of the relative positions of the real elements with respect to the dummy ones.
—	Output: An array ${\bf B}$ of size $n$ containing all real elements in ${\bf I}$ and the same number of dummy elements, where all elements in the array are randomly permuted. The only leakage is $n$, the number of elements in the array.

In other words, the real elements are randomly permuted, but there is no guarantee regarding their order in the array with respect to the dummy elements. In particular, the dummy elements can appear in arbitrary (known to the adversary) positions in the input, e.g., appear all in the front, all at the end, or appearing in all the odd positions. The output will be an array where all the real and dummy elements are randomly permuted, and the random permutation is hidden from the adversary.

The implementation of ${\sf IntersperseRD}$ is done by first running the deterministic tight compaction procedure on the input array such that all the real balls appear before the dummy ones. Next, we count the number of real elements in this array run the Intersperse procedure from Algorithm 6.1 on this array with the calculated sizes. The formal implementation appears as Algorithm 6.6.

We prove that this algorithm obliviously implements a uniformly random shuffle.

Claim 6.7.

Let ${\bf I}$ be an array of $n$ elements that satisfies the input assumption as in the description of Algorithm 6.6. The Algorithm ${\sf IntersperseRD}_n({\bf I})$ obliviously implements the functionality ${{\mathcal {F}}_{\sf Shuffle}^{n}}({\bf I})$. The implementation has $O(n)$ time.

The proof of this claim is deferred to Appendix C.2.

6.4 Perfect Oblivious Random Permutation (Proof of Theorem 4.6)

Recall that an oblivious random permutation shuffles an input array of $n$ elements using a secret permutation $\pi :[n] \rightarrow [n]$ uniformly at random (Section 4.2). The following (perfect) oblivious random permutation, ${\sf PerfectORP}$, is constructed with a standard divide-and-conquer technique using ${\sf Intersperse}$ for merging.

We argue that ${\sf PerfectORP}$ runs in $O(n\cdot \log n)$ time and permutes ${\bf I}$ uniformly at random. The time bound follows since ${\sf Intersperse}$ runs in $O(n)$ time (Claim 6.3) and the recursion consists of 2 sub-problems, each of half the size. The fact that the permutation is uniformly random follows by induction and that ${\sf Intersperse}$ perfectly-obliviously implements ${{\mathcal {F}}_{\sf Shuffle}^{n}}$ (Claim 6.3).

7 BigHT: Oblivious Hashing for Non-Recurrent Lookups

The hash table construction we describe in this section suffers from $\mathsf {poly}\log \log \lambda$ extra multiplicative factor in ${\sf Build}$ and ${\sf Lookup}$ (which leads to similar overhead in the implied ORAM construction). Nevertheless, this hash table serves as a first step and we will get rid of the extra factor in Section 8. Hence, the parameter of expected bin load $\mu = \log ^9 \lambda$ is seemingly loose in this section but is necessary later in Section 8 (to apply Cuckoo hash). Additionally, note that this hash table captures and simplifies many of the ideas in the oblivious hash table of Patel et al. [53] and can be used to get an ORAM with a similar overhead to theirs.

CONSTRUCTION 7.1: Hash Table for Shuffled Inputs

Procedure ${\sf BigHT}.{\sf Build}({\bf I})$:

—

Input: An array ${\bf I}=(a_1,\ldots , a_n)$ containing $n$ elements, where each $a_i$ is either dummy or a (key, value) pair denoted $(k_i,v_i)$, where both the key $k$ and the value $v$ are $D$-bit strings where $D := O(1) \cdot w$.

—

Input assumption: The elements in the array are uniformly shuffled.

—

The algorithm:

(1)

Let $\mu ~{\sf :=}~ \log ^9 \lambda$, $\epsilon ~{\sf :=}~ \frac{1}{\log ^2 \lambda }$, $\delta ~{\sf :=}~ e^{- \log \lambda \cdot \log \log \lambda }$, and $B ~{\sf :=}~ ⌈ {n / \mu }⌉ $.

(2)

Sample PRF key. Sample a random PRF secret key ${\sf sk}$.

(3)

Directly hash into major bins. Throw the real $a_i=(k_i,v_i)$ into $B$ bins using ${\sf PRF}_\texttt{sk} (k_i)$. If $a_i={\sf dummy}$, throw it to a uniformly random bin. Let ${\sf Bin}_1,\ldots ,{\sf Bin}_B$ be the resulted bins.

(4)

Sample independent smaller loads. Execute Algorithm 4.19 to obtain

$({{L}} _1,\ldots ,{{L}} _B) \leftarrow {\sf SampleBinLoad}_{B,\delta }(n^{\prime })$, where $n^{\prime } = n\cdot \left(1- \epsilon \right)$. If there exists $i\in [B]$ such that $| |{\sf Bin}_i| - \mu | \gt 0.5 \cdot \epsilon \mu$ or $| {{L}} _i - \frac{n^{\prime }}{B}| \gt 0.5 \cdot \epsilon \mu$, then ${\sf abort}$.

(5)

Create major bins. Allocate new arrays $({\sf Bin}_1^{\prime },\ldots ,{\sf Bin}_B^{\prime })$, each of size $\mu$. For every $i$, iterate in parallel on both ${\sf Bin}_i$ and ${\sf Bin}_i^{\prime }$, and copy the first ${{L}} _i$ elements in ${\sf Bin}_i$ to ${\sf Bin}_i^{\prime }$. Fill the empty slots in ${\sf Bin}_i^{\prime }$ with ${\sf dummy}$. (${{L}} _i$ is not revealed during this process, by continuing to iterate over ${\sf Bin}_i$ after we cross the threshold ${{L}} _i$.) (This step will succeed since each bin is large enough except with small probability of failure.)

(6)

Create overflow pile. Obliviously merge all of the last $|{\sf Bin}_i| - {{L}} _i$ elements in each bin ${\sf Bin}_1,\ldots ,{\sf Bin}_B$ into an overflow pile:

–	For each $i \in [B]$, replace the first ${{L}} _i$ positions with dummy.
–	Concatenate all of the resulting bins and perform oblivious tight compaction on the resulting array such that the real balls appear in the front. Truncate the outcome to be of length $\epsilon n$.

(7)

Prepare an oblivious hash table for elements in the overflow pile by calling the Build algorithm of the $(1- O(\delta) - \delta _{{\sf PRF}})$-oblivious Cuckoo hashing scheme (Theorem 4.14) parameterized by $\delta$ (recall that $\delta = e^{-\Omega (\log \lambda \cdot \log \log \lambda)}$) and the stash size $\log (1/\delta) / \log n$. Let ${\sf OF}=({\sf OF}_{\sf T},{\sf OF}_{\sf S})$ denote the outcome data structure. Henceforth, we use ${\sf OF}.{\sf Lookup}$ to denote a lookup operation to this oblivious Cuckoo hashing scheme.

(8)

Prepare data structure for efficient lookup. For $i=1,\ldots ,B$, call ${{\sf naïveHT}}.{\sf Build}({\sf Bin}_i^{\prime })$ on each major bin to construct an oblivious hash table, and let ${\sf OBin}_i$ denote the outcome for the $i$th bin.

—

Output: The algorithm stores in the memory a state that consists of $({\sf OBin}_1,\ldots ,{\sf OBin}_B, {\sf OF}, \texttt{sk})$.

Procedure ${\sf BigHT}.{\sf Lookup}(k)$:

—

Input: The secret state $({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},\texttt{sk})$, and a key $k$ to look for (that may be $\bot$, i.e., dummy).

—

The algorithm:

(1)	Call $v \leftarrow {\sf OF}.{\sf Lookup}(k)$.
(2)	If $k = \bot$, choose a random bin $i {\overset{\$}{\leftarrow }} [B]$ and call ${\sf OBin}_i.{\sf Lookup}(\bot)$.
(3)	If $k \ne \bot$ and $v \ne \bot$ (i.e., $v$ was found in ${\sf OF}$), choose a random bin $i {\overset{\$}{\leftarrow }} [B]$ and call ${\sf OBin}_i.{\sf Lookup}(\bot)$.
(4)	If $k \ne \bot$ and $v = \bot$ (i.e., $v$ was not found in ${\sf OF}$), let $i: = {\sf PRF} _{\texttt{sk} }(k)$ and call $v \leftarrow {\sf OBin}_i.{\sf Lookup}(k)$.

—

Output: The value $v$.

Procedure ${\sf BigHT}.{\sf Extract}()$:

—

Input: The secret state $({\sf OBin}_1,\ldots ,{\sf OBin}_B, {\sf OF}, \texttt{sk})$.

—

The algorithm:

(1)	Let $T = {\sf OBin}_1.{\sf Extract}() \Vert {\sf OBin}_2.{\sf Extract}() \Vert \ldots \Vert {\sf OBin}_B.{\sf Extract}() \Vert {\sf OF}.{\sf Extract}()$.
(2)	Perform oblivious tight compaction on $T$, moving all the real balls to the front. Truncate the resulting array at length $n$. Let ${\bf X}$ be the outcome of this step.
(3)	Call ${\bf X}^{\prime } \leftarrow {\sf IntersperseRD}_n({\bf X})$ (Algorithm 6.6).

—

Output: ${\bf X}^{\prime }$.

We prove that our construction obliviously implements Functionality 4.7 for every sequence of instructions with non-recurrent lookups between two ${\sf Build}$ operations and as long as the input array to ${\sf Build}$ is randomly and secretly shuffled.

Theorem 7.2.

Assume a $\delta _{{\sf PRF}}$-secure PRF. Then, Construction 7.1 $(1- n^2 \cdot e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}})$-obliviously implements Functionality 4.7 for all $n \ge \log ^{11}\lambda$, assuming that the input array (of size $n$) for ${\sf Build}$ is randomly shuffled. Moreover,

—	${\sf Build}$ and ${\sf Extract}$ each take $O(n \cdot \mathsf {poly}\log \log \lambda + n \cdot \frac{\log n}{\log ^2 \lambda })$ time; and
—	${\sf Lookup}$ takes $O(\mathsf {poly}\log \log \lambda)$ time in addition to linearly scanning a stash of size $O(\log \lambda)$.

In particular, if $\log ^{11}\lambda \le n \le \mathsf {poly}(\lambda)$, then hash table is $(1- e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}})$-obliviously and consumes $O(n\cdot \mathsf {poly}\log \log \lambda)$ time for the ${\sf Build}$ and ${\sf Extract}$ phases; and ${\sf Lookup}$ consumes $O(\mathsf {poly}\log \log \lambda)$ time in addition to linearly scanning a stash of size $O(\log \lambda)$.

Proof.

The proof of security is given in Appendix C.3. We give the efficiency analysis here. In Construction 7.1, there are $n / \log ^9\lambda$ major bins and each is of size $O(\log ^9\lambda)$. The subroutine ${\sf SampleBinLoad}_{B,\delta }(n^{\prime })$ runs in time $O(B \cdot \log ^5 (1/\delta)) \le O(⌈ {n/\log ^9 \lambda } ⌉ \cdot \log ^6 \lambda) = O(n/\log ^3 \lambda)$ by Theorem 4.20 (recall that $\delta = e^{-\log \lambda \cdot \log \log \lambda }$) and since $n \ge \log ^{11} \lambda$. We employed the hash table ${{\sf naïveHT}}$ (Theorem 4.9) for each major bin, and thus their initialization takes time $\begin{align*} \frac{n}{\mu } \cdot O\left(\mu \cdot \mathsf {poly}\log \mu \right) \le O(n\cdot \mathsf {poly}\log \log \lambda). \end{align*}$

The overflow pile consists of $\epsilon n \ge \log ^{9} \lambda \ge \log ^{8}(1/\delta)$ elements as $n \ge \log ^{11} \lambda$, and it is implemented via an oblivious Cuckoo hashing scheme (Theorem 4.14) so its initialization takes time $O(\epsilon n \cdot \log (\epsilon n)) \le O(n \cdot \frac{\log n}{\log ^2 \lambda })$, where the stash size is $O(\frac{\log (1/\delta)}{\log \epsilon n}) \le O(\log \lambda)$. Each ${\sf Lookup}$ incurs $O(\mathsf {poly}\log \log \lambda)$ time from the major bins and $O(\log \lambda)$ time from the linear scan of ${\sf OF}_{\sf S}$, the stash of the overflow pile (searching in ${\sf OF}_{\sf T}$ incurs $O(1)$ time). The overhead of ${\sf Extract}$ depends on the overhead of ${\sf Extract}$ for each major bin and ${\sf Extract}$ from the overflow pile. The former is again bounded by $O(n\cdot \mathsf {poly}\log \log \lambda)$ and the latter is bounded by $O(\epsilon n\cdot \log (\epsilon n))\le O(n \cdot \frac{\log n}{\log ^2 \lambda })$.□

Finally, observe that it is not difficult to adjust the constants in our construction and analysis to show the following more general corollary:

Corollary 7.3.

Assume a $\delta _{{\sf PRF}}$-secure PRF. Then, for any constant $c \ge 2$, there exists an algorithm that $(1- n^2 \cdot e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}})$-obliviously implements Functionality 4.7 for all $n \ge \log ^{9+c}\lambda$, assuming that the input array (of size $n$) for ${\sf Build}$ is randomly shuffled. Moreover,

—	${\sf Build}$ and ${\sf Extract}$ each take $O(n \cdot \mathsf {poly}\log \log \lambda + n \cdot \frac{\log n}{\log ^{c} \lambda })$ time; and
—	${\sf Lookup}$ takes $O(\mathsf {poly}\log \log \lambda)$ time in addition to linearly scanning a stash of size $O(\log \lambda)$.

Proof.

We can let $\epsilon = \frac{1}{\log ^c \lambda }$ in Construction 7.1. The analysis follows in a similar fashion.□

Remark 7.4.

As we mentioned, Construction 7.1 is only the first step towards the final oblivious hash table that we use in the final ORAM construction. We make significant optimizations in Section 8. We show how to improve upon the ${\sf Build}$ and ${\sf Extract}$ procedures from $O(n\cdot \mathsf {poly}\log \log \lambda)$ to $O(n)$ by replacing the ${{\sf naïveHT}}$ hash table with an optimized version (called ${\sf SmallHT}$) that is more efficient for small lists. Additionally, while it may now seem that the $O(\log \lambda)$-stash overhead of ${\sf Lookup}$ is problematic, we will “merge” the stashes for different hash tables in our final ORAM construction and store them again in an oblivious hash table.

8 SMALLHT: OBLIVIOUS HASHING FOR SMALL BINS

In Section 7, we constructed an oblivious hashing scheme for randomly shuffled inputs where ${\sf Build}$ and ${\sf Extract}$ consumes $n \cdot \mathsf {poly}\log \log \lambda$ time and ${\sf Lookup}$ consumes $\mathsf {poly}\log \log \lambda$. The extra $\mathsf {poly}\log \log \lambda$ factors arise from the oblivious hashing scheme (denoted ${{\sf naïveHT}}$) which we use for each major bin of size $\approx \log ^9\lambda$. To get rid of the extra $\mathsf {poly}\log \log \lambda$ factors, in this section, we will construct a new oblivious hashing scheme for $\mathsf {poly}\log \lambda$-sized arrays which are randomly shuffled. In our new construction, ${\sf Build}$ and ${\sf Extract}$ takes linear time and ${\sf Lookup}$ takes constant time (ignoring the stash which we will treat separately later).

As mentioned in Section 2.1, the key idea is to rely on packed operations such that the metadata phase of ${\sf Build}$ (i.e., the cuckoo assignment problem) takes only linear time—this is possible because the problem size $n = \mathsf {poly}\log \lambda$ is small. The more tricky step is how to route the actual balls into their destined location in the hash-table. We cannot rely on standard oblivious sorting to perform this routing since this would consume a logarithmic extra overhead. Instead, we devise a method to directly place the balls into the destined location in the hash-table in the clear—this is safe as long as the input array has been padded with dummies to the output length, and randomly shuffled; in this way, only a random permutation is revealed. A technicality arises in realizing this idea: after figuring out the assigned destinations for real elements, we need to expand this assignment to include dummy elements too, and the dummy elements must be assigned at random to the locations unoccupied by the reals. At a high level, this is accomplished through a combination of packed oblivious random permutation and packed oblivious sorting over metadata.

We first describe two helpful procedures (mentioned in Section 2.1.2) in Sections 8.1 and 8.2. Then, in Section 8.3, we give the full description of the ${\sf Build}$, ${\sf Lookup}$, and ${\sf Extract}$ procedures (Construction 8.5). Throughout this section, we assume for simplicity that $n = \log ^9\lambda$ (while in reality $n \in \log ^9\lambda \pm \log ^7\lambda$).

8.1 Step 1 – Add Dummies and Shuffle

We are given a randomly shuffled array ${\bf I}$ of length $n$ that contains real and dummy elements. In Algorithm 8.1, we pad the input array with dummies to match the size of the hash-table to be built. Each dummy will receive a unique index label, and we rely on packed oblivious random permutation to permute the labeled dummies. Finally, we rely on Intersperse on the real balls to make sure that all elements, including reals and dummies, are randomly shuffled.

More formally, the output of Algorithm 8.1 is an array of size ${n_{\sf cuckoo}}= {\sf c_{\sf cuckoo}}\cdot n + \log \lambda$, where ${\sf c_{\sf cuckoo}}$ is the constant required for Cuckoo hashing, which contains all the real elements from ${\bf I}$ and the rest are dummies. Furthermore, each dummy receives a distinct random index from $\lbrace 1,\ldots ,{n_{\sf cuckoo}}- n_R\rbrace$, where $n_R$ is the number of real elements in ${\bf I}$. Assuming that the real elements in ${\bf I}$ are a-priori uniformly shuffled, then the output array is randomly shuffled.

Claim 8.2.

Algorithm 8.1 fails with probability at most $e^{-\Omega (\sqrt {n})}$ and completes in $O(n + \frac{n}{w} \cdot \log ^3 n)$ time. Specifically, for $n = \log ^9 \lambda$ and $w \ge \log ^3 \log \lambda$, the algorithm completes in $O(n)$ time and fails with probability $e^{-\Omega (\log ^{9/2}\lambda)}$.

Proof.

All steps except the oblivious random permutation in Step 3 incur $O(n)$ time and are perfectly correct by construction. Each element of ${\bf MD}$ can be expressed with $O(\log n)$ bits, so the packed oblivious random permutation (Theorem 4.5) incurs $O((n\cdot \log ^3 n) /w)$ time and has failure probability at most $e^{-\Omega (\sqrt {n})}$.□

8.2 Step 2 – Evaluate Assignment with Metadata Only

We obliviously emulate the Cuckoo hashing procedure, but doing it directly on the input array is too expensive (as it incurs oblivious sorting inside) so we do it directly on metadata (which is short since there are few elements), and use the packed version of oblivious sort (Theorem 4.2). At the end of this step, every element in the input array should learn which bin (either in the main table or the stash) it is destined for. Recall that the Cuckoo hashing consists of a main table of ${\sf c_{\sf cuckoo}}\cdot n$ bins and a stash of $\log \lambda$ bins.

Our input for this step is an array ${\bf MD}_{\bf X}$ of length ${n_{\sf cuckoo}}:= {\sf c_{\sf cuckoo}}\cdot n + \log \lambda$ which consists of pairs of bin choices $({\sf choice}_1, {\sf choice}_2)$, where each choice is an element from $[{\sf c_{\sf cuckoo}}\cdot n] \cup \lbrace \bot \rbrace$. The real elements have choices in $[{\sf c_{\sf cuckoo}}\cdot n]$ while the dummies have $\bot$. This array corresponds to the bin choices of the original elements in ${\bf X}$ (using a PRF) which is the original array ${\bf I}$ after adding enough dummies and randomly shuffling that array.

To compute the bin assignments we start with obliviously assigning the bin choices of the real elements in ${\bf MD}_{\bf X}$. Next, we obliviously assign the remaining dummy elements to the remaining available locations. We do so by a sequence of oblivious sort algorithms. See Algorithm 8.3.

Claim 8.4.

For $n \ge \log ^9 \lambda$, Algorithm 8.3 fails with probability at most $e^{-\Omega (\log \lambda \cdot \log \log \lambda)}$ and completes in $O(n \cdot (1 + \frac{\log ^3 n}{w}))$ time. Specifically, for $n = \log ^9 \lambda$ and $w \ge \log ^3\log \lambda$, Algorithm 8.3 completes in $O(n)$ time.

Proof.

The input arrays are of size ${n_{\sf cuckoo}}= {\sf c_{\sf cuckoo}}\cdot n + \log \lambda$ and the arrays ${\bf MD}_{\bf X}$, ${\bf Assign}_{\bf X}$, ${\bf Occupied}$, $\widetilde{{\bf Occupied}}$, $\widetilde{{\bf Assign}}$ are all of length at most ${n_{\sf cuckoo}}$ and consist of elements that need $O(\log {n_{\sf cuckoo}})$ bits to describe. Thus, the cost of packed oblivious sort (Theorem 4.2) is $O(({n_{\sf cuckoo}}/w) \cdot \log ^3 {n_{\sf cuckoo}}) \le O((n\cdot \log ^3 n) / w)$. The linear scans take time $O({n_{\sf cuckoo}}) = O(n)$. The cost of the $\overline{{\sf cuckooAssign}}$ (see Corollary 4.12) from Step 1 has failure probability $e^{-\Omega (\log \lambda \cdot \log \log \lambda)}$ and it takes time $O(({n_{\sf cuckoo}}/w) \cdot \log ^3 {n_{\sf cuckoo}}) \le O((n\cdot \log ^3 n) / w)$.□

8.3 SmallHT Construction

The full description of the construction is given next. It invokes Algorithms 8.1 and 8.3.

CONSTRUCTION 8.5.

${\sf SmallHT}$ – Hash table for Small Bins

Procedure ${\sf SmallHT}.{\sf Build}({\bf I})$:.

—

Input: An input array ${\bf I}$ of length $n$ consisting of real and dummy elements. Each real element is of the form $(k, v)$ where both the key $k$ and the value $v$ are $D$-bit strings where $D := O(1) \cdot w$.

—

Input Assumption: The real elements among ${\bf I}$ are randomly shuffled.

—

The algorithm:

(1)	Run Algorithm 8.1 (prepare real and dummy elements) on input ${\bf I}$, and receive back an array ${\bf X}$.
(2)	Choose a PRF key $\texttt{sk}$ where PRF maps $\lbrace 0,1\rbrace ^{D} \rightarrow [{\sf c_{\sf cuckoo}}\cdot n]$.
(3)	Create a new metadata array ${\bf MD}_{\bf X}$ of length $n$. Iterate over the the array ${\bf X}$ and for each real element ${\bf X}[i]=(k_i,v_i)$ compute two values $({\sf choice}_{i,1},{\sf choice}_{i,2}) \leftarrow {\sf PRF} _{\texttt{sk} }(k_i)$, and write $({\sf choice}_{i,1},{\sf choice}_{i,2})$ in the $i$th location of ${\bf MD}_{\bf X}$. If ${\bf X}[i]$ is dummy, write $(\bot ,\bot)$ in the $i$th location of ${\bf MD}_{\bf X}$.
(4)	Run Algorithm 8.3 on ${\bf MD}_{\bf X}$ to compute the assignment for every element in ${\bf X}$. The output of this algorithm, denoted ${\bf Assign}_{\bf X}$, is an array of length $n$, where in the $i$th position we have the destination location of element ${\bf X}[i]$.
(5)	Route the elements of ${\bf X}$, in the clear, according to ${\bf Assign}_{\bf X}$, into an array ${\bf Y}$ of size ${\sf c_{\sf cuckoo}}\cdot n$ and into a stash ${\sf S}$.

—

Output: The algorithm stores in the memory a secret state consists of the array ${\bf Y}$, the stash ${\sf S}$ and the secret key $\texttt{sk}$.

Procedure ${\sf SmallHT}.{\sf Lookup}(k)$:.

—

Input: A key $k$ that might be dummy $\bot$. It receives a secret state that consists of an array ${\bf Y}$, a stash ${\sf S}$, and a key $\texttt{sk}$.

—

The algorithm:

(1)

If $k \ne \bot$:

(a)	Evaluate $({\sf choice}_1,{\sf choice}_2) \leftarrow {\sf PRF} _{\texttt{sk} }(k)$.
(b)	Visit ${\bf Y}_{{\sf choice}_1}, {\bf Y}_{{\sf choice}_2}$ and the stash ${\sf S}$ to look for the key $k$. If found, remove the element by overwriting $\bot$. Let $v^$ be the corresponding value (if not found, set $v^ ~{\sf :=}~ \bot$).

(2)

Otherwise:

(a)	Choose random $({\sf choice}_1,{\sf choice}_2)$ independently at random from $[{\sf c_{\sf cuckoo}}\cdot n]$.
(b)	Visit ${\bf Y}_{{\sf choice}_1}, {\bf Y}_{{\sf choice}_2}$ and the stash ${\sf S}$ and look for the key $k$. Set $v^* ~{\sf :=}~ \bot$.

—

Output: Return $v^*$.

Procedure ${\sf SmallHT}.{\sf Extract}()$..

—

Input: The algorithm has no input; It receives the secret state that consists of an array ${\bf Y}$, a stash ${\sf S}$, and a key $\texttt{sk}$.

—

The algorithm:

(1)	Perform oblivious tight compaction (Theorem 5.1) on ${\bf Y}\Vert {\sf S}$, moving all the real elements to the front. Truncate the resulting array at length $n$. Let ${\bf X}$ be the outcome of this step.
(2)	Call ${\bf X}^{\prime } \leftarrow {\sf IntersperseRD}_n({\bf X})$ (Algorithm 6.6).

—

Output: The array ${\bf X}^{\prime }$.

We prove that our construction obliviously implements Functionality 4.7 for every sequence of instructions with non-recurrent lookups between two ${\sf Build}$ operations, assuming that the input array for ${\sf Build}$ is randomly shuffled.

Theorem 8.6.

Assume a $\delta _{{\sf PRF}}$-secure PRF. Suppose that $n = \log ^9\lambda$ and $w \ge \log ^3 \log \lambda$. Then, Construction 8.5 $(1- n \cdot e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}})$-obliviously implements Functionality 4.7 assuming that the input for ${\sf Build}$ (of size $n$) is randomly shuffled. Moreover, ${\sf Build}$ and ${\sf Extract}$ incur $O(n)$ time, ${\sf Lookup}$ has constant time in addition to linearly scanning a stash of size $O(\log \lambda)$.

Proof.

The proof of security is given in Appendix C.4. We proceed with the efficiency analysis. The ${\sf Build}$ operation executes Algorithm 8.1 that consumes $O(n)$ time (by Claim 8.2), then performs additional $O(n)$ time, then executes Algorithm 8.3 that consumes $O(n)$ time (by Claim 8.4), and finally performs additional $O(n)$ time. Thus, the total time is $O(n)$. ${\sf Lookup}$, by construction, incurs $O(1)$ time in addition to linearly scanning the stash ${\sf S}$ which is of size $O(\log \lambda)$. The time of ${\sf Extract}$ is $O(n)$ by construction.□

8.4 CombHT: Combining BigHT with SmallHT

We use ${\sf SmallHT}$ in place of ${{\sf naïveHT}}$ for each of the major bins in the ${\sf BigHT}$ construction from Section 7. Since the load in the major bin in the hash table BigHT construction is indeed $n=\log ^{9}\lambda$, this modification is valid. Note that we still assume that the number of elements in the input to ${\sf CombHT}$, is at least $\log ^{11} \lambda$ (as in Theorem 7.2).

However, we make one additional modification that will be useful for us later in the construction of the ORAM scheme (Section 9). Recall that each instance of ${\sf SmallHT}$ has a stash ${\sf S}$ of size $O(\log \lambda)$ and so ${\sf Lookup}$ will require, not only searching an element in the (super-constant size) stash ${\sf OF}_{\sf S}$ of the overflow pile from ${\sf BigHT}$, but also linearly scanning the super-constant size stash of the corresponding major bin. To this end, we merge the different stashes of the major bins and store the merged list in an oblivious Cuckoo hash (Section 4.5). (A similar idea has also been applied in several prior works [15, 33, 35, 40].) This results in a new hash table scheme we call ${\sf CombHT}$.

CONSTRUCTION 8.7: ${\sf CombHT}$: combining ${\sf BigHT}$ with ${\sf SmallHT}$

Procedure ${\sf CombHT}.{\sf Build}({\bf I})$: Run Steps 1–7 of Procedure ${\sf BigHT}.{\sf Build}$ in Construction 7.1, where in Step 7 let ${\sf OF}=({\sf OF}_{\sf T},{\sf OF}_{\sf S})$ denote the outcome structure of the overflow pile. Then, perform:

(8)	Prepare data structure for efficient lookup. For $i=1,\ldots ,B$, call ${\sf SmallHT}.{\sf Build}({\sf Bin}_i)$ on each major bin to construct an oblivious hash table, and let $\lbrace ({\sf OBin}_i,{\sf S}_i)\rbrace _{i \in [B]}$ denote the outcome bins and the stash.
(9)	Concatenate the stashes ${\sf S}_1,\ldots ,{\sf S}_B$ (each of size $O(\log \lambda)$) from all small hash tables together. Pad the concatenated stash (of size $O(n / \log ^7 \lambda)$) to the size $O(n / \log ^2 \lambda)$. Call the ${\sf Build}$ algorithm of an oblivious Cuckoo hashing scheme on the combined set (Section 4.5), and let ${\sf Comb}{\sf S}=({\sf Comb}{\sf S}_{\sf T},{\sf Comb}{\sf S}_{\sf S})$ denote the output data structure, where ${\sf Comb}{\sf S}_{\sf T}$ is the main table and ${\sf Comb}{\sf S}_{\sf S}$ is the stash.

Output: Output $({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},{\sf Comb}{\sf S},\texttt{sk})$.

Procedure ${\sf Lookup}(k_i)$: The procedure is the same as in Construction 7.1, except that whenever visiting some bin ${\sf OBin}_j$ for searching for a key $k_i$, instead of visiting the stash of ${\sf OBin}_j$ to look for $k_i$, we visit ${\sf Comb}{\sf S}$.

Procedure ${\sf Extract}()$. The procedure ${\sf Extract}$ is the same as in Construction 7.1, except that $T = {\sf OBin}_1.{\sf Extract}() \Vert \ldots \Vert {\sf OBin}_B.{\sf Extract}() \Vert {\sf OF}.{\sf Extract}() \Vert {\sf Comb}{\sf S}.{\sf Extract}()$.

Theorem 8.8.

Assume a $\delta _{{\sf PRF}}$-secure PRF. Suppose that the input ${\bf I}$ of the ${\sf Build}$ algorithm has length $n\ge \log ^{11}\lambda$. Then, Construction 8.7 is $(1- n^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)} - \delta _{{\sf PRF}})$-obliviously implements Functionality 4.7 assuming that the input for ${\sf CombHT}.{\sf Build}$ is randomly shuffled. Moreover,

—	${\sf Build}$ and ${\sf Extract}$ each take $O(n + n \cdot \frac{\log n}{\log ^2 \lambda })$ time; and
—	${\sf Lookup}$ takes $O(1)$ time in addition to linearly scanning a stash of size $O(\log \lambda)$.

In particular, if $\log ^{11}\lambda \le n \le \mathsf {poly}(\lambda)$, the hash table is $(1- e^{-\Omega (\log \lambda \cdot \log \log \lambda)}) - \delta _{{\sf PRF}})$-oblivious and consumes $O(n)$ time for the ${\sf Build}$ and ${\sf Extract}$ phases; and ${\sf Lookup}$ consumes $O(1)$ time in addition to linearly scanning a stash of size $O(\log \lambda)$.

Proof.

The proof of security is given in Appendix C.5. We proceed with the efficiency analysis. Since each stash ${\sf S}_i$ is of size $O(\log \lambda)$ and there are $n/\log ^9 \lambda$ major bins, the merged and padded stash ${\sf Comb}{\sf S}$ has size size $O(n/\log ^2\lambda)$. The size of the overflow pile ${\sf OF}$ is $O(n/\log ^2 \lambda)$. Thus, we can store each of them using an oblivious Cuckoo hashing and this requires $O(n/\log ^2 \lambda)$ space for the main tables (resulting with ${\sf OF}_{\sf T}$ and ${\sf Comb}{\sf S}_{\sf T}$) plus an additional stash of size $O(\log \lambda)$ (resulting with ${\sf OF}_{\sf S}$ and ${\sf Comb}{\sf S}_{\sf S}$).

Thus, by Theorems 8.6 and 7.2, ${\sf CombHT}.{\sf Build}({\bf I})$ and ${\sf CombHT}.{\sf Extract}$ performs in $O(n + n \cdot \frac{\log n}{\log ^2 \lambda })$ time. Regarding ${\sf CombHT}.{\sf Lookup}$, it needs to perform a linear scan in two stashes (${\sf OF}_{\sf S}$ and ${\sf Comb}{\sf S}_{\sf S}$) of size $O(\log \lambda)$ plus constant time to search the main Cuckoo hash tables (${\sf OF}_{\sf T}$ and ${\sf Comb}{\sf S}_{\sf T}$).□

If we use the earlier Corollary 7.3 to instantiate the ${\sf BigHT}$, we can generalize the above theorem to the following corollary:

Corollary 8.9.

Assume a $\delta _{{\sf PRF}}$-secure PRF and $c \ge 2$. Then, there exists an algorithm that $(1- n^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)} - \delta _{{\sf PRF}})$-obliviously implements Functionality 4.7 assuming that the input array is of length at least $n\ge \log ^{9+c}\lambda$ and moreover the input is randomly shuffled. Furthermore, the algorithm achieves the following performance:

—	${\sf Build}$ and ${\sf Extract}$ each take $O(n + n \cdot \frac{\log n}{\log ^{c} \lambda })$ time; and
—	${\sf Lookup}$ takes $O(1)$ time in addition to linearly scanning a stash of size $O(\log \lambda)$.

Remark 8.10.

In our ORAM construction, we will have $O(\log N)$ levels, where each (non-empty) level has a merged stash and also a stash from the overflow pile ${\sf OF}_{\sf S}$, both of size $O(\log \lambda)$. We will employ the “merged stash” trick once again, merging the stashes of every level in the ORAM into a single one, resulting with a total stash size $O(\log N \cdot \log \lambda)$. We will store this merged stash in an oblivious dictionary (see Section 4.6), and accessing this merged stash would cost $O(\log ^4 (\log N + \log \lambda))$ total time.

9 OBLIVIOUS RAM

In this section, we utilize ${\sf CombHT}$ in the hierarchical framework of Goldreich and Ostrovsky [31] to construct our ORAM scheme. We denote by $\lambda$ the security parameter. For simplicity, we assume that $N$, the size of the logical memory, is a power of 2. Additionally, we assume that $w$, the word size is $\Theta (\log N)$.

ORAM Initialization.. Our structure consists of one dictionary $D$ (see Section 4.6), and $O(\log N)$ levels numbered $\ell +1,\ldots ,L$ respectively, where $\ell = \lceil 11\log \log \lambda \rceil$, and $L=\lceil \log N \rceil$ is the maximal level.

—	The dictionary $D$ is an oblivious dictionary storing $2^{\ell +1}+\log N \log {\lambda }$ elements. Every element in $D$ is of the form $({\sf levelIndex}, {\sf whichStash}, {\mathsf {data}})$, where ${\sf levelIndex} \in \lbrace \ell ,\ldots ,L\rbrace$, ${\sf whichStash} \in \lbrace {\sf overflow},{\sf stashes},\bot \rbrace$ and ${\mathsf {data}} \in \lbrace 0,1\rbrace ^{w}$.
—	Each level $i\in \lbrace \ell +1,\ldots ,L\rbrace$ consists of an instance, called $T_i$, of the oblivious hash table ${\sf CombHT}$ from Section 8.4 that has capacity $2^i$.

Additionally, each level is associated with an additional bit ${\sf full} _i$, where 1 stands for full and 0 stands for available. Available means that this level is currently empty and does not contain any blocks, and thus one can rebuild into this level. Full means that this level currently contains blocks, and therefore an attempt to rebuild into this level will effectively cause a cascading merge. In addition, there is a global counter ${\sf ctr}$ that is initialized to 0.

CONSTRUCTION 9.1.

Oblivious RAM ${\sf Access}(\mathsf {op},{\mathsf {addr}},{\mathsf {data}})$.

—

Input: $\mathsf {op} \in \lbrace {\mathsf {read}},{\mathsf {write}} \rbrace$, ${\mathsf {addr}} \in [N]$ and ${\mathsf {data}} \in \lbrace 0,1\rbrace ^{w}$.

—

Secret state: The dictionary $D$, levels $T_{\ell +1},\ldots ,T_L$, the bits ${\sf full} _{\ell +1},\ldots , {\sf full} _{L}$ and counter ${\sf ctr}$.

—

The need for Steps¹⁶ $^,$¹⁷:

(1)

Initialize $\mathsf {found} ~{\sf :=}~ \mathsf {false}$, ${\mathsf {data}} ^* ~{\sf :=}~ \bot$, ${\sf levelIndex} ~{\sf :=}~ \bot$ and ${\sf whichStash} ~{\sf :=}~ \bot$.

(2)

Perform $\mathsf {fetched} ~{\sf :=}~ D.{\sf Lookup}({\mathsf {addr}})$. If $\mathsf {fetched} \ne \bot$:

(1)	Interpret $\mathsf {fetched}$ as $({\sf levelIndex},{\sf whichStash},\mathsf {data^*})$.
(2)	If ${\sf levelIndex} = \ell$, then set $\mathsf {found} ~{\sf :=}~ \mathsf {true}$.

(3)

For each $i \in \lbrace \ell +1,\ldots ,L\rbrace$ in increasing order, do:

(a)

If $\mathsf {found} = \mathsf {false}$:

(i)

Run $\mathsf {fetched} ~{\sf :=}~ T_i.{\sf Lookup}({\mathsf {addr}})$ with the following modifications:

–	Instead of visiting the stash of ${\sf OF}$, namely ${\sf OF}_{\sf S}$, in Construction 8.7, check whether ${\sf levelIndex}=i$ and ${\sf whichStash} = {\sf overflow}$. If so, treat the value ${\sf data^*}$ as if it is fetched from ${\sf OF}_{\sf S}$.
–	Instead of visiting the stash of ${\sf Comb}{\sf S}$, namely ${\sf Comb}{\sf S}_{\sf S}$, check whether ${\sf levelIndex} = i$ and ${\sf whichStash} = {\sf stashes}$, and if so, treat the value ${\sf data^*}$ as if it is fetched from ${\sf Comb}{\sf S}_{\sf S}$.¹⁸

(ii)

If $\mathsf {fetched} \ne \bot$, let $\mathsf {found} ~{\sf :=}~ \mathsf {true}$ and $\mathsf {data^*} ~{\sf :=}~ \mathsf {fetched}$.

(b)

Else, $T_i.{\sf Lookup}(\bot)$.

(4)

If $\mathsf {found} = \mathsf {false}$, i.e., this is the first time ${\mathsf {addr}}$ is being accessed, set $\mathsf {data^*} = 0$.

(5)

Let $(k,v) ~{\sf :=}~ ({\mathsf {addr}}, {\mathsf {data}} ^*)$ if this is a ${\mathsf {read}}$ operation; else let $(k,v) ~{\sf :=}~ ({\mathsf {addr}}, {\mathsf {data}})$. Insert $(k,(\ell ,\bot ,v))$ into oblivious dictionary $D$ using $D.{\sf Insert}(k, (\ell ,\bot , v))$.

(6)

Increment ${\sf ctr}$ by 1. If ${\sf ctr} \equiv 0 \mod {2}^\ell$, perform the following.

(a)

Let $j$ be the smallest level index such that ${\sf full} _j = 0$ (i.e., available). If all levels are marked full, then $j ~{\sf :=}~ L$. In other words, $j$ is the target level to be rebuilt.

(b)

Let $\widetilde{D} := D.{\sf Extract}()$. Let $D_1$ be a copy of $\widetilde{D}$ preserving only elements whose ${\sf levelIndex}$ is at most $j-1$, and all other elements are marked dummy. Let $D_2$ be a copy of $\widetilde{D}$ preserving only elements whose ${\sf levelIndex}$ is greater than $j-1$, and all other elements are marked as dummy.

(c)

Let ${\bf U} ~{\sf :=}~ D_1 \Vert T_{\ell +1}.{\sf Extract}() \Vert \ldots \Vert T_{j-1}.{\sf Extract}()$ and set $j^* ~{\sf :=}~ j-1$. If all levels are marked full, then additionally let ${\bf U} ~{\sf :=}~ {\bf U}\Vert T_L.{\sf Extract}()$ and set $j^* ~{\sf :=}~ L$. (Here, we change also Construction 8.7 such that ${\sf CombHT}.{\sf Extract}()$ does not extract the element from the stashes. The element of those are already kept in $D$.)

(d)

Run ${\sf Intersperse}^{(j^*-\ell)}_{2^{\ell +1},2^{\ell +1},2^{\ell +2},\ldots ,2^{j^*}}({\bf U})$ (Algorithm 6.4) to randomly shuffle ${\bf U}$. Denote the output by $\tilde{\bf U}$. If $j = L$, then additionally do the following to shrink $\tilde{\bf U}$ to size $N=2^L$:

(i)	Run the tight compaction on $\tilde{\bf U}$ moving all real elements to the front. Truncate $\tilde{\bf U}$ to length $N$.
(ii)	Run $\tilde{\bf U} \leftarrow {\sf IntersperseRD}_{N}(\tilde{\bf U})$ (Algorithm 6.6).

(e)

Rebuild the $j$th hash table with the $2^j$ elements from $\tilde{\bf U}$ via $T_j ~{\sf :=}~ {\sf CombHT}.{\sf Build}(\tilde{\bf U})$ (Construction 8.7) and let ${\sf OF}_{\sf S}, {\sf Comb}{\sf S}_{\sf S}$ be the associated stashes of that level (of size $O(\log \lambda)$ each). Mark ${\sf full} _j ~{\sf :=}~ 1$.

(i)	Initialize a new oblivious dictionary $D$ which will hold at most $2^{\ell +1} + \log N \log {\lambda }$ elements.
(ii)	For each element $(k,v)$ in the stash ${\sf OF}_{\sf S}$, run $D.{{\sf Insert}}(k, (j,{\sf overflow}, v))$.
(iii)	For each element $(k,v)$ in the stash ${\sf Comb}{\sf S}_{\sf S}$, run $D.{{\sf Insert}}(k, (j,{\sf stashes}, v))$.
(iv)	For each tuple $e \in D_2$, run $D.{{\sf Insert}}(e)$.

(f)

For $i\in \lbrace \ell +1,\ldots ,j-1\rbrace$, reset $T_i$ to be empty structure and set ${\sf full} _i ~{\sf :=}~ 0$.

—

Output: Return ${\mathsf {data}} ^*$.

We prove that our construction obliviously implements the ORAM functionality (Functionality 3.4) and analyze its amortized overhead.

Theorem 9.2.

Let $N \in \mathbb {N}$ be the capacity of ORAM and $\lambda \in \mathbb {N}$ be a security parameter. Assume a $\delta _{{\sf PRF}}$-secure PRF. For any number of queries $T = T(N,\lambda) \ge N$, Construction 9.1 $(1-T\cdot N^2 \cdot e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}})$-obliviously implements the ORAM functionality. Moreover, the construction has $O(\log N \cdot (1 + \frac{\log N}{\log ^2 \lambda }) + \log ^9 \log \lambda)$ amortized time overhead.

Proof.

The proof of security is given in Appendix C.6 and we give the efficiency analysis next.

We may consider two cases:

—	if $w\lt \log ^3 \log \lambda$, which implies that $N \lt 2^{O(\log ^3 \log \lambda)}$, we can use a perfect ORAM on $N$ which yields an ORAM scheme with $O(\log ^9 \log \lambda)$ overhead (see Theorem 4.8).
—	therefore, henceforth it suffices to consider the case when $w\ge \log ^3 \log \lambda$.

In each of the $T$ ${\sf Access}$ operations, Steps 1–5 perform a single ${\sf Lookup}$ and ${\sf Insert}$ operation on the oblivious dictionary, and one ${\sf Lookup}$ on each $T_\ell , \ldots , T_L$. These operations require $O(\log ^4 \log \lambda) + O(\log N)$ time. In Step 6, for every $2^\ell$ requests of ${\sf Access}$, one ${\sf Extract}$ and at most $O(2^\ell)$ ${\sf Insert}$ operations are performed on the oblivious dictionary $D$, and at most one ${\sf CombHT}.{\sf Build}$ on $T_{\ell +1}$. These require $O(2^\ell \cdot \log ^3 (2^{\ell +1}) + 2^\ell \cdot \log ^4 (2^{\ell +1}) + 2^{\ell +1}) = O(2^\ell \cdot \log ^4\log \lambda)$ time. In addition, for each $j \in \lbrace \ell +1,\ldots ,L\rbrace$, for every $2^j$ requests of ${\sf Access}$, at most one ${\sf Extract}$ is performed on $T_j$, one ${\sf Build}$ on $T_{j+1}$, one ${\sf Intersperse}^{(j-\ell)}_{2^{\ell +1}, 2^{\ell +1}, 2^{\ell +2}, \ldots , 2^j}$, one ${\sf IntersperseRD}_N$, and one tight compaction, all of which require linear time and thus the total time is $O(2^j + 2^j \cdot \frac{j}{\log ^2 \lambda })$ by ${\sf Build}$ and ${\sf Extract}$ of ${\sf CombHT}$ (Theorem 8.8). Hence, over $T$ requests, the amortized time is $\begin{equation*} \frac{1}{T} \left[\frac{T}{2^\ell } \cdot O\left(2^\ell \cdot \log ^4\log \lambda \right) + \sum _{j=\ell +1}^{L} \frac{T}{2^j} \cdot O\left(2^j \cdot \left(1+\frac{j}{\log ^2 \lambda }\right)\right) \right] = O\left(\log ^4\log \lambda + \log N \cdot \left(1 + \frac{\log N}{\log ^2 \lambda }\right) \right). \end{equation*}$□

As a corollary, we can now state a more general version:

Corollary 9.3 (Precise Statement of Theorem 1.1).

Let $N \in \mathbb {N}$ be the capacity of ORAM and $\lambda \in \mathbb {N}$ be a security parameter. Assume a $\delta _{{\sf PRF}}$-secure PRF. For any number of queries $T = T(N,\lambda) \ge N$ and any constant $c\gt 1$, there is a $(1-T\cdot N^2 \cdot \delta - \delta _{{\sf PRF}})$-oblivious construction of an ORAM. Moreover, the construction has $O(\log N \cdot (1 + \frac{\log N}{\log ^c (1/\delta)}) + \mathsf {poly}(\log \log (1/\delta)))$ amortized time overhead.

Proof.

There are two differences to the statement of Theorem 9.2. First, we replaced the term $e^{-\Omega (\log \lambda \cdot \log \log \lambda)}$ with $\delta$. This means that $\Omega ((1/\delta)^{1/\log \log (1/\delta)})\le \lambda \le O(\log (1/\delta))$, and so we can replace the $O(\log N/\log ^2\lambda)$ term from Theorem 9.2 with $O(\log N\cdot \log ^2\log (1/\delta)/\log ^2 (1/\delta))$ and the $\mathsf {poly}(\log \log \lambda)$ term with $\mathsf {poly}(\log \log (1/\delta))$. Second, we generalize the exponent of the $\log ^2 \lambda$ to any constant $c\gt 1$ (and this absorbes the $\log ^c\log (1/\delta)$ factor). This is okay by relying on Corollary 8.9 to instantiate our ORAM’s ${\sf CombHT}$ (which will make the ORAM’s smallest level larger but still upper bounded by $\mathsf {poly}\log (\lambda)$).□

Remark 9.4

(Using More CPU Registers).

Our construction can be slightly modified to obtain optimal amortized time overhead (up to constant factors) for any number of CPU registers, as given by the lower bound of Larsen and Nielsen [41]. Specifically, if the number of CPU registers is $m$, then we can achieve a scheme with $O(\log (N/m))$ amortized time overhead.

If $m\in N^{1-\epsilon }$ for $\epsilon \gt 0$, then the lower bound still says that $\Omega (\log N)$ amortized time overhead is required so we can use Construction 9.1 without any change (and only utilize a constant number of CPU registers). For larger values of $m$ (e.g., $m=O(N/\log N)$), we slightly modify Construction 9.1 as follows. Instead of storing levels $\ell =\lceil 11\log \log \lambda \rceil$ through $L=\lceil \log N\rceil$ in the memory, we utilize the extra space in the CPU to store levels $\ell$ through $\ell _m\triangleq \lfloor \log m\rfloor$ while the rest of the levels (i.e., $\ell _m+1$ through $L$) are stored in the memory, as in the above construction. The number of levels that we store in the memory is $O\left(\log N - \log m\right))=O\left(\log (N/m)\right)$ which is the significant factor in the overhead analysis (as the amortized time overhead per level is $O(1)$).

Appendices

A EXTENDED PARAMETER ANALYSIS

In the introduction, we have presented the overhead of our construction as a function of $N$ and assuming that $\lambda \le N \le T \le \mathsf {poly}(\lambda)$ for any fixed polynomial $\mathsf {poly}(\cdot)$, where $N$ is the size of the memory, and $T$ is a bound on the number of accesses. In order to have a more meaningful comparison with previous works, we restate our result as well as some prior results using two parameters, $N$ and $\delta$, as above, and no longer assume that $\lambda \le N \le T \le \mathsf {poly}(\lambda)$. Our construction achieves:

Theorem A.1.

Assume the existence of a PRF family. Let $N$ denote the maximum number of blocks stored by the ORAM.

For any $\delta \gt 0$, and any constant $c \ge 1$, there exists an ORAM scheme with $\begin{align*} O\left(\log N \cdot \left(1+ \frac{\log N}{\log ^c (1/\delta)}\right)+ \mathsf {poly}\log \log \left(\frac{1}{\delta }\right)\right) \end{align*}$ amortized overhead, and for every $\texttt{PPT}$adversary ${\mathcal {A}}$ adaptively issuing at most $T$ requests, the ORAM’s failure probability is at most $T^3 \cdot \delta + \delta _{\rm prf}^{{\mathcal {A}}}$ where $\delta _{\rm prf}^{{\mathcal {A}}}$ denotes the probability that ${\mathcal {A}}$ breaks PRF security.

Previous works.. We state other results with the same parameters $N, \delta$, for the best of our understanding. In all results stated below we assume $O(1)$ CPU private registers and use $\delta$ to denote the ORAM’s statistical failure probability; however, keep in mind that all computationally secure results below have an additional additive failure probability related to the PRF’s security.

Goldreich and Ostrovsky [29, 31] showed a computationally secure ORAM construction with $O(\log ^2 N \cdot \log (1/\delta))$ overhead. Their work proposed the elegant hierarchical paradigm for constructing ORAMs and subsequent works have improved the hierarchical paradigm leading to asymptotically better results. The work of [12] (see also [33, 40]) showed how to improve Goldreich and Ostrovsky’s construction to $O(\log ^2 N/\log \log N + \log (1/\delta) \cdot \log \log (1/\delta))$ overhead. The work of Patel et al. [53] achieved $O(\log N \cdot (\log \log (1/\delta) +\frac{\log N}{\log (1/\delta)}) +\log (1/\delta) \cdot \log \log (1/\delta))$ overhead (to the best of our understanding).

Besides the hierarchical paradigm, Shi et al. [58] propose the tree-based paradigm for constructing ORAMs. Subsequent works [15, 16, 61, 63] have improved tree-based constructions, culminating in the works of Circuit ORAM [63] and Circuit OPRAM [15]. Circuit ORAM [63] achieves $O(\log N \cdot (\log N + \log (1/\delta)))$ overhead and statistical security. Subsequently, the Circuit OPRAM work [15] showed (as a by-product of their main result on OPRAM) that we can construct a statistically secure ORAM with $O(\log ^2 N + \log (1/\delta))$ overhead (by merging the stashes of all recursion depths in Circuit ORAM) and a computationally secure ORAM with $O(\log ^2 N/\log \log N + \log (1/\delta))$ overhead (by additionally adopting a metadata compression trick originally proposed by Fletcher et al. [26]).

B DETAILS ON OBLIVIOUS CUCKOO ASSIGNMENT

Recall that the input of Cuckoo assignment is the array of the two choices, ${\bf I}= ((u_1, v_1), \ldots , (u_n, v_n))$, and the output is an array ${\bf A} = \lbrace a_1, \ldots a_n\rbrace$, where $a_i \in \lbrace u_i, v_i, {\tt stash}\rbrace$ denotes that the $i$th ball $k_i$ is assigned to either bin $u_i$, or bin $v_i$, or the secondary array of stash. We say that a Cuckoo assignment ${\bf A}$ is correct iff it holds that (i) each bin is assigned to at most one ball, and (ii) the number of balls in the stash is minimized.

To compute A, the array of choices ${\bf I}$ is viewed as a bipartite multi-graph $G = (U \cup V, E)$, where $U = \lbrace u_i\rbrace _{i\in [n]}$, $V = \lbrace v_i\rbrace _{i\in [n]}$, $E$ is the multi-set $\lbrace (u_i, v_i)\rbrace _{i\in [n]}$, and the ranges of $u_i$ and $v_i$ are disjoint. Given $G$, the Cuckoo assignment algorithm performs an oblivious breadth-first search (BFS) such that traverses a tree for each connected component in $G$. In addition, the BFS performs the following for each edge $e \in E$: $e$ is marked as either a tree edge or a cycle edge, $e$ is tagged with the root $r \in U \cup V$ of the connected component of $e$, and $e$ is additionally marked as pointing toward either root or leaf if $e$ is a tree edge. Note that all three properties can be obtained in the standard tree traversal. Given such marking, the idea to compute A is to assign each tree edge $e = (u_i, v_i)$ toward the leaf side, and there are three cases for any connected component:

(1)	If there is no cycle edge in the connected component, perform the following. If $e = (u_i, v_i)$ points toward a leaf, then assign $a_i = v_i$; otherwise, assign $a_i = u_i$.
(2)	If there is exactly one cycle edge in the connected component, traverse from the cycle edge up to the root using another BFS, reverse the pointing of every edge on the path from the cycle edge to the root, and then apply the assignment of (1).
(3)	If there are two or more cycle edges in the connected component, throw extra cycle edges to the stash by assigning $a_i = {\tt stash}$, and then apply the assignment of (2).

The above operations take constant passes of sorting and BFS, and hence it remains to implement an oblivious BFS efficiently.

Recall that in a standard BFS, we start with a root node $r$ and expand to nodes at depth 1. Then, iteratively we expands all nodes at depth $i$ to nodes of depth $i+1$ until all nodes are expanded. Any cycle edges is detected when two or more nodes expand to the same node (because any cycle in a bipartite graph must consist of an even number of edges). We say the nodes at depth $i$ is the $i$th front and the expanding is the $i$th iteration. To do it obliviously, the oblivious BFS performs the maximum number of iterations, and, in the $i$th iteration, it touches all nodes, yet only the $i$th front is actually expanded. Each iteration is accomplished by sorting and grouping adjacent edges and then updating the marking within each group.¹⁹ Note that the oblivious BFS does not need to know any connected components in advance. It simply expands all nodes in the beginning, and then, a front “includes” nodes in another front when the two meet, and the first front has a root node that precedes the other root. Such BFS is not efficient as the maximum number of iterations is $n$, and each iteration takes several sorting on $O(n)$ elements.

To achieve efficiency, the intuition is that in the random bipartite graph $G$, with overwhelming probability, (i) the largest connected component in $G$ is small, and (ii) there are many small connected components such that the BFS finishes in a few iterations. The intuition is informally stated by the following two tail bounds, where $\gamma \lt 1$ and $\beta \lt 1$ are constants such that depends only on the Cuckoo constant, ${\sf c_{\sf cuckoo}}$.

(1)	For every integer $s$, the size of the largest connected component of $G$ is greater than $s$ with probability $O(\gamma ^{-s})$.
(2)	Conditioning on the largest component is at most $s$, the following holds. For any integer $t$, let $C_t$ be the total number of edges of all components such that the size is at least $t$. Let $c = 1/7$. If $t$ satisfies that $n \beta ^t \ge \Theta (n^{1-c})$, then $C_k = O(n \beta ^k)$ holds with probability $2^{-\Omega (\frac{n^{1-2c}}{s^4})}$.

Using such tail bounds, the BFS is pre-programmed in the following way. The second tail bound says that after the $t$th iteration of the BFS, we can safely eliminate $(1- \beta)n$ edges that is (with high probability) in components of size at most $t$ until there are $\Theta (n^{1-c})$ edges remaining. Then, using the first tail bound, it suffices to run the BFS for $s$ additional iterations on the remaining edges to figure out the remaining assignment (with overwhelming probability). To achieve failure probability $\delta$, a standard choice is $s = \log \frac{1}{\delta }$ [12, 33, 40].

Therefore, the access pattern of such oblivious BFS is pre-determined by constants $\gamma , \beta , \delta$, which does not depend on the input ${\bf I}$. The tail bounds incurs failure in correctness for a $\delta$ fraction among all I, and then it is fixed by checking and applying perfectly correct but non-oblivious algorithm, which incurs a loss in obliviousness. This concludes the construction of ${\sf cuckooAssign}$ at a very high level, and we have the following lemma.

Lemma B.1.

Let $n$ be the input size. Let the input of two choices be sampled uniformly at random. Then, the cuckoo assignment runs in time $O(T(n) + \log \frac{1}{\delta } \cdot T(n^{6/7}))$ except with probability $\delta + \log n \cdot 2^{-\Omega (\frac{n^{5/7}}{\log ^4 1/\delta })}$, where $T(m)$ is the time to sort $m$ integers each of $O(\log m)$ bits.

Assuming $n = \Omega (\log ^{8} \frac{1}{\delta })$ and plugging in $T(m) = O(m \log m)$, we have the standard statement of $O(n \log n)$ time and $\delta$ failure probability [12, 33, 40].

C DEFERRED PROOFS

C.1 Proof of Theorem 5.2

To prove Theorem 5.2 we need several standard definitions and results. Given $G=(V,E)$ and a set of vertices $U\subset V$, we let $\Gamma (U)$ be the set of all vertices in $V$ which are adjacent to a vertex in $U$ (namely, $\Gamma (U) = \lbrace v\in V \mid \exists u\in U, (u,v)\in E\rbrace$).

Definition C.1 (The Parameter $\lambda (G)$ [3, Definition 21.2]) Given a $d$-regular graph $G$ on $n$ vertices, we let $A = A(G)$ be the matrix such that for every two vertices $u$ and $v$ of $G$, it holds that $A_{u,v}$ is equal to the number of edges between $u$ and $v$ divided by $d$. (In other words, $A$ is the adjacency matrix of $G$ multiplied by $1/d$.) The parameter $\lambda (A)$, denoted also as $\lambda (G)$, is $\begin{align*} \lambda (A) = \max _{{\mathbf {v}}\in \mathbf {1}^\bot , \Vert {\mathbf {v}}\Vert _2=1 } \Vert A{\mathbf {v}}\Vert _2, \end{align*}$ where $\mathbf {1}^\bot = \lbrace \mathbf {v} \mid \sum \mathbf {v}_i = 0\rbrace$.

Lemma C.2 (Expander Mixing Lemma [3, Definition 21.11]).

Let $G=(V,E)$ be a $d$-regular $n$-vertex graph. Then, for all sets $S,T\subseteq V$, it holds that $\begin{align*} \left|e(S,T) - \frac{d}{n}\cdot |S|\cdot |T| \right| \le \lambda (G)\cdot d\cdot \sqrt {|S|\cdot |T|}, \end{align*}$

Proof of Theorem 5.2

Recall the graph of Margulis [48]. Fix a positive $i\in \mathbb {N}$. The vertex set $V=[i]\times [i]$. A node $(x,y) \in V$ is connected to $(x,y),(x,x+y),(x,x+y+1),(x+y,y),(x+y+1, y)$ (all arithmetic is modulo $i$). We let $H_N$ be the resulting graph that has $N=i^2$ vertices. Notice that $H_N$ is nonbipartite, but we will construct bipartite $G_{\epsilon ,N}$ from $H_N$.

It is known (Margulis [48], Gabber and Galil [27], and Jimbo and Maruoka [38]) that for every $n$ which is a square (i.e., of the form $n=i^2$ for some $i\in \mathbb {N}$), $H_n$ is 5-regular and $\lambda (H_n)$ is at most a constant fraction for all $n$. Also, the neighbors of each vertex can be computed via $O(1)$ elementary operations so the entire edge set of $H_{n}$ can be computed in linear time in $n$. To boost $\lambda (H_n)$ to satisfy $\lambda \lt \epsilon$, we consider the $k$th power of $H_n$. This yields a $5^k$-regular graph $H_n^k$ on $n$ vertices whose $\lambda$ parameter is $\lambda (H_n)^k$. By choosing $k$ to be a sufficiently large constant such that $\lambda (H_n)^k \lt \epsilon$, we obtain the resulting $(5^k)$-regular graph $H^k_n$. The entire edge set of this graph can also be computed in $O(n)$ time since moving from the $(i-1)$-th to the $i$th power requires $O(1)$ operations for each of the 5 edges that replace an edge—so the total time per vertex is $\sum _{i=1}^k O(5^i)= O(5^k) = O(1)$.

This completes the required construction for sizes $2^{2}, 2^{4}, 2^{6},\ldots$ (i.e., for even powers of 2). We can fill in the odd powers by padding. Given the graph $H^k_{n}$, we can obtain the required graph for $2n$ vertices by considering 2 copies of $H^k_{n}$ and connecting all pairs of vertices between them, that is, performing direct product of $H^k_{n}$ with the complete graph on two vertices. The resulting graph $\tilde{H}^k_{2n}$ has $2n$ vertices on each side, it is $(2\cdot 5^k)$-regular and $\lambda (\tilde{H}^k_{2n}) = \lambda (H^k_{n}) \lt \epsilon$.

Finally the desired bipartite graph $G_{\epsilon ,n}=(L,R,E)$ is obtained by: let $L$ and $R$ each be the vertex set $[n]$, and for each pair $(u,v) \in L \times R$ add the edge $(u,v)$ to $E$ if and only if $(u,v)$ is an edge on the nonbipartite graph $H^k_n$ (or $\tilde{H}^k_n$ if $n$ is an odd power of 2), where the vertices of $H^k_n$ (or $\tilde{H}^k_n$ respectively.) is also numbered by $[n]$. The inequality and thus the theorem now follow by applying the Expander Mixing Lemma (Lemma C.2).□

C.2 Deferred Proofs from Section 6

Proof of Claim 6.3

We build a simulator that receives only $n ~{\sf :=}~ n_0 + n_1$ and simulates the access pattern of Algorithm 6.1. The simulating of the generation of the array ${\sf Aux}$ is straightforward, and consists of modifying two counters (that can be stored at the client side) and just a sequential write of the array ${\sf Aux}$. The rest of the algorithm is deterministic and the access pattern is completely determined by the size $n$. Thus, it is straightforward to simulate the algorithm deterministically.

We next prove that the output distribution of the algorithm is identical to that of the ideal functionality. In the ideal execution, the functionality simply outputs an array ${\bf B}$, where ${\bf B}[i]=({\bf I}_0 \Vert {\bf I}_1)[\pi (i)]$ and $\pi$ is a uniformly random permutation on $n$ elements. In the real execution, we assume that the two arrays were first randomly permuted, and let $\pi _0$ and $\pi _1$ be the two permutations.²⁰ Let ${\bf I}^{\prime }$ be an array define as ${\bf I}^{\prime } ~{\sf :=}~ \pi _0({\bf I}_0) \Vert \pi _1({\bf I}_1)$. The algorithm then runs the ${\sf Distribution}$ on ${\bf I}^{\prime }$ and ${\sf Aux}$, where ${\sf Aux}$ is a uniformly random binary array of size $n$ that has $n_0$ 0’s and $n_1$ 1’s, and ends up with the output array ${\bf B}$ such that for all positions $i$, the label of the element ${\bf B}[i]$ is ${\sf Aux}[i]$. Note that ${\sf Distribution}$ is not a stable, so this defines some arbitrary mapping $\rho :[n] \rightarrow [n]$. Hence, the algorithm outputs an array ${\bf B}$ such that ${\bf B}[i] = \rho ^{-1}({\bf I}^{\prime } [i])$. We show that if we sample ${\sf Aux}$, $\pi _0$, and $\pi _1$, as above, the resulting permutation is a uniform one.

To this end, we show that (1) ${\sf Aux}$ is distributed according to the distribution above, (2) the total number of different choices for $({\sf Aux}, \pi _0, \pi _1)$ is $n!$ (exactly as for a uniform permutation), and (3) any two choices of $({\sf Aux},\pi _0,\pi _1)\ne ({\sf Aux}^{\prime },\pi _0^{\prime },\pi _1^{\prime })$ result with a different permutation. This completes our proof.

For (1) we show that the implementation of the sampling of the array in Step 1 in Algorithm 6.1 is equivalent to uniformly sampling an array of size $n_0+n_1$ among all arrays of size $n_0+n_1$ with $n_0$ 0’s and $n_1$ 1’s. Fix any array $X\in \lbrace 0,1\rbrace ^n$ that consists of $n_0$ 0’s followed by $n_1$ 1’s. It is enough to show that $\begin{eqnarray*} \Pr \left[\forall i\in [n]:{\sf Aux}[i] = X[i]\right] = \frac{1}{{n \choose n_0}}. \end{eqnarray*}$ This equality holds since the probability to get the bit $b=X[i]$ in ${\sf Aux}[i]$ only depends on $i$ and on the number of $b$’s that happened before iteration $i$. Concretely, $\begin{eqnarray*} \Pr \left[\forall i\in [n]:{\sf Aux}[i] = X[i]\right] &=& \left(\frac{n_0!}{n\cdot \ldots \cdot (n-n_0)} \right) \cdot \left(\frac{n_1!}{(n - n_0-1) \cdot \ldots \cdot 1 }\right) \\ &=& \frac{n_0!\cdot n_1!}{n!} = \frac{1}{{n \choose n_0}}. \end{eqnarray*}$ For (2), the number of possible choices of $({\sf Aux},\pi _0,\pi _1)$ is $\begin{eqnarray*} {n \choose n_0} \cdot n_0! \cdot n_1! = \frac{(n_0+n_1)!}{n_0!\cdot n_1!} \cdot n_0! \cdot n_1! = n! \ . \end{eqnarray*}$ For (3), consider two different triples $({\sf Aux},\pi _0,\pi _1)$ and $({\sf Aux}^{\prime },\pi ^{\prime }_0,\pi ^{\prime }_1)$ that result with two permutations $\psi$ and $\psi ^{\prime }$, respectively. If ${\sf Aux}(i)\ne {\sf Aux}^{\prime }(i)$ for some $i\in [n]$ and without loss of generality ${\sf Aux}(i)=0$, then $\psi (i)\in \lbrace 1,\ldots ,n_0\rbrace$ while $\psi ^{\prime }(i)\in \lbrace n_0+1,\ldots ,n\rbrace$. Otherwise, if ${\sf Aux}(i)={\sf Aux}^{\prime }(i)$ for every $i\in [n]$, then there exist $b\in \lbrace 0,1\rbrace$ and $j\in [n_b]$ such that $\pi _b(j) \ne \pi ^{\prime }_b(j)$. Since the tight compaction circuit $C_n$ is fixed given ${\sf Aux}$, the $j$th input in ${\bf I}_b$ is mapped in both cases to the same location of the bit $b$ in ${\sf Aux}$. Denote the index of this location by $j^{\prime }$. Thus, $\psi (j^{\prime }) = \pi _b(i)$ while $\psi ^{\prime }(j^{\prime }) = \pi _b^{\prime }(i)$ which means that $\psi \ne \psi ^{\prime }$, as needed.

The implementation has $O(n)$ time since there are three main steps and each can be implemented in $O(n)$ time. Step 1 has time $O(n)$ since there are $n$ coin flips and each can be done with $O(1)$ time (by just reading a word from the random tape). Steps 2 is only marking $n$ elements, and Step 3 can be implemented in $O(n)$ time by Theorem 5.21.□

Proof of Claim 6.5

The simulator that receives $n_1,\ldots ,n_k$ runs the simulator of ${\sf Intersperse}$ for $k-1$ times with the right lengths, as in the description of the algorithm. The indistinguishability follows immediately from the indistinguishability of ${\sf Intersperse}$. For functionality, note that whenever ${\sf Intersperse}$ is applied, it holds that both of its inputs are randomly shuffled which means that the input assumption of ${\sf Intersperse}$ holds. Thus, the final array ${\bf B}$ is a uniform permutation of ${\bf I}_1\Vert \ldots \Vert {\bf I}_k$.

Since the time of ${\sf Intersperse}_n$ is linear in $n$, the time required in the $i$th iteration of ${\sf Intersperse}^{(k)}_{n_1,\ldots ,n_k}$ is $O(\sum _{j=1}^{i} n_j)$. Namely, we pay $O(n_1)$ in $k-1$ iterations, $O(n_2)$ in $k-2$ iterations, and so on. Overall, the time is bounded by $O(\sum _{i=1}^{k}(k-i+1)\cdot n_i)$, as required.□

Proof of Claim 6.7

We build a simulator that receives only the size of ${\bf I}$ and simulates the access pattern of Algorithm 6.6. The simulation of the first and second steps is immediate since they are completely deterministic. The simulating of the execution of ${\sf Intersperse}_n$ is implied by Claim 6.3.

The proof that the output distribution of the algorithm is identical to that of the ideal functionality follows immediately from Claim 6.3. Indeed, after compaction and counting the number of real elements, we execute ${\sf Intersperse}_n$ with two arrays ${\bf I}^{\prime }_R$ and ${\bf I}^{\prime }_D$ of total size $n$, where ${\bf I}^{\prime }_R$ consists of all the $n_R$ real elements and ${\bf I}^{\prime }_D$ consists of all the dummy elements. The array ${\bf I}^{\prime }_R$ is uniformly shuffled to begin with by the input assumption, and the array ${\bf I}^{\prime }_D$ consists of identical elements, so we can think of it as if they are randomly permuted. So, the input assumption of ${\sf Intersperse}_n$ (see Algorithm 6.1) holds and thus the output is guaranteed to be randomly shuffled (by Claim 6.3).

The implementation runs in $O(n)$ time since the first two steps take $O(n)$ time (by Theorem 5.1) and ${\sf Intersperse}_n$ itself runs in $O(n)$ time (by Claim 6.3).□

C.3 Proof of Security of BigHT (Theorem 7.2)

We view our construction in a hybrid model, in which we have ideal implementations of the underlying building blocks: an oblivious hash table for each bin (implemented via naïveHT, as in Section 4.4), an oblivious Cuckoo hashing scheme (Section 4.5), an oblivious tight compaction algorithm (Section 5), and an algorithm for sampling bin loads (Section 4.7). Since the underlying Cuckoo hash scheme (on $\epsilon n=n/\log \lambda$ elements with a stash of size $O(\log \lambda)$) is ideal we have to take into account the probability that it fails: at most $O(\delta) + \delta _{\rm prf}^{{\mathcal {A}}} = e^{-\Omega (\log \lambda \cdot \log \log \lambda)} + \delta _{\rm prf}^{{\mathcal {A}}}$, where $\delta =e^{-\log \lambda \cdot \log \log \lambda }$. The failure probability of sampling the bin loads is $nB\cdot \delta \le n^2 \cdot e^{-\log \lambda \cdot \log \log \lambda }$. These two terms are added to the error probability of the scheme. Note that our tight compaction and naïveHT are perfectly oblivious.

We describe a simulator ${\mathsf {Sim}}$ that simulates the access patterns of the ${\sf Build}$, ${\sf Lookup}$, and ${\sf Extract}$ operations of ${\sf BigHT}$:

—	Simulating ${\sf Build}$. Upon receiving an instruction to simulate ${\sf Build}$ with security parameter $1^\lambda$ and a list of size $n$, the simulator runs the real algorithm ${\sf Build}$ on input $1^\lambda$ and a list that consists of $n$ dummy elements. It outputs the access pattern of this algorithm. Let $({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},\texttt{sk})$ be the output state. The simulator stores this state.
—	Simulating ${\sf Lookup}$. When the adversary submits a ${\sf Lookup}$ command with a key $k$, the simulator simulates an execution of the algorithm ${\sf Lookup}$ on input $\bot$ (i.e., a dummy element) with the state $({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},\texttt{sk})$ (which was generated while simulating the the ${\sf Build}$ operation).
—	Simulating ${\sf Extract}$. When the adversary submits an ${\sf Extract}$ command, the simulator executes the real algorithm with its stored internal state $({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},\texttt{sk})$.

We prove that no adversary can distinguish between the real and ideal executions. Recall that in the ideal execution, with each command that the adversary outputs, it receives back the output of the functionality and the access pattern of the simulator, where the latter is simulating the access pattern of the execution of the command on dummy elements. On the other hand, in the real execution, the adversary sees the access pattern and the output of the algorithm that implements the functionality. The proof is via a sequence of hybrid experiments.

Experiment ${\sf Hyb}_0(\lambda)$.. This is the real execution. With each command that the adversary submits to the experiment, the real algorithm is being executed, and the adversary receives the output of the execution together with the access pattern as determined by the execution of the algorithm.

Experiment ${\sf Hyb}_1(\lambda)$.. This experiment is the same as ${\sf Hyb}_0$, except that instead of choosing a PRF key $\texttt{sk}$, we use a truly random function ${\mathcal {O}}$. That is, instead of calling to ${\sf PRF} _{\texttt{sk} }(\cdot)$ in Step 3 of ${\sf Build}$ and Step 4 of the function ${\sf Lookup}$, we call ${\mathcal {O}}(\texttt{sk} \Vert \cdot)$.

The following claim states that due to the $\delta _{{\sf PRF}}^{{\mathcal {A}}}$-security of the PRF, experiments ${\sf Hyb}_0$ and ${\sf Hyb}_1$ are computationally indistinguishable. The proof of this claim is standard.

Claim C.3.

For any $\texttt{PPT}$adversary $\mathcal {A}$, it holds that $\begin{align*} | \Pr [{\sf Hyb}_0(\lambda) = 1] - \Pr [{\sf Hyb}_1(\lambda) = 1] | \le \delta _{{\sf PRF}}^{{\mathcal {A}}}(\lambda). \end{align*}$

Experiment ${\sf Hyb}_2(\lambda)$.. This experiment is the same as ${\sf Hyb}_1(\lambda)$, except that with each command that the adversary submits to the experiment, both the real algorithm is being executed as well as the functionality. The adversary receives the access pattern of the execution of the algorithm, yet the output comes from the functionality.

In the following claim, we show that the initial secret permutation and the random oracle, guarantee that experiments ${\sf Hyb}_1$ and ${\sf Hyb}_2$ are identical.

Claim C.4.

$\Pr \left[{\sf Hyb}_1(\lambda) = 1 \right] = \Pr \left[{\sf Hyb}_2(\lambda) = 1 \right]$.

Proof.

Recall that we assume that the lookup queries of the adversary are non-recurring. Our goal is to show that the output distribution of the extract procedure is a uniform permutation of the unvisited items even given the access pattern of the previous ${\sf Build}$ and ${\sf Lookup}$ operations. By doing so, we can replace the ${\sf Extract}$ procedure with the ideal ${{\mathcal {F}}_{\sf HT}^{n}}. {\sf Extract}$ functionality which is exactly the difference between ${\sf Hyb}_1(\lambda)$ and ${\sf Hyb}_2(\lambda)$.

Consider a sequence of operations that the adversary makes. Let us denote by ${\bf I}$ the set of elements with which it invokes ${\sf Build}$ and by $k^*_1,\ldots ,k^*_m$ the set of keys with which it invokes ${\sf Lookup}$. Finally, it invokes ${\sf Extract}$. We first argue that the output of ${{\mathcal {F}}_{\sf HT}^{n}}.{\sf Extract}$ consists of the same elements as that of ${\sf Extract}$. Indeed, both ${{\mathcal {F}}_{\sf HT}^{n}}.{\sf Lookup}$ and ${\sf Lookup}$ mark every visited item so when we execute ${\sf Extract}$, the same set of elements will be in the output.

We need to argue that the distribution of the permutation of unvisited items in the input of ${\sf Extract}$ is uniformly random. This is enough since ${\sf Extract}$ performs ${\sf IntersperseRD}$ which shuffles the reals and dummies to obtain a uniformly random permutation overall (given that the reals were randomly shuffled to begin with). Fix an access pattern observed during the execution of ${\sf Build}$ and ${\sf Lookup}$. We show, by programming the random oracle and the initial permutation appropriately (while not changing the access pattern), that the permutation of the unvisited elements is uniformly distributed.

Consider tuples of the form $(\pi _{\sf in},{\mathcal {O}},R,{\mathcal {T}},\pi _{\sf out})$, where (1) $\pi _{\sf in}$ is the permutation performed on ${\bf I}$ by the input assumption (prior to ${\sf Build}$), (2) ${\mathcal {O}}$ is the random oracle, (3) $R$ is the internal randomness of all intermediate functionalities and of the balls into bins choices of the dummy elements; (4) ${\mathcal {T}}$ is the access pattern of the entire sequence of commands $({\sf Build}({\bf I}),{\sf Lookup}(k^*_1),\ldots ,{\sf Lookup}(k^*_m))$, and (5) $\pi _{\sf out}$ is the permutation on ${\bf I}^{\prime } = \lbrace (k,v) \in {\bf I}\mid k \notin \lbrace k^*_1,\ldots ,k^*_m\rbrace \rbrace$ which is the input to ${\sf Extract}$. The algorithm defines a deterministic mapping $\psi _R(\pi _{\sf in},{\mathcal {O}}) \rightarrow ({\mathcal {T}},\pi _{\sf out})$.

To gain intuition, consider arbitrary $R$, $\pi _{\sf in}$, and ${\mathcal {O}}$ such that $\psi _R(\pi _{\sf in},{\mathcal {O}})\rightarrow ({\mathcal {T}},\pi _{\sf out})$ and two distinct existing keys $k_i$ and $k_j$ that are not queried during the Lookup stage (i.e., $k_i,k_j \notin \lbrace k^*_1,\ldots ,k^*_m\rbrace$). We argue that from the point of view of the adversary, having seen the access pattern and all query results, he cannot distinguish whether $\pi _{\sf out}(i) \lt \pi _{\sf out}(j)$ or $\pi _{\sf out}(i) \gt \pi _{\sf out}(j)$. The argument will naturally generalize to arbitrary unqueried keys and an arbitrary ordering.

To this end, we show that there is $\pi _{\sf in}^{\prime }$ and ${\mathcal {O}}^{\prime }$ such that $\psi _R(\pi ^{\prime }_{\sf in},{\mathcal {O}}^{\prime })\rightarrow ({\mathcal {T}},\pi ^{\prime }_{\sf out})$, where $\pi ^{\prime }_{\sf out}(\ell) = \pi _{\sf out}(\ell)$ for every $\ell \notin \lbrace i,j\rbrace$, and $\pi ^{\prime }_{\sf out}(i) = \pi _{\sf out}(j)$ and $\pi ^{\prime }_{\sf out}(j) = \pi _{\sf out}(i)$. That is, the access pattern is exactly the same and the output permutation switches the mappings of $k_i$ and $k_j$. The permutation $\pi ^{\prime }_{\sf in}$ is the same as $\pi _{\sf in}$ except that $\pi ^{\prime }_{\sf in}(i) = \pi _{\sf in}(j)$ and $\pi ^{\prime }_{\sf in}(j) = \pi _{\sf in}(i)$, and ${\mathcal {O}}^{\prime }$ is the same as ${\mathcal {O}}$ except that ${\mathcal {O}}^{\prime }(k_i) = {\mathcal {O}}(k_j)$ and ${\mathcal {O}}^{\prime }(k_j) = {\mathcal {O}}(k_i)$. This definition of $\pi ^{\prime }_{\sf in}$ together with ${\mathcal {O}}^{\prime }$ ensure, by our construction, that the observed access pattern remains exactly the same. The mapping is also reversible so by symmetry all permutations have the same number of configurations of $\pi _{\sf in}$ and ${\mathcal {O}}$.

For the general case, one can switch from any $\pi _{\sf out}$ to any (legal) $\pi ^{\prime }_{\sf out}$ by changing only $\pi _{\sf in}$ and ${\mathcal {O}}$ at locations that correspond to unvisited items. We define $\begin{align*} \pi _{\sf in}^{\prime }(i) = \pi _{\sf in}({\pi _{\sf out}}^{-1}({\pi _{\sf out}^{\prime }}(i))) \quad \text{ and } \quad {\mathcal {O}}^{\prime }(k_i) = {\mathcal {O}}(k_{\pi _{\sf in}({\pi _{\sf out}}^{-1}({\pi _{\sf out}^{\prime }}(i)))}). \end{align*}$ This choice of $\pi _{\sf in}^{\prime }$ and ${\mathcal {O}}^{\prime }$ do not change the observed access pattern and result with the output permutation $\pi ^{\prime }_{\sf out}$, as required. By symmetry, the resulting mapping between different $(\pi _{\sf in}^{\prime },{\mathcal {O}}^{\prime })$ and $\pi _{\sf out}^{\prime }$ is regular (i.e., each output permutation has the same number of ways to reach to) which completes the proof.□

Experiment ${\sf Hyb}_3(\lambda)$.. This experiment is the same as ${\sf Hyb}_2(\lambda)$, except that we modify the definition of ${\sf Extract}$ to output a list of dummy elements. This is implemented by modifying each ${\sf Obin}_i.{\sf Extract}()$ to return a list of dummy elements (for each $i\in [B]$), as well as ${\sf OF}.{\sf Extract}()$. We also stop marking elements that were searched for during ${\sf Lookup}$.

Recall that in this hybrid experiment the output of ${\sf Extract}$ is given to the adversary by the functionality, and not by the algorithm. Thus, the change we made does not affect the view of the adversary which means that experiments ${\sf Hyb}_2$ and ${\sf Hyb}_3$ are identical.

Claim C.5.

$\Pr \left[{\sf Hyb}_2(\lambda) = 1 \right] = \Pr \left[{\sf Hyb}_3(\lambda) = 1 \right]$.

Experiment ${\sf Hyb}_4(\lambda)$.. This experiment is identical to experiment ${\sf Hyb}_3(\lambda)$, except that when the adversary submits the command ${\sf Lookup}(k)$ with key $k$, we run ${\sf Lookup}(\bot)$ instead of ${\sf Lookup}(k)$.

Recall that the output of the procedure is determined by the functionality and not the algorithm. In the following claim we show that the access pattern observed by the adversary in this experiment is statistically close to the one observed in ${\sf Hyb}_3(\lambda)$.

Claim C.6.

For any (unbounded) adversary $\mathcal {A}$, there is a negligible function ${\sf negl}(\cdot)$ such that $\begin{align*} |\Pr [{\sf Hyb}_3(\lambda) = 1] - \Pr [{\sf Hyb}_4(\lambda) = 1 ]| \le n \cdot e^{-\Omega (\log ^5 \lambda)}. \end{align*}$

Proof.

Consider a sequence of operations that the adversary makes. Let us denote by ${\bf I}= \lbrace k_1, k_2, \ldots , k_n : k_1 \lt k_2 \lt \ldots \lt k_n\rbrace$ the set of elements with which it invokes ${\sf Build}$, by $\pi$ the secret input permutation such that the $i$th element ${\bf I}[i] = k_{\pi (i)}$, and by $Q = \lbrace k^*_1, \ldots , k^*_m\rbrace$ the set of keys with which it invokes ${\sf Lookup}$. We first claim that it suffices to consider only the joint distribution of the access pattern of Step 3 in ${\sf Build}({\bf I})$, followed by the access pattern of ${\sf Lookup}(k^*_i)$ for all $k^*_i \in {\bf I}$. In particular, in both hybrids, the outputs are determined by the functionality, and the access pattern of ${\sf Extract}()$ is identically distributed. Moreover, the access pattern in Steps 4 through 6 in ${\sf Build}$ is deterministic and is a function of the access pattern of Step 3. In addition, in both executions ${\sf Lookup}(k)$ for keys such that $k \not\in {\bf I}$, as well as ${\sf Lookup}(\bot)$ cause a linear scan of the overflow pile followed by an independent visit of a random bin (even when conditioning on the access pattern of ${\sf Build}$)—we can ignore such queries. Finally, even though the adversary is adaptive, we essentially prove in the following that the entire view of the adversary is close in both experiment, and therefore the ordering of how the view is obtained cannot help distinguishing.

Let $X \leftarrow {{\sf BallsIntoBins}}(n,B)$ denote a sample of the access pattern obtained by throwing $n$ balls into $B$ bins. It is convenient to view $X$ as a bipartite graph $X=(V_n\cup V_B,E_X)$, where $V_n$ are the $n$ vertices representing the balls, $V_B$ are $B$ vertices representing the bins, and $E_X$ are representing the access pattern. Note that, the output degree of each balls is 1, whereas the degree of each bin is its load, and the expectation of the latter is $n/B$. For two graphs that share the same bins $V_B$, $X=(V_{n_1} \cup V_B,E_X)$ and $Y=(V_{n_2} \cup V_B,E_Y)$, we define the union of the two graphs, denoted $X \cup Y$, by $X \cup Y = (V_{n_1} \cup V_{n_2} \cup V_B, E_X \cup E_Y)$. Consider the following two distributions.

Distribution ${\sf AccessPtrn}_3(\lambda)$: In ${\sf Hyb}_3(\lambda)$, the joint distribution of the access pattern of Step 3 in ${\sf Build}({\bf I})$, followed by the access pattern of ${\sf Lookup}(k_i)$ for all $k_i \in {\bf I}$, can be described by the following process:

(1)	Sample $X \leftarrow {{\sf BallsIntoBins}}(n,B)$. Let $(n_1,\ldots ,n_B)$ be the loads obtained in the process and $\mu = \frac{n}{B}$ be the expectation of $n_i$ for all $i \in [B]$.
(2)	Sample independent bin loads $(L_1,\ldots ,L_B) \leftarrow {\mathcal {F}}^{{\mathsf {throw\text{-}balls}}}_{n^{\prime },B}$, where $n^{\prime } = n\cdot \left(1- \epsilon \right)$. Let $\mu ^{\prime } = \frac{n^{\prime }}{B}$ be the expectation of $L_i$ for all $i \in [B]$.
(3)	$\underline{{\sf Overflow}}$: If for some $i\in [B]$ we have that $\left\| n_i - \mu \right\| \gt 0.5 \epsilon \mu$ or $\left\| {{L}} _i - \mu ^{\prime } \right\| \gt 0.5 \epsilon \mu$, then ${\sf abort}$ the process.
(4)	Consider the graph $X = (V_n \cup V_B, E_X)$, and for every bin $i \in [B]$, remove from $E_X$ exactly $n_i - L_i$ edges arbitrarily (these correspond to the elements that are stored in the overflow pile). Let $X^{\prime } = (V_n \cup V_B, E^{\prime }_X)$ be the resulting graph. Note that $X^{\prime }$ has $n^{\prime }$ edges, each bin $i \in [B]$ has exactly $L_i$ edges, and $n-n^{\prime }$ vertices in $V_n$ have no output edges.
(5)	Recall that $\pi$ is the input permutation on ${\bf I}$. Let $\tilde{E}^{\prime }_X = \lbrace (\pi (i), v_i) : (i, v_i) \in E^{\prime }_X\rbrace$ be the set of permuted edges, $\tilde{V}_{n^{\prime }} \subset V_n$ be the set of nodes that have an edge in $\tilde{E}^{\prime }_X$, and $\tilde{X}^{\prime } = (\tilde{V}_{n^{\prime }} \cup V_B, \tilde{E}^{\prime }_X)$. Note that there are $n^{\prime }$ vertices in $\tilde{V}_{n^{\prime }}$.
(6)	For the $\epsilon n$ remaining vertices in $V_n$ but not in $\tilde{V}_{n^{\prime }}$ that have no output edges (i.e., the balls in the overflow pile), sample new and independent output edges, where each edge is obtained by choosing independent bin $i \leftarrow [B]$. Let $Z^{\prime }$ be the resulting graph (corresponds to the access pattern of ${\sf Lookup}(k_i)$ for all $k_i$ that appear in ${\sf OF}$ and not in the major bins). Let $Y = \tilde{X}^{\prime } \cup Z^{\prime }$. (The graph $Y$ contains edges that correspond to the “real” elements placed in the major bins which were obtained from the graph $\tilde{X}^{\prime }$, together with fresh “noisy” edges corresponding to the elements stored in the overflow pile).
(7)	Output $(X,Y)$.

Distribution ${{\sf AccessPtrn}}_4(\lambda)$: In ${\sf Hyb}_4(\lambda)$, the joint distribution of the access pattern of Step 3 in ${\sf Build}({\bf I})$, followed by the access pattern of ${\sf Lookup}(\bot)$ for all $k_i \in {\bf I}$, is described by the following (simpler) process:

(1)	Sample $X \leftarrow {\sf BallsIntoBins}(n,B)$. Let $(n_1,\ldots ,n_B)$ be the loads obtained in the process and $\mu = \frac{n}{B}$ be the expectation of $n_i$ for all $i \in [B]$.
(2)	Sample independent bin loads $(L_1,\ldots ,L_B) \leftarrow {\mathcal {F}}^{\mathsf {throw\text{-}balls}}_{n^{\prime },B}$, where $n^{\prime } = n\cdot \left(1- \epsilon \right)$. Let $\mu ^{\prime } = \frac{n^{\prime }}{B}$ be the expectation of $L_i$ for all $i \in [B]$.
(3)	$\underline{{\sf Overflow}}$: If for some $i\in [B]$ we have that $\left\| n_i - \mu \right\| \gt 0.5 \epsilon \mu$ or $\left\| {{L}} _i - \mu ^{\prime } \right\| \gt 0.5 \epsilon \mu$, then ${\sf abort}$ the process.
(4)	Sample an independent $Y \leftarrow {\sf BallsIntoBins}(n,B)$. (Corresponding to the access pattern of ${\sf Lookup}(\bot)$ for every command ${\sf Lookup}(k)$.)
(5)	Output $(X,Y)$.

By the definition of our distributions and hybrid experiments, we need to show $\begin{align*} |\Pr \left[{\sf Hyb}_3(\lambda) = 1 \right] - \Pr \left[{\sf Hyb}_4(\lambda) = 1 \right]| \le {\sf SD}\left({\sf AccessPtrn}_3(\lambda),{\sf AccessPtrn}_4(\lambda)\right) \le n \cdot e^{-\Omega (\log ^5 \lambda)}. \end{align*}$

Towards this goal, first, by a Chernoff bound per bin and union bound over the bins, then by $\mu = \log ^9 \lambda$ and $\epsilon = \frac{1}{\log ^2 \lambda }$, it holds that $\begin{align*} \Pr _{{\sf AccessPtrn}_3}\left[{\sf Overflow}\right] = \Pr _{{\sf AccessPtrn}_4}\left[{\sf Overflow}\right] \le B \cdot 2\exp (-\mu (0.5\epsilon)^2/2) \le n \cdot e^{-\Omega (\log ^5 \lambda)}. \end{align*}$ We condition on ${\sf Overflow}$ not occurring and show that both distributions output two independent graphs, i.e., two independent samples of ${\sf BallsIntoBins}(n,B)$, and thus they are equivalent.

This holds in ${\sf AccessPtrn}_4$ directly by definition. As for ${\sf AccessPtrn}_3$, consider the joint distribution of $(X, \tilde{X}^{\prime })$ conditioning on ${\sf Overflow}$ not happening. For any graph $G = (V_{n^{\prime }} \cup V_B, E_G)$ that corresponds to a sample of ${\sf BallsIntoBins}(n^{\prime }, B)$, we have that $\tilde{X}^{\prime } = G$ if and only if (i) the loads of $\tilde{X}^{\prime }$ equals to the loads of $G$ and (ii) $\tilde{E}^{\prime }_X = E_G$, where the loads of $G$ are defined as the degrees of nodes $v \in V_B$. Observe that, by definition, the loads of $\tilde{X}^{\prime }$ are exactly the loads of $Z$: $(L_1, \ldots , L_n)$, and hence the loads of $\tilde{X}^{\prime }$ are independent of $X$. Also, since ${\sf Overflow}$ does not happen, $X$, and the event of (i), the probability of $\tilde{E}^{\prime }_X = E_G$ is exactly the probability of the $n^{\prime }$ vertices in $\tilde{X}^{\prime }$ matching those in $G$, which is $\frac{1}{n^{\prime }!}$ by the uniform input permutation $\pi$. It follows that $\begin{align*} \Pr [X^{\prime } = G \mid X \wedge \lnot {\sf Overflow}] &= \Pr [\text{loads of $G$} = (L_1, \ldots , L_n) \mid \lnot {\sf Overflow}] \cdot \frac{1}{n^{\prime }!} \\ &= \Pr [Z = G \mid \lnot {\sf Overflow}] \end{align*}$ for all $G$, which implies that $X^{\prime }$ is independent of $X$. Moreover, in Step 6, we sample a new graph $Z^{\prime } = {\sf BallsIntoBins}(n-n^{\prime },B)$, and output $Y$ as $\tilde{X}^{\prime }$ augmented by $Z^{\prime }$. In other words, we sample $Y$ as follows: we sample two independent graphs $Z \leftarrow {{\sf BallsIntoBins}}(n^{\prime },B)$ and $\tilde{Z^{\prime }} \leftarrow {\sf BallsIntoBins}(n-n^{\prime },B)$, and output the joint graph $Z \cup Z^{\prime }$. This has exactly the same distribution as an independent instance of ${\sf BallsIntoBins}(n,B)$. We therefore conclude that $\begin{align*} {\sf SD}\left({\sf AccessPtrn}_3(\lambda) \mid \lnot {\sf Overflow}, {\sf AccessPtrn}_4(\lambda) \mid \lnot {\sf Overflow}\right)=0. \end{align*}$ Thus, following a fact on statistical distance,²¹ $\begin{align*} &{\sf SD}\left({\sf AccessPtrn}_3(\lambda),{\sf AccessPtrn}_4(\lambda)\right)\\ &\!\le {\sf SD}\left({\sf AccessPtrn}_3(\lambda)\, \vert \, \lnot {\sf Overflow},{\sf AccessPtrn}_4(\lambda)\, \vert \, \lnot {\sf Overflow}\right) + \Pr \left[{\sf Overflow}\right] \le n \cdot e^{-\Omega (\log ^5 \lambda)}\!. \end{align*}$ Namely, the access patterns are statistically close. The above analysis assumes that all $n$ elements in the input ${\bf I}$ are real and the $m$ ${\sf Lookup}$s visit all real keys in ${\bf I}$. If the number of real elements is less than $n$ (or even less than $n^{\prime }$), then the construction and the analysis go through similarly; the only difference is that the ${\sf Lookup}$s reveal a smaller number of edges in $\tilde{X}^{\prime }$, and thus the distributions are still statistically close. The same argument follows if the $m$ ${\sf Lookup}$s visit only a subset of real keys in ${\bf I}$. Also note that fixing any set $Q = \lbrace k^*_1, \ldots , k^*_m\rbrace$ of ${\sf Lookup}$, every ordering of $Q$ reveals the same access pattern $\tilde{X}^{\prime }$ as $\tilde{X}^{\prime }$ is determined only by ${\bf I}, \pi , X, Z$, and thus the view is identical for every ordering. This completes the proof of Claim C.6.□

Experiment ${\sf Hyb}_5$.. This experiment is the same as ${\sf Hyb}_4$, except that we run ${\sf Build}$ in input ${\bf I}$ that consists of only ${\sf dummy}$ values.

Recall that in this hybrid experiment the output of ${\sf Extract}$ and ${\sf Lookup}$ is given to the adversary by the functionality, and not by the algorithm. Moreover, the access pattern of ${\sf Build}$, due to the random function, each ${\mathcal {O}}(\texttt{sk} || k_i)$ value is distributed uniformly at random, and therefore the random choices made to the real elements are similar to those made to dummy elements. We conclude that the view of the adversary in ${\sf Hyb}_4(\lambda)$ and ${\sf Hyb}_5(\lambda)$ is identical.

Claim C.7.

$\Pr [{\sf Hyb}_4(\lambda) = 1] = \Pr [{\sf Hyb}_5(\lambda) = 1]$.

Experiment ${\sf Hyb}_6$.. This experiment is the same as ${\sf Hyb}_5$, except that we replace the random oracle ${\mathcal {O}}(\texttt{sk} \Vert \cdot)$ with a PRF key $\texttt{sk}$.

Observe that this experiment is identical to the ideal execution. Indeed, in the ideal execution the simulator runs the real ${\sf Build}$ operation on input that consists only of dummy elements and has an embedded PRF key. However, this PRF key is never used since we input only dummy elements, and thus the two experiments are identical.

Claim C.8.

$\Pr [{\sf Hyb}_5(\lambda) = 1 ] = \Pr [{\sf Hyb}_6(\lambda) = 1]$.

By combining Claims C.3–C.8, we have that ${\sf BigHT}$ is $(1 - n^2\cdot e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}}^{\mathcal {A}})$-oblivious, which concludes the proof of Theorem 7.2.

C.4 Proof of Security of SmallHT (Theorem 8.6)

We view our construction in a hybrid model, in which we have ideal implementations of the underlying building blocks: an oblivious random permutation (see Section 4.2 and Claim 8.2) and an oblivious Cuckoo assignment (see Section 4.5 and Claim 8.4). Since the two building blocks are ideal, we take into account the failure probability of the Cuckoo assignment, $e^{-\Omega (\log \lambda \cdot \log \log \lambda)}$, and the failure probability of the random permutation, $e^{-\Omega (\sqrt {n})} \le e^{-\Omega (\log \lambda \log \log \lambda)}$ since $n = \log ^9 \lambda$.

We present a simulator ${\mathsf {Sim}}$ that simulates ${\sf Build},{\sf Lookup}$ and ${\sf Extract}$ procedures of ${\sf SmallHT}$.

—	Simulating ${\sf Build}$. Upon receiving an instruction to simulate ${\sf Build}$ with security parameter $1^\lambda$ and a list of size $n$, the simulator ${\mathsf {Sim}}$ runs the real ${\sf SmallHT}.{\sf Build}$ algorithm on input $1^\lambda$ and a list that consists of $n$ dummy elements. It outputs the access pattern of this algorithm. Let $({\bf Y},{\sf S},\texttt{sk})$ be the output state, where ${\bf Y}$ is an array of size ${\sf c_{\sf cuckoo}}\cdot n$, ${\sf S}$ is a stash of size $O(\log \lambda)$, and $\texttt{sk}$ is a secret key used to generate pseudorandom values. The simulator stores this state.
—	Simulating ${\sf Lookup}$. When the adversary submits a ${\sf Lookup}$ command with a key $k$, the simulator ${\mathsf {Sim}}$ simulates an execution of the algorithm ${\sf SmallHT}.{\sf Lookup}$ on input $\bot$ (i.e., a dummy element) with the state $({\bf Y},{\sf S},\texttt{sk})$ (which was generated while simulating the the ${\sf Build}$ operation).
—	Simulating ${\sf Extract}$. When the adversary submits an ${\sf Extract}$ command, the simulator ${\mathsf {Sim}}$ executes the real ${\sf SmallHT}.{\sf Extract}$ algorithm with its stored internal state $({\bf Y},{\sf S},\texttt{sk})$.

We proceed to show that no adversary can distinguish between the real and ideal executions. Recall that in the ideal execution, with each command that the adversary outputs, it receives back the output of the functionality and the access pattern of the simulator, where the latter is simulating the access pattern of the execution of the command on dummy elements. On the other hand, in the real execution, the adversary sees the access pattern and the output of the algorithm that implements the functionality. The proof is via a sequence of hybrid experiments.

The following claim states that due to the security of the PRF, experiments ${\sf Hyb}_0$ and ${\sf Hyb}_1$ are computationally indistinguishable. The proof of this claim is standard.

Claim C.9.

In the following claim, we show that the initial secret permutation and the random oracle, guarantee that experiments ${\sf Hyb}_1$ and ${\sf Hyb}_2$ are identical.

Claim C.10.

$\Pr \left[{\sf Hyb}_1(\lambda) = 1 \right] = \Pr \left[{\sf Hyb}_2(\lambda) = 1 \right]$.

Proof.

Recall that we assume that the lookup queries of the adversary are non-recurring. Our goal is to show that the output distribution of the extract procedure is a uniform permutation of the unvisited items even given the access patter of the previous ${\sf Build}$ and ${\sf Lookup}$ operations. By doing so, we can replace the ${\sf Extract}$ procedure with the ideal ${{\mathcal {F}}_{\sf HT}^{n}}.{\sf Extract}$ functionality which is exactly the difference between ${\sf Hyb}_1(\lambda)$ and ${\sf Hyb}_2(\lambda)$.

Consider a sequence of operations that the adversary makes. Let us denote by ${\bf I}$ the set of elements with which it invokes ${\sf Build}$ and by $k^*_1,\ldots ,k^*_m$ the set of keys with which it invokes ${\sf Lookup}$. Finally, it invokes ${\sf Extract}$. We first argue that the output of ${{\mathcal {F}}_{\sf HT}^{n}}.{\sf Extract}$ consists of the same elements as that of ${\sf Extract}$. Indeed, both ${{\mathcal {F}}_{\sf HT}^{n}}.{\sf Lookup}$ and ${\sf SmallHT}.{\sf Lookup}$ remove every visited item so when we execute ${\sf Extract}$, the same set of elements will be in the output.

We need to argue that the distribution of the permutation of unvisitied items in the output of ${\sf Extract}$ is uniformly random. This is enough since ${\sf Extract}$ performs ${\sf IntersperseRD}$ which shuffles the reals and dummies to obtain a uniformly random permutation overall (given that the reals were randomly shuffled to begin with). Fix an access pattern observed during the execution of ${\sf Build}$ and ${\sf Lookup}$. We show, by programming the random oracle and the initial permutation appropriately (while not changing the access pattern), that the permutation is uniformly distributed.

Consider tuples of the form $(\pi _{\sf in},{\mathcal {O}},R,{\mathcal {T}},\pi _{\sf out})$, where (1) $\pi _{\sf in}$ is the permutation performed on ${\bf I}$ by the input assumption (prior to ${\sf Build}$), (2) ${\mathcal {O}}$ is the random oracle, (3) $R$ is the internal randomness of all intermediate procedures (such as ${\sf IntersperseRD}$, Algorithms 8.1 and 8.3, etc); (4) ${\mathcal {T}}$ is the access pattern of the entire sequence of commands $({\sf Build}({\bf I}),{\sf Lookup}(k^*_1),\ldots ,{\sf Lookup}(k^*_m))$, and (5) $\pi _{\sf out}$ is the permutation on ${\bf I}^{\prime } = \lbrace (k,v) \in {\bf I}\mid k \notin \lbrace k^*_1,\ldots ,k^*_m\rbrace \rbrace$ which is the input to ${\sf Extract}$. The algorithm defines a deterministic mapping $\psi _R(\pi _{\sf in},{\mathcal {O}}) \rightarrow ({\mathcal {T}},\pi _{\sf out})$.

To this end, we show that there is $\pi _{\sf in}^{\prime }$ and ${\mathcal {O}}^{\prime }$ such that $\psi _R(\pi ^{\prime }_{\sf in},{\mathcal {O}}^{\prime })\rightarrow ({\mathcal {T}},\pi ^{\prime }_{\sf out})$, where $\pi ^{\prime }_{\sf out}(\ell) = \pi _{\sf out}(\ell)$ for every $\ell \notin \lbrace i,j\rbrace$, and $\pi ^{\prime }_{\sf out}(i) = \pi _{\sf out}(j)$ and $\pi ^{\prime }_{\sf out}(j) = \pi _{\sf out}(i)$. The permutation $\pi ^{\prime }_{\sf in}$ is the same as $\pi _{\sf in}$ except that $\pi ^{\prime }_{\sf in}(i) = \pi _{\sf in}(j)$ and $\pi ^{\prime }_{\sf in}(j) = \pi _{\sf in}(i)$, and ${\mathcal {O}}^{\prime }$ is the same as ${\mathcal {O}}$ except that ${\mathcal {O}}^{\prime }(k_i) = {\mathcal {O}}(k_j)$ and ${\mathcal {O}}^{\prime }(k_j) = {\mathcal {O}}(k_i)$. The fact that the access pattern after this modification remains the same stems from the indiscriminate property of the hash table construction procedure which says that the Cuckoo hash assignments are a fixed function of the two choices of all elements (i.e., ${\bf MD}_{\bf X}$), independently of their real key value (i.e., the procedure does not discriminate elements based on their keys in the input array). Note that the mapping is also reversible so by symmetry all permutations have the same number of configurations of $\pi _{\sf in}$ and ${\mathcal {O}}$.

For the general case, one can switch from any $\pi _{\sf out}$ to any (legal) $\pi ^{\prime }_{\sf out}$ by changing only $\pi _{\sf in}$ and ${\mathcal {O}}$ at locations that correspond to unvisited items. We define $\begin{align*} \pi _{\sf in}^{\prime }(i) = \pi _{\sf in}({\pi _{\sf out}}^{-1}({\pi _{\sf out}^{\prime }}(i))) \quad \text{ and } \quad {\mathcal {O}}^{\prime }(k_i) = {\mathcal {O}}(k_{\pi _{\sf in}({\pi _{\sf out}}^{-1}({\pi _{\sf out}^{\prime }}(i)))}). \end{align*}$ Due to the indiscriminate property, this choice of $\pi _{\sf in}^{\prime }$ and ${\mathcal {O}}^{\prime }$ do not change the observed access pattern and result with the output permutation $\pi ^{\prime }_{\sf out}$, as required. By symmetry, the resulting mapping between different $(\pi _{\sf in}^{\prime },{\mathcal {O}}^{\prime })$ and $\pi _{\sf out}^{\prime }$ is regular (i.e., each output permutation has the same number of ways to reach to) which completes the proof.□

Claim C.11.

$\Pr [{\sf Hyb}_2(\lambda) = 1] = \Pr [{\sf Hyb}_3(\lambda) = 1]$.

Recall that the output of the procedure is determined by the functionality and not the algorithm. By construction, the access pattern observed by the adversary in this experiment is identical to the one observed from ${\sf Hyb}_3(\lambda)$ (recall that we already switched the PRF to a completely random choices).

Claim C.12.

$\Pr [{\sf Hyb}_3(\lambda) = 1] = \Pr [{\sf Hyb}_4(\lambda) = 1]$.

Experiment ${\sf Hyb}_5$.. This experiment is the same as ${\sf Hyb}_4$, except that we run ${\sf Build}$ in input ${\bf I},$ which consists of only ${\sf dummy}$ values.

Recall that in this hybrid experiment the output of ${\sf Extract}$ and ${\sf Lookup}$ is given to the adversary by the functionality, and not by the algorithm. Moreover, the access pattern of ${\sf Build}$, due to the random oracle and the obliviousness of all the underlying building blocks (oblivious Cuckoo hash, oblivious random permutation, oblivious tight compaction, ${\sf IntersperseRD}$, oblivious bin assignment, and oblivious sorting), the view of the adversary in ${\sf Hyb}_4(\lambda)$ and ${\sf Hyb}_5(\lambda)$ is identical.

Claim C.13.

$\Pr [{\sf Hyb}_4(\lambda) = 1] = \Pr [{\sf Hyb}_5(\lambda) = 1]$.

Experiment ${\sf Hyb}_6$.. This experiment is the same as ${\sf Hyb}_5$, except that we replace the random oracle ${\mathcal {O}}(\texttt{sk} \Vert \cdot)$ with a PRF key $\texttt{sk}$.

Observe that this experiment is identical to the ideal execution. Indeed, in the ideal execution, the simulator runs the real ${\sf Build}$ operation on input that consists only of dummy elements and has an embedded PRF key. However, this PRF key is never used since we input only dummy elements, and thus the two experiments are identical.

Claim C.14.

$\Pr [{\sf Hyb}_5(\lambda) = 1] = \Pr [{\sf Hyb}_6(\lambda) = 1]$.

By combining Claims C.9–C.14, ${\sf SmallHT}$ is $(1-e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}}^{{\mathcal {A}}})$-oblivious, and we conclude the proof of Theorem 8.6.

C.5 Proof of Security of CombHT (Theorem 8.8)

We present a sequence of hybrid constructions (where we ignore the failure probability of the primitives we use) and show that each one of them obliviously implements Functionality 4.7. Afterwards, we will account for the security loss of each modification and/or primitive we use.

—	Construction I: This construction is the same as Construction 7.1, except that we replace each ${{\sf naïveHT}}$ with ${\sf SmallHT}$. Let ${\sf S}_1,\ldots ,{\sf S}_B$ denote the small stash of the bins ${\sf OBin}_1,\ldots ,{\sf OBin}_B$, respectively.
—	Construction II: This construction is the same as Construction I, except the following (inefficient) modification. Instead of searching for the key $k_i$ in the small stash ${\sf S}_i$ of one of the bins of ${\sf SmallHT}$, we search for $k_i$ in all small stashes ${\sf S}_1,\ldots ,{\sf S}_B$ in order.
—	Construction III: This construction is the same as Construction II, except that we modify ${\sf Build}$ as follows. We merge all the small stashes ${\sf S}_1,\ldots ,{\sf S}_B$ into one long list. As in Construction II, when we have to access one of the stashes and look for a key $k_i$, we perform a linear scan in this list, searching for $k_i$.
—	Construction IV: This construction is the same as Construction III, except that we make the search in the merged set of stashes more efficient. In ${\sf CombHT}.{\sf Build}$ we construct an oblivious Cuckoo hashing scheme as in Theorem 4.14 on the elements in the combined set of stashes. The resulting structure is called ${\sf Comb}{\sf S}= ({\sf Comb}{\sf S}_{\sf T},{\sf Comb}{\sf S}_{\sf S})$ and it is composed of a main table and a stash. Observe that this construction is identical to Construction 8.7.

Construction II is the same as Construction I, except that there is a blowup in the access pattern of each ${\sf Lookup}$ by performing a linear scan of all elements in all stashes. In terms of functionality, by construction the input/output behavior of the two constructions is exactly the same. For obliviousness, one can simulate the linear scan of all the stashes by performing a fake linear scan. More formally, there exists a 1-to-1 mapping between access pattern provided by Construction I and an access pattern provided by Construction II. Thus, there is no security loss from Construction I to Construction II.

Construction III and Construction II are the same, except for a cosmetic modification in the locations where we put the elements from the stashes. Thus, no security loss from Construction II to Construction III.

Construction IV is the same as Construction III, except that we apply an oblivious Cuckoo hash on the merged set of hashes $\cup _{i\in [B]} {\sf S}_i$ to improve on ${\sf Lookup}$ time (compared to a linear scan).

Lastly, we analyze the total security loss. We first implemented all the major bins with ${\sf SmallHT}$. Since there are $n/\mu \lt n$ bins, by Theorem 8.6, this amounts to $n\cdot e^{-\Omega (\log \lambda \log \log \lambda)} + \delta _{{\sf PRF}}^{\mathcal {A}}$ total security loss. We lose an additional $e^{-\Omega (\log \lambda \cdot \log \log \lambda)} + \delta _{{\sf PRF}}^{\mathcal {A}}$ additive term by implementing the merged stashes using an oblivious cuckoo hash. Also, recall that the original security loss of ${\sf BigHT}$ was already $n^2 \cdot e^{-\Omega (\log \lambda \cdot \log \log \lambda)} + \delta _{{\sf PRF}}^{\mathcal {A}}$. In conclusion, the final construction $(1- n^2 \cdot e^{-\Omega (\log \lambda \cdot \log \log \lambda)} - \delta _{{\sf PRF}}^{\mathcal {A}})$-obliviously implements Functionality 4.7.

C.6 Proof of Security of ORAM (Theorem 9.2)

We show the existence of a simulator for which, for any sequence of operations ${\sf Access}(\mathsf {op} _1,{\mathsf {addr}} _1,{\mathsf {data}} _i),\ldots ,{\sf Access}(\mathsf {op} _n,{\mathsf {addr}} _n,{\mathsf {data}} _n)$, the joint distribution of the output of the simulator and the output of the functionality is indistinguishable from the access pattern of the construction and the output of the construction. We show that by a sequence of intermediate constructions.

Construction 1.. Our starting point is a construction in the $({{\mathcal {F}}_{\sf HT}},{\mathcal {F}}_{\sf Dict},{{\mathcal {F}}_{\sf Shuffle}^{}},{{\mathcal {F}}_{\sf compaction}})$-hybrid model which is slightly different from Construction 9.1. In this construction, each level $T_{\ell +1},\ldots ,T_L$ is implemented using the ideal functionality ${{\mathcal {F}}_{\sf HT}}$ (of the respective size). The dictionary $D$ is implemented using the ideal functionality ${\mathcal {F}}_{\sf Dict}$. Steps 6d and 6(d)ii are implemented using ${{\mathcal {F}}_{\sf Shuffle}^{}}$ of the respective size, and the compaction in Step 6(d)i is implemented using ${{\mathcal {F}}_{\sf compaction}}$. Note that in this construction, Step 6(e)iv is invalid, as the ${{\mathcal {F}}_{\sf HT}}$ functionality is not necessarily implemented using stashes. This construction boils down to the construction of Goldreich Ostrovsky using ideal implementations of $({{\mathcal {F}}_{\sf HT}},{\mathcal {F}}_{\sf Dict},{{\mathcal {F}}_{\sf Shuffle}^{}},{{\mathcal {F}}_{\sf compaction}})$. For completeness, we provide a full description:

The construction: Let $\ell =11\log \log \lambda$ and $L=\log N$. The internal state include an handle $D$ to ${\mathcal {F}}_{\sf Dict}^{2^\ell }$, handles $T_{\ell +1},\ldots ,T_L$ to ${{\mathcal {F}}_{\sf HT}}^{2^{\ell +1},N},\ldots ,{{\mathcal {F}}_{\sf HT}}^{2^L,N}$, respectively, a counter ${\sf ctr}$ and flags ${\sf full} _{\ell +1},\ldots , {\sf full} _{L}$. Upon receiving a command ${\sf Access}(\mathsf {op},{\mathsf {addr}},{\mathsf {data}})$:

(1)

Initialize $\mathsf {found} ~{\sf :=}~ \mathsf {false}$, ${\mathsf {data}} ^* ~{\sf :=}~ \bot$, ${\sf levelIndex} ~{\sf :=}~ \bot$ and ${\sf whichStash} ~{\sf :=}~ \bot$.

(2)

Perform $\mathsf {fetched} ~{\sf :=}~ D.{\sf Lookup}({\mathsf {addr}})$. If $\mathsf {fetched} \ne \bot$:

(a)	Interpret $\mathsf {fetched}$ as $({\sf levelIndex},{\sf whichStash},\mathsf {data^*})$.
(b)	If ${\sf levelIndex} = \ell$, then set $\mathsf {found} ~{\sf :=}~ \mathsf {true}$.

(3)

For each $i \in \lbrace \ell +1,\ldots ,L\rbrace$ in increasing order, do:

(a)	If $\mathsf {found} = \mathsf {false}$, run $\mathsf {fetched} ~{\sf :=}~ T_i.{\sf Lookup}({\mathsf {addr}})$. If $\mathsf {fetched} \ne \bot$, let $\mathsf {found} = \mathsf {true}$ and $\mathsf {data^*} ~{\sf :=}~ \mathsf {fetched}$.
(b)	Else, $T_i.{\sf Lookup}(\bot)$.

(4)

If $\mathsf {found} = \mathsf {false}$, i.e., this is the first time ${\mathsf {addr}}$ is being accessed, set $\mathsf {data^*} = 0$.

(5)

Let $(k,v) ~{\sf :=}~ ({\mathsf {addr}}, {\mathsf {data}} ^*)$ if $\mathsf {op} ={\mathsf {read}}$ operation; else let $(k,v) ~{\sf :=}~ ({\mathsf {addr}}, {\mathsf {data}})$. Insert $(k,(\ell ,\bot ,v))$ into oblivious dictionary $D$ using $D.{{\sf Insert}}(k, (\ell ,\bot , v)$.

(6)

Increment ${\sf ctr}$ by 1. If ${\sf ctr} \equiv 0 \mod {2}^\ell$, perform the following.

(a)

Let $j$ be the smallest level index such that ${\sf full} _j = 0$ (i.e., empty). If all levels are marked full, then $j ~{\sf :=}~ L$. In other words, $j$ is the target level to be rebuilt.

(b)

(c)

(d)

Run ${{\mathcal {F}}_{\sf Shuffle}^{}}({\bf U})$. Denote the output by $\tilde{\bf U}$. If $j = L$, then additionally do the following to shrink $\tilde{\bf U}$ to size $N=2^L$:

(i)	Run ${{\mathcal {F}}_{\sf compaction}}(\tilde{\bf U})$ moving all real elements to the front. Truncate $\tilde{\bf U}$ to length $N$.
(ii)	Run $\tilde{\bf U} \leftarrow {{\mathcal {F}}_{\sf Shuffle}^{N}}(\tilde{\bf U})$

(e)

Rebuild the $j$th hash table with the $2^j$ elements from $\tilde{\bf U}$ by calling ${{\mathcal {F}}_{\sf HT}}.{\sf Build}(\tilde{\bf U})$.

(f)

Initialize a new dictionary $D$ and for each tuple in $e \in D_2$, run $D.{\sf Insert}(e)$. Mark ${\sf full} _j ~{\sf :=}~ 1$.

(g)

For $i\in \lbrace l,\ldots ,j-1\rbrace$, reset $T_i$ to empty structure and set ${\sf full} _i ~{\sf :=}~ 0$.

(7)

Output ${\mathsf {data}} ^*$.

Claim C.15.

Construction 1 is perfectly oblivious and implements Functionality 3.4 (${\mathcal {F}}_{\sf ORAM}$).

Proof.

Since the functionality ${\mathcal {F}}_{\sf ORAM}$ is deterministic, it suffices to show that the construction is correct (i.e., it computes the same output as the ideal functionality), and to present a simulator that produces an access pattern that is computationally-indistinguishable from the one produced by the real construction.

Correctness is straightforward, as all sub-algorithms are ideal implementations. Note that, all input assumptions are guarantees, e.g., we never lookup for the same element more than once until the table is being rebuilt.

The simulator ${\mathsf {Sim}}$ runs the algorithm ${\sf Access}$ on dummy values. In more detail, it maintains an internal secret state that consists of handles to ideal implementations of the dictionary $D$, the hash tables $T_{\ell +1},\ldots ,T_L$, bits ${\sf full} _{\ell +1},\ldots , {\sf full} _{L}$ and counter ${\sf ctr}$ exactly as the real construction. Upon receiving a command ${\sf Access}(\bot ,\bot ,\bot)$, the simulator runs Construction 1 on input $(\bot ,\bot ,\bot)$. By definition of the algorithm, the access pattern (in particular, which ideal functionalities are being invoked with each ${\sf Access}$) is completely determined by the internal state ${\sf ctr}, {\sf full} _{\ell +1},\ldots , {\sf full} _{L}$. Moreover, the change of these counters is deterministic and is the same in both real and ideal executions. As a result, the real algorithm and the simulator perform the exact same calls to the internal ideal functionalities with each ${\sf Access}$. In particular, it is important to note that ${\sf Lookup}$ is invoked on all levels regardless of which level the element was found, and the level that is being rebuild is completely determined by the value of ${\sf ctr}$. Moreover, the construction preserves the restriction of the functionality ${{\mathcal {F}}_{\sf HT}}$ in which any key is being searched for only once between two calls to ${\sf Build}$.□

Give that Construction 1 obliviously implement Functionality 3.4, we proceed with a sequence of constructions and show that each and one of them implements also Functionality 3.4.

—	Construction 2. This is the same as in Construction 1, where we instantiate ${{\mathcal {F}}_{\sf Shuffle}^{}}$ and ${{\mathcal {F}}_{\sf compaction}}$ with the real implementations. Explicitly, we instantiate ${{\mathcal {F}}_{\sf Shuffle}^{}}$ in Step 6d in Construction 1 with ${\sf Intersperse}^{(j^*-\ell)}$ (Algorithm 6.4), instantiate ${{\mathcal {F}}_{\sf Shuffle}^{}}$ with ${\sf IntersperseRD}$ (Algorithm 6.6), and instantiate ${{\mathcal {F}}_{\sf compaction}}$ with an algorithm for tight compaction (Theorem 5.1). Note that at this point, the hash tables $T_{\ell +1},\ldots ,T_L$ are still implemented using the ideal functionality ${{\mathcal {F}}_{\sf HT}}$, as well as $D$ that uses ${\mathcal {F}}_{\sf Dict}$.
—	Construction 3. In this construction, we follow Construction 2 but instantiate ${{\mathcal {F}}_{\sf HT}}$ with Construction 8.7 (i.e., ${\sf CombHT}$ from Theorem 8.8). Note that we do not combine the stashes yet. That is, we simply replace Step 6e (as ${\sf Build}$), Step 3 (as ${\sf Lookup}$) and Step 6c (as ${\sf Extract}()$) in Construction 1 with the implementation of Construction 8.7 instead of the ideal functionality ${{\mathcal {F}}_{\sf HT}}$.
—	Construction 4. In this construction, we follow Construction 3 but change Step 6e (in Construction 1) as the corresponding step in Construction 9.1: We add all elements in $D_2$, ${\sf OF}_{\sf S}$, and ${\sf Comb}{\sf S}_{\sf S}$ into a newly initialized $D$, marked with their level index and what stash they are coming from (where ${\sf whichStash}={\sf overflow}$ in case that the element comes from ${\sf OF}_{\sf S}$, and ${\sf whichStash}={\sf stashes}$ in case that the element comes from ${\sf Comb}{\sf S}_{\sf S}$). Moreover, we change the construction of ${\sf CombHT}.{\sf Extract}()$ to ignore the stashes, as in Step 6c in Construction 9.1. Note that, here except for the smallest level $\ell$, for every other level, we are not using $D$ for any lookup yet, and the stashes ${\sf OF}_{\sf S}$ and ${\sf Comb}{\sf S}_{\sf S}$ are still being used. Moreover, note that now each element in the stash of some level $i$ has two copies: one in $D$, and one in the stash of level $i$. In ${\sf Lookup}$, the element will be found in $D$, but we ignore it except for the smallest level $\ell$. We will find it when we will visit the level $i$. When rebuilding the level, we ignore the copy in the stash of level $i$, but use the copy that is in $D$.
—	Construction 5. In this construction, we follow Construction 4, but make the following change. In Step 3 of Construction 1, we modify the ${\sf Lookup}$ procedure to be that of Construction 8.7. That is, whenever accessing ${\sf OF}_{\sf S}$ and ${\sf Comb}{\sf S}_{\sf S}$, we perform lookup at the stored values ${\sf levelIndex}$ and ${\sf whichStash}$. Basically, in Construction 5, we are no longer using ${\sf OF}_{\sf S}$ and ${\sf Comb}{\sf S}_{\sf S}$ (when being built, we copy their content to $D$; in ${\sf Lookup}$, we first look in $D$ and if found, we continue to look in all levels until the level in which the element belongs to. We do not access the stashes of each level).
—	Construction 6. This is the same as Construction 5, where we replace the ideal implementation ${\mathcal {F}}_{\sf Dict}$ of the dictionary $D$ with the real perfect oblivious dictionary (Corollary 4.17). Note that this is exactly Construction 9.1.

The theorem is obtained using a sequence of simple claims, given that Construction 1 implements ${\mathcal {F}}_{\sf ORAM}$.

Claim C.16.

Construction 2 perfectly-obliviously implements Functionality 3.4.

Proof.

This follows by composing Claims 6.7, 6.5 and Theorem 5.1. It is important to note that the input assumptions are preserved, and therefore we can replace the functionality with the respective algorithm:

—	We invoke ${\sf Intersperse}^{(j^*-\ell)}$ (in Step 6d instead of ${{\mathcal {F}}_{\sf Shuffle}^{}}$) on arrays that are output of ${\sf Extract}$ and therefore are randomly shuffled, maintaining the input assumption of Algorithm 6.4.
—	We invoke ${\sf IntersperseRD}$ (in Step 6(d)ii) on an array in which the real elements are randomly shuffled, as this is an output of compaction on a randomly shuffled array. Therefore, this maintains the input assumption of Algorithm 6.6.

□

Claim C.17.

Construction 3 $(1-T\cdot N^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)} - \delta ^{{\mathcal {A}}}_{{\sf PRF}})$-obliviously implements Functionality 3.4.

Proof.

For any $T \in \mathbb {N}$ accesses, Construction 3 instantiates $O(T \cdot \log N)$ instances of ${\sf CombHT}$. The input for ${\sf CombHT}.{\sf Build}$ in Step 6e is always randomly permuted, as this is an output of ${\sf IntersperseRD}$. By Theorem 8.8, each instantiation of ${\sf CombHT}$ incurs failure probability $N^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)} + \delta ^{{\mathcal {A}}}_{{\sf PRF}}$. In Construction 3 instantiates $O(N)$ ${\sf CombHT}$ per $N$ requests. Hence, using composition and taking union bound over $T$ requests, Construction 3 is $(1-T\cdot N^2 \cdot e^{-\Omega (\log \lambda \log \log \lambda)} - \delta ^{{\mathcal {A}}}_{{\sf PRF}})$-oblivious.□

Claim C.18.

Construction 4 $(1-T\cdot N^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)} - \delta ^{{\mathcal {A}}}_{{\sf PRF}})$-obliviously implements Functionality 3.4.

Proof.

The difference from Construction 3 is only by adding more elements into $D$, however, except with $T\cdot N^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)}$ probability, the size of $D$ never exceeds its capacity $\log ^{11}{\lambda }+ \log N \log {\lambda }$. This is because there are $O(\log N)$ levels, each has its own stash of size $O(\log {\lambda })$. The probability that a level that contains $n$ elements exceeds its stash of size $s$ is bounded by $n^{-\Theta (s)}$. Since the smallest level is of size $2(\log {\lambda })^{11}$ and each stash is of size $O(\log {\lambda })$ except with probability $e^{-\Omega (\log {\lambda }\cdot \log \log {\lambda })}$. By a union bound over all $O(\log N)$ levels and all $T$ requests of the ORAM, the number of elements in all stashes combined does not exceed $\log N \log {\lambda }$ except with probability $T \cdot N^2 \cdot e^{-\Omega (\log {\lambda }\cdot \log \log {\lambda })}$. Moreover, it contains at most $\log ^{11}{\lambda }$ elements from level $\ell$.

Moreover, note that we do not consider these added elements to $D$ when we perform lookups (the construction will find them in $D$, but they will be ignored and will be found again in the relevant level). In fact, we claim that the stash of each level appears twice: Once in that level, and once in $D$. To see that, observe that whenever we rebuild level $j$ we take $D$ and split it into 2: $D_1$ contains all the stashes of level $\lt j$, and $D_2$ contains all the stashes of level $\gt j$. The dictionary $D_1$ will be used for building the level $j$, whereas all elements in $D_2$ are copied into a new dictionary $D$. Thus, when we rebuild level $j$, all the elements and all stashes of levels $\ell ,\ldots ,j-1$ will appear in level $j$ after building the level, whereas all stashes of levels $j+1,\ldots ,L$ will remain in $D$. Note that when we rebuild a level, ${\sf Extract}$ ignores the element in the stashes, and therefore we do not maintain multiple copies of each element. An element is being copied when rebuilding level $j$ into $D$, will be found during lookup in level $j$, and the copy in level $j$ will be destroyed when the level is rebuilt.

In terms of functionality, we compute exactly the same input/output behavior as in Construction 3. As for the access pattern, the change is just by adding more accesses into $D$ when rebuilding a level, and omit accesses to the stashes when rebuilding a level. Those are deterministic changes to the access pattern since $D$ is realized by $\mathcal {F}_{\sf Dict}$ at this moment. We therefore conclude that Construction 4 obliviously implements Functionality 3.4.□

Claim C.19.

Construction 5 $(1-T\cdot N^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)} - \delta ^{{\mathcal {A}}}_{{\sf PRF}})$-obliviously implements Functionality 3.4.

Proof.

The construction is just as Construction 4, where instead of searching in each level for the elements in the stashes ${\sf OF}_{\sf S}$ and ${\sf Comb}{\sf S}_{\sf S}$, we look at the stored values ${\sf levelIndex}$ and ${\sf whichStash}$ and ${\mathsf {data}} ^*$. In case one of the elements appear in one of the stashes, it also appears in the dictionary $D$, as we copy all those elements into $D$ when building the level. We then never access stashes of each level after copying the elements into $D$ when re-building the level.

In terms of functionality, the construction has the exact same input/output behavior as Construction 4. In terms of the access pattern, we skip visiting of the stashes in each level, which is just omitting (a deterministic and well defined) part of the access pattern. Note that until accessing levelIndex, we look for the correct key, and pretend that the element was not found yet. In levelIndex, we simulate exactly the case as the level is found in that level, and in the correct stash in that level. Thus, $D$ behaves as a logical extension of the stashes of each level. The change in the access pattern is deterministic, as we just do not visit the stashes in each level, and therefore the construction implements Functionality 3.4.□

Claim C.20.

Construction 6 $(1-T\cdot N^2\cdot e^{-\Omega (\log \lambda \log \log \lambda)} - \delta ^{{\mathcal {A}}}_{{\sf PRF}})$-obliviously implements Functionality 3.4.

Proof.

As Construction 5 is a construction in the ${\mathcal {F}}_{\sf Dict}$-hybrid model, substituting ${\mathcal {F}}_{\sf Dict}$ with the perfect oblivious dictionary (Corollary 4.17) incurs no security loss. Hence, the claim follows by using composition.□

This completes the proof of obliviousness (Theorem 9.2).

ACKNOWLEDGMENTS

We are grateful to Hubert Chan, Kai-Min Chung, Yue Guo, and Rafael Pass for helpful discussions. We thank Brett Hemenway Falk and Daniel Noble and Rafail Ostrovsky for pointing out a gap in a previous version of this work.

Footnotes

¹ An ORAM scheme is online if it supports accesses arriving in an online manner, one by one. Almost all known schemes have this property.
Footnote
² Note that for the (sub-)exponential security regime, e.g., failure probability of $2^{-\lambda }$ or $2^{-\lambda ^\epsilon }$ for some $\epsilon \in (0, 1)$, perfectly secure ORAM schemes [14, 19] asymptotically outperform known statistically or computationally secure constructions assuming that $N = \mathsf {poly}(\lambda)$.
Footnote
³ Alternatively, if we use the number of IOs as an overhead metric, we only need to assume that the CPU can evaluate a PRF internally without writing to memory, but the evaluation need not be unit cost.
Footnote
⁴ Although the algorithm of Leighton et al. [42] appears to be comparison-based, it is in fact not since the algorithm must tally the number of reals/dummies and make use of this number.
Footnote
⁵ $\lambda$ denotes the security parameter. Since the size of the hash table $n$ may be small, here we separate the security parameter from the hash table’s size.
Footnote
⁶ The time overhead is a bit more complicated to state and the above expression is for the case where $|T_{i}|=2|T_{i-1}|$ for every $i$ (which is the case in a hierarchical ORAM construction).
Footnote
⁷ Note that, the number of such assignments is ${n \choose {n_1, n_2}}$. Assuming that each array is already permuted, the number of possible outputs is ${n \choose {n_1, n_2}}\cdot n_1! n_2! = n!$.
Footnote
⁸ We refer to Section 4.5 for background information on Cuckoo hashing.
Footnote
⁹ For the time being, the reader need not worry about how to perform lookup in the stash. Later, when we use our oblivious Cuckoo hashing scheme in the bigger hash table construction, we will merge the stashes of all Cuckoo hash tables into a single one and treat the merged stash specially.
Footnote
¹⁰ We omit the concrete parameter calculation in the last couple of steps but from the calculations so far, it should be obvious by now that the there is at most a constant blowup in the constants hidden inside the big-O notation.
Footnote
¹¹ Chan et al. [12] is a re-exposition and slight rectification of the elegant ideas of Goodrich and Mitzenmacher [33]; also note that the Cuckoo hashing appears only in the full version of Chan et al., http://eprint.iacr.org/2017/924.
Footnote
¹² The description here differs slightly from previous works [12]. In previous works, the Cuckoo assignment ${\bf A}$ was allowed to depend not only on the two bin choices ${\bf I}$, but also on the balls and keys themselves. In our work, the fact that the Cuckoo assignment is only a function of ${\bf I}$ is crucial – see Remark 4.13 for a discussion on this property that we call indiscrimination.
Footnote
¹³ In practice, a dummy swap consists of downloading the two encrypted items, re-encrypting both of them with fresh randomness, and then uploading them to their original positions.
¹⁴ For readability, we recurse with $\sqrt {\ell }$ instead of optimizing the constants (e.g., $2^{38}$). Compared to the non-oblivious routing of Pippenger [56], our constant is much larger, and it is an open problem to resolve this issue or prove it is inherent.
Footnote
¹⁵ In the standard RAM model, we assume only a memory word (i.e., fair random bits) can be sampled uniformly at random in unit time. We note that to represent the probability of the shuffling exactly, infinite random bits are necessary, which implies that no shuffling can finish in worst-case finite time. One may approximate the stronger sampling using standard random words and repeating, and then bound the total number of repetition to get a high-probability time (where the total number of repetition is a sum of geometric random variables, which has a very sharp tail due to Chernoff bound). We adopt the stronger sampling for simplicity.
Footnote
¹⁶ Steps 6b, 6c, 6(e)iv are due to a subtlety pointed out by Falk et al. [20].
Footnote
¹⁷ Below, we assume that $w\ge \log ^3 \log \lambda$. Indeed, otherwise, we can directly use a perfect ORAM on $N \lt 2^{O(\log ^3 \log \lambda)}$ which yields an ORAM scheme with $O(\log ^9 \log \lambda)$ overhead. See the proof below for details.
Footnote
¹⁸ Note that if we use the oblivious Cuckoo hash table construction of Chan et al. [12], even if the block is found in the ${\sf Comb}{\sf S}_{\sf S}$ or ${\sf OF}_{\sf S}$ of level $i$, we will still visit real addresses (computed with a PRF function over the logical addresses) in the main Cuckoo hash tables.
Footnote
¹⁹ If there is a tie in the sorting of edges, we resolve it by the ordering of edges in ${\bf I}$. This resolution was arbitrary in Chan et al. [12], which is insufficient in our case. Here, we want it to be decided based on the original ordering as it implies that the assignment A is determined given (only) the input ${\bf I}$. We called this the indiscrimination property in Remark 4.13.
²⁰ Recall that according to our definition, we translate an “input assumption” to a protocol in a hybrid model where the protocol first invokes a functionality that guarantee that the input assumption holds. In our case, the functionality receives the input array ${\bf I}_0 \Vert {\bf I}_1$ and the parameters $n_0,n_1$, chooses two random permutations $\pi _0,\pi _1$ and permute the two arrays ${\bf I}_0,{\bf I}_1$.
²¹ The fact is that for every two random variables $X$ and $Y$ over a finite domain, and any event $E$ such that $\Pr _{X}[E] = \Pr _{Y}[E]$, it holds that ${\sf SD}(X,Y) \le {\sf SD}(X \mid E , Y \mid E) + \Pr _{X}[\lnot E]$. This fact can be verified by a direct expansion.
²² Note that in Construction 1, all elements in $D$ have $({\sf levelIndex},{\sf whichStash}) = (\ell ,\bot)$. At the end of this step, $D_1$ is always $D$, and $D_2$ contains only dummies.

REFERENCES

[1] Ajtai Miklós, Komlós János, and Szemerédi Endre. 1983. An $O(n \log n)$ sorting network. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing. 1–9.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[2] Andersson Arne, Hagerup Torben, Nilsson Stefan, and Raman Rajeev. 1995. Sorting in linear time? In Proceedings of the 27th annual ACM Symposium on Theory of Computing. 427–436.Google ScholarDigital Library
Reference 1Reference 2
[3] Arora Sanjeev and Barak Boaz. 2009. Computational Complexity - A Modern Approach. Cambridge University Press. Retrieved from http://www.cambridge.org/catalogue/catalogue.asp?isbn=9780521424264.Google ScholarCross Ref
[4] Arora Sanjeev, Leighton Frank Thomson, and Maggs Bruce M.. 1990. On-line algorithms for path selection in a nonblocking network (extended abstract). In Proceedings of the ACM STOC.Google Scholar
Reference
[5] Batcher Kenneth E.. 1968. Sorting networks and their applications. In Proceedings of the American Federation of Information Processing Societies: AFIPS Conference Proceedings.Thomson Book Company, Washington D.C., 307–314.Google ScholarDigital Library
Reference 1Reference 2
[6] Bindschaedler Vincent, Naveed Muhammad, Pan Xiaorui, Wang XiaoFeng, and Huang Yan. 2015. Practicing oblivious access on cloud storage: The gap, the fallacy, and the new way forward. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 837–849.Google ScholarDigital Library
Reference
[7] Boyle Elette and Naor Moni. 2016. Is there an oblivious RAM lower bound? In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science. 357–368.Google ScholarDigital Library
Reference
[8] Bringmann Karl, Kuhn Fabian, Panagiotou Konstantinos, Peter Ueli, and Thomas Henning. 2014. Internal DLA: Efficient simulation of a physical growth model. In Proceedings of the International Colloquium on Automata, Languages, and Programming. 247–258. Retrieved April 3, 2019 from http://people.mpi-inf.mpg.de/kbringma/paper/2014ICALP.pdf.Google ScholarCross Ref
Reference 1Reference 2
[9] Canetti Ran. 2000. Security and composition of multiparty cryptographic protocols. Journal of CRYPTOLOGY 13, 1 (2000), 143–202.Google ScholarDigital Library
Reference 1Reference 2
[10] Canetti Ran. 2001. Universally composable security: A new paradigm for cryptographic protocols. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science. 136–145.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[11] Chan Hubert, Chung Kai-Min, and Shi Elaine. 2017. On the depth of oblivious parallel ORAM. In Proceedings of the International Conference on the Theory and Application of Cryptology and Information Security.Google Scholar
Reference 1Reference 2
[12] Chan T.-H. Hubert, Guo Yue, Lin Wei-Kai, and Shi Elaine. 2017. Oblivious hashing revisited, and applications to asymptotically efficient ORAM and OPRAM. In Proceedings of the International Conference on the Theory and Application of Cryptology and Information Security. 660–690.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
Reference 14
Reference 15
Reference 16
Reference 17
[13] Chan T.-H. Hubert, Guo Yue, Lin Wei-Kai, and Shi Elaine. 2018. Cache-oblivious and data-oblivious sorting and applications. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms. 2201–2220.Google ScholarCross Ref
Reference
[14] Chan T.-H. Hubert, Nayak Kartik, and Shi Elaine. 2018. Perfectly secure oblivious parallel RAM. In Proceedings of the Theory of Cryptography Conference. 636–668.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
Reference 13
[15] Chan T.-H. Hubert and Shi Elaine. 2017. Circuit OPRAM: unifying statistically and computationally secure ORAMs and OPRAMs. In Proceedings of the Theory of Cryptography Conference. 72–107.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[16] Chung Kai-Min, Liu Zhenming, and Pass Rafael. 2014. Statistically-secure ORAM with $\tilde{O}(\log ^2n)$ overhead. In Proceedings of the International Conference on the Theory and Application of Cryptology and Information Security.Google Scholar
Reference
[17] Paul Jean claude and Simon Wilhelm. 1980. Decision trees and random access machines. Logic and Algorithmic 30 (1980), 331–340.Google Scholar
Reference 1Reference 2
[18] Cormen T. H., Leiserson C. E., Rivest R. L., and Stein C.. 2009. Introduction to Algorithms (3rd ed.). MIT Press, 428–436.Google Scholar
Reference
[19] Damgård Ivan, Meldgaard Sigurd, and Nielsen Jesper Buus. 2011. Perfectly secure oblivious RAM without random oracles. In Proceedings of the Theory of Cryptography Conference. 144–163.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[20] Falk Brett Hemenway, Noble Daniel, and Ostrovsky Rafail. 2020. Alibi: A Flaw in Cuckoo-Hashing based Hierarchical ORAM Schemes and a Solution. Cryptology ePrint Archive, Report 2020/997. (2020). Retrieved from https://eprint.iacr.org/2020/997.Google Scholar
[21] Farach-Colton Martín and Tsai Meng-Tsung. 2015. Exact sublinear binomial sampling. Algorithmica 73, 4(2015), 637–651. DOI:Google ScholarDigital Library
Reference
[22] Farhadi Alireza, Hajiaghayi MohammadTaghi, Larsen Kasper Green, and Shi Elaine. 2019. Lower bounds for external memory integer sorting via network coding. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing.Google ScholarDigital Library
Reference 1Reference 2
[23] Feldman P., Friedman J., and Pippenger N.. 1986. Non-blocking networks. In Proceedings of the ACM STOC. 247–254.Google ScholarDigital Library
Reference
[24] Fisher R. A. and Yates F.. 1975. Statistical Tables for Biological, Agricultural and Medical Research. Oliver and Boyd. Retrieved October 1, 2020 from https://books.google.com/books?id=KWBLPgAACAAJ.Google Scholar
Reference
[25] Fletcher Christopher W., Dijk Marten van, and Devadas Srinivas. 2012. A secure processor architecture for encrypted computation on untrusted programs. In Proceedings of the 7th ACM Workshop on Scalable Trusted Computing. ACM, 3–8.Google ScholarDigital Library
Reference
[26] Fletcher Christopher W., Ren Ling, Kwon Albert, Dijk Marten van, and Devadas Srinivas. 2015. Freecursive ORAM: [Nearly] free recursion and integrity verification for position-based oblivious RAM. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS. ACM, 103–116.Google ScholarDigital Library
Reference
[27] Gabber Ofer and Galil Zvi. 1981. Explicit constructions of linear-sized superconcentrators. Journal of Computer and System Sciences 22, 3 (1981), 407–420.Google ScholarCross Ref
[28] Gentry Craig, Halevi Shai, Jutla Charanjit, and Raykova Mariana. 2015. Private database access with he-over-oram architecture. In Proceedings of the International Conference on Applied Cryptography and Network Security. Springer, 172–191.Google ScholarCross Ref
Reference
[29] Goldreich Oded. 1987. Towards a theory of software protection and simulation by oblivious RAMs. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing. 182–194.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[30] Goldreich Oded. 2004. The Foundations of Cryptography - Volume 2, Basic Applications. Cambridge University Press.Google ScholarDigital Library
Reference 1Reference 2
[31] Goldreich Oded and Ostrovsky Rafail. 1996. Software protection and simulation on oblivious RAMs. Journal of the ACM 43, 3(1996), 431–473.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
[32] Goodrich Michael T.. 2014. Zig-zag Sort: A simple deterministic data-oblivious sorting algorithm running in O(N Log N) time. In Proceedings of the 46th annual ACM Symposium on Theory of Computing. 684–693.Google ScholarDigital Library
Reference 1Reference 2
[33] Goodrich Michael T. and Mitzenmacher Michael. 2011. Privacy-preserving access of outsourced data via oblivious RAM simulation. In International Colloquium on Automata, Languages, and Programming. 576–587.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
[34] Goodrich Michael T., Mitzenmacher Michael, Ohrimenko Olga, and Tamassia Roberto. 2011. Oblivious RAM simulation with efficient worst-case access overhead. In Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop. 95–100.Google ScholarDigital Library
Reference
[35] Goodrich Michael T., Mitzenmacher Michael, Ohrimenko Olga, and Tamassia Roberto. 2012. Privacy-preserving group data access via stateless oblivious RAM simulation. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms. 157–167.Google ScholarDigital Library
Reference
[36] Hagerup Torben and Shen Hong. 1990. Improved nonconservative sequential and parallel integer sorting. Information Processing Letters 36, 2 (1990), 57–63.Google ScholarDigital Library
Reference 1Reference 2
[37] Håstad Johan, Impagliazzo Russell, Levin Leonid A., and Luby Michael. 1999. A pseudorandom generator from any one-way function. SIAM Journal on Computing 28, 4 (1999), 1364–1396.Google ScholarDigital Library
Reference
[38] Jimbo Shuji and Maruoka Akira. 1987. Expanders obtained from affine transformations. Combinatorica 7, 4 (1987), 343–355.Google ScholarDigital Library
[39] Kirsch Adam, Mitzenmacher Michael, and Wieder Udi. 2009. More robust hashing: Cuckoo hashing with a stash. SIAM Journal on Computing 39, 4 (2009), 1543–1561.Google ScholarDigital Library
Reference
[40] Kushilevitz Eyal, Lu Steve, and Ostrovsky Rafail. 2012. On the (in)security of hash-based oblivious RAM and a new balancing scheme. In Proceedings of the 23rd annual ACM-SIAM symposium on Discrete Algorithms. 143–156.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
[41] Larsen Kasper Green and Nielsen Jesper Buus. 2018. Yes, there is an oblivious RAM lower bound!. In Proceedings of the Annual International Cryptology Conference. 523–542.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[42] Leighton Frank Thomson, Ma Yuan, and Suel Torsten. 1997. On probabilistic networks for selection, merging, and sorting. Theory of Computing Systems 30, 6 (1997), 559–582.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[43] Li Zongpeng and Li Baochun. 2004. Network coding: The case of multiple unicast sessions. Allerton Conference on Communications 16, 8 (2004),Google Scholar
Reference
[44] Lin Wei-Kai, Shi Elaine, and Xie Tiancheng. 2019. Can we overcome the $n \log n$ barrier for oblivious sorting? In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[45] Liu Chang, Wang Xiao Shaun, Nayak Kartik, Huang Yan, and Shi Elaine. 2015. ObliVM: A programming framework for secure computation. In Proceedings of the 2015 IEEE Symposium on Security and Privacy.Google ScholarDigital Library
Reference
[46] Lu Steve and Ostrovsky Rafail. 2013. Distributed oblivious RAM for secure two-party computation. In Proceedings of the Theory of Cryptography Conference. 377–396.Google ScholarDigital Library
Reference
[47] Maas Martin, Love Eric, Stefanov Emil, Tiwari Mohit, Shi Elaine, Asanovic Krste, Kubiatowicz John, and Song Dawn. 2013. PHANTOM: Practical oblivious computation in a secure processor. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security. 311–324.Google ScholarDigital Library
Reference
[48] Margulis Grigorii Aleksandrovich. 1973. Explicit constructions of concentrators. Problemy Peredachi Informatsii 9, 4 (1973), 71–80.Google Scholar
Reference
[49] Mitchell John C. and Zimmerman Joe. 2014. Data-oblivious data structures. In Proceedings of the 31st International Symposium on Theoretical Aspects of Computer Science STACS. 554–565.Google Scholar
Reference
[50] Naor Moni. 1989. Bit commitment using pseudo-randomness. In Proceedings of the Conference on the Theory and Application of Cryptology. 128–136.Google Scholar
Reference
[51] Ostrovsky Rafail and Shoup Victor. 1997. Private information storage. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. 294–303.Google Scholar
Reference 1Reference 2
[52] Pagh Rasmus and Rodler Flemming Friche. 2004. Cuckoo hashing. Journal of Algorithms 51, 2(2004), 122–144.Google ScholarDigital Library
Reference
[53] Patel Sarvar, Persiano Giuseppe, Raykova Mariana, and Yeo Kevin. 2018. PanORAMa: Oblivious RAM with logarithmic overhead. In Proceedings of the IEEE 59th Annual Symposium on Foundations of Computer Science.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
[54] Pinsker Mark S.. 1973. On the complexity of a concentrator. In Proceedings of the 7th International Teletraffic Conference.Google Scholar
Reference
[55] Pippenger Nicholas. 1977. Superconcentrators. SIAM Journal on Computing 6, 2 (1977), 298–304.Google ScholarDigital Library
Reference 1Reference 2
[56] Pippenger Nicholas. 1996. Self-routing superconcentrators. Journal of Computer and System Sciences 52, 1 (1996), 53–60.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[57] Ren Ling, Yu Xiangyao, Fletcher Christopher W., Dijk Marten van, and Devadas Srinivas. 2013. Design space exploration and optimization of path oblivious RAM in secure processors. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 571–582.Google ScholarDigital Library
Reference
[58] Shi Elaine, Chan T.-H. Hubert, Stefanov Emil, and Li Mingfei. 2011. Oblivious RAM with $O((\log N)^3)$ worst-case cost. In Proceedings of the International Conference on The Theory and Application of Cryptology and Information Security. 197–214.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[59] Stefanov Emil and Shi Elaine. 2013. Oblivistore: High performance oblivious cloud storage. In Proceedings of the 2013 IEEE Symposium on Security and Privacy. 253–267.Google ScholarDigital Library
Reference
[60] Stefanov Emil, Shi Elaine, and Song Dawn Xiaodong. 2012. Towards practical oblivious RAM. In Proceedings of the 19th Annual Network and Distributed System Security Symposium.Google Scholar
Reference 1Reference 2
[61] Stefanov Emil, Dijk Marten van, Shi Elaine, Fletcher Christopher W., Ren Ling, Yu Xiangyao, and Devadas Srinivas. 2013. Path ORAM: An extremely simple oblivious RAM protocol. In Proceedings of the ACM CCS. 299–310.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[62] Valiant Leslie G.. 1976. Graph-theoretic properties in computational complexity. Journal of Computer and System Sciences 13, 3(1976), 278–285. DOI:Google ScholarDigital Library
Reference
[63] Wang Xiao, Chan T.-H. Hubert, and Shi Elaine. 2015. Circuit ORAM: On tightness of the goldreich-ostrovsky lower bound. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 850–861.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[64] Wang Xiao Shaun, Huang Yan, Chan T.-H. Hubert, Shelat Abhi, and Shi Elaine. 2014. SCORAM: Oblivious RAM for secure computation. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 191–202.Google ScholarDigital Library
Reference
[65] Wikipedia. Bitonic sorter. (n. d.). Retrieved September 22, 2019 from https://en.wikipedia.org/w/index.php?title=Bitonic_sorter.Google Scholar
Reference
[66] Wikipedia. Sorting network. (n. d.). Retrieved February 14, 2020 from https://en.wikipedia.org/wiki/Sorting_network.Google Scholar
Reference
[67] Williams Peter, Sion Radu, and Tomescu Alin. 2012. PrivateFS: A parallel oblivious file system. In Proceedings of the 2012 ACM Conference on Computer and Communications Security.Google ScholarDigital Library
Reference
[68] Zahur Samee, Wang Xiao Shaun, Raykova Mariana, Gascón Adria, Doerner Jack, Evans David, and Katz Jonathan. 2016. Revisiting square-root ORAM: Efficient random access in multi-party computation. In Proceedings of the IEEE Symposium on Security and Privacy. 218–234.Google ScholarCross Ref
Reference

Index Terms

OptORAMa: Optimal Oblivious RAM
1. Theory of computation
  1. Computational complexity and cryptography
    1. Cryptographic protocols

Recommendations

OptORAMa: Optimal Oblivious RAM
Advances in Cryptology – EUROCRYPT 2020
Abstract
Oblivious RAM (ORAM), first introduced in the ground-breaking work of Goldreich and Ostrovsky (STOC ’87 and J. ACM ’96) is a technique for provably obfuscating programs’ access patterns, such that the access patterns leak no information about the ...
Read More
Oblivious RAM with Worst-Case Logarithmic Overhead
Abstract
We present the first Oblivious RAM (ORAM) construction that for N memory blocks supports accesses with worst-case $O (log N)$ overhead for any block size $Ω (log N)$ while requiring a client memory of only a constant number of memory blocks. We rely on ... $^{}$
Read More
On the (in)security of hash-based oblivious RAM and a new balancing scheme
SODA '12: Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete algorithms

With the gaining popularity of remote storage (e.g. in the Cloud), we consider the setting where a small, protected local machine wishes to access data on a large, untrusted remote machine. This setting was introduced in the RAM model in the context of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of the ACM Volume 70, Issue 1
February 2023
405 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/3572730
Editor:
Venkatesan Guruswami
University of California, Berkeley, United States
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 December 2022
- Online AM: 6 October 2022
- Accepted: 6 September 2022
- Revised: 2 August 2022
- Received: 11 March 2020
Published in jacm Volume 70, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Oblivious RAM
oblivious tight compaction
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 920
  Total Downloads
- Downloads (Last 12 months)697
- Downloads (Last 6 weeks)149
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

—	Read phase. Access each non-empty level \(T_1,\ldots ,T_L\) in order and perform \({\sf Lookup}\) for \({\mathsf {addr}}\). If the item is found in some level \(T_i\), then when accessing all non-empty levels \(T_{i+1},\ldots ,T_L\) looks for a dummy.
—	Write back. If this operation is \({\mathsf {read}}\), then store the found data in the read phase and write back the data value to \(T_1\). If this operation is \({\mathsf {write}}\), then ignore the associated data found in the read phase and write the value provided in the access instruction in \(T_1\).
—	Rebuild: Find the first empty level \(\ell\). If no such level exists, set \(\ell ~{\sf :=}~ L\). Merge all \(\lbrace T_j\rbrace _{1 \le j\le \ell }\) into \(T_\ell\). Mark all levels \(T_1,\ldots ,T_{\ell -1}\) as empty and \(T_\ell\) as full.

(1)	For step (1) the \(n = \mathsf {poly}\log \lambda\) elements in the input array can each evaluate the PRF on its associated key, and write down its two bin choices (this takes linear time).
(2)	Implementing step (2) in linear time is harder as this step is dominated by a sequence of oblivious sorts. To overcome this, we use the fact that the problem size \(n\) is of size \(\mathsf {poly}\log \lambda\). As a result, the index of each item and its two bin choices can be expressed using \(O(\log \log \lambda)\) bits which means that a single memory word (which is \(\log \lambda\) bits long) can hold \(O(\frac{\log \lambda }{\log \log \lambda })\) many elements’ metadata. We can now apply a “packed sorting” type of idea [2, 13, 17, 36] where we use the RAM’s word-level instructions to perform SIMD-style operations. Through this packing trick, we show that oblivious sorting and oblivious random permutation (of the elements’ metadata) can be accomplished in \(O(n)\) time!
(3)	Step (3) is classically implemented using oblivious bin distribution which again uses oblivious sorts. Here, we cannot use the packing trick since we operate on the elements themselves, so we use the fact that the input array is randomly shuffled and just route the elements in the clear.

(1)	Split one word into two words such that elements of the offset \(t\) are interleaved, where the two words are called odd and even, and then
(2)	Shift the even word by \(t\) elements so the comparators are aligned to the odd word.

(1)	Let \(\vec{s^{\prime }} = \vec{a} - \vec{b}\), which has the format \((s_1 ^k, s_2 ^k, \dots , s_B ^k)\), where \(s_i \in \lbrace 0,1\rbrace\) is the sign bit* such that \(s_i = 1\) iff \(a_i \ge b_i\). Keep only sign bits and let \(\vec{s} = (s_1 0^k, \dots , s_B 0^k)\).
(2)	Shift \(\vec{s}\) and get \(\vec{m^{\prime }} = (0^k s_1, \dots , 0^k s_B)\). Then, the mask is \(\vec{m} = \vec{s} - \vec{m^{\prime }} = (0s_1^k, \dots , 0s_B^k)\).

—	Assign each element an \(8\log n\)-bit random label drawn uniformly from \(\lbrace 0, 1\rbrace ^{8\log n}\). Obliviously sort all elements based on their random labels, resulting in the array \({\bf R}\). This step takes \(O(T_{\rm sort}^{D + \log n}(n) + n)\) time.
—	In one linear scan, write down two arrays: (1) an array \({\bf I}\) containing the indices of all elements that have collisions; and (2) an array \({\bf X}\) containing all the colliding elements themselves. This can be accomplished in \(O(n)\) time assuming that we can leak the indices of the colliding elements.
—	If the number of elements that collide is greater than \(\sqrt {n}\), simply abort throwing an Overflow exception. Otherwise, use a naïve quadratic oblivious random permutation algorithm to obliviously and randomly permute the array \({\bf X}\), and let \({\bf Y}\) be the outcome. This step can be completed in \(O(n)\) time where the quadratic oblivious random permutation performs the following: for each of \(i \in \lbrace 1, 2, \ldots , n\rbrace\), sample a random index \(r\) from \(\lbrace 1, 2, \ldots , n-i+1\rbrace\), and write the \(i\)th element of the input to the \(r\)th unoccupied position of the output through a linear scan of the output array.
—	Finally, for each \(j \in \|{\bf I}\|\), write back each element \({\bf Y}[j]\) to the position \({\bf R}[{\bf I}[j]]\) and output the resulting \({\bf R}\).

—	Build and Extract each consumes \(n \cdot \mathsf {poly}\log n\) time;
—	Each Lookup request consumes \(\mathsf {poly}\log n\) time.

—	To perform Build, use the aforementioned cuckooAssign algorithm, while determining element’s \((k, v)\) two bin choices by evaluating a pseudorandom function \({\sf PRF}_{\texttt{sk} }(k)\) where \(\texttt{sk}\) is a secret key sampled freshly for this hash-table instance, and stored inside the CPU.
—	For Lookup of key \(k\), we evaluate the element’s two bin choices using the PRF, and look up the corresponding two bins in the hash-table. Besides these two bins, we also need to scan through the stash (no matter whether the element is found in one of the two bins). After an element has been looked up, it will be marked as removed.
—	Finally, to realize Extract we obliviously shuffle all unvisited elements using a perfect oblivious random permutation (Theorem 4.6) and output the resulting array.

—	Build takes as input \({\bf I}\) of length \(n\), and outputs a Cuckoo table \({\mathcal {T}}\) of size \(O(n)\) and a stash \({\sf S}\) of size \(O(\log (1/\delta)/\log n)\). It requires \(O\left(n\cdot \log n\right)\) time.
—	Lookup requires looking up only \(O(1)\) positions in the table \({\mathcal {T}}\) which takes \(O(1)\) time, and making a linear scan of the stash \({\sf S}\) consuming \(O(\log (1/\delta)/\log n)\) time.
—	Extract performs a perfect oblivious random permutation, consuming \(O\left(n \cdot \log n\right)\) time.

—	The algorithm \({\sf LooseCompaction}_\ell\) receives as input an array \({\bf I}\) consisting of \(n\) balls, where at most \(1/\ell\) fraction are real and the rest are dummies. The output is an array of size \(n/2\) that contains all the real balls. We implement this procedure in Section 5.2.
—	The algorithm \({\sf LooseSwapMisplaced}_\ell\) receives the same input as SwapMisplaced: \(n\) balls, each is labeled as either \({\sf red}\), \({\sf blue}\), or \(\bot\), and the number of \({\sf blue}\)s equals the number of \({\sf red}\)s. This procedure swaps the locations of all the red-blue balls except at most \(1/\ell\) fraction. All the swapped balls are labeled with \(\bot\). We implement this procedure below in this subsection.

—	There exist pairs of indices \((i_1,j_1),\ldots ,(i_k,j_k)\) all distinct such that the following holds: For every \(\ell \in [k]\), \({\bf I}[i_\ell ],{\bf I}[j_\ell ]\) are marked with different colors (\(({\sf red},{\sf blue})\) or \(({\sf blue},{\sf red})\)), and \({\bf O}[i_\ell ] = {\bf I}[j_\ell ]\), \({\bf O}[j_\ell ] = {\bf I}[i_\ell ]\) and both \({\bf O}[i_\ell ],{\bf O}[j_\ell ]\) are marked \(\bot\).
—	For every \(i \not\in \lbrace i_1,\ldots ,i_k, j_1,\ldots ,j_k\rbrace\) then \({\bf O}[i] = {\bf I}[i]\) and both have the same mark.

—	Case I: It is always called with \(\ell = 2^{38}\). At Step 2, note that the number of dense blocks is at most \(\frac{n}{{\mu }}\cdot \frac{1}{\sqrt {\ell }}\) as the total number of real balls is at most \(\frac{n}{\ell }\). Hence, the two \({\sf CompactionFromMatching}\) takes as input at most \((\frac{n}{{\mu }}) / \sqrt {\ell }\) and then \((\frac{n}{2{\mu }}) / (\sqrt {\ell } / 2)\) dense blocks, and the 128-condition holds as \(\sqrt {\ell } / 2 = (2 \cdot 128)^2 \gt 128\). The two \({\sf LooseCompaction}_{\sqrt {\ell } / 2}\) takes at most \({\mu }/ \sqrt {\ell }\) and \((\frac{{\mu }}{2}) / (\sqrt {\ell } / 2)\) real balls as each block is not dense, and then the 128-condition holds for \(\sqrt {\ell }\) and \(\sqrt {\ell } / 2\) similarly.
—	Case II: It can be either called directly with \(\ell = 2^{38}\), or called indirectly from Case I with \(\ell = (2 \cdot 128)^2\). Hence, the number of real is at most \(n / (2 \cdot 128)^2\). By the same calculation as Case I, the two \({\sf CompactionFromMatching}\) and two \({\sf LooseCompaction}_{\sqrt {\ell } / 2}\) take an input array such that the sparsity is at least \(\sqrt {(2 \cdot 128)^2} / 2 = 128\), and the 128-condition holds.
—	Case III: Similar to Case II, it can be called directly, indirectly from Case I, or indirectly from Case II with \(\ell = 2^{38}, (2 \cdot 128)^2\), or 128 respectively. Hence, the given sparsity \(\ell\) is at least 128, and hence the 128-condition holds directly for \({\sf CompactionFromMatching}\).

—	Case III: Since \(n \lt \frac{w}{\log w}\), the \({\sf CompactionFromMatching}_D\) takes an input size \(m = n \lt \frac{w}{\log w}\) runs in \(O(\lceil D/w\rceil \cdot n)\) time by Corollary 5.17.
—	Case II: Given that \(n \lt (\frac{w}{\log w})^2\), the subsequent \({\sf CompactionFromMatching}_{D\cdot p}\) takes input size \(m = \frac{n}{p} \lt \frac{w}{\log w}\). Hence, its running time is \(O(\lceil D\cdot p/w\rceil \cdot m) = (\lceil D/w\rceil \cdot n)\), as in Case III.
—	Case I: For arbitrary \(n\), the subsequent invocation of \({\sf CompactionFromMatching}_{D\cdot p^2}\), in Steps 3 and 4, takes an input size \(m = n / p^2\) and then \(m / 2\). By Corollary 5.17, in both cases, the procedure runs in time \(O(\lceil D \cdot p^2 / w\rceil \cdot m + m \cdot \log m) = O(\lceil D / w\rceil \cdot n + (n/p^2) \cdot \log n)\). Since Case I is the starting point of the algorithm, by the standard RAM model, we have that \(w = \Omega (\log n)\), which implies that \(n/p^2 = O(n / \log n)\) as \(p = w/\log w\). Thus, the total time is bounded by \(O(\lceil D / w\rceil \cdot n)\).

—	Input: An array \({\bf I} ~{\sf :=}~ {\bf I}_0\Vert {\bf I}_1\) of size \(n\) and two numbers \(n_0\) and \(n_1\) such that \(\|{\bf I}_0\| = n_0\) and \(\|{\bf I}_1\| = n_1\) and \(n=n_0+n_1\). We assume that each element in the input array fits in \(O(1)\) memory words.
—	Output: An array \({\bf B}\) of size \(n\) that contains all elements of \({\bf I}_0\) and \({\bf I}_1\). Each position in \({\bf B}\) will hold an element from either \({\bf I}_0\) or \({\bf I}_1\), chosen uniformly at random and the choices are concealed from the adversary.

—	Input: An array \({\bf I}_1 \Vert \ldots \Vert {\bf I}_k\) consisting of \(k\) different arrays of lengths \(n_1,\ldots ,n_k\), respectively. The parameters \(n_1,\ldots ,n_k\) are public.
—	Output: An array \({\bf B}\) of size \(\sum _{i=1}^k n_i\) that contains all elements of \({\bf I}_1,\ldots ,{\bf I}_k\). Each position in \({\bf B}\) will hold an element from one of the arrays, chosen uniformly at random and the choices are concealed from the adversary.

—	Input: An array \({\bf I}\) of \(n\) elements, where each element is tagged as either real or dummy. The real elements are distinct. We assume that if we extract the subset of all real elements in the array, then these elements appear in random order. However, there is no guarantee of the relative positions of the real elements with respect to the dummy ones.
—	Output: An array \({\bf B}\) of size \(n\) containing all real elements in \({\bf I}\) and the same number of dummy elements, where all elements in the array are randomly permuted. The only leakage is \(n\), the number of elements in the array.

—	\({\sf Build}\) and \({\sf Extract}\) each take \(O(n \cdot \mathsf {poly}\log \log \lambda + n \cdot \frac{\log n}{\log ^2 \lambda })\) time; and
—	\({\sf Lookup}\) takes \(O(\mathsf {poly}\log \log \lambda)\) time in addition to linearly scanning a stash of size \(O(\log \lambda)\).

(8)	Prepare data structure for efficient lookup. For \(i=1,\ldots ,B\), call \({\sf SmallHT}.{\sf Build}({\sf Bin}_i)\) on each major bin to construct an oblivious hash table, and let \(\lbrace ({\sf OBin}_i,{\sf S}_i)\rbrace _{i \in [B]}\) denote the outcome bins and the stash.
(9)	Concatenate the stashes \({\sf S}_1,\ldots ,{\sf S}_B\) (each of size \(O(\log \lambda)\)) from all small hash tables together. Pad the concatenated stash (of size \(O(n / \log ^7 \lambda)\)) to the size \(O(n / \log ^2 \lambda)\). Call the \({\sf Build}\) algorithm of an oblivious Cuckoo hashing scheme on the combined set (Section 4.5), and let \({\sf Comb}{\sf S}=({\sf Comb}{\sf S}_{\sf T},{\sf Comb}{\sf S}_{\sf S})\) denote the output data structure, where \({\sf Comb}{\sf S}_{\sf T}\) is the main table and \({\sf Comb}{\sf S}_{\sf S}\) is the stash.

—	\({\sf Build}\) and \({\sf Extract}\) each take \(O(n + n \cdot \frac{\log n}{\log ^2 \lambda })\) time; and
—	\({\sf Lookup}\) takes \(O(1)\) time in addition to linearly scanning a stash of size \(O(\log \lambda)\).

—	The dictionary \(D\) is an oblivious dictionary storing \(2^{\ell +1}+\log N \log {\lambda }\) elements. Every element in \(D\) is of the form \(({\sf levelIndex}, {\sf whichStash}, {\mathsf {data}})\), where \({\sf levelIndex} \in \lbrace \ell ,\ldots ,L\rbrace\), \({\sf whichStash} \in \lbrace {\sf overflow},{\sf stashes},\bot \rbrace\) and \({\mathsf {data}} \in \lbrace 0,1\rbrace ^{w}\).
—	Each level \(i\in \lbrace \ell +1,\ldots ,L\rbrace\) consists of an instance, called \(T_i\), of the oblivious hash table \({\sf CombHT}\) from Section 8.4 that has capacity \(2^i\).

—	if \(w\lt \log ^3 \log \lambda\), which implies that \(N \lt 2^{O(\log ^3 \log \lambda)}\), we can use a perfect ORAM on \(N\) which yields an ORAM scheme with \(O(\log ^9 \log \lambda)\) overhead (see Theorem 4.8).
—	therefore, henceforth it suffices to consider the case when \(w\ge \log ^3 \log \lambda\).

OptORAMa: Optimal Oblivious RAM

Journal of the ACM

Abstract

1 INTRODUCTION

1.1 Our Results: Optimal Oblivious RAM

1.2 Our Results: Optimal Oblivious Tight Compaction

1.3 Related Works

2 TECHNICAL ROADMAP

2.1 Oblivious RAM

2.1.1 Interspersing Randomly Shuffled Arrays.

2.1.2 An Optimal Hash Table for Shuffled Inputs.

(Comparison of the Warmup Construction with Patel et al. [53]).

2.1.3 Additional Technicalities.

2.2 Tight Compaction

3 PRELIMINARIES

(Pseudorandom Functions (PRFs)).

3.1 Oblivious Machines

(Oblivious Simulation).

(Oblivious Simulation of a Reactive Functionality).

4 OBLIVIOUS BUILDING BLOCKS

4.1 Oblivious Sorting Algorithms

4.2 Oblivious Random Permutations

4.3 Oblivious Bin Placement

4.4 Oblivious Hashing

4.5 Oblivious Cuckoo Hashing

4.5.1 Build Phase: Oblivious Cuckoo Assignment.

(Indiscriminate Hashing).

4.5.2 Oblivious Cuckoo Hashing.

4.6 Oblivious Dictionary

4.7 Oblivious Balls-into-Bins Sampling

5 OBLIVIOUS TIGHT COMPACTION

5.1 Reducing Tight Compaction to Loose Compaction

5.2 Loose Compaction

5.2.1 Compaction from Matching.

((B, B/4)-Matching)

5.2.2 Computing the Matching.

5.2.3 Oblivious Loose Compaction.

5.3 Oblivious Distribution

6 INTERSPERSING RANDOMLY SHUFFLED ARRAYS

6.1 Interspersing Two Arrays

6.2 Interspersing Multiple Arrays

6.3 Interspersing Reals and Dummies

6.4 Perfect Oblivious Random Permutation (Proof of Theorem 4.6)

7 BigHT: Oblivious Hashing for Non-Recurrent Lookups

8 SMALLHT: OBLIVIOUS HASHING FOR SMALL BINS

8.1 Step 1 – Add Dummies and Shuffle

8.2 Step 2 – Evaluate Assignment with Metadata Only

8.3 SmallHT Construction

8.4 CombHT: Combining BigHT with SmallHT

9 OBLIVIOUS RAM

(Using More CPU Registers).

Appendices

A EXTENDED PARAMETER ANALYSIS

B DETAILS ON OBLIVIOUS CUCKOO ASSIGNMENT

C DEFERRED PROOFS

C.1 Proof of Theorem 5.2

C.2 Deferred Proofs from Section 6

C.3 Proof of Security of BigHT (Theorem 7.2)

C.4 Proof of Security of SmallHT (Theorem 8.6)

C.5 Proof of Security of CombHT (Theorem 8.8)

C.6 Proof of Security of ORAM (Theorem 9.2)

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

OptORAMa: Optimal Oblivious RAM

Oblivious RAM with Worst-Case Logarithmic Overhead

On the (in)security of hash-based oblivious RAM and a new balancing scheme

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

(1)	For every integer \(s\), the size of the largest connected component of \(G\) is greater than \(s\) with probability \(O(\gamma ^{-s})\).
(2)	Conditioning on the largest component is at most \(s\), the following holds. For any integer \(t\), let \(C_t\) be the total number of edges of all components such that the size is at least \(t\). Let \(c = 1/7\). If \(t\) satisfies that \(n \beta ^t \ge \Theta (n^{1-c})\), then \(C_k = O(n \beta ^k)\) holds with probability \(2^{-\Omega (\frac{n^{1-2c}}{s^4})}\).

—	Simulating \({\sf Build}\). Upon receiving an instruction to simulate \({\sf Build}\) with security parameter \(1^\lambda\) and a list of size \(n\), the simulator runs the real algorithm \({\sf Build}\) on input \(1^\lambda\) and a list that consists of \(n\) dummy elements. It outputs the access pattern of this algorithm. Let \(({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},\texttt{sk})\) be the output state. The simulator stores this state.
—	Simulating \({\sf Lookup}\). When the adversary submits a \({\sf Lookup}\) command with a key \(k\), the simulator simulates an execution of the algorithm \({\sf Lookup}\) on input \(\bot\) (i.e., a dummy element) with the state \(({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},\texttt{sk})\) (which was generated while simulating the the \({\sf Build}\) operation).
—	Simulating \({\sf Extract}\). When the adversary submits an \({\sf Extract}\) command, the simulator executes the real algorithm with its stored internal state \(({\sf OBin}_1,\ldots ,{\sf OBin}_B,{\sf OF},\texttt{sk})\).

(1)	Sample \(X \leftarrow {{\sf BallsIntoBins}}(n,B)\). Let \((n_1,\ldots ,n_B)\) be the loads obtained in the process and \(\mu = \frac{n}{B}\) be the expectation of \(n_i\) for all \(i \in [B]\).
(2)	Sample independent bin loads \((L_1,\ldots ,L_B) \leftarrow {\mathcal {F}}^{{\mathsf {throw\text{-}balls}}}_{n^{\prime },B}\), where \(n^{\prime } = n\cdot \left(1- \epsilon \right)\). Let \(\mu ^{\prime } = \frac{n^{\prime }}{B}\) be the expectation of \(L_i\) for all \(i \in [B]\).
(3)	\(\underline{{\sf Overflow}}\): If for some \(i\in [B]\) we have that \(\left\| n_i - \mu \right\| \gt 0.5 \epsilon \mu\) or \(\left\| {{L}} _i - \mu ^{\prime } \right\| \gt 0.5 \epsilon \mu\), then \({\sf abort}\) the process.
(4)	Consider the graph \(X = (V_n \cup V_B, E_X)\), and for every bin \(i \in [B]\), remove from \(E_X\) exactly \(n_i - L_i\) edges arbitrarily (these correspond to the elements that are stored in the overflow pile). Let \(X^{\prime } = (V_n \cup V_B, E^{\prime }_X)\) be the resulting graph. Note that \(X^{\prime }\) has \(n^{\prime }\) edges, each bin \(i \in [B]\) has exactly \(L_i\) edges, and \(n-n^{\prime }\) vertices in \(V_n\) have no output edges.
(5)	Recall that \(\pi\) is the input permutation on \({\bf I}\). Let \(\tilde{E}^{\prime }_X = \lbrace (\pi (i), v_i) : (i, v_i) \in E^{\prime }_X\rbrace\) be the set of permuted edges, \(\tilde{V}_{n^{\prime }} \subset V_n\) be the set of nodes that have an edge in \(\tilde{E}^{\prime }_X\), and \(\tilde{X}^{\prime } = (\tilde{V}_{n^{\prime }} \cup V_B, \tilde{E}^{\prime }_X)\). Note that there are \(n^{\prime }\) vertices in \(\tilde{V}_{n^{\prime }}\).
(6)	For the \(\epsilon n\) remaining vertices in \(V_n\) but not in \(\tilde{V}_{n^{\prime }}\) that have no output edges (i.e., the balls in the overflow pile), sample new and independent output edges, where each edge is obtained by choosing independent bin \(i \leftarrow [B]\). Let \(Z^{\prime }\) be the resulting graph (corresponds to the access pattern of \({\sf Lookup}(k_i)\) for all \(k_i\) that appear in \({\sf OF}\) and not in the major bins). Let \(Y = \tilde{X}^{\prime } \cup Z^{\prime }\). (The graph \(Y\) contains edges that correspond to the “real” elements placed in the major bins which were obtained from the graph \(\tilde{X}^{\prime }\), together with fresh “noisy” edges corresponding to the elements stored in the overflow pile).
(7)	Output \((X,Y)\).

(1)	Sample \(X \leftarrow {\sf BallsIntoBins}(n,B)\). Let \((n_1,\ldots ,n_B)\) be the loads obtained in the process and \(\mu = \frac{n}{B}\) be the expectation of \(n_i\) for all \(i \in [B]\).
(2)	Sample independent bin loads \((L_1,\ldots ,L_B) \leftarrow {\mathcal {F}}^{\mathsf {throw\text{-}balls}}_{n^{\prime },B}\), where \(n^{\prime } = n\cdot \left(1- \epsilon \right)\). Let \(\mu ^{\prime } = \frac{n^{\prime }}{B}\) be the expectation of \(L_i\) for all \(i \in [B]\).
(3)	\(\underline{{\sf Overflow}}\): If for some \(i\in [B]\) we have that \(\left\| n_i - \mu \right\| \gt 0.5 \epsilon \mu\) or \(\left\| {{L}} _i - \mu ^{\prime } \right\| \gt 0.5 \epsilon \mu\), then \({\sf abort}\) the process.
(4)	Sample an independent \(Y \leftarrow {\sf BallsIntoBins}(n,B)\). (Corresponding to the access pattern of \({\sf Lookup}(\bot)\) for every command \({\sf Lookup}(k)\).)
(5)	Output \((X,Y)\).

—	We invoke \({\sf Intersperse}^{(j^*-\ell)}\) (in Step 6d instead of \({{\mathcal {F}}_{\sf Shuffle}^{}}\)) on arrays that are output of \({\sf Extract}\) and therefore are randomly shuffled, maintaining the input assumption of Algorithm 6.4.
—	We invoke \({\sf IntersperseRD}\) (in Step 6(d)ii) on an array in which the real elements are randomly shuffled, as this is an output of compaction on a randomly shuffled array. Therefore, this maintains the input assumption of Algorithm 6.6.