Linear Regression with Sparsely Permuted Data

In regression analysis of multivariate data, it is tacitly assumed that response and predictor variables in each observed response-predictor pair correspond to the same entity or unit. In this paper, we consider the situation of"permuted data"in which this basic correspondence has been lost. Several recent papers have considered this situation without further assumptions on the underlying permutation. In applications, the latter is often to known to have additional structure that can be leveraged. Specifically, we herein consider the common scenario of"sparsely permuted data"in which only a small fraction of the data is affected by a mismatch between response and predictors. However, an adverse effect already observed for sparsely permuted data is that the least squares estimator as well as other estimators not accounting for such partial mismatch are inconsistent. One approach studied in detail herein is to treat permuted data as outliers which motivates the use of robust regression formulations to estimate the regression parameter. The resulting estimate can subsequently be used to recover the permutation. A notable benefit of the proposed approach is its computational simplicity given the general lack of procedures for the above problem that are both statistically sound and computationally appealing.


Introduction
A largely unquestioned assumption in regression analysis with response-predictor pairs {(y i , x i )} n i=1 is that each y i corresponds to the same statistical unit as x i . In this paper, we consider the situation where the identifiers of the predictors (or equivalently those of the responses) are subject to an unknown permutation so that correspondence between y i and x i may not be be taken for granted. We refer to this situation as "permuted data" respectively "sparsely permuted data" when the permutation only affects a small fraction of all response-predictor pairs. Restoring the original correspondence between responses and predictors may not be achievable in practice for both computational and statistical reasons, but may also not be required for consistent estimation of regression parameters. Conversely, if the primary interest concerns recovering the permutation itself, which is the case in entity resolution, the regression model can facilitate that task.

Background and Motivation
Large organizations that own or have access to multiple data sources regularly rely on data integration for conducting large-scale scientific projects. Available datasets are gathered or produced at different points of time and independently of one another. The main reason for combining datasets is that no single dataset contains all variables of interest that pertain the research questions. Data collected from a single source are often limited and do not have all the variables needed for statistical analysis. Limited budget, time and resources prevent each particular agency from collecting a comprehensive dataset. In  Record linkage, or entity resolution, is an essential task in data integration. The task is to identify which records in different datasets belong to the same entity. In this context, the term "entity" is to be understood in a broad sense, and may refer to customers, tax payers or patients, etc. Record linkage has a long history of uses in large enterprises, government agencies and health care institutions. A business register consisting of names, addresses, and other identifying information such as total financial receipts might be constructed from tax and employment data bases; a survey of retail establishments or agricultural establishments might combine results from an area frame and a list frame. In this case, units from the area frame would need to be identified in the list frame [52].
Generally, due to the lack of unique entity identifiers in data files, record matching is based on methods of probabilistic record linkage [24] that score similarity between common quasi-identifiers. For example, when the data files contain health information of individuals, the quasi-identifiers can be first and last names, addresses, dates of birth. Spelling errors, typographical variation and missing values in records inhibit exact matching and lead to matching errors: mismatches and missed-matches. Even low matching error rates can lead to selection bias and pervasive outliers, particularly when the record linkage process includes or excludes particular types of entities and their attributes [10]. Selection bias and outliers contaminate subsequent statistical analysis. To reduce bias, it is of interest to develop statistical methods that can alleviate the adverse effects of matching errors.
In this paper, we focus on the problem of linear regression with sparsely permuted data as it arises frequently when linking survey records to external data. The canonical example is regression estimation of a study variable y on auxiliary variables x that reside in another data source such as master data or administrative records. The matching error, determined by the fraction of mismatches and missed-matches, is expected to be only a small fraction of the sample size n [52]. The paper [47] considers two such situations. To model the (Π * y, X) observed data data with true correspondence Figure 2: Regression with lost correspondence between response and predictors.
energy economy properly, an economist might need company-specific data on the fuel and feedstocks used by companies that are only available from Agency A and corresponding microdata on the goods produced for companies that is only available from Agency B. To model the health of individuals in society, a demographer or health science policy worker might need individual-specific information on those receiving social benefits from Agencies B1, B2, and B3, corresponding income information from Agency I, and information on health services from Agencies H1 and H2. Regression with permuted data arises in other applications as well. The papers [3,39] provide extensive lists including multi-target tracking in radar systems [43] and pose and correspondence estimation in computer vision [19].

Problem statement
To setup the problem more concretely, suppose the data (y 1 , x 1 ), . . . , (y n , x n ) is obtained from matching two data files A and B, where for each i, y i ∈ R resides in file A and x i ∈ R d resides in file B. Since the process of record linkage is not error-free, some x i may have been paired up with a non-correspondent y i . If the number of such mismatches is known to be less than or equal to k ≤ n, then there is an unknown permutation ϕ on [n] := {1, . . . , n} so that: ϕ moves at most k of the indices, and (y 1 , x ϕ(1) ), . . . , (y n , x ϕ(n) ) are independent realizations from the classical linear regression model Let Π * and Π * be the matrix representations of ϕ and its inverse, respectively, and let X = x 1 · · · x d . Eq. (1) readily implies that y = Π * Xβ * + , where y = (y 1 , . . . , y n ) and = ( 1 , . . . , n ) For simplicity, since the distribution of Π * and is the same, we do not distinguish between Π * and in the sequel and write in place of Π * .
The main question to be studied here is how the parameters Π * and β * can be accurately and efficiently estimated from the partially mismatched pairs (y i , x i ). Efficiency here refers to computational complexity of the procedure that outputs the desired estimates. Let us consider the number of mismatches which can be expressed as d H (Π * , I n ) := |{i : Π * ii = 0}|, the Hamming distance between Π * and the identity matrix I n . We can naturally estimate Π * and β * by a least-squares approach as: where Π and β run over P n , the set of all permutation matrices in R n×n , and R d , respectively. A permutation respectively its matrix representation Π is said to be k-sparse if d H (Π, I n ) ≤ k. We note that [39] have shown that unless d = 1 the optimization problem (3) is NP-hard for k = n. The same is true for any k that is defined as a fraction of n.
Notation. We here gather some notation frequently used in the present paper. For a positive integer m, I m denotes the m × m identity matrix, and S m−1 denotes the unit sphere in R m . We write |S| for the cardinality of a set S. The complement of S with respect to a given base set depending on the context is denoted by S c . For a matrix A, A 2 = σ max (A) denotes its spectral norm respectively maximum singular value, and range(A) denotes the column space of A. For an index set I ⊆ [m] := {1, . . . , m} and v ∈ R m , v I denotes the subvector of v corresponding to I. We write a ∨ b = max{a, b}. Positive constants are denoted by C, c, c 1 etc. We make use of the usual Big-O notation in terms of O, o, Ω and Θ.

Prior work
Linear regression with linked data files is a common scenario in which traditional methods of statistical data analyses are error-prone. Early work in [37] recognizes the adverse effect of matching error on the regression analysis of linked datasets. They show that as a consequence of matching error, the ordinary least squares estimator β ols is generally not an unbiased estimator of β * . In the paper [37], the process of matching is regarded as random. Letting z i denote the response that is actually in correspondence to x i , and letting y i denote the response that has been matched to x i , the model in [37] assumes that for i = 1, . . . , n, Assuming that these probabilities can be estimated [34,35,37,46,47] focus on constructing an estimator that is unbiased or less biased than β ols . In fact, in [35] the matrix Q = (q ij ) is used to construct an unbiased estimator of β * . The papers [29,30,49] propose to estimate β * using a Bayesian procedure. The main shortcoming of these approaches are that they rely on the assumption that the doubly stochastic matrix Q is known or can be accurately estimated from the data. In addition, while achieving reduction in bias, the proposed estimators may still be inconsistent and may have large mean squared errors. The classical papers [20,21,22] and the later note [53] consider the situation of permuted data under the term "broken sample". A broken sample is a sample of (y i , x i )-pairs that up to some permutation of the {x i } (or equivalently the {y i }) are generated from their joint distribution. In other words, each component of (y i , x i ) is observed separately (as if it were generated from its marginal distribution), with possibly different orders for each component. Assuming that the {(y i , x i )} are generated from a bi-variate normal distribution up to a permutation, these papers discuss recovering the permutation or estimating the correlation parameter of the underlying bivariate normal distribution. In [5,15], the authors discuss whether that parameter can be consistently estimated from a broken sample.
In recent years, computer science and engineering have witnessed a surge of interest in regression analysis of permuted data resulting from problems in multi-target tracking, statistical seriation [25], and unlabeled compressed sensing [50], to mention just a few. The papers [39,40,50] are particularly important as they provide rigorous results on fundamental questions associated with the problem. We shall refer to some of these results in the subsequent sections in more detail. While the paper [17] is not concerned with a regression setting, it provides a detailed analysis of the problem of finding correct matches between two sets of objects in the presence of noise, which bears some relation to the problem discussed in §2.4 below.

Summary of Contributions
The analysis in this paper concerns estimators of Π * and β * given model (1) under the , as also assumed in the recent work [39]. It is not hard to extend the results to the case where x i We first provide a negative result concerning the least squares approach (3) if k = n, i.e., if no constraint is imposed on the permutation Π. It is shown that optimizing over all possible choices of Π leads to overfitting in the sense that if β * = 0, one still has β 2 2 = Ω(1) with high probability. This result complements another one of a similar spirit in the recent paper [3] who show that for d = 1, the least squares estimator (3) converges almost surely to a limit different from β * . Our result also aligns well with a minimax lower bound in [40].
Altogether, these negative results provide additional justification to consider the regime of sparse permutations with k n. We demonstrate a bound on the estimation error β − β * 2 with β as in (3). Specifically, the bound implies that the error vanishes as both d/n and k log(n/k)/n → 0. In view of the fact that the optimization problem (3) is not computationally tractable, we consider a convex relaxation that takes the form of a robust regression estimation problem as it has been considered before in different contexts [26,38,48], with the permuted observations here being in correspondence to gross errors. The robust regression formulation can be reduced to a specific sparse regression problem in an underdetermined setting with n − d samples and n parameters, one for each observation in a given sample that is affected by partial mismatches between the {x i } n i=1 and the {y i } n i=1 . Our analysis of the robust regression problem applies generically beyond the specific setting of sparsely permuted data. Prior works [26,38] have considered a more general version of the problem in which β * is assumed to be sparse as well, thereby being able to cover the regime n < d, but it is not clear whether the results in [26,38] can be specialized to match those of the present paper for the traditional n > d case.
While the robust regression formulation gives rise to an error bound for estimating β * that is comparable to that of the computationally hard formulation (3), it does not immediately yield an estimator for the permutation Π * . We address this issue by adopting a two-stage approach that uses an accurate estimator θ of β * to match the fitted values . This reduces to simple sorting operations, thereby avoiding the computational challenges associated with problem (3). We show that our approach recovers the underlying permutation under qualitatively the same condition as in [39] which is considerably more stringent in terms of required signal-to-noise ratio than what is required for accurate estimation of the regression coefficients β * . We complement our result with a comparable lower bound on the signal-to-noise ratio that is required for permutation recovery even if β * itself is known.
As already pointed out in [39], the Hamming ball constraint in (3) does not substantially change the fundamental statistical limits of permutation recovery. However, that constraint does help in that it gives rise to a computationally efficient estimator of Π * , whereas the statistical achievability result in [39] is for the computationally hard estimator (3). Lemma 6 (D) Figure 3: Schematic overview on targets in the setting of linear regression with sparsely permuted data and pointers to corresponding results in the paper. The directions of indicate that if the object on the left hand side is known or can be accurately estimated, it immediately yields or simplifies estimation of the object on the right hand side.

Main results
This section is structured as follows. We start by showing that the estimator β in (3) lacks statistical consistency if k = n. This sets the stage for considering the scenario k n. We first focus on recovery of β * , treating Π * as a nuisance parameter. We then present and analyze a post-processing step for recovering Π * given an accurate estimate of β * .
In the sequel, unless stated otherwise, we assume Gaussian design for the {x i } n i=1 : Our results generalize to the case x i , for a symmetric positive definite matrix Σ ∈ R d×d in a straightforward manner by defining a new regression parameter Σ 1/2 β * . An estimator of that parameter (as discussed in the sequel) and an estimator of Σ can then be combined to form an estimator of β * . Note that estimation of Σ is not affected by the presence of an unknown permutation.
2.1 Least squares estimation of (Π * , β * ) without additional constraints: a negative result Let us consider problem (3) for k = n, i.e., no further constraints are imposed on the solution ( Π, β). It turns out that in this case, we cannot hope for β to be a good estimator of β * . As can be seen from the proof of the following proposition, complete freedom in choosing Π to fit the data results into over-adaptation to noise even if β * is low-dimensional -in fact, the phenomenon already manifests itself for d = 1.
Proposition 1. Let β * = 0 and Π * = I n , i.e., y = and consider the estimator (3) with k = n. Then there exist constants c, C > 0 such that with probability at least 1−C exp(−cn) In other words, in a pure noise setting, β 2 will be bounded away from zero by a constant (assuming d = O(n)). A result of a similar flavor is shown in [3]. For d = 1, they show that the estimator β converges almost surely to a limit different from β * as n → ∞, and they derive an explicit expression for that limit. Our result here is more of a qualitative nature, but it is non-asymptotic and not limited to d = 1.
Both the result in [3] as well as Proposition 1 raise the question whether there exist alternative estimators that do significantly better in a regime where Π * can be an arbitrary element of P n . Theorem 1 in [40] indicates that the answer is negative: they show that for any estimator ( Π, β), one has While the above lower bound concerns estimation of Π * Xβ * rather than β * , accurate insample prediction (or synonymously denoising), i.e., recovery of Π * Xβ * , is typically easier than recovery of β * . Along this direction, another negative result is shown in [32]. They demonstrate that a lower bound on the signal-to-noise ratio (SNR) of the order Ω(d/ log log n) is required for any estimator of β * to achieve a non-trivial expected relative estimation error 1 . As shown in [39], the condition SNR > n c for c large enough is sufficient for the solution Π of (3) to recover Π * , in which case β − β * 2 2 scales as O(d/n) as in the usual regression setting in the absence of an unknown permutation.

Least squares estimation of
In summary, the previous section points to the fact that we cannot hope for accurate estimation of β * without additional assumptions on Π * and/or the SNR of the problem. As motivated in the introduction, we henceforth turn our attention to the case of Π * being k-sparse, with k "significantly smaller" than n. The allowable range of k is addressed in our analysis presented below. We start by fixing notation. For 0 ≤ k ≤ n, let us introduce the shorthand for the constraint set of Π in problem (3). Moreover, for a compact and symmetric 2 set K ⊂ R n , its Gaussian width is defined by While originating in geometric analysis, the Gaussian width is a measure of complexity that has been increasingly adopted in the analysis of high-dimensional linear inverse problems [13,16,42] in connection with Gordon's "Escape Through a Mesh Theorem" [27], which is the key component in the proof of Theorem 1 below as well. In our setting, we use the Gaussian width (6) in conjunction with the set where we recall that Π * ∈ P n,k is the inverse of Π * . Let · 0 denote the 0 -"norm", i.e., the number of non-zero entries of a vector. A simple observation is that for any Π ∈ P n,k and any v ∈ R n , it holds that 1 Non-trivial here refers to a relative estimation error of lower order than that of the estimator θ ≡ 0.
As a result, T ⊆ B 0 (2k; n) ∩ S n−1 , where for 0 ≤ r ≤ n, the set B 0 (r; n) = {v ∈ R n : v 0 ≤ r} denotes the 0 -"ball" of radius r in R n . By well-known results (cf. [41], Lemma 2.3), we hence have that Moreover, it is not hard to show that w(T ) ≥ w(B 0 (n − k; k) ∩ S n−1 ), i.e., there is a lower bound on w(T ) of the same order. After these preparations, we are in position to state the following result.
If furthermore n > 9 ∨ 4d, then with probability We start our interpretation of Theorem 1 by inspecting condition (9) which imposes a limit on how large d and k can be in relation to n. Roughly speaking, the requirements are n = Ω(d) and n = Ω(w 2 (T )). In light of (8), the latter condition becomes n = Ω(k log(n/k)). Specifically, let us fix ε = 1/4 and d = αn for α ∈ (0, 1 4 ), then (9) essentially evaluates as The error bound (10) consists of two parts. The first part equals the error one would have if the permutation Π * were known in advance and is thus inevitable. The second term is a bound on the excess error incurred for not knowing Π * . That term is well controlled as long as long as the fraction of permuted observations is small relative to n. A crucial intermediate step in the proof of Theorem 1 is a bound on Π * y − Πy 2 . Under an additional condition on the SNR, we may deduce from that bound that Π identifies the "support" S * = {i : Π * ii = 1} of Π * . We may then re-fit with the corresponding observations left out, to achieve a smaller error in estimating β * . Corollary 1. For any δ ∈ (0, 1), under the conditions of Theorem 1, if it additionally holds that the following events hold with probability at least 1 − δ − 2 exp(− 1 2 (d ∨ log n)): where β denotes the ordinary least squares estimator based on data {(y i , Under the conditions of the corollary, as long as n − k = Ω(n), the error rate in estimating β * is of the same order as if Π * were known in advance; the second term in (10) gets eliminated. It is important to note that support recovery (i.e., { S = S * }) is a weaker result than permutation recovery (i.e., {Π * = Π}). As discussed in more detail in §2.4 below, the latter requires a considerably more stringent condition on the SNR than what is required in Corollary 1.

Convex relaxation
While the approach of the previous section has appealing statistical properties, its computational hardness asks for computationally efficient alternatives with similar guarantees. As long as Π * is treated as a nuisance parameter, we may eliminate it in the following way. Introducing f * = Π * Xβ * − Xβ * = (Π * − I n )Xβ * , model (2) can be re-expressed as This prompts the optimization problem A first relaxation is given by We note in passing that one could additionally impose the constraint n i=1 f i = 1. However, it turns out that its addition does not yield significant statistical benefits, and it is thus omitted. Relaxation (11) is still not convex, but following the standard approach of replacing the 0 -norm by the 1 -norm, we end up with the convex optimization problem Since it tends to be difficult to choose b appropriately in practice, it is more convenient to work with the Lagrangian form of (12). After re-parameterizing e = f / √ n 3 min β∈R d ,e∈R n 1 n y − Xβ − √ ne 2 2 + λ e 1 , λ > 0.
Formulation (13) and variants thereof have been used in robust regression and signal recovery with gross corruptions (e.g., [26,36,38,48]). In fact, (13) is equivalent to employing Huber's loss [33] instead of squared loss, cf. [48] and the references therein. The connection to robust regression comes naturally as observations with mismatch between x and y are likely to induce severe errors in model fitting beyond the usual noise, and hence take the role of outliers. Indeed, this reasoning could have been used to motivate (13) right away instead of via the sequence of relaxations stated above. Formulation (13) is related to least absolute deviation regression min in that (14) is obtained from (13) in the limit λ → 0 and the additional constraint y = Xβ + √ ne. In that sense, (14) can be seen as the counterpart to (13) in a noiseless setup. Problem (14) has been analyzed under assumption (G) in the landmark papers of [14,45] on the classical error correcting problem in coding theory.
Theorem 2 below provides an upper bound on the 2 -error of the estimator of β * resulting as the optimal β for (13). Theorem 2. Let ( β, e) be a minimizer of (13) with λ = 4(1 + M )σ 2 log(n)/n for some M > 0. There exist constants c 1 , c 2 , ε > 0 so that if Comparing the upper bounds (10) and (15) of the original problem and its relaxation, respectively, we observe a close agreement given that w(T ) = Θ(k log(n/k)). Apart from the slight change of the order in the second term, only constants differ. Similarly to Corollary 1, we may consider a two-stage procedure in order to further improve upon (15): given e and a suitable threshold t, we let S t = {i : | e i | ≥ t} and obtain a plain least squares With k assumed to be known, we may take t as the k-th largest largest entry of e in absolute magnitude. The formal analysis is similar to Corollary 1 and is hence omitted.

Estimation of Π * given an estimator of β *
Having discussed estimation of β * , we now come back to the problem of estimating Π * . As indicated above, the latter problem turns out to be significantly more challenging than the former problem. In the sequel, we study the question of how the availability of an accurate estimate of β * can be leveraged to construct an estimator of Π * that is computationally feasible including the case d > 1. As mentioned earlier, except for d = 1, the optimization problem in Eq.
For k = n, maximizing each term inside the curly brackets is a specific instance of a linear assignment problem [12], a class of problems that can be solved in polynomial time despite their combinatorial nature. In fact, it easy to see that where X (i) and y (i) denote the i-th order statistic of X and y, respectively, i ∈ [n]. Hence, for k = n, problem (16) reduces to two vanilla sorting operations. At this point, it is not well understood yet whether the Hamming constraint d H (Π, I n ) ≤ k for general k causes problem (16) to be NP-hard again. So far, we are not aware of any computationally efficient algorithm for general k.
Note that when β * is known, computing the least squares estimator of Π * reduces to solving precisely one of the two optimization problem encountered in (16) for d = 1: At this point, a natural idea is to replace β * by an accurate and computationally feasible estimator like the one discussed in §2. 3. From a computational point of view, this already constitutes a simplification as (18) reduces to an integer linear program; while problems in this class are still NP-hard in general, problem (18) can be considered as much more benign than the original problem (3) which belongs to the class of quadratic assignment problems notorious for their computational hardness. Due to recent advances in integer programming that have meanwhile been taken advantage of for a series of other statistical problems [7,8], it turns out that problem (18) is practically feasible at least for n in the order of a few thousands. For large n, we instead recommend estimating the support of Π * and then solve the unconstrained problem (17) restricted to observations in the estimated support. Formally, denote by S an estimator of the support of Π * and let y S and X S be the sub-vector of y and the row submatrix of X corresponding to observations in S, respectively. We then estimate Π * by Π defined by where We now turn to the statistical limits of permutation recovery, including a lower bound on the SNR in the idealized case with known β * . For simplicity, the theorem stated below is for the case k = n, but it generalizes to k < n conditional on having { S = S * } in that all expressions in n below would get replaced by k.
If β * is known, specializing part (a) to the case ∆ = 0 asserts that SNR = Ω(n 4 ) is a sufficient condition for permutation recovery. This is to be compared to a result in [39] stating that the estimator Π in (3) recovers Π * if SNR = Ω(n c ), where the constant c > 0 is not specified. Next, let us now turn to the case where β * is substituted by an estimator θ. Using standard concentration arguments, one shows that holds with high probability. Specifically, considering the estimator β in §2.3, Theorem 2 then yields X β − Xβ * ∞ ≤ Cσ √ n for k small enough, with high probability. Inserting ∆ = C √ n into part (a) then results into the requirement SNR = Ω(n 5 ). We stress again that as opposed to Π, the estimator Π( β) is computationally appealing as it is obtained from a quadratic program (13) and subsequent sorting.
Finally, part (b) provides evidence that the condition SNR = Ω(n 2 ) is necessary for permutation recovery. While the result in (b) concerns a specific estimator, there does not seem to be much indication for the existence of a substantially better estimator. In [39], it is shown that SNR = Ω(n 8/9 ) is necessary for any estimator.

Simulated data
Below, we present the results of two synthetic data experiments that are meant to serve as illustration of the developments in the previous section. In the first set of experiments, we generate n = 200 observations from model (2) {.01, .02, .05, .1, .15, .2, . . . , .5}, where the support of Π * is selected uniformly at random. The parameter β * is generated from the uniform distribution on S d−1 . For each pair (σ, k/n), we conduct 100 independent replications.
We compare the following estimators: (i) The naive estimator (ordinary least squares estimator) of β * that ignores the fact that a fraction of the data is mismatched; this corresponds to the choice k = 0 in (3). The estimator ( β, e) is computed in CVX [28]. The estimator β in §2.2 is compared to β in a second set of simulations only for the case d = 1. In that case, computation of β reduces to (16) where each of the two inner optimization problems can be cast as an integer linear program with n 2 variables and 2n+1 linear constraints. It turns out that the general purpose routine cplexbilp in IBM CPLEX [1] is able to solve such problems in only a few seconds for n = 200. Again, such reduction is limited to the case d = 1. Inspection of Figure 4 shows that the approach (ii) improves dramatically over the naive estimator (i) as long as the SNR is not too small (for σ = 1 and σ = 0.5, there is no longer a visible improvement). The results look promising in that the tolerable fraction of permuted observations can be as much as 0.5 as long as the noise level is small; "tolerable" here refers to the fact that the estimation error increases gently with the fraction of permuted observations as opposed to (i) with a sharp increase in error as k/n moves away from zero. Approach (iii) appears to yield further improvements over (ii) for large SNR. For small SNR, (ii) and (iii) are not distinguishable. This observation aligns well with Corollary 1.

Real data
The El Niño data set in the UCI Machine Learning Repository [2] contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The data set consists of n = 93, 935 records with the following attributes: buoy identifier, date, location (latitude and longitude), zonal and meridional wind speeds (zon and mer), relative humidity (humidity), air temperature (air.temp), sea surface temperature and subsurface temperatures down to a depth of 500 meters (s.s.temp). The following linear regression model is considered: air.temp = β 0 + β zon · zon + β mer · mer+ + β humidity · humidity + β s.s.temp · s.s.temp + .
The results of least squares regression indicate an excellent fit of this model with an Rsquared of about 0.9. With the goal to mimic the situation of data merging based on nonunique identifiers, the data set is divided into two data sets A and B, with A containing the response variable and B containing the predictors. The variables latitude and longitude are maintained in both data sets, and used as quasi-identifiers for record linkage of A and B with the R package fastLink [23]. Since (latitude, longitude)-pairs are generally associated with multiple observations, the linkage process is not free of mismatches. The merged data set in turn follows the format depicted in Figure 2 with a permutation affecting  Table 1: Results of the real data analysis. Here, β oracle denotes the least squares estimator based on the original data set, β ols denotes the naive least squares estimator, and β denotes the estimator from formulation (13). The columns labeled zon, . . ., s.s.temp contain the regression parameter estimates for the respective variable; interc refers to the intercept. The three rightmost columns contain the root mean square errors (RMSEs) on the original ( β oracle ) respectively the merged data set ( β ols , β), the 2 -distance to β oracle , and the average hold-out errors (standard errors in parentheses), respectively.
the correspondence between responses and predictors. As shown in Table 1, least squares regression with the thus merged data set leads to an increase of the residual sum of squares by 27%. In addition, the estimates of the regression parameters of the linear model (20) change noticeably. As an alternative, we consider the approach (13) studied in detail herein, where the parameter λ is chosen as λ = 2c σ/ √ n with c = 1.345 and σ being a robust estimator of the residual standard error. More specifically, we use the function rlm (short for "robust linear model fitting") in the R package MASS [44] with its default arguments which is in exact correspondence to the above choice of λ, as follows from the connection between linear regression with the Huber loss and formulation (13) (cf., e.g., Proposition 3.1 in [48]). The figures in Table 1 indicate that this approach provides some remedy relative to the naive use of the least squares estimator based on the merged data set. The estimates of the regression parameters from (13) reduce the gap to those obtained with the original data set by about a factor of one half in terms of 2 -distance. Moreover, approach (13) also yields a better fit in terms of the mean squared prediction error on a collection of twenty hold-out sets obtained from leaving out different random subsets of 10% of the observations; in each of those twenty runs, the hold-in data set is split and subsequently re-linked in the same manner as described above. The reduction in excess risk (relative to the least squares estimator) achieved by (13) in comparison to the naive least squares solution is again a factor of roughly one half.

Conclusion
Regression problems in which the correspondence between responses and predictors has been lost can be traced back to classical work in statistics and naturally arises in the context of record linkage. Despite its long history, the problem has seen revived interest lately and has been analyzed through the lens of modern non-asymptotic statistical theory. An additional layer of complexity to be dealt with is the computational hardness of (quadratic) assignment problems. In the present paper, both aspects are taken into account under the assumption of sparsely permuted data. One central insight of our work is that the availability of an accurate estimator of β * may be helpful in circumventing the computational barrier associated with permutation recovery. On the other hand, preliminary numerical results for d = 1 indicate a gap in statistical performance between a presumably hard formulation capturing precisely the notion of a sparse permutation and a computationally convenient relaxation that introduces a loss of information. This observation deserves further investigation.
[53] Y. N. Wu, A note on broken sample problem, tech. rep., Department of Statistics, University of Michigan, 1998.

A Proof of Proposition 1
Consider the least squares problem (3) with y = : Omitting the term not depending on β, an equivalent problem is given by Re-parameterizing Xβ = u · r, with r ≥ 0 for u ∈ range(X) ∩ S n−1 , we obtain the optimization problem min Π,u∈range(X)∩S n−1 ,r≥0 −2 Π , u r + r 2 .
Observe that for any u and Π, the optimal value of r is found as r(u) = max{ Π , u , 0}.

Let further
n + = min{n ,+ , n X 1 ,+ }, n − = min{n ,− , n X 1 ,− }, n ± = n − n + − n − . Furthermore, let in increasing order. We then construct Π ± such that All terms inside the first two sums are i.i.d. from the same distribution as |g||h|, where g and h are independent N (0, 1)-random variables, and accordingly the terms inside the last summand are i.i.d. from the same distribution as −|g||h|. Conditional on n ± , hence Π ± , X 1 follows the same distribution as where {g i } n i=1 and {h i } n i=1 are two independent sequences of n i.i.d. random variables from the N (0, 1)-distribution. The expectation of the above expression (conditional on n ± ) is given by From Proposition 5.16 in [51], we have the following concentration inequality: Hence, conditional on n ± , with probability at least 1 − exp(−nc), Consider the definition of n + and n − Observe that n ,+ , n X 1 ,+ , n ,− and n X 1 ,− are each binomial random variables with n trials and probability of success 1/2. By repeated application of Hoeffding's inequality, we hence have P(n + ≤ 3n/8) ≤ 2 exp(−n/32), P(n − ≤ 3n/8) ≤ 2 exp(−n/32).

B Proof of Theorem 1
We start with a basic, yet important observation that allows us to decouple Π and β. Let P X and P ⊥ X denote the orthogonal projection on range(X) respectively its orthogonal complement.
The second part can always be made equal to zero by choosing β such that Xβ = P X Πy. It hence suffices to minimize the first part w.r.t. Π and to back-substitute the result into the second part. Under model (G), X is non-singular with probability one as long as n ≥ d which yields the second expression in (22).
In the next lemma, we bound the 2 -distance between Π * y and Π y.
Under the conditions of Theorem 1, the following holds with probability at least Proof. In view of Lemma 1, by definition of Π and the fact that Π * is a feasible solution, we have Since P ⊥ X Π * y = P ⊥ X (Xβ * + ) = P ⊥ X , we have the implication that Dividing both sides by P ⊥ X ( Π − Π * )y 2 , we obtain that i) Lower bounding the left hand side of (23).
Observe that ∆ ∈ T as defined in (7). Moreover, note that under (G), the random subspace range(X) ⊆ R n follows the uniform distribution on the Grassmannian G(n, d).
ii) Upper bounding the right hand side of (23). Consider t > 0 arbitrary, to be chosen later. We have Now note that and P ⊥ X are independent. As a result, when conditioning on P ⊥ X , we have from Lemma 14 that The second equality is a consequence of the Sudakov-Fernique comparison inequality (Theorem 2.2.3 in [4]) and the fact that P ⊥ X 2 ≤ 1. Choosing t = (1 + √ 2)σ{w(T ) ∨ log n}, combining this result with i), and putting together the pieces in (23), the assertion of the lemma follows.
In order to complete the proof of Theorem 1, we need one last lemma whose proof is deferred to Appendix G.
Lemma 3. Let σ min (·) denote the minimum singular value of a matrix. If n ≥ 9 ∨ 4d, it holds that with the stated probability by combining Lemmas 2 and 3.

C Proof of Corollary 1
Observe that the event { S = S * } is implied by the event which is in turn implied by We now have Note that for any i = j, y i − y j ∼ N (0, 2( β * 2 2 + σ 2 )). Invoking Lemma 19, we thus obtain that for any i = j P (|y i − y j | ≤ γ) ≤ γ 2( β * 2 2 + σ 2 ) .
It hence follows from the previous two displays that for any δ > 0, Substituting γ by the bound from Lemma 2, the above condition becomes and the first part of the corollary follows from Lemma 2. Turning to the second part, consider the least squares estimator using the reduced data set {(x i , y i ), i ∈ [n] \ S * } and denote by β(S * ) the corresponding least squares estimator. We then have for any δ > 0 It thus remains to control the first probability on the left hand side, which can be done with arguments similar to those used in the last part of the proof of Theorem 1.

D Proof of Theorem 2
We start with a Lemma that parallels Lemma 1. The reasoning is analogous, and the proof is hence omitted.
For our analysis of the optimization problem of Lemma 4, we need the following crucial lemma whose proof is provided in Appendix F.
In correspondence with Lemma 2, we first bound e−e * 2 , where e * = (Π * Xβ * −Xβ * )/ √ n. Proof. We first decompose According to Lemma 4, since e is a minimizer, we have that Expanding the square on the left-hand side and re-arranging terms, we obtain that which implies λ e 1 ≤ 2 e * − e 1 / √ n ∞ + λ e * 1 , and in turn Re-arranging, we arrive at Equipped with this is intermediate result, we go back to (25). We first obtain by rearranging that √ k e S * − e * S * 2 , where we have used (26) in the second inequality. In order to lower bound the left hand side, note that (26) can be written as e − e * ∈ C(S * , 3) according to (31). Conditional on the event in Lemma 5, Combining this with the previous display, we find that In view of Lemma 18, the choice λ = 4(1 + M )σ 2 log(n)/n for M ≥ 0 guarantees that the event Λ defined above (26) occurs with probability at least 1 − 2n −M 2 . This completes the proof of the Lemma.
Using Lemmas 2 and 4, the proof of the theorem is completed along the lines of the last part of the proof of Theorem 1. For brevity, we omit those steps here.

E Proof of Theorem 3
The proof of part (a) of the theorem relies on two lemmas, which are stated and proved first.
The following lemma is of interest in its own when the design matrix is fixed and both β * and σ 2 are known or accurate estimates of these quantities are available. Condition (28) can then be evaluated explicitly, at least after substituting β * and σ 2 by their respective estimates. The result is close in spirit to those in the framework in [17].
Lemma 8. Conditional on X, we have P Π(β * ) = Π * | X < δ if Proof. Without loss of generality, we may assume that Π * = I n . By construction of Π(β * ), it is clear that Π(β * ) = Π * whenever there exists i = j such that (y i −y j )(x i β * −x j β * ) < 0, that is, the order between x i β * and x j β * is different than that of y i , y j . First, conditioning on the random design matrix X, we have Since y i − y j | x i β * − x j β * ∼ N (x i β * − x j β * , 2σ 2 ), using the usual tail bound for the Gaussian distribution for each term in the above sum, we obtain Hence, for any δ > 0, P( Π(β * ) = Π * | X) < δ if min i<j x i β * − x j β * 2 > 4σ 2 log n(n − 1) δ .
The proof of part (b) in Theorem 3 requires another lemma.
We finally turn to the proof of part (b) of the theorem. Define indices i 0 , j 0 ⊂ [n] by the relation x i 0 β * − x j 0 β * = min i<j |x i β * − x j β * |. For given γ > 0, we have