Skip to main content
Log in

The matrix mechanism: optimizing linear counting queries under differential privacy

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. We describe the matrix mechanism, an algorithm for answering a workload of linear counting queries that adapts the noise distribution to properties of the provided queries. Given a workload, the mechanism uses a different set of queries, called a query strategy, which are answered using a standard Laplace or Gaussian mechanism. Noisy answers to the workload queries are then derived from the noisy answers to the strategy queries. This two-stage process can result in a more complex, correlated noise distribution that preserves differential privacy but increases accuracy. We provide a formal analysis of the error of query answers produced by the mechanism and investigate the problem of computing the optimal query strategy in support of a given workload. We show that this problem can be formulated as a rank-constrained semidefinite program. We analyze two seemingly distinct techniques proposed in the literature, whose similar behavior is explained by viewing them as instances of the matrix mechanism. We also describe an extension of the mechanism in which nonnegativity constraints are included in the derivation process and provide experimental evidence of its efficacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Ács, G., Castelluccia, C., Chen, R.: Differentially private histogram publishing through lossy compression. In: ICDM, pp. 1–10 (2012)

  2. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS (2007)

  3. Ben-Israel, A., Greville, T.: Generalized Inverses: Theory and Applications, vol. 15. Springer, Berlin (2003)

    Google Scholar 

  4. Cormode, G., Procopiuc, M., Shen, E., Srivastava, D., Yu, T.: Differentially private spatial decompositions. In: ICDE (2012)

  5. Dattorro, J.: Convex Optimization & Euclidean Distance Geometry. Meboo Publishing, USA (2005)

    Google Scholar 

  6. Ding, B., Winslett, M., Han, J., Li, Z.: Differentially private data cubes: optimizing noise sources and consistency. In: SIGMOD, pp. 217–228 (2011)

  7. Dwork, C.: Differential privacy: a survey of results. In: TAMC (2008)

  8. Dwork, C.: The differential privacy frontier. In: TCC (2009)

  9. Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)

    Article  Google Scholar 

  10. Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: EUROCRYPT, pp. 486–503 (2006)

  11. Dwork, C., Naor, M., Reingold, O., Rothblum, G., Vadhan, S.: On the complexity of differentially private data release: efficient algorithms and hardness results. In: STOC, pp. 381–390 (2009)

  12. Dwork, C., Nissim, F.M.K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC (2006)

  13. Dwork, C., Rothblum, G.N., Vadhan, S.P.: Boosting and differential privacy. In: FOCS, pp. 51–60 (2010)

  14. Ghosh, A., Roughgarden, T., Sundararajan, M.: Universally utility-maximizing privacy mechanisms. In: STOC (2009)

  15. Gupta, A., Roth, A., Ullman, J.: Iterative constructions and private data release. In: TCC, pp. 339–356 (2012)

  16. Hardt, M., Ligett, K., McSherry, F.: A simple and practical algorithm for differentially private data release. In: NIPS, pp. 2348–2356 (2012)

  17. Hardt, M., Rothblum, G.: A multiplicative weights mechanism for privacy-preserving data analysis. In: FOCS, pp. 61–70 (2010)

  18. Hardt, M., Talwar, K.: On the geometry of differential privacy. In: STOC, pp. 705–714 (2010)

  19. Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially-private histograms through consistency. PVLDB 3(1–2), 1021–1032 (2010)

    Google Scholar 

  20. Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: PODS, pp. 123–134 (2010)

  21. Li, C., Miklau, G.: An adaptive mechanism for accurate query answering under differential privacy. PVLDB 5(6), 514–525 (2012)

    Google Scholar 

  22. McSherry, F., Mironov, I.: Differentially private recommender systems: building privacy into the netflix prize contenders. In: SIGKDD (2009)

  23. McSherry, F.D.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: SIGMOD, pp. 19–30 (2009)

  24. Nikolov, A., Talwar, K., Zhang, L.: The geometry of differential privacy: the sparse and approximate cases. In: STOC (2013)

  25. Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC, pp. 75–84 (2007)

  26. Qardaji, W.H., Yang, W., Li, N.: Understanding hierarchical methods for differentially private histograms. PVLDB 6(14), 1954–1965 (2013)

    Google Scholar 

  27. Roth, A., Roughgarden, T.: Interactive privacy via the median mechanism. In: STOC, pp. 765–774 (2010)

  28. Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. In: ICDE, pp. 225–236 (2010)

  29. Xiao, Y., Gardner, J.J., Xiong, L.: Dpcube: Releasing differentially private data cubes for health information. In: ICDE, pp. 1305–1308 (2012)

  30. Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J 22(6), 797–822 (2013)

    Article  Google Scholar 

  31. Yaroslavtsev, G., Cormode, G., Procopiuc, C.M., Srivastava, D.: Accurate and efficient private release of datacubes and contingency tables. In: ICDE (2013)

  32. Yuan, G., Zhang, Z., Winslett, M., Xiao, X., Yang, Y., Hao, Z.: Low-rank mechanism: optimizing batch queries under differential privacy. In: VLDB (2012)

Download references

Acknowledgments

We appreciate the comments of each of the anonymous reviewers. Hay, Li, and Miklau were supported by the National Science Foundation under NSF Award Numbers IIS-0964094, CNS 1012748 and CNS-1409143. McGregor was supported by CCF-0953754. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Li.

Appendices

Appendix 1: Linear algebra fundamentals

In this section, we summarize the concepts and results in linear algebra and matrix analysis that are used throughout the paper. We use \(\text{ diag }(d_1, \ldots d_n)\) to indicate the \(n \times n\) diagonal matrix with scalars \(d_i\) on the diagonal and \(\mathbf {0}^{m \times n}\) to indicate a matrix of zeroes with m rows and n columns. Recall that for a matrix \(\mathbf {A},\,\mathbf {A}^T\) is its transpose and \({\mathbf {A}}^{-1}\) is its inverse. We say \(\mathbf {A}\) is symmetric if \(\mathbf {A}^T=\mathbf {A}\) and orthogonal if \(\mathbf {A}^T={\mathbf {A}}^{-1}\). The rank of a matrix \(\mathbf {A},\,\mathrm{rank}(\mathbf {A})\), is defined as the size of the largest set of linearly independent rows (or equivalently columns) of \(\mathbf {A}\). We say a matrix is full row (column) rank if its rank is equal to the number of its rows (columns). In particular, \({\mathbf {A}}^{-1}\) exists if and only if \(\mathbf {A}\) is a square matrix with full rank.

If matrix \(\mathbf {A}\) is a square matrix, the trace of \(\mathbf {A}\), denoted as \(\text{ trace }(\mathbf {A})\), is the sum of entries on the main diagonal if \(\mathbf {A}\). The trace of a matrix has a very important property: It is invariant under cyclic permutations; i.e, if matrix \(\mathbf {A}_1\) has m columns and matrix \(\mathbf {A}_3\) has m rows,

$$\begin{aligned} \text{ trace }(\mathbf {A}_1\mathbf {A}_2\mathbf {A}_3)=\text{ trace }(\mathbf {A}_3\mathbf {A}_1\mathbf {A}_2) \end{aligned}$$

Another concept that is related to the trace is the Frobenius norm. The Frobenius norm of \(\mathbf {A}\) is denoted as \(||\mathbf {A}||_F\) and defined as \(\sqrt{\text{ trace }(\mathbf {A}^T\mathbf {A})}\), or, equivalently, the square root of the squared sum of all entries in \(\mathbf {A}\).

Matrix decomposition is extensively used in the paper. We focus on two decompositions: eigenvalue decomposition and singular value decomposition. Given a matrix \(\mathbf {A}\), the eigenvalue decomposition of \(\mathbf {A}\) always exists when \(\mathbf {A}\) is symmetric. It can be written as \(\mathbf {A}=\mathbf {Q}\mathbf {D}\mathbf {Q}^T\) where \(\mathbf {Q}\) is an orthogonal matrix whose columns are eigenvectors of \(\mathbf {A}\) and \(\mathbf {D}\) is a diagonal matrix whose diagonal entries are eigenvalues of \(\mathbf {A}\). The singular value decomposition of \(\mathbf {A}\) always exists and has the form \(\mathbf {A}=\mathbf {Q}\mathbf {D}\mathbf {P}^T\) where \(\mathbf {Q}\) and \(\mathbf {P}\) are orthogonal matrices and \(\mathbf {D}\) is a diagonal matrix padded with columns or rows of 0s.

We will also rely on the notion of the positive semidefinite matrix. A symmetric square matrix \(\mathbf {A}\) is called positive semidefinite, denoted as \(\mathbf {A}\succeq 0\), if for any vector \(\mathbf {x},\,\mathbf {x}^T\mathbf {A}\mathbf {x}\ge 0\). In particular, for any matrix \(\mathbf {A},\,\mathbf {A}^T\mathbf {A}\succeq 0\). Here, we present two equivalent conditions to positive semidefinite.

Proposition 15

Given an \(n\times n\) symmetric matrix \(\mathbf {A}\), both of the following conditions are equivalent with \(\mathbf {A}\succeq 0\).

  1. (i)

    All the eigenvalues of \(\mathbf {A}\) are nonnegative.

  2. (ii)

    For any \(1\le i_1<\cdots < i_k\le n\), the determinant of the matrix that consists of the intersection of the \(i_1^{th},\ldots , i_k^{th}\) rows and \(i_1^{th},\ldots , i_k^{th}\) columns of matrix \(\mathbf {A}\) is nonnegative.

In addition, we consider a generalization of the matrix inverse, called the Moore–Penrose pseudoinverse, which is defined as follows:

Definition 17

(MoorePenrose Pseudoinverse [3]) Given a \(m\times n\) matrix \(\mathbf {A}\), a matrix \(\mathbf {A}^{+}\) is the Moore–Penrose pseudoinverse of \(\mathbf {A}\) if it satisfies each of the following:

$$\begin{aligned} \mathbf {A}\mathbf {A}^{+}\mathbf {A}=\mathbf {A},\;\mathbf {A}^{+}\mathbf {A}\mathbf {A}^{{+}} =\mathbf {A}^{{+}},\;(\mathbf {A}\mathbf {A}^{+})^T =\mathbf {A}\mathbf {A}^{+}, \; (\mathbf {A}^{+}\mathbf {A})^T=\mathbf {A}^{+}\mathbf {A}. \end{aligned}$$

The Moore–Penrose pseudoinverse is unique and can be computed using the singular value decomposition of a matrix.

Proposition 16

([3]) Given an \(n\times n\) diagonal matrix \(\mathbf {D}_0,\,\mathbf {D}_0^{+}=\{d'_{ij}\}\) is an \(n\times n\) diagonal matrix such that

$$\begin{aligned} d'_{ij}=\left\{ \begin{array}{cl}0 &{}\quad d_{ij}=0 \\ \frac{1}{d_{ij}} &{}\quad d_{ij}\ne 0\end{array}\right. \end{aligned}$$

For an \(m\times n\) matrix \(\mathbf {D}\) consisting of a diagonal matrix \(\mathbf {D}_0\) padding with columns (rows) of 0s, \(\mathbf {D}^{+}\) is an \(n\times m\) matrix consisting of the diagonal matrix \(\mathbf {D}_0^{+}\) with rows (columns) of 0s. Given a matrix \(\mathbf {A}\) with singular value decomposition \(\mathbf {A}=\mathbf {Q}\mathbf {D}\mathbf {P}^T,\,\mathbf {A}^{+}=\mathbf {P}\mathbf {D}^{+}\mathbf {Q}^T\).

When \(\mathbf {A}\) has full column rank, \(\mathbf {A}^{+}=(\mathbf {A}^T\mathbf {A})^{-1}\mathbf {A}^T\). We include some important properties of the Moore–Penrose pseudoinverse in the following proposition.

Proposition 17

([3]) The Moore–Penrose pseudoinverse satisfies the following properties:

  1. 1.

    Given any matrix \(\mathbf {A}\), there exists a unique matrix that is the Moore–Penrose pseudoinverse of \(\mathbf {A}\).

  2. 2.

    Given a vector \(\mathbf {y}\), we have \(||\mathbf {y}-\mathbf {A}\mathbf {x}||_2\ge ||\mathbf {y}-\mathbf {A}\mathbf {A}^{+}\mathbf {y}||_2\) for any vector \(\mathbf {x}\).

  3. 3.

    For any satisfiable linear system \(\mathbf {B}\mathbf {A}=\mathbf {W},\,\mathbf {W}\mathbf {A}^{+}\) is a solution to the linear system and \(||\mathbf {W}\mathbf {A}^{+}||_F\le ||\mathbf {B}||_F\) for any solution \(\mathbf {B}\) to the linear system.

Appendix 2: Proofs

1.1 Proofs from Section 4

Proposition 6 The matrix mechanism \(\mathcal {M}_{\mathcal {K}, \mathbf {A}}\) inherits the privacy guarantee of \(\mathcal {K}\) and is unbiased if \(\mathcal {K}\) is unbiased.

Proof

According to Eq. (1), the matrix mechanism can be considered to post-process the output of \(\mathcal {K}(\mathbf {A}, \mathbf {x})\), without using \(\mathbf {x}\), and hence shares the same privacy guarantee with \(\mathcal {K}(\mathbf {A}, \mathbf {x})\). In addition, since \(\mathbf {W}\mathbf {A}^{+}\mathbf {A}=\mathbf {W}\), we have:

$$\begin{aligned}\mathbb {E}[\mathcal {M}_{\mathcal {K},\mathbf {A}}(\mathbf {A},\mathbf {x})]&=\mathbb {E}[\mathbf {W}\mathbf {A}^{+}\mathcal {K}(\mathbf {A}, \mathbf {x})]\\&=\mathbf {W}\mathbf {A}^{+}\mathbb {E}[\mathcal {K}(\mathbf {A}, \mathbf {x})]=\mathbf {W}\mathbf {A}^{+}\mathbf {A}\mathbf {x}=\mathbf {W}\mathbf {x}. \end{aligned}$$

\(\square \)

1.2 Proofs from Section 5

Corollary 1 Given two query strategies \(\mathbf {A}_1\) and \(\mathbf {A}_2\) that are profile equivalent, for any query \(\mathbf {W},\,\mathbf {A}_1\) supports \(\mathbf {W}\) if and only if \(\mathbf {A}_2\) supports \(\mathbf {W}\). Furthermore, there exists a nonzero constant c such that given a differentially private algorithm \(\mathcal {K}\), for any workload query \(\mathbf {W}\) that \(\mathbf {A}_1\) and \(\mathbf {A}_2\) support, \({\textsc {TotalError}}_{\mathcal {K}, \mathbf {A}_1}( \mathbf {W} )=c\cdot {\textsc {TotalError}}_{\mathcal {K}, \mathbf {A}_2}( \mathbf {W} )\).

Proof

Given a query workload \(\mathbf {W}\), if \(\mathbf {A}_1\) supports \(\mathbf {W}\), there exists a matrix \(\mathbf {X}\) such that \(\mathbf {W}=\mathbf {X}\mathbf {A}_1\). According to Proposition 11(iii), \(\mathbf {A}_1\) and \(\mathbf {A}_2\) are profile equivalent if and only if there exists a nonzero constant c and an orthogonal matrix \(\mathbf {Q}\) such that \(\mathbf {A}_1=c\cdot \mathbf {Q}\mathbf {A}_2\). Then, \(c\cdot \mathbf {X}\mathbf {Q}\) satisfies \(\mathbf {W}=c\cdot \mathbf {X}\mathbf {Q}\mathbf {A}_2\), and therefore, \(\mathbf {A}_2\) supports \(\mathbf {W}\) as well.

The definition of profile equivalence indicates that there is a constant \(c'\) such that \((\mathbf {A}_1^T\mathbf {A}_1)^{+}= c'\cdot (\mathbf {A}_2^T\mathbf {A}_2)^{+}\). Thus, for any query workload that \(\mathbf {A}_1\) supports:

$$\begin{aligned} \frac{{\textsc {TotalError}}_{\mathcal {K}, A_1}( W )}{{\textsc {TotalError}}_{\mathcal {K}, A_2}( W )}&=\frac{||\mathbf {A}_1||^2||\mathbf {W}\mathbf {A}_1^{+}||^2_F}{||\mathbf {A}_2||^2||\mathbf {W}\mathbf {A}_2^{+}||^2_F}\\&=\frac{||\mathbf {A}_1||^2\text{ trace }(\mathbf {W}(\mathbf {A}_1^T\mathbf {A}_1)^{+}\mathbf {W}^T)}{||\mathbf {A}_2||^2\text{ trace }(\mathbf {W}(\mathbf {A}_2^T\mathbf {A}_2)^{+}\mathbf {W}^T)}\\&=c'\frac{||\mathbf {A}_1||^2}{||\mathbf {A}_2||^2}, \end{aligned}$$

where the ratio is a value that is independent of \(\mathbf {W}\). \(\square \)

1.3 Proofs from Section 6

Corollary 2 For any linear counting query \(\mathbf {w}\) and differentially private mechanism \(\mathcal {K}\),

$$\begin{aligned} \frac{1}{2}{\textsc {Error}}_{\mathcal {K}, \mathbf {Y}}( \mathbf {w} )\le {\textsc {Error}}_{\mathcal {K}, \mathbf {H}}( \mathbf {w} )\le 2{\textsc {Error}}_{\mathcal {K}, \mathbf {Y}}( \mathbf {w} ). \end{aligned}$$

Proof

According to Theorem 1, let \(\mathbf {H}'_n\) be the matrix that results from removing the row of all 1s from matrix \({\left[ \begin{array}{c}\mathbf {H}_n \\ \mathbf {I}_n\end{array}\right] }\). Since \(\mathbf {H}'_n\) and \(\mathbf {Y}_n\) are equivalent strategies under both \(\epsilon \)- and \((\epsilon ,\delta )\)-differentially private mechanisms, it is sufficient to prove that for any linear counting query \(\mathbf {w}\),

$$\begin{aligned} \frac{1}{2}{\textsc {Error}}_{\mathcal {K}, \mathbf {H}'_n}( \mathbf {w} )\le {\textsc {Error}}_{\mathcal {K}, \mathbf {H}_n}( \mathbf {w} )\le 2{\textsc {Error}}_{\mathcal {K}, \mathbf {H}'_n}( \mathbf {w} ). \end{aligned}$$

Let \(\mathbf {v}=\mathbf {w}\mathbf {H}_n^{+}\) and \(\mathbf {v}'\) be a vector such that

$$\begin{aligned} v'_i=\left\{ \begin{array}{ll}v_1 &{}\quad 1\le i \le 2 \\ v_{i-1} &{}\quad 3 \le i \le 2n\\ 0 &{}\quad 2n+1\le i\le 3n \end{array}\right. . \end{aligned}$$

One can verify that \(\mathbf {v}'\mathbf {H}'_n=\mathbf {v}\mathbf {H}_n=\mathbf {w}\). Since

$$\begin{aligned} ||\mathbf {w}{\mathbf {H}'_n}^{+}||_F\le ||\mathbf {v}'||_F \le 2||\mathbf {v}||_F=||\mathbf {w}\mathbf {H}_n^{+}||_F, \end{aligned}$$

noticing that \(||\mathbf {H}_n||_1=||\mathbf {H}_n'||_1\) and \(||\mathbf {H}_n||_2=||\mathbf {H}_n'||_2,\,{\textsc {Error}}_{\mathcal {K}, \mathbf {H}'_n}( \mathbf {w} )\le 2{\textsc {Error}}_{\mathcal {K}, \mathbf {H}_n}( \mathbf {w} )\).

On the other hand, \(\mathbf {H}'_n\) contains two copies of queries \(\mathbf {I}_n\), which is equivalent to reduce the error on those queries by a factor of 2. Noticing all other queries in \(\mathbf {H}'_n\) are contained in \(\mathbf {H}_n\), we have \({\textsc {Error}}_{\mathcal {K}, \mathbf {H}_n}( \mathbf {w} )\le 2{\textsc {Error}}_{\mathcal {K}, \mathbf {H}'_n}( \mathbf {w} )\). \(\square \)

Corollary 3 (Cell condition simplification) Given a workload \(\mathbf {W}_1\subseteq \mathbf {W}_R\) with m queries and its corresponding list of cell conditions, \(\varPhi _1\), there exists a workload \(\mathbf {W}_2\) and a list of cell conditions, \(\varPhi _2\), such that \((\mathbf {W}_1, \varPhi _1)\equiv (\mathbf {W}_2, \varPhi _2),\,|\varPhi _2| \le 2m-1\).

Proof

Since \(\mathbf {W}_1\subseteq \mathbf {W}_R\), let \(b_1, \ldots , b_{2m}\) be the boundaries of queries in \(\mathbf {W}_1\) such that \(b_1\le \cdots \le b_{2m}\). For any query \(q\in W_1\), it contains either all cells between \(b_i\) and \(b_{i+1}\) or none of them. Therefore, columns \(b_i,\ldots , b_{i+1}-1\) are exactly the same. According to Theorem 7, removing columns \(b_i+1,\ldots , b_{i+1}-1\) and merging the cell conditions of \(b_i,\ldots , b_{i+1}-1\) results in a workload and a list of cell conditions that is equivalent with \((\mathbf {W}_1, \varPhi _1)\). Thus, merging cells between \(b_i, \ldots , b_{i+1}-1\) for \(i=1,\ldots , 2m-1\) and discarding the rest of the cells gives a workload \(\mathbf {W}_2\) and its corresponding list of cell conditions \(\varPhi _2\), such that \((\mathbf {W}_2, \varPhi _2)\equiv (\mathbf {W}_1, \varPhi _1)\) and \(|\varPhi _2| \le 2m-1\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Miklau, G., Hay, M. et al. The matrix mechanism: optimizing linear counting queries under differential privacy. The VLDB Journal 24, 757–781 (2015). https://doi.org/10.1007/s00778-015-0398-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-015-0398-x

Keywords

Navigation