Skip to main content
Log in

The PRIMPING routine—Tiling through proximal alternating linearized minimization

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Mining and exploring databases should provide users with knowledge and new insights. Tiles of data strive to unveil true underlying structure and distinguish valuable information from various kinds of noise. We propose a novel Boolean matrix factorization algorithm to solve the tiling problem, based on recent results from optimization theory. In contrast to existing work, the new algorithm minimizes the description length of the resulting factorization. This approach is well known for model selection and data compression, but not for finding suitable factorizations via numerical optimization. We demonstrate the superior robustness of the new approach in the presence of several kinds of noise and types of underlying structure. Moreover, our general framework can work with any cost measure having a suitable real-valued relaxation. Thereby, no convexity assumptions have to be met. The experimental results on synthetic data and image data show that the new method identifies interpretable patterns which explain the data almost always better than the competing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. \({{\mathrm{dom}}}(\phi )\) is the domain of \(\phi \)

  2. http://hpc.isti.cnr.it/~claudio/web/archives/20131113/index.html

  3. http://people.mpi-inf.mpg.de/~skaraev/

  4. http://sfb876.tu-dortmund.de/primp

  5. http://grouplens.org/datasets/movielens/10m/

  6. http://grouplens.org/datasets/movielens/1m/

References

  • Bauckhage C (2015) k-means clustering is matrix factorization. arXiv preprint arXiv:1512.07548

  • Bolte J, Sabach S, Teboulle M (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math Program 146(1–2):459–494

    Article  MathSciNet  Google Scholar 

  • Cover T, Thomas J (2006) Elements of information theory. Wiley-Interscience, Hoboken

    MATH  Google Scholar 

  • De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446

    Article  MathSciNet  Google Scholar 

  • Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 126–135

  • Ding CH, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the SIAM international conference on data mining (SDM), pp 606–610

  • Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: International conference on discovery science (DS), pp 278–289

    Chapter  Google Scholar 

  • Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge

    Google Scholar 

  • Hess S, Piatkowski N, Morik K (2014) Shrimp: descriptive patterns in a tree. In: Proceedings of the LWA workshops: KDML, IR and FGWM, pp 181–192

  • Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition? In: IEEE international conference on computer in proceedings (ICCV), pp 2146–2153

  • Karaev S, Miettinen P, Vreeken J (2015) Getting to know the unknown unknowns: destructive-noise resistant boolean matrix factorization. In: Proceedings of the SIAM international conference on data mining (SDM), pp 325–333

    Chapter  Google Scholar 

  • Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding informative noisy tiles in binary databases. In: Proceedings of the SIAM international conference on data mining (SDM), pp 153–164

    Chapter  Google Scholar 

  • Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97

    Article  MathSciNet  Google Scholar 

  • Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791

    Article  Google Scholar 

  • Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems (NIPS), pp 556–562

  • Li PVM (1997) An introduction to kolmogorov complexity and its applications. Springer, Berlin

    Book  Google Scholar 

  • Li T (2005) A general model for clustering binary data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining (KDD), pp 188–197

  • Li T, Ding C (2006) The relationships among various nonnegative matrix factorization methods for clustering. In: International conference on data mining (ICDM), pp 362–371

  • Lucchese C, Orlando S, Perego R (2010) Mining top-k patterns from binary datasets in presence of noise. In: Proceedings of the SIAM international conference on data mining (SDM), pp 165–176

    Chapter  Google Scholar 

  • Lucchese C, Orlando S, Perego R (2014) A unifying framework for mining approximate top-k binary patterns. Trans Knowl Data Eng 26(12):2900–2913

    Article  Google Scholar 

  • Miettinen P (2015) Generalized matrix factorizations as a unifying framework for pattern set mining: complexity beyond blocks. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 36–52

    Chapter  Google Scholar 

  • Miettinen P, Vreeken J (2014) Mdl4bmf: minimum description length for boolean matrix factorization. Trans Knowl Discov Data 8(4):18:1–18:31

    Google Scholar 

  • Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. Trans Knowl Data Eng 20(10):1348–1362

    Article  Google Scholar 

  • Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2):111–126

    Article  Google Scholar 

  • Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239

    Article  Google Scholar 

  • Rissanen J (1978) Modeling by shortest data description. Automatica 14:465–471

    Article  Google Scholar 

  • Siebes A, Kersten R (2011) A structure function for transaction data. In: Proceedings of the SIAM international conference on data mining (SDM), pp 558–569

    Chapter  Google Scholar 

  • Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM international conference on data mining (SDM), pp 393–404

  • Smets K, Vreeken J (2012) Slim: directly mining descriptive patterns. In: Proceedings of the SIAM international conference on data mining (SDM), pp 236–247

    Chapter  Google Scholar 

  • Tatti N, Vreeken J (2012) Comparing apples and oranges: measuring differences between exploratory data mining results. Data Min Knowl Discov 25(2):173–207

    Article  MathSciNet  Google Scholar 

  • van Leeuwen M, Siebes A (2008) Streamkrimp: Detecting change in data streams. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 672–687

  • Vreeken J, Van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214

    Article  MathSciNet  Google Scholar 

  • Wang YX, Zhang YJ (2013) Nonnegative matrix factorization: a comprehensive review. Trans Knowl Data Eng 25(6):1336–1353

    Article  Google Scholar 

  • Xiang Y, Jin R, Fuhry D, Dragan FF (2011) Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Discov 23(2):215–251

    Article  MathSciNet  Google Scholar 

  • Zhang Z, Ding C, Li T, Zhang X (2007) Binary matrix factorization with applications. In: International conference on data mining (ICDM), pp 391–400

  • Zimek A, Vreeken J (2013) The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach Learn 98(1–2):121–155

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, projects A1 and C1 (http://sfb876.tudortmund.de). Furthermore, we thank Jilles Vreeken and Sanjar Karaev for their support in the execution of experiments and useful remarks.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sibylle Hess.

Additional information

Responsible editor: Johannes Fürnkranz.

Appendices

Appendix 1: Derivation of the proximal operator

Theorem 1 Let \(\alpha >0\) and \(\phi (X)=\sum _{i,j}\varLambda (X_{ij})\) for \(X\in {\mathbb {R}}^{m\times n}\). The proximal operator of \(\alpha \phi \) maps the matrix X to the matrix \({{\mathrm{prox}}}_{\alpha \phi }(X)=A\in [0,1]^{m\times n}\) defined by \(A_{ji}={{\mathrm{prox}}}_{\alpha \varLambda }(X_{ji})\), where for \(x\in {\mathbb {R}}\) it holds that

$$\begin{aligned} {{\mathrm{prox}}}_{\alpha \varLambda }(x)= {\left\{ \begin{array}{ll} \max \{0,x-2\alpha \} &{}x\le 0.5\\ \min \{1,x+2\alpha \} &{}x>0.5. \end{array}\right. } \end{aligned}$$
(9)

Proof

Let \(\alpha >0\), \(X\in {\mathbb {R}}^{m\times n}\) for some \(m,n\in \mathbb {N}\) and \(A={{\mathrm{prox}}}_{\alpha \phi }(X)\). The function \(\phi \) is fully separable across all matrix entries. In this case, the proximal operator can be applied entry-wise to the composing scalar functions (Parikh and Boyd 2014), i.e., \(A_{ji}={{\mathrm{prox}}}_{\alpha \varLambda }(X_{ji})\). It remains to derive the proximal mapping of \(\varLambda \) Eq. (refeq:prox).

The proximal operator reduces to Euclidean projection if the argument lies outside of the function’s domain (Parikh and Boyd 2014) and it follows that

$$\begin{aligned} {{\mathrm{prox}}}_{\alpha \varLambda }(x)=\theta (x) \text { if } x\notin [0,1]. \end{aligned}$$

For \(x\in [0,1]\) holds \(\varLambda (x)=-|1-2x|+1\) and

$$\begin{aligned} {{\mathrm{prox}}}_{\alpha \varLambda }(x)&= {\mathop {{{\mathrm{arg\,min}}}}\limits _{x^\star \in {\mathbb {R}}}} \left\{ \frac{1}{2}(x-x^\star )^2-\alpha |1-2x^\star | +1\alpha \right\} \\&= {\mathop {{{\mathrm{arg\,min}}}}\limits _{x^\star \in {\mathbb {R}}}} \left\{ \underbrace{(x-x^\star )^2-2\alpha |1-2x^\star | +(2\alpha )^2}_{=g(x^\star ;x,\alpha )}\right\} , \end{aligned}$$

where g is derived by a multiplication and addition of constants, such that the minimum can easily be derived by completing the square.

$$\begin{aligned} g(x^\star ;x,\alpha )&={\left\{ \begin{array}{ll} (x-x^\star )^2 -2\alpha (1-2 x^\star ) +(2\alpha )^2 &{} x^\star \le 0.5\\ (x-x^\star )^2 +2\alpha (1-2 x^\star ) +(2\alpha )^2 &{} x^\star> 0.5 \end{array}\right. }\\&= {\left\{ \begin{array}{ll} (x^\star -(x-2\alpha ))^2 -2\alpha ( 1-2x)&{} x^\star \le 0.5\\ (x^\star -(x+2\alpha ))^2 +2\alpha ( 1-2x) &{} x^\star > 0.5 \end{array}\right. }. \end{aligned}$$

The function g is a continuous piecewise quadratic function which attains its global minimum at the minimum of one of the two quadratic functions, i.e.,

$$\begin{aligned} {\mathop {{{\mathrm{arg\,min}}}}\limits _{x^\star \in {\mathbb {R}}}}\,g(x^\star ;x,\alpha ) \in \{x-2\alpha \mid x\le 0.5+2\alpha \}\cup \{x+2\alpha \mid x> 0.5-2\alpha \}. \end{aligned}$$

A function value comparison in the intersecting domain \(x\in (0.5-2\alpha ,0.5+2\alpha ]\) yields that

$$\begin{aligned} g(x-2\alpha ;x,\alpha )=-2\alpha (1-2x)\le g(x+2\alpha ;x,\alpha ) =2\alpha (1-2x) \Leftrightarrow x\le 0.5 \end{aligned}$$

\(\square \)

Appendix 2: Krimp’s encoding as matrix factorization

Lemma 1 Let D be a data matrix. For any code table CT and its cover function there exists a Boolean matrix factorization \(D=\theta (YX^T)+N\) such that non-singleton patterns in CT are mirrored in X and the cover function is reflected by Y. The description lengths correspond to each other, such that

$$\begin{aligned} L_{\mathsf {CT}}(D,CT)=f_{\mathsf {CT}}(X,Y,D)=f_{\mathsf {CT}}^D(X,Y,D)+f_{\mathsf {CT}}^M(X,Y,D), \end{aligned}$$

where the functions returning the model and the data description size are given as

$$\begin{aligned} f_{\mathsf {CT}}^D(X,Y,D)&=-\sum _{s=1}^r |Y_{\cdot s}| \cdot \log (p_s) -\sum _{i=1}^n |N_{\cdot i}| \cdot \log (p_{r+i})\\&=L^D_{\mathsf {CT}}(D,CT)\\ f_{\mathsf {CT}}^M(X,Y,D)&=\sum _{s:|Y_{\cdot s}|> 0}\left( X_{\cdot s}^Tc-\log (p_s)\right) +\sum _{i:|N_{\cdot i}|> 0}\left( c_i-\log (p_{r+i})\right) \\&=L_{\mathsf {CT}}^M(CT). \end{aligned}$$

The probabilities \(p_s\) and \(p_{r+i}\) indicate the relational usage of non-singleton patterns \(X_{\cdot s}\) and singletons \(\{i\}\),

$$\begin{aligned} p_s = \frac{|Y_{\cdot s}|}{|Y|+|N|},\ p_{r+i} = \frac{|N_{\cdot i}|}{|Y|+|N|}. \end{aligned}$$

We denote with \(c\in {\mathbb {R}}_+^n\) the vector of standard code lengths for each item, i.e.,

$$\begin{aligned} c_i=-\log \left( \frac{|D_{\cdot i}|}{|D|}\right) . \end{aligned}$$

Proof

Let D be a data matrix, \(CT=\{( X_\sigma ,C_\sigma )|1\le \sigma \le \tau \}\) a \(\tau \)-element code table and cover the cover function. Let r be the number of non-singleton patterns in CT and assume w.l.o.g. that CT is indexed such that these non-singleton patterns have an index \(1\le \sigma \le r\). We construct the pattern matrix \(X\in \{0,1\}^{n\times r}\) and usage matrix \(Y\in \{0,1\}^{m\times r}\) such that for \(1\le \sigma \le r\) it holds that

$$\begin{aligned} X_{i\sigma }=1&\Leftrightarrow i\in X_\sigma \\ Y_{j\sigma }=1&\Leftrightarrow X_\sigma \in cover(CT,D_{j\cdot }). \end{aligned}$$

The Boolean product \(\theta (YX^T)\) indicates the entries of D which are covered by non-singleton patterns of CT. That implies that ones in the noise matrix \(N=D-\theta (YX^T)\) are covered by singletons, it holds that

$$\begin{aligned} N_{ji}\ne 0\Leftrightarrow {i}\in cover(CT,D_{j\cdot }). \end{aligned}$$

The usage of a non-singleton pattern \(X_\sigma \) is then computed as

$$\begin{aligned} usage(X_\sigma )&=|\{X_\sigma \in cover(CT,D_{j\cdot })|j\in {\mathcal {T}}\}|\\&=|\{Y_{j\sigma }=1|j\in {\mathcal {T}}\}|\\&=|Y_{\cdot \sigma }|, \end{aligned}$$

and correspondingly it follows that \(usage(\{i\})=|N_{\cdot i}|\). The calculation of the probabilities \(p_\sigma \) for \(1\le \sigma \le r+n\) is directly obtained by inserting this usage calculation in the definition of code-usage-probabilities of Eq. (2). Likewise follow the functions \(f_{\mathsf {CT}}^M\) and \(f_{\mathsf {CT}}^D\) from the definition of the description sizes \(L_{\mathsf {CT}}^M\) and \(L_{\mathsf {CT}}^D\). \(\square \)

Appendix 3: Bounding the description length of code tables

Lemma 1

Let \((a_s)\) be a finite sequence of r non-negative scalars such that \(S_r=\sum _{s=1}^ra_s>0\). The function \(g:[0,\infty )\rightarrow [0,\infty )\) defined by

$$\begin{aligned} g(x;a_1,\ldots ,a_r,S_r)=-\sum _{s=1}^r(a_s+x)\log \left( \frac{a_s+x}{S_r+rx}\right) \end{aligned}$$

is monotonically increasing in x.

Proof

W.l.o.g., let \(a_1,\ldots ,a_{r_0}>0\) and \(a_{r_0+1},\ldots ,a_r=0\) for some \(r_0\in \mathbb {N}\). We rewrite the function g as

$$\begin{aligned} g(x;a_1,\ldots ,a_r,S_r)=g(x;a_1,\ldots ,a_{r_0},S_r)+g(x;a_{r_0+1},\ldots ,a_r,S_r) \end{aligned}$$

and show that each of the subfunctions is monotonically increasing. The first subfunction is differentiable and its derivative is non-negative

$$\begin{aligned} \frac{d}{dx}g(x;a_1,\ldots ,a_{r_0},S_r)&= -\sum _{s=1}^r\left( \log \left( \frac{a_s+x}{S_r+rx}\right) \right. \\&\quad \left. +(a_s+x)\frac{S_r+rx}{a_s+x}\frac{S_r+rx-r(a_s+x)}{(S_r+rx)^2}\right) \\&= -\sum _{s=1}^r\log \left( \frac{a_s+x}{S_r+rx}\right) +\sum _{s=1}^r\frac{S_r-ra_s}{S_r+rx}\\&= -\sum _{s=1}^r\log \left( \frac{a_s+x}{S_r+rx}\right) \ge 0. \end{aligned}$$

The second subfunction is monotonically increasing, since for \(a_s=0\) and all \(x\ge 0\) it holds that

$$\begin{aligned} a_s\log \left( \frac{a_s}{S_r}\right) =0\le -(a_s+x)\log \left( \frac{a_s+x}{S_r+rx}\right) . \end{aligned}$$

\(\square \)

Theorem 2 Given binary matrices X and Y and \(\mu = 1+\log (n)\), it holds that

$$\begin{aligned} f^D_{\mathsf {CT}}(X,Y,D)&\le \mu \Vert D-YX^T\Vert ^2-\sum _{s=1}^r(|Y_{\cdot s}|+1)\log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) +|Y| \end{aligned}$$
(10)

Proof

We recall that the description size of the data is computed by

$$\begin{aligned} f_{\mathsf {CT}}^D(X,Y,D)=\underbrace{-\sum _{s=1}^r |Y_{\cdot s}| \cdot \log \left( \frac{|Y_{\cdot s}|}{|Y|+|N|}\right) }_{=f_1(X,Y,D)} \underbrace{-\sum _{i=1}^n |N_{\cdot i}| \cdot \log \left( \frac{|N_{\cdot i}|}{|N|+|Y|}\right) }_{=f_2(X,Y,D)}. \end{aligned}$$

Applying the logarithmic properties, we rewrite the first sum

$$\begin{aligned} f_1(X,Y,D)&= -\sum _{s=1}^r|Y_{\cdot s}|\log \left( \frac{|Y_{\cdot s}|}{|Y|}\frac{|Y|}{|Y|+|N|}\right) \\&= -\sum _{s=1}^r|Y_{\cdot s}|\log \left( \frac{|Y_{\cdot s}|}{|Y|}\right) +\sum _{s=1}^r|Y_{\cdot s}|\log \left( \frac{|Y|+|N|}{|Y|}\right) \\&= g(0;|Y_{\cdot 1}|,\ldots ,|Y_{\cdot r}|,|Y|) +|Y|\log \left( 1+\frac{|N|}{|Y|}\right) . \end{aligned}$$

It follows from the monotonicity of g (Lemma 1) and the logarithm inequality (\(\log (1+x)\le x, \forall x\ge 0\)) that \(f_1\) is upper bounded by

$$\begin{aligned} f_1(X,Y,D)\le -\sum _{s=1}^r(|Y_{\cdot s}|+1)\log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) +|N|. \end{aligned}$$

The second term \(f_2\) can be transformed into

$$\begin{aligned} f_2(X,Y,D)&=-\sum _{i=1}^n |N_{\cdot i}| \cdot \log \left( |N_{\cdot i}|\right) +\sum _{i=1}^n |N_{\cdot i}| \cdot \log \left( |N|+|Y|\right) \\&= \sum _{i=1}^n|N_i|\log \frac{1}{|N_i|} +|N|\log (|N|+|Y|). \end{aligned}$$

Subsequently, we show \(f_2(X,Y,D)\le |N|\log (n) +|Y|\). This inequality trivially holds if \(|N|=0\). Otherwise, we apply Jensen’s inequality to the concave logarithm function

$$\begin{aligned} |N|\sum _{i=1}^n\frac{|N_i|}{|N|}\log \frac{1}{|N_i|}\le |N|\log \left( \frac{n}{|N|}\right) . \end{aligned}$$

and obtain

$$\begin{aligned} f_2(X,Y,D)&\le |N|\log \left( \frac{n}{|N|}\right) +|N|\log (|N|+|Y|)\\&= |N|\log (n) +|N|\log \left( 1+\frac{|Y|}{|N|}\right) \\&\le |N|\log (n) +|Y|, \end{aligned}$$

where the last equality again follows from the logarithm inequality. We derive the final inequality by

$$\begin{aligned} f_{\mathsf {CT}}^D(X,Y,D)&= f_1(X,Y,D)+f_2(X,Y,D)\\&\le (1+\log (n))|N|-\sum _{s=1}^r(|Y_{\cdot s}|+1)\log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) +|Y| \end{aligned}$$

\(\square \)

Appendix 4: Calculating the Lipschitz moduli of \(\textsc {Primp}\)

We study the partial gradients of the regularization term used in Primp (Sect. 3.4)

$$\begin{aligned} \nabla _X G(X,Y)&=c(1)_s^T\\ \nabla _Y G(X,Y)&=-\left( \log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) \right) _{js}+(1)_{js}. \end{aligned}$$

The partial gradient with respect to X is constant and has a Lipschitz constant of zero. The partial gradient with respect to Y can be written as the sum

$$\begin{aligned} \nabla _YG(X,Y)=-(\underbrace{(\log (|Y_{\cdot s}|+1))_{js}}_{=A(Y)}-\underbrace{(\log (|Y|+r))_{js}}_{=B(Y)})+(1)_{js}. \end{aligned}$$

From the triangle inequality follows that the gradient with respect to Y is Lipschitz continuous with modulus \(M_{\nabla _YG}(X)=M_A+M_B\), if the functions A and B are Lipschitz continuous with moduli \(M_A\) and \(M_B\):

$$\begin{aligned} \Vert \nabla _YG(X,Y)-\nabla _VG(X,V)\Vert&=\Vert A(Y)-A(V)+B(Y)-B(V)\Vert \\&\le \Vert A(Y)-A(V)\Vert +\Vert B(Y)-B(V)\Vert \\&\le (M_A+M_B)\Vert Y-V\Vert . \end{aligned}$$

The one-dimensional function \(x\mapsto \log (x+\delta )\), \(x\in {\mathbb {R}}_+\) is for any \(\delta >0\) Lipschitz continuous with modulus \(\delta ^{-1}\). This can be easily derived by the mean value theorem and the bound

$$\begin{aligned} \frac{d}{dx}\log (x+\delta )=\frac{1}{x+\delta }\le \frac{1}{\delta } \end{aligned}$$

for all \(x\ge 0\). We show with the following equations, that \(M_A=M_B=m\). For improved readability, we use the squared Lipschitz inequality, i.e.,

$$\begin{aligned} \Vert A(Y)-A(V)\Vert ^2&=\sum _{s,j}(\log (|Y_{\cdot s}|+1)-\log (|V_{\cdot s}|+1))^2\nonumber \\&=m\sum _{s=1}^r(\log (|Y_{\cdot s}|+1)-\log (|V_{\cdot s}|+1))^2\nonumber \\&\le m\sum _{s=1}^r(|Y_{\cdot s}|-|V_{\cdot s}|)^2 \end{aligned}$$
(12)
$$\begin{aligned}&= m\sum _{s=1}^r\left( \sum _{j=1}^m(Y_{j s}-V_{j s})\right) ^2\nonumber \\&\le m^2\sum _{s,j}(Y_{j s}-V_{j s})^2= m^2\Vert Y-V\Vert ^2, \end{aligned}$$
(13)

where Eq. (12) follows from the Lipschitz continuity of the logarithmic function as discussed above for \(\delta =1\) and Eq. (13) follows from the Cauchy-Schwarz inequality. Similar steps yield the Lipschitz modulus of B,

$$\begin{aligned} \Vert B(Y)-B(V)\Vert ^2&=\sum _{s,j}(\log (|Y|+r)-\log (|V|+r))^2\\&=mr(\log (|Y|+r)-\log (|V|+r))^2\\&\le \frac{mr}{r^2}(|Y|-|V|)^2\\&= \frac{m}{r}\left( \sum _{s,j}(Y_{j s}-V_{j s})\right) ^2\\&\le m^2\sum _{s,j}(Y_{j s}-V_{j s})^2. \end{aligned}$$

We conclude that the Lipschitz moduli of the gradients are given as

$$\begin{aligned} M_{\nabla _X G}(Y)=0 \quad M_{\nabla _YG}(X)=2m. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hess, S., Morik, K. & Piatkowski, N. The PRIMPING routine—Tiling through proximal alternating linearized minimization. Data Min Knowl Disc 31, 1090–1131 (2017). https://doi.org/10.1007/s10618-017-0508-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-017-0508-z

Keywords

Navigation