The PRIMPING routine—Tiling through proximal alternating linearized minimization

Hess, Sibylle; Morik, Katharina; Piatkowski, Nico

doi:10.1007/s10618-017-0508-z

The PRIMPING routine—Tiling through proximal alternating linearized minimization

Published: 12 May 2017

Volume 31, pages 1090–1131, (2017)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

547 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

Mining and exploring databases should provide users with knowledge and new insights. Tiles of data strive to unveil true underlying structure and distinguish valuable information from various kinds of noise. We propose a novel Boolean matrix factorization algorithm to solve the tiling problem, based on recent results from optimization theory. In contrast to existing work, the new algorithm minimizes the description length of the resulting factorization. This approach is well known for model selection and data compression, but not for finding suitable factorizations via numerical optimization. We demonstrate the superior robustness of the new approach in the presence of several kinds of noise and types of underlying structure. Moreover, our general framework can work with any cost measure having a suitable real-valued relaxation. Thereby, no convexity assumptions have to be met. The experimental results on synthetic data and image data show that the new method identifies interpretable patterns which explain the data almost always better than the competing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent

Article Open access 11 August 2021

Sibylle Hess, Gianvito Pio, … Michelangelo Ceci

Modeling in MiningZinc

Generalized Matrix Factorizations as a Unifying Framework for Pattern Set Mining: Complexity Beyond Blocks

Notes

${{\mathrm{dom}}}(\phi )$ is the domain of $\phi $
http://hpc.isti.cnr.it/~claudio/web/archives/20131113/index.html
http://people.mpi-inf.mpg.de/~skaraev/
http://sfb876.tu-dortmund.de/primp
http://grouplens.org/datasets/movielens/10m/
http://grouplens.org/datasets/movielens/1m/

References

Bauckhage C (2015) k-means clustering is matrix factorization. arXiv preprint arXiv:1512.07548
Bolte J, Sabach S, Teboulle M (2014) Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math Program 146(1–2):459–494
Article MathSciNet Google Scholar
Cover T, Thomas J (2006) Elements of information theory. Wiley-Interscience, Hoboken
MATH Google Scholar
De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446
Article MathSciNet Google Scholar
Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 126–135
Ding CH, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the SIAM international conference on data mining (SDM), pp 606–610
Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: International conference on discovery science (DS), pp 278–289
Chapter Google Scholar
Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
Google Scholar
Hess S, Piatkowski N, Morik K (2014) Shrimp: descriptive patterns in a tree. In: Proceedings of the LWA workshops: KDML, IR and FGWM, pp 181–192
Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition? In: IEEE international conference on computer in proceedings (ICCV), pp 2146–2153
Karaev S, Miettinen P, Vreeken J (2015) Getting to know the unknown unknowns: destructive-noise resistant boolean matrix factorization. In: Proceedings of the SIAM international conference on data mining (SDM), pp 325–333
Chapter Google Scholar
Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding informative noisy tiles in binary databases. In: Proceedings of the SIAM international conference on data mining (SDM), pp 153–164
Chapter Google Scholar
Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1–2):83–97
Article MathSciNet Google Scholar
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Article Google Scholar
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems (NIPS), pp 556–562
Li PVM (1997) An introduction to kolmogorov complexity and its applications. Springer, Berlin
Book Google Scholar
Li T (2005) A general model for clustering binary data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining (KDD), pp 188–197
Li T, Ding C (2006) The relationships among various nonnegative matrix factorization methods for clustering. In: International conference on data mining (ICDM), pp 362–371
Lucchese C, Orlando S, Perego R (2010) Mining top-k patterns from binary datasets in presence of noise. In: Proceedings of the SIAM international conference on data mining (SDM), pp 165–176
Chapter Google Scholar
Lucchese C, Orlando S, Perego R (2014) A unifying framework for mining approximate top-k binary patterns. Trans Knowl Data Eng 26(12):2900–2913
Article Google Scholar
Miettinen P (2015) Generalized matrix factorizations as a unifying framework for pattern set mining: complexity beyond blocks. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 36–52
Chapter Google Scholar
Miettinen P, Vreeken J (2014) Mdl4bmf: minimum description length for boolean matrix factorization. Trans Knowl Discov Data 8(4):18:1–18:31
Google Scholar
Miettinen P, Mielikainen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. Trans Knowl Data Eng 20(10):1348–1362
Article Google Scholar
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2):111–126
Article Google Scholar
Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239
Article Google Scholar
Rissanen J (1978) Modeling by shortest data description. Automatica 14:465–471
Article Google Scholar
Siebes A, Kersten R (2011) A structure function for transaction data. In: Proceedings of the SIAM international conference on data mining (SDM), pp 558–569
Chapter Google Scholar
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM international conference on data mining (SDM), pp 393–404
Smets K, Vreeken J (2012) Slim: directly mining descriptive patterns. In: Proceedings of the SIAM international conference on data mining (SDM), pp 236–247
Chapter Google Scholar
Tatti N, Vreeken J (2012) Comparing apples and oranges: measuring differences between exploratory data mining results. Data Min Knowl Discov 25(2):173–207
Article MathSciNet Google Scholar
van Leeuwen M, Siebes A (2008) Streamkrimp: Detecting change in data streams. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 672–687
Vreeken J, Van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214
Article MathSciNet Google Scholar
Wang YX, Zhang YJ (2013) Nonnegative matrix factorization: a comprehensive review. Trans Knowl Data Eng 25(6):1336–1353
Article Google Scholar
Xiang Y, Jin R, Fuhry D, Dragan FF (2011) Summarizing transactional databases with overlapped hyperrectangles. Data Min Knowl Discov 23(2):215–251
Article MathSciNet Google Scholar
Zhang Z, Ding C, Li T, Zhang X (2007) Binary matrix factorization with applications. In: International conference on data mining (ICDM), pp 391–400
Zimek A, Vreeken J (2013) The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach Learn 98(1–2):121–155
MathSciNet MATH Google Scholar

Download references

Acknowledgements

Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, projects A1 and C1 (http://sfb876.tudortmund.de). Furthermore, we thank Jilles Vreeken and Sanjar Karaev for their support in the execution of experiments and useful remarks.

Author information

Authors and Affiliations

TU Dortmund, Computer Science, LS 8, 44221, Dortmund, Germany
Sibylle Hess, Katharina Morik & Nico Piatkowski

Authors

Sibylle Hess
View author publications
You can also search for this author in PubMed Google Scholar
Katharina Morik
View author publications
You can also search for this author in PubMed Google Scholar
Nico Piatkowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sibylle Hess.

Additional information

Responsible editor: Johannes Fürnkranz.

Appendices

Appendix 1: Derivation of the proximal operator

Theorem 1 Let $\alpha >0$ and $\phi (X)=\sum _{i,j}\varLambda (X_{ij})$ for $X\in {\mathbb {R}}^{m\times n}$. The proximal operator of $\alpha \phi $ maps the matrix X to the matrix ${{\mathrm{prox}}}_{\alpha \phi }(X)=A\in [0,1]^{m\times n}$ defined by $A_{ji}={{\mathrm{prox}}}_{\alpha \varLambda }(X_{ji})$, where for $x\in {\mathbb {R}}$ it holds that

$$\begin{aligned} {{\mathrm{prox}}}_{\alpha \varLambda }(x)= {\left\{ \begin{array}{ll} \max \{0,x-2\alpha \} &{}x\le 0.5\\ \min \{1,x+2\alpha \} &{}x>0.5. \end{array}\right. } \end{aligned}$$

(9)

Proof

Let $\alpha >0$, $X\in {\mathbb {R}}^{m\times n}$ for some $m,n\in \mathbb {N}$ and $A={{\mathrm{prox}}}_{\alpha \phi }(X)$. The function $\phi $ is fully separable across all matrix entries. In this case, the proximal operator can be applied entry-wise to the composing scalar functions (Parikh and Boyd 2014), i.e., $A_{ji}={{\mathrm{prox}}}_{\alpha \varLambda }(X_{ji})$. It remains to derive the proximal mapping of $\varLambda $ Eq. (refeq:prox).

The proximal operator reduces to Euclidean projection if the argument lies outside of the function’s domain (Parikh and Boyd 2014) and it follows that

$$\begin{aligned} {{\mathrm{prox}}}_{\alpha \varLambda }(x)=\theta (x) \text { if } x\notin [0,1]. \end{aligned}$$

For $x\in [0,1]$ holds $\varLambda (x)=-|1-2x|+1$ and

$$\begin{aligned} {{\mathrm{prox}}}_{\alpha \varLambda }(x)&= {\mathop {{{\mathrm{arg\,min}}}}\limits _{x^\star \in {\mathbb {R}}}} \left\{ \frac{1}{2}(x-x^\star )^2-\alpha |1-2x^\star | +1\alpha \right\} \\&= {\mathop {{{\mathrm{arg\,min}}}}\limits _{x^\star \in {\mathbb {R}}}} \left\{ \underbrace{(x-x^\star )^2-2\alpha |1-2x^\star | +(2\alpha )^2}_{=g(x^\star ;x,\alpha )}\right\} , \end{aligned}$$

where g is derived by a multiplication and addition of constants, such that the minimum can easily be derived by completing the square.

$$\begin{aligned} g(x^\star ;x,\alpha )&={\left\{ \begin{array}{ll} (x-x^\star )^2 -2\alpha (1-2 x^\star ) +(2\alpha )^2 &{} x^\star \le 0.5\\ (x-x^\star )^2 +2\alpha (1-2 x^\star ) +(2\alpha )^2 &{} x^\star> 0.5 \end{array}\right. }\\&= {\left\{ \begin{array}{ll} (x^\star -(x-2\alpha ))^2 -2\alpha ( 1-2x)&{} x^\star \le 0.5\\ (x^\star -(x+2\alpha ))^2 +2\alpha ( 1-2x) &{} x^\star > 0.5 \end{array}\right. }. \end{aligned}$$

The function g is a continuous piecewise quadratic function which attains its global minimum at the minimum of one of the two quadratic functions, i.e.,

$$\begin{aligned} {\mathop {{{\mathrm{arg\,min}}}}\limits _{x^\star \in {\mathbb {R}}}}\,g(x^\star ;x,\alpha ) \in \{x-2\alpha \mid x\le 0.5+2\alpha \}\cup \{x+2\alpha \mid x> 0.5-2\alpha \}. \end{aligned}$$

A function value comparison in the intersecting domain $x\in (0.5-2\alpha ,0.5+2\alpha ]$ yields that

$$\begin{aligned} g(x-2\alpha ;x,\alpha )=-2\alpha (1-2x)\le g(x+2\alpha ;x,\alpha ) =2\alpha (1-2x) \Leftrightarrow x\le 0.5 \end{aligned}$$

$\square $

Appendix 2: Krimp’s encoding as matrix factorization

Lemma 1 Let D be a data matrix. For any code table CT and its cover function there exists a Boolean matrix factorization $D=\theta (YX^T)+N$ such that non-singleton patterns in CT are mirrored in X and the cover function is reflected by Y. The description lengths correspond to each other, such that

$$\begin{aligned} L_{\mathsf {CT}}(D,CT)=f_{\mathsf {CT}}(X,Y,D)=f_{\mathsf {CT}}^D(X,Y,D)+f_{\mathsf {CT}}^M(X,Y,D), \end{aligned}$$

where the functions returning the model and the data description size are given as

$$\begin{aligned} f_{\mathsf {CT}}^D(X,Y,D)&=-\sum _{s=1}^r |Y_{\cdot s}| \cdot \log (p_s) -\sum _{i=1}^n |N_{\cdot i}| \cdot \log (p_{r+i})\\&=L^D_{\mathsf {CT}}(D,CT)\\ f_{\mathsf {CT}}^M(X,Y,D)&=\sum _{s:|Y_{\cdot s}|> 0}\left( X_{\cdot s}^Tc-\log (p_s)\right) +\sum _{i:|N_{\cdot i}|> 0}\left( c_i-\log (p_{r+i})\right) \\&=L_{\mathsf {CT}}^M(CT). \end{aligned}$$

The probabilities $p_s$ and $p_{r+i}$ indicate the relational usage of non-singleton patterns $X_{\cdot s}$ and singletons $\{i\}$,

$$\begin{aligned} p_s = \frac{|Y_{\cdot s}|}{|Y|+|N|},\ p_{r+i} = \frac{|N_{\cdot i}|}{|Y|+|N|}. \end{aligned}$$

We denote with $c\in {\mathbb {R}}_+^n$ the vector of standard code lengths for each item, i.e.,

$$\begin{aligned} c_i=-\log \left( \frac{|D_{\cdot i}|}{|D|}\right) . \end{aligned}$$

Proof

Let D be a data matrix, $CT=\{( X_\sigma ,C_\sigma )|1\le \sigma \le \tau \}$ a $\tau $-element code table and cover the cover function. Let r be the number of non-singleton patterns in CT and assume w.l.o.g. that CT is indexed such that these non-singleton patterns have an index $1\le \sigma \le r$. We construct the pattern matrix $X\in \{0,1\}^{n\times r}$ and usage matrix $Y\in \{0,1\}^{m\times r}$ such that for $1\le \sigma \le r$ it holds that

$$\begin{aligned} X_{i\sigma }=1&\Leftrightarrow i\in X_\sigma \\ Y_{j\sigma }=1&\Leftrightarrow X_\sigma \in cover(CT,D_{j\cdot }). \end{aligned}$$

The Boolean product $\theta (YX^T)$ indicates the entries of D which are covered by non-singleton patterns of CT. That implies that ones in the noise matrix $N=D-\theta (YX^T)$ are covered by singletons, it holds that

$$\begin{aligned} N_{ji}\ne 0\Leftrightarrow {i}\in cover(CT,D_{j\cdot }). \end{aligned}$$

The usage of a non-singleton pattern $X_\sigma $ is then computed as

$$\begin{aligned} usage(X_\sigma )&=|\{X_\sigma \in cover(CT,D_{j\cdot })|j\in {\mathcal {T}}\}|\\&=|\{Y_{j\sigma }=1|j\in {\mathcal {T}}\}|\\&=|Y_{\cdot \sigma }|, \end{aligned}$$

and correspondingly it follows that $usage(\{i\})=|N_{\cdot i}|$. The calculation of the probabilities $p_\sigma $ for $1\le \sigma \le r+n$ is directly obtained by inserting this usage calculation in the definition of code-usage-probabilities of Eq. (2). Likewise follow the functions $f_{\mathsf {CT}}^M$ and $f_{\mathsf {CT}}^D$ from the definition of the description sizes $L_{\mathsf {CT}}^M$ and $L_{\mathsf {CT}}^D$. $\square $

Appendix 3: Bounding the description length of code tables

Lemma 1

Let $(a_s)$ be a finite sequence of r non-negative scalars such that $S_r=\sum _{s=1}^ra_s>0$. The function $g:[0,\infty )\rightarrow [0,\infty )$ defined by

$$\begin{aligned} g(x;a_1,\ldots ,a_r,S_r)=-\sum _{s=1}^r(a_s+x)\log \left( \frac{a_s+x}{S_r+rx}\right) \end{aligned}$$

is monotonically increasing in x.

Proof

W.l.o.g., let $a_1,\ldots ,a_{r_0}>0$ and $a_{r_0+1},\ldots ,a_r=0$ for some $r_0\in \mathbb {N}$. We rewrite the function g as

$$\begin{aligned} g(x;a_1,\ldots ,a_r,S_r)=g(x;a_1,\ldots ,a_{r_0},S_r)+g(x;a_{r_0+1},\ldots ,a_r,S_r) \end{aligned}$$

and show that each of the subfunctions is monotonically increasing. The first subfunction is differentiable and its derivative is non-negative

$$\begin{aligned} \frac{d}{dx}g(x;a_1,\ldots ,a_{r_0},S_r)&= -\sum _{s=1}^r\left( \log \left( \frac{a_s+x}{S_r+rx}\right) \right. \\&\quad \left. +(a_s+x)\frac{S_r+rx}{a_s+x}\frac{S_r+rx-r(a_s+x)}{(S_r+rx)^2}\right) \\&= -\sum _{s=1}^r\log \left( \frac{a_s+x}{S_r+rx}\right) +\sum _{s=1}^r\frac{S_r-ra_s}{S_r+rx}\\&= -\sum _{s=1}^r\log \left( \frac{a_s+x}{S_r+rx}\right) \ge 0. \end{aligned}$$

The second subfunction is monotonically increasing, since for $a_s=0$ and all $x\ge 0$ it holds that

$$\begin{aligned} a_s\log \left( \frac{a_s}{S_r}\right) =0\le -(a_s+x)\log \left( \frac{a_s+x}{S_r+rx}\right) . \end{aligned}$$

$\square $

Theorem 2 Given binary matrices X and Y and $\mu = 1+\log (n)$, it holds that

$$\begin{aligned} f^D_{\mathsf {CT}}(X,Y,D)&\le \mu \Vert D-YX^T\Vert ^2-\sum _{s=1}^r(|Y_{\cdot s}|+1)\log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) +|Y| \end{aligned}$$

(10)

Proof

We recall that the description size of the data is computed by

$$\begin{aligned} f_{\mathsf {CT}}^D(X,Y,D)=\underbrace{-\sum _{s=1}^r |Y_{\cdot s}| \cdot \log \left( \frac{|Y_{\cdot s}|}{|Y|+|N|}\right) }_{=f_1(X,Y,D)} \underbrace{-\sum _{i=1}^n |N_{\cdot i}| \cdot \log \left( \frac{|N_{\cdot i}|}{|N|+|Y|}\right) }_{=f_2(X,Y,D)}. \end{aligned}$$

Applying the logarithmic properties, we rewrite the first sum

$$\begin{aligned} f_1(X,Y,D)&= -\sum _{s=1}^r|Y_{\cdot s}|\log \left( \frac{|Y_{\cdot s}|}{|Y|}\frac{|Y|}{|Y|+|N|}\right) \\&= -\sum _{s=1}^r|Y_{\cdot s}|\log \left( \frac{|Y_{\cdot s}|}{|Y|}\right) +\sum _{s=1}^r|Y_{\cdot s}|\log \left( \frac{|Y|+|N|}{|Y|}\right) \\&= g(0;|Y_{\cdot 1}|,\ldots ,|Y_{\cdot r}|,|Y|) +|Y|\log \left( 1+\frac{|N|}{|Y|}\right) . \end{aligned}$$

It follows from the monotonicity of g (Lemma 1) and the logarithm inequality ($\log (1+x)\le x, \forall x\ge 0$) that $f_1$ is upper bounded by

$$\begin{aligned} f_1(X,Y,D)\le -\sum _{s=1}^r(|Y_{\cdot s}|+1)\log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) +|N|. \end{aligned}$$

The second term $f_2$ can be transformed into

$$\begin{aligned} f_2(X,Y,D)&=-\sum _{i=1}^n |N_{\cdot i}| \cdot \log \left( |N_{\cdot i}|\right) +\sum _{i=1}^n |N_{\cdot i}| \cdot \log \left( |N|+|Y|\right) \\&= \sum _{i=1}^n|N_i|\log \frac{1}{|N_i|} +|N|\log (|N|+|Y|). \end{aligned}$$

Subsequently, we show $f_2(X,Y,D)\le |N|\log (n) +|Y|$. This inequality trivially holds if $|N|=0$. Otherwise, we apply Jensen’s inequality to the concave logarithm function

$$\begin{aligned} |N|\sum _{i=1}^n\frac{|N_i|}{|N|}\log \frac{1}{|N_i|}\le |N|\log \left( \frac{n}{|N|}\right) . \end{aligned}$$

and obtain

$$\begin{aligned} f_2(X,Y,D)&\le |N|\log \left( \frac{n}{|N|}\right) +|N|\log (|N|+|Y|)\\&= |N|\log (n) +|N|\log \left( 1+\frac{|Y|}{|N|}\right) \\&\le |N|\log (n) +|Y|, \end{aligned}$$

where the last equality again follows from the logarithm inequality. We derive the final inequality by

$$\begin{aligned} f_{\mathsf {CT}}^D(X,Y,D)&= f_1(X,Y,D)+f_2(X,Y,D)\\&\le (1+\log (n))|N|-\sum _{s=1}^r(|Y_{\cdot s}|+1)\log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) +|Y| \end{aligned}$$

$\square $

Appendix 4: Calculating the Lipschitz moduli of $\textsc {Primp}$

We study the partial gradients of the regularization term used in Primp (Sect. 3.4)

$$\begin{aligned} \nabla _X G(X,Y)&=c(1)_s^T\\ \nabla _Y G(X,Y)&=-\left( \log \left( \frac{|Y_{\cdot s}|+1}{|Y|+r}\right) \right) _{js}+(1)_{js}. \end{aligned}$$

The partial gradient with respect to X is constant and has a Lipschitz constant of zero. The partial gradient with respect to Y can be written as the sum

$$\begin{aligned} \nabla _YG(X,Y)=-(\underbrace{(\log (|Y_{\cdot s}|+1))_{js}}_{=A(Y)}-\underbrace{(\log (|Y|+r))_{js}}_{=B(Y)})+(1)_{js}. \end{aligned}$$

From the triangle inequality follows that the gradient with respect to Y is Lipschitz continuous with modulus $M_{\nabla _YG}(X)=M_A+M_B$, if the functions A and B are Lipschitz continuous with moduli $M_A$ and $M_B$:

$$\begin{aligned} \Vert \nabla _YG(X,Y)-\nabla _VG(X,V)\Vert&=\Vert A(Y)-A(V)+B(Y)-B(V)\Vert \\&\le \Vert A(Y)-A(V)\Vert +\Vert B(Y)-B(V)\Vert \\&\le (M_A+M_B)\Vert Y-V\Vert . \end{aligned}$$

The one-dimensional function $x\mapsto \log (x+\delta )$, $x\in {\mathbb {R}}_+$ is for any $\delta >0$ Lipschitz continuous with modulus $\delta ^{-1}$. This can be easily derived by the mean value theorem and the bound

$$\begin{aligned} \frac{d}{dx}\log (x+\delta )=\frac{1}{x+\delta }\le \frac{1}{\delta } \end{aligned}$$

for all $x\ge 0$. We show with the following equations, that $M_A=M_B=m$. For improved readability, we use the squared Lipschitz inequality, i.e.,

$$\begin{aligned} \Vert A(Y)-A(V)\Vert ^2&=\sum _{s,j}(\log (|Y_{\cdot s}|+1)-\log (|V_{\cdot s}|+1))^2\nonumber \\&=m\sum _{s=1}^r(\log (|Y_{\cdot s}|+1)-\log (|V_{\cdot s}|+1))^2\nonumber \\&\le m\sum _{s=1}^r(|Y_{\cdot s}|-|V_{\cdot s}|)^2 \end{aligned}$$

(12)

$$\begin{aligned}&= m\sum _{s=1}^r\left( \sum _{j=1}^m(Y_{j s}-V_{j s})\right) ^2\nonumber \\&\le m^2\sum _{s,j}(Y_{j s}-V_{j s})^2= m^2\Vert Y-V\Vert ^2, \end{aligned}$$

(13)

where Eq. (12) follows from the Lipschitz continuity of the logarithmic function as discussed above for $\delta =1$ and Eq. (13) follows from the Cauchy-Schwarz inequality. Similar steps yield the Lipschitz modulus of B,

$$\begin{aligned} \Vert B(Y)-B(V)\Vert ^2&=\sum _{s,j}(\log (|Y|+r)-\log (|V|+r))^2\\&=mr(\log (|Y|+r)-\log (|V|+r))^2\\&\le \frac{mr}{r^2}(|Y|-|V|)^2\\&= \frac{m}{r}\left( \sum _{s,j}(Y_{j s}-V_{j s})\right) ^2\\&\le m^2\sum _{s,j}(Y_{j s}-V_{j s})^2. \end{aligned}$$

We conclude that the Lipschitz moduli of the gradients are given as

$$\begin{aligned} M_{\nabla _X G}(Y)=0 \quad M_{\nabla _YG}(X)=2m. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hess, S., Morik, K. & Piatkowski, N. The PRIMPING routine—Tiling through proximal alternating linearized minimization. Data Min Knowl Disc 31, 1090–1131 (2017). https://doi.org/10.1007/s10618-017-0508-z

Download citation

Received: 27 March 2016
Accepted: 16 April 2017
Published: 12 May 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10618-017-0508-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The PRIMPING routine—Tiling through proximal alternating linearized minimization

Abstract

Access this article

Similar content being viewed by others

BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent

Modeling in MiningZinc

Generalized Matrix Factorizations as a Unifying Framework for Pattern Set Mining: Complexity Beyond Blocks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Derivation of the proximal operator

Proof

Appendix 2: Krimp’s encoding as matrix factorization

Proof

Appendix 3: Bounding the description length of code tables

Lemma 1

Proof

Proof

Appendix 4: Calculating the Lipschitz moduli of \(\textsc {Primp}\)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The PRIMPING routine—Tiling through proximal alternating linearized minimization

Abstract

Access this article

Similar content being viewed by others

BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent

Modeling in MiningZinc

Generalized Matrix Factorizations as a Unifying Framework for Pattern Set Mining: Complexity Beyond Blocks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Derivation of the proximal operator

Proof

Appendix 2: Krimp’s encoding as matrix factorization

Proof

Appendix 3: Bounding the description length of code tables

Lemma 1

Proof

Proof

Appendix 4: Calculating the Lipschitz moduli of \(\textsc {Primp}\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation