Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method

Zhu, Hong; Zhang, Xiaowei; Chu, Delin; Liao, Li-Zhi

doi:10.1007/s10915-017-0359-1

Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method

Published: 20 January 2017

Volume 72, pages 331–372, (2017)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

Hong Zhu¹,
Xiaowei Zhang²,
Delin Chu³ &
…
Li-Zhi Liao⁴

1742 Accesses
13 Citations
Explore all metrics

Abstract

Nonconvex and nonsmooth optimization problems with linear equation and generalized orthogonality constraints have wide applications. These problems are difficult to solve due to nonsmooth objective function and nonconvex constraints. In this paper, by introducing an extended proximal alternating linearized minimization (EPALM) method, we propose a framework based on the augmented Lagrangian scheme (EPALMAL). We also show that the EPALMAL method has global convergence in the sense that every bounded sequence generated by the EPALMAL method has at least one convergent subsequence that converges to the Karush–Kuhn–Tucker point of the original problem. Experiments on a variety of applications, including compressed modes and multivariate data analysis, have demonstrated that the proposed method is noticeably efficient and achieves comparable performance with existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

Article 10 April 2024

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Notes

We did not compare with the ADMM method although the convergence for compact constraints is given (e.g. [48]) since the only difference between the SOC method and the ADMM method is that the augmented penalty parameters in the latter method are generally the same for different constraints.
SULDA_admm and SULDA$\_{\ell _1}$ can be found at http://web.bii.a-star.edu.sg/~zhangxw/Publications.html, SDA and GLOSS can be found at http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=5671 and http://www.hds.utc.fr/~grandval/dokuwiki/doku.php?id=en:code, respectively.
Brain, Lymphoma and Srbct are obtained from http://stat.ethz.ch/~dettling/bagboost.html.
ORL$_{64\times 64}$ is from http://www.cl.cam.ac.uk/Reasearch/DTG/attarchive:pub/data/attfaces/tar.Z, and Palmprint is from http://www4.comp.polyu.edu.hk/~biometrics/.
All data sets were downloaded from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download.
The MATLAB implementation of SCCA with acceleration is available at http://web.bii.a-star.edu.sg/~zhangxw/Publications.html.
http://www.isi.edu/natural-language/download/hansard/.
http://www.statmt.org/europarl/.

References

Abrudan, T., Eriksson, J., Koivunen, V.: Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans. Signal Process. 56(3), 1134–1147 (2008)
Article MathSciNet Google Scholar
Abrudan, T., Eriksson, J., Koivunen, V.: Conjugate gradient algorithm for optimization under unitary matrix constraint. Signal Process. 89(9), 1704–1714 (2009)
Article MATH Google Scholar
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
MATH Google Scholar
Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented lagrangian methods with general lower-level constraints. SIAM J. Optim. 18(4), 1286–1309 (2007)
Article MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the kurdyka-lojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Article MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss-seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London (1982)
MATH Google Scholar
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)
MATH Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Article MathSciNet MATH Google Scholar
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)
Article Google Scholar
Chen, W.: Wavelet frames on the sphere, high angular resolution diffusion imagining and $l_1$-regularized optimization on stiefel manifolds. Ph.D. thesis, The National University of Singapore (2015)
Chen, W., Ji, H., You, Y.: An augmented lagrangian method for $\ell _1$-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38(4), B570–B592 (2016)
Article MATH Google Scholar
Chu, D., Liao, L.Z., Ng, M.K., Zhang, X.: Sparse canonical correlation analysis: new formulation and algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 3050–3065 (2013)
Article Google Scholar
Chu, M.T., Trendafilov, N.T.: The orthogonally constrained regression revisited. J. Comput. Graph. Stat. 10(4), 746–771 (2001)
Article MathSciNet Google Scholar
Clarke, F.H., Ledyaev, Y.S., Stern, R.J., Wolenski, P.R.: Nonsmooth Analysis and Control Theory, vol. 178. Springer, Berlin (2008)
MATH Google Scholar
Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, New York (2000)
MATH Google Scholar
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J Matrix Anal. Appl. 20(2), 303–353 (1998)
Article MathSciNet MATH Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)
Article MathSciNet Google Scholar
Francisco, J.B., Martínez, J.M., Martínez, L., Pisnitchenko, F.: Inexact restoration method for minimization problems arising in electronic structure calculations. Comput. Optim. Appl. 50(3), 555–590 (2011)
Article MathSciNet MATH Google Scholar
Grubišić, I., Pietersz, R.: Efficient rank reduction of correlation matrices. Linear Algebra Appl. 422(2), 629–653 (2007)
MathSciNet MATH Google Scholar
Hardoon, D.R., Shawe-Taylor, J.: Sparse canonical correlation analysis. Mach. Learn. 83(3), 331–353 (2011)
Article MathSciNet MATH Google Scholar
Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4(5), 303–320 (1969)
Article MathSciNet MATH Google Scholar
Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)
Article MATH Google Scholar
Howland, P., Jeon, M., Park, H.: Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl. 25, 165–179 (2003)
Article MathSciNet MATH Google Scholar
Jiang, B., Dai, Y.H.: A framework of constraint preserving update schemes for optimization on stiefel manifold. Math. Program. 153(2), 535–575 (2015)
Article MathSciNet MATH Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86. Citeseer (2005)
Kokiopoulou, E., Chen, J., Saad, Y.: Trace optimization and eigenproblems in dimension reduction methods. Numer. Linear Algebra Appl. 18(3), 565–602 (2011)
Article MathSciNet MATH Google Scholar
Kovnatsky, A., Glashoff, K., Bronstein, M.M.: Madmm: a generic algorithm for non-smooth optimization on manifolds. arXiv preprint arXiv:1505.07676 (2015)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier 48(3), 769–783 (1998)
Article MathSciNet MATH Google Scholar
Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58(2), 431–449 (2014)
Article MathSciNet MATH Google Scholar
Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25(4), 2434–2460 (2015)
Article MathSciNet MATH Google Scholar
Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles pp. 87–89 (1963)
Lu, Z., Zhang, Y.: An augmented lagrangian approach for sparse principal component analysis. Math. Program. 135(1–2), 149–193 (2012)
Article MathSciNet MATH Google Scholar
Merchante, L., Grandvalet, Y., Govaert, G.: An efficient approach to sparse linear discriminant analysis. In: Preceedings of the 29th International Conference on Machine Learning (2012)
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I: Basic Theory, vol. 330. Springer, Berlin (2006)
Google Scholar
Mordukhovich, B.S., Shao, Y.: On nonconvex subdifferential calculus in banach spaces. J. Convex Anal. 2(1/2), 211–227 (1995)
MathSciNet MATH Google Scholar
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France 93, 273–299 (1965)
Article MathSciNet MATH Google Scholar
Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the stiefel manifold. Neurocomputing 67, 106–135 (2005)
Article Google Scholar
Ozoliņš, V., Lai, R., Caflisch, R., Osher, S.: Compressed modes for variational problems in mathematics and physics. Proc. Natl. Acad. Sci. 110(46), 18368–18373 (2013)
Article MathSciNet MATH Google Scholar
Powell, M.J.: A method for non-linear constraints in minimization problems. UKAEA (1967)
Rockafellar, R.T.: Augmented lagrange multiplier functions and duality in nonconvex programming. SIAM J. Control 12(2), 268–285 (1974)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer, Berlin (2009)
MATH Google Scholar
Savas, B., Lim, L.H.: Quasi-newton methods on grassmannians and multilinear approximations of tensors. SIAM J. Sci. Comput. 32(6), 3352–3393 (2010)
Article MathSciNet MATH Google Scholar
Sriperumbudur, B.K., Torres, D.A., Lanckriet, G.R.: A majorization-minimization approach to the sparse generalized eigenvalue problem. Mach. Learn. 85(1–2), 3–39 (2011)
Article MathSciNet MATH Google Scholar
Vinokourov, A., Cristianini, N., Shawe-Taylor, J.S.: Inferring a semantic representation of text via cross-language correlation analysis. In: Advances in Neural Information Processing Systems, pp. 1473–1480 (2002)
Voorhees, E.M.: The sixth text retrieval conference (trec-6). Inf. Process. Manag. 36(1), 1–2 (2000)
Article MathSciNet Google Scholar
Wang, Y., Yin, W., Zeng, J.: Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324 (2015)
Wen, Z., Yang, C., Liu, X., Zhang, Y.: Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 66, 1175–1203 (2016)
Article MathSciNet Google Scholar
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)
Article MathSciNet MATH Google Scholar
Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009)
Article Google Scholar
Yang, C., Meza, J.C., Wang, L.W.: A trust region direct constrained minimization algorithm for the Kohn–Sham equation. SIAM J. Sci. Comput. 29(5), 1854–1875 (2007)
Article MathSciNet MATH Google Scholar
Yang, K., Cai, Z., Li, J., Lin, G.: A stable gene selection in microarray data analysis. BMC Bioinform. 7(1), 228 (2006)
Article Google Scholar
Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. J. Mach. Learn. Res. 6(4), 483–502 (2005)
MathSciNet MATH Google Scholar
Zhang, L., Li, R.: Maximization of the sum of the trace ratio on the stiefel manifold, i: theory. Sci. China Math. 57(12), 2495–2508 (2014)
Article MathSciNet MATH Google Scholar
Zhang, L., Li, R.: Maximization of the sum of the trace ratio on the stiefel manifold, ii: computation. Sci. China Math. 58(7), 1549–1566 (2015)
Article MathSciNet MATH Google Scholar
Zhang, X.: Sparse dimensionality reduction methods: algorithms and applications. Ph.D. thesis, The National University of Singapore (2013)
Zhang, X., Chu, D.: Sparse uncorrelated linear discriminant analysis. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 45–52 (2013)
Zhang, X., Chu, D., Tan, R.C.: Sparse uncorrelated linear discriminant analysis for undersampled problems. IEEE Trans. Neural Netw. Learn. Syst. 27(7), 1469–1485 (2015)
Article MathSciNet Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)
Article MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the two anonymous referees for their valuable comments and suggestions. The work of L.-Z. Liao was supported in part by grants from Hong Kong Baptist University (FRG) and General Research Fund (GRF) of Hong Kong.

Author information

Authors and Affiliations

Faculty of Science, Jiangsu University, Zhenjiang, Jiangsu, 212013, China
Hong Zhu
Bioinformatics Institute, A*STAR, Singapore, Singapore
Xiaowei Zhang
Department of Mathematics, National University of Singapore, Singapore, 19076, Singapore
Delin Chu
Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
Li-Zhi Liao

Authors

Hong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Delin Chu
View author publications
You can also search for this author in PubMed Google Scholar
Li-Zhi Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li-Zhi Liao.

Appendices

Appendix 1: Proof of Lemma 2

To prove Lemma 2, we need a few preliminary results regarding the limiting subdifferential of indicator functions. For any closed set ${\mathscr {X}}$, it is well-known [37] that the limiting subdifferential of indicator function $\delta _{{\mathscr {X}}}$ at x is given by the normal cone to ${\mathscr {X}}$ with respect to x denoted by $N_{{\mathscr {X}}}(x)$, that is,

$$\begin{aligned} \partial \delta _{{\mathscr {X}}}(x) = N_{{\mathscr {X}}}(x), \quad \forall x \in {\mathscr {X}}. \end{aligned}$$

Next, we consider the normal cone of two specific closed sets. Let $\ell (X) = {\mathscr {A}}X : {\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}^{\gamma \times n}$ be a linear mapping with ${\mathscr {A}} \in {\mathbb {R}}^{\gamma \times m}$, and define the closed set ${\mathscr {X}}:= \{X \in {\mathbb {R}}^{m\times n} \ | \ {\mathscr {A}}X = 0\}$. It is easy to show that

$$\begin{aligned} N_{{\mathscr {X}}}(X)=\{{\mathscr {A}}^T {\varGamma } ~|~ {\varGamma } \in {\mathbb {R}}^{\gamma \times n}\}. \end{aligned}$$

(32)

Let ${\mathscr {M}} = \{X\in {\mathbb {R}}^{n\times q} \ |\ X^TMX = I_q, M\in S_+^n\}$ be the set of matrices satisfying the generalized orthogonality constraints, it follows that for any curve $Y(t)\in {\mathscr {M}}$ with $Y(0) = X, (Y'(t))^TMY(t) + Y(t)^TMY'(t) = 0$. Let $t = 0$, notice that $Y'(0)\in T_{{\mathscr {M}}}(X)$, we have

$$\begin{aligned} \eta ^TMX + X^TM\eta = 0\quad \forall \ \eta \in T_{{\mathscr {M}}}(X). \end{aligned}$$

Let $MX = {\bar{U}}{\bar{{\varSigma }}} {\bar{V}}^T$ be the reduced SVD of MX, and $(MX)_\perp $ be a column orthogonal matrix such that $[{\bar{U}} \ (MX)_\perp ]$ is an orthogonal matrix, then $\eta $ can be written as

$$\begin{aligned} \eta = {\bar{U}}\eta _1 + (MX)_\perp \eta _2, \end{aligned}$$

where $\eta _1^T{\bar{{\varSigma }}}{\bar{V}}^T + {\bar{V}}{\bar{{\varSigma }}}\eta _1 = 0$, which means ${\bar{V}}{\bar{{\varSigma }}}\eta _1$ is skew-symmetric. Moreover,

$$\begin{aligned} \eta= & {} {\bar{U}}\eta _1 + (MX)_\perp \eta _2\\= & {} {\bar{U}}{\bar{{\varSigma }}}^{-1}{\bar{V}}^T({\bar{V}}{\bar{{\varSigma }}}\eta _1) + (MX)_\perp \eta _2\\= & {} (X^TM)^\dagger ({\bar{V}}{\bar{{\varSigma }}}\eta _1) + (MX)_\perp \eta _2, \end{aligned}$$

where $(X^TM)^\dagger $ is the pseudo-inverse of $X^TM$. Therefore, the tangent space of ${\mathscr {M}}$ is given by

$$\begin{aligned} T_{{\mathscr {M}}}(X) = \{(X^TM)^\dagger {\varOmega } + (MX)_\perp K \ | \ {\varOmega }^T + {\varOmega } = 0\}. \end{aligned}$$

In addition, any $\zeta \in N_{{{\mathscr {M}}}}(X)$ can be written as

$$\begin{aligned} \zeta = {\bar{U}}\zeta _1 + (MX)_\perp \zeta _2 = (MX){\bar{V}}{\bar{{\varSigma }}}^{-1}\zeta _1 + (MX)_\perp \zeta _2. \end{aligned}$$

Since $\left\langle \zeta , \eta \right\rangle = 0$ for any $\eta \in T_{{\mathscr {M}}}(X)$, we have

$$\begin{aligned} \left\langle \eta _1, \zeta _1\right\rangle = 0\quad \text{ and } \quad \zeta _2 = 0. \end{aligned}$$

The first equality implies that

$$\begin{aligned} 0 = \left\langle \eta _1, \zeta _1\right\rangle = \left\langle {\bar{V}}{\bar{{\varSigma }}}\eta _1, {\bar{V}}{\bar{{\varSigma }}}^{-1}\zeta _1\right\rangle , \end{aligned}$$

and thus ${\bar{V}}{\bar{{\varSigma }}}^{-1}\zeta _1$ must be symmetric. Hence, the normal cone of ${\mathscr {M}}$ at X is given by

$$\begin{aligned} N_{{\mathscr {M}}}(X) = \{(MX)S\ |\ S\in S^{q}\}. \end{aligned}$$

(33)

Now, we are ready to prove Lemma 2.

Proof

Equalities

$$\begin{aligned} AX^* + BY^* - C = 0, \ X^* - G^* = 0\ \text{ and } \ (G^*)^TMG^* = I_q \end{aligned}$$

hold since $(X^*, Y^*, G^*)$ is feasible as a local minimizer. For the convenience of analysis, we denote $W := \begin{pmatrix} X\\ Y\\ G \end{pmatrix}\in {\mathbb {R}}^{(2n+m)\times q}$ and define $g_1: {\mathbb {R}}^{(2n+m)\times q} \rightarrow {\mathbb {R}} ^{(l+ n)\times q}$ as

$$\begin{aligned} g_1(W) = \begin{pmatrix} A &{}\quad B &{}\quad 0\\ I &{}\quad 0 &{}\quad -I \end{pmatrix} \begin{pmatrix} X\\ Y\\ G \end{pmatrix}. \end{aligned}$$

Let ${\varOmega } = \{W\in {\mathbb {R}}^{(2n+m)\times q}\ | \ g_1(W) = 0\}$, then problem (9) is equivalent to

$$\begin{aligned} \min _W \ f(X) + g(Y) + h(X, Y)+ \delta _{{\mathscr {M}}}(G) + \delta _{{\varOmega }}(W). \end{aligned}$$

Since $(X^*, Y^*, G^*)$ is a local minimizer, by the generalized Fermat’s rule and subdifferentiability property [15, 43], we have

$$\begin{aligned} 0\in \begin{pmatrix} \partial f(X^*) + \partial _X h(X^*, Y^*)\\ \partial g(Y^*) + \partial _Y h(X^*, Y^*)\\ \partial \delta _{{\mathscr {M}}}(G^*) \end{pmatrix} + \partial \delta _{{\varOmega }}(W^*). \end{aligned}$$

As described at the beginning of this “Appendix 1”, subdifferential of indicator functions $\delta _{{\mathscr {M}}}$ and $\delta _{{\varOmega }}$ are given by the normal cones (33) and (32), respectively. In particular,

$$\begin{aligned} \partial \delta _{{\mathscr {M}}}(G^*) = N_{{\mathscr {M}}}(G^*) = \{MG^*S\ | \ S\in S^q\}, \end{aligned}$$

and

$$\begin{aligned} \partial \delta _{{\varOmega }}(W^*) = N_{{\varOmega }}(W^*) = \left\{ \begin{pmatrix} A^T &{}\quad I\\ B^T &{}\quad 0\\ 0 &{}\quad -I \end{pmatrix} \begin{pmatrix} {\varLambda }_1\\ {\varLambda }_2 \end{pmatrix} | \ {\varLambda }_1\in {\mathbb {R}}^{l\times q}, \ {\varLambda }_2\in {\mathbb {R}}^{n\times q} \right\} . \end{aligned}$$

Therefore, there exist $v^*\in \partial f(X^*), \ w^*\in \partial g(Y^*), \ {\varLambda }_1^*\in {\mathbb {R}}^{l\times q}, \ {\varLambda }_2^*\in {\mathbb {R}}^{n\times q}, \ {\varLambda }_3^*\in S^q$ such that

$$\begin{aligned} 0= & {} \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 2MG^*{\varLambda }_3^* \end{pmatrix} + \begin{pmatrix} A^T &{}\quad I\\ B^T &{}\quad 0\\ 0 &{} -I \end{pmatrix} \begin{pmatrix} {\varLambda }_1^*\\ {\varLambda }_2^* \end{pmatrix} = \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 0 \end{pmatrix}\\&+ \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{} -I &{} 2MG^* \end{pmatrix} \begin{pmatrix} {\varLambda }_1^* \\ {\varLambda }_2^* \\ {\varLambda }_3^* \end{pmatrix}, \end{aligned}$$

which proves equality (12). Moreover, it yields that ${\varLambda }_2^* = 2MG^*{\varLambda }_3^*$. Substitute this into (12) and eliminate $G^*$, we get equality (13). This completes the proof. $\square $

Appendix 2: Proof of Theorem 1

Proof

For any limit point $(X^*, Y^*, G^*)$ of the bounded sequence $\{(X^k, Y^k, G^k)\}_{k\in {\mathbb {N}}}$, there exists an index set ${\mathscr {K}}\subset {\mathbb {N}}$ such that $\{(X^k, Y^k, G^k)\}_{k\in {\mathscr {K}}}$ converges to $(X^*, Y^*, G^*)$. To prove that $(X^*, Y^*, G^*)$ is a KKT point, we first show that it is a feasible point. The equality $(G^*)^TMG^* = I_q$ is trivial to check since $(G^k)^TMG^k = I_q$ holds for any $k\in {\mathbb {N}}$. If $\{\rho _k\}$ is bounded, then by the updating rule of $\rho _k$ in Algorithm 2, there exists an $k_0\in {\mathbb {N}}$ such that

$$\begin{aligned} \Vert R_j^k\Vert _\infty \le \tau \Vert R_j^{k - 1}\Vert _\infty \quad \forall k\ge k_0, \ j = 1, 2. \end{aligned}$$

By the definition of $R_j^k, j = 1, 2$, it holds that

$$\begin{aligned} \left\{ \begin{array}{ll} \Vert AX^k + BY^k - C\Vert _\infty \le \tau \Vert AX^{k - 1} + BY^{k - 1} - C\Vert _\infty , \\ \Vert X^k - G^k\Vert _\infty \le \tau \Vert X^{k - 1} - G^{k - 1}\Vert _\infty , \end{array} \right. \end{aligned}$$

for any $k\ge k_0$. Thus

$$\begin{aligned} \left\{ \begin{array}{ll} AX^* + BY^* - C = 0, \\ X^* - G^* = 0. \end{array} \right. \end{aligned}$$

If $\{\rho _k\}$ is unbounded, by the generalized Fermat rule, finding a solution satisfying the constraint (11) is equivalent of calculating a point $(X^k, Y^k, G^k)$ such that

$$\begin{aligned}&\left\| \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix}/\rho _{k-1} + \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{}\quad -I &{}\quad 2MG^k \end{pmatrix} \begin{pmatrix} \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + (AX^k + BY^k -C) \\ \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + (X^k - G^k) \\ {\varLambda }_3^k/\rho _{k-1} \end{pmatrix} \right\| _\infty \nonumber \\&\quad \le \frac{\epsilon _{k-1}}{\rho _{k-1}}, \end{aligned}$$

(34)

for some $v^k\in \partial f(X^k), w^k\in \partial g(Y^k), {\varLambda }_3^k \in {\mathscr {S}}^q$ , and $\epsilon _{k}\downarrow 0$ as $k\rightarrow \infty $. Notice that $\{{\bar{{\varLambda }}}_1^k\}$ and $\{{\bar{{\varLambda }}}_2^k\}$ are bounded, $\{v^k\}_{k\in {\mathscr {K}}}, \{w^k\}_{k\in {\mathscr {K}}}, \{\partial _X h(X^k, Y^k)\}$ and $\{\partial _Y h(X^k, Y^k)\}$ are bounded under Assumption 1 (iii)–(iv). Let $k\in {\mathscr {K}}$ go to infinity, Eq. (34) implies that

$$\begin{aligned} \begin{pmatrix} A^T &{}\quad I\\ B^T &{}\quad 0 \end{pmatrix} \begin{pmatrix} AX^* + BY^* - C \\ X^* - G^* \end{pmatrix} =0. \end{aligned}$$

Recall that B has full row rank, we get $AX^* + BY^* - C = 0$ and $X^* - G^* = 0$. Therefore, in both cases, we show that $(X^*, Y^*, G^*)$ is a feasible point.

Next, we show that there exist ${\varLambda }_1^*\in {\mathbb {R}}^{k\times q}, {\varLambda }_2^*\in {\mathbb {R}}^{n\times q}$ and ${\varLambda }_3^*\in {\mathscr {S}}^q$ such that $(X^*, Y^*, G^*; {\varLambda }_1^*, {\varLambda }_2^*, {\varLambda }_3^*)$ satisfies (12). If $\{X^k, Y^k, G^k\}_{k\in {\mathbb {N}}}$ is bounded, there exists an index set ${\mathscr {K}}\subseteq {\mathbb {N}}$, such that $\lim _{k\in {\mathscr {K}}}(X^k, Y^k, G^k) = (X^*, Y^*, G^*)$. Since $\{v^k\}_{k\in {\mathscr {K}}}$ is bounded, there exists a subsequence ${\mathscr {K}}_2\subseteq {\mathscr {K}}$ such that $\lim _{k\in {\mathscr {K}}_2}v^k = v^*$. Moreover, by the closedness property of the limiting subdifferential, we get

$$\begin{aligned} v^*\in \partial f(X^*). \end{aligned}$$

Similarly, there exists a subsequence ${\mathscr {K}}_3\subseteq {\mathscr {K}}_2$ such that $\lim _{k\in {\mathscr {K}}_3}w^k = w^*$, and

$$\begin{aligned} w^*\in \partial g(Y^*). \end{aligned}$$

Combining with the updating formula of ${\bar{{\varLambda }}}_1^k$ and ${\bar{{\varLambda }}}_2^k$ in Step 2 of Algorithm 2, (34) implies that there exists a $\xi ^k$ with $\Vert \xi ^k\Vert _\infty \le \frac{\epsilon _{k-1}}{\rho _{k-1}}$ such that

$$\begin{aligned} \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{}\quad -I &{}\quad 2MG^k \end{pmatrix} \begin{pmatrix} {{\varLambda }}_1^{k} \\ {{\varLambda }}_2^{k} \\ {\varLambda }_3^k \end{pmatrix}/ \rho _{k-1} = \xi ^k - \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix}/\rho _{k-1}. \end{aligned}$$

(35)

Define

$$\begin{aligned} {\varXi }^k = \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{}\quad -I &{}\quad 2MG^k \end{pmatrix} \in R^{(2n+m)\times (2k+q)}, \quad {\varUpsilon }^k = \begin{pmatrix} {\varLambda }_1^{k}\\ {\varLambda }_2^{k}\\ {\varLambda }_3^{k} \end{pmatrix}, \end{aligned}$$

we can rewrite (35) as

$$\begin{aligned} {\varXi }^k{\varUpsilon }^k = \rho _{k-1} \xi ^{k} - \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix}. \end{aligned}$$

Since the columns of ${\varXi }^k$ are linearly independent, $({\varXi }^k)^T({\varXi }^k)$ is nonsingular. Thus

$$\begin{aligned} {\varUpsilon }^k = (({\varXi }^k)^T{\varXi }^k)^{-1}({\varXi }^k)^T \left[ \rho _{k-1} \xi ^k - \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix} \right] . \end{aligned}$$

(36)

Taking limit on (36) as $k\in {\mathscr {K}}_3$ goes to infinity, and noticing that $\Vert \xi ^k\Vert _{\infty } \le \frac{\epsilon _{k-1}}{\rho _{k-1}}$ with $\epsilon _k\downarrow 0$ as $k\rightarrow \infty $, we have

$$\begin{aligned} \lim \limits _{k\in {\mathscr {K}}_3, k\rightarrow \infty }{\varUpsilon }^k = {\varUpsilon }^* := - (({\varXi }^*)^T{\varXi }^*)^{-1}({\varXi }^*)^T \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 0 \end{pmatrix}, \end{aligned}$$

where

$$\begin{aligned} {\varXi }^* = \begin{pmatrix} A^T &{}\quad I &{}\quad 0 \\ B^T &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad -I &{}\quad 2MG^* \end{pmatrix} \end{aligned}$$

has full column rank. From the definition of ${\varUpsilon }^k$, taking limit $k\in {\mathscr {K}}_3$ goes to infinity on both sides of (35) yields that

$$\begin{aligned} \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 0 \end{pmatrix} + \begin{pmatrix} A^T &{}\quad I &{}\quad 0 \\ B^T &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad -I &{}\quad 2MG^* \end{pmatrix} \begin{pmatrix} {\varLambda }_1^* \\ {\varLambda }_2^* \\ {\varLambda }_3^* \end{pmatrix}=0, \end{aligned}$$

where ${\varLambda }_3^*\in {\mathscr {S}}^q$ since ${\varLambda }_3^k\in {\mathscr {S}}^q$ for any $k\in {\mathbb {N}}$. According to Lemma 2, $(X^*, Y^*, G^*)$ is a KKT point of problem (9). Moreover, $(X^*, Y^*)$ is a KKT point of problem (1). $\square $

Appendix 3: Proof of Proposition 1

Proof

In this proof, we assume k is fixed and for the simplicity of notations, we let $W := \begin{pmatrix} X\\ Y\\ G \end{pmatrix}$ and $L_k(W): = L_{\rho _{k-1}}(X, Y, G; {\bar{{\varLambda }}}^{k-1})$.

1) By the first-order optimality conditions of three subproblems in (15), there exist $v^{k,j}\in \partial f_1^k(X^{k, j}), w^{k,j}\in \partial f_2^k(Y^{k, j}), \nu ^{k,j}\in \partial f_3^k(G^{k,j})$ such that

$$\begin{aligned} \left\{ \begin{array}{ll} 0 = v^{k,j} + \nabla _{X} H_k(X^{k,j-1}, Y^{k,j-1}, G^{k,j-1}) + B_1^{k,j-1}(X^{k,j}- X^{k,j-1}), \\ 0 = w^{k,j} + \nabla _{Y} H_k(X^{k,j}, Y^{k,j-1}, G^{k,j-1}) + B_2^{k,j-1}(Y^{k,j}- Y^{k,j-1}), \\ 0 = \nu ^{k,j} - {\bar{{\varLambda }}}_1^{k-1}/\rho _{k-1} + (G^{k, j} - X^{k, j}) + B_3^k(G^{k, j} - G^{k, j-1}). \end{array} \right. \end{aligned}$$

Combined with the above relationships, by simple calculations, $A^{k,j}$ defined in (19) has the following properties,

$$\begin{aligned} A_{X}^{k,j}= & {} \nabla _X H_k(W^{k,j}) - \nabla _X H_k(W^{k,j-1}) + B_1^{k,j-1}(X^{k, j-1} - X^{k,j})\\= & {} v^{k,j} +\nabla _{X} H_k(W^{k,j}) \in \partial _{X} L_k(W^{k,j}),\\ A_{Y}^{k,j}= & {} \nabla _Y H_k(W^{k,j}) - \nabla _Y H_k(X^{k,j}, Y^{k,j-1}, Z^{k,j-1}) + B_2^{k,j-1}(Y^{k, j-1} - Y^{k,j}), \\= & {} w^{k,j} + \nabla _{Y} H_k(W^{k,j}) \in \partial _{Y} L_k(W^{k,j}),\\ A_G^{k,j}= & {} B_3^k(G^{k, j-1} - G^{k, j}) = \nu ^{k,j} - {\bar{{\varLambda }}}_1^{k-1}/\rho _{k-1} + (G^{k,j} - X^{k,j}) \\= & {} \nu ^{k,j} + \nabla _G H_k(W^{k,j}) \in \partial _G L_k(W^{k,j}). \end{aligned}$$

Since $H_k(W)$ is continuously differentiable, by subdifferentiability property [5, 43],

$$\begin{aligned} \partial L_k(W) = \partial _{X} L_k(W) \times \partial _{Y} L_k(W) \times \partial _G L_k(W), \end{aligned}$$

which implies

$$\begin{aligned} A^{k,j} \in \partial L_k(W^{k,j}) = \partial L_{\rho _{k-1}}(X^{k,j}, Y^{k,j}, G^{k,j}; {\bar{{\varLambda }}}^{k-1})\quad \forall j\in {\mathbb {N}}. \end{aligned}$$

To show $\Vert A^{k,j}\Vert _\infty \rightarrow 0$, it suffices to show that $L_k(W)$ satisfies the conditions in Assumption 2 and apply Lemma 1 c).

We first show that $\{W^{k,j}\}_{j\in {\mathbb {N}}}$ generated by Algorithm 3 is bounded. This can be accomplished via proof by contradiction. Notice that ${\tilde{L}}_k(W) = \rho _{k-1}L_k(W)$ is a coercive function under the assumption that $\phi (X, Y) + \frac{\rho _0}{2}\Vert AX + BY - C\Vert _F^2$ is coercive function and the fact that $\{\rho _k\}_{k\in {\mathbb {N}}}$ is non-decreasing, $\delta _{{\mathscr {M}}}(G)$ is a coercive function and $\frac{\rho _{k-1}}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2$ is bounded from below. Suppose $\lim _{j\rightarrow \infty }\Vert W^{k,j}\Vert _\infty = +\infty $, then there must hold

$$\begin{aligned} \lim _{j\rightarrow \infty } {\tilde{L}}_{k}(W^{k,j}) = +\infty . \end{aligned}$$

On the other hand, we know from Lemma 1 a) that $\{L_{k}(W^{k,j})\}_{j\in {\mathbb {N}}}$ is a decreasing sequence, thus $\{{\tilde{L}}_{k}(W^{k,j})\}_{j\in {\mathbb {N}}}$ is non-increasing, which implies

$$\begin{aligned} \lim _{j\rightarrow \infty } {\tilde{L}}_{k}(W^{k,j}) \le {\tilde{L}}_{k}(W^{k,0}) < +\infty . \end{aligned}$$

Hence, a contradiction, and $\{W^{k,j}\}_{j\in {\mathbb {N}}}$ is bounded.

Now, we verify that $L_k(W)$ satisfies the conditions in Assumption 2. By the definitions of $L_k(W), H_k(W), f_i^k, i = 1, 2, 3$ and Assumption 1 (i), it is easy to see that for any given ${\bar{{\varLambda }}}^{k-1}$ and $\rho _{k-1}$, the following results hold:

(a)
$f_i^k, i = 1, 2, 3$, are proper and lower semicontinuous functions satisfying $\inf f_i^k > -\infty , H_k$ is a $C^1$ function, and
$$\begin{aligned} H_k(W)&= \frac{1}{\rho _{k-1}} h(X,Y) + \frac{1}{2}\Vert AX + BY - C + {\bar{{\varLambda }}}_1^{k-1} / \rho _{k-1}\Vert _F^2 \nonumber \\&\quad + \frac{1}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}/\rho _{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2. \end{aligned}$$
(37)
Thus $\inf _{W} H_k(W) \ge \frac{1}{\rho _{k-1}}\min _{X,Y}h(X,Y) - \frac{1}{2(\rho _{k-1})^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2 > -\infty $.
(b)
Since $H_k$ is a quadratic function with respect to $G, \nabla _G H_k$ is obviously Lipschitz continuous. Regarding the Lipschitz continuity of partial derivatives $\nabla _X h(X, Y)$ and $\nabla _Y h(X, Y)$, we have
$$\begin{aligned} \Vert \nabla _{X} H_k(X, Y^{k,j-1}, G^{k,j-1}) - \nabla _{X} H_k({\tilde{X}}, Y^{k,j-1}, G^{k,j-1})\Vert \le L_1^{k,j-1} \Vert X - {\tilde{X}}\Vert , \end{aligned}$$
where $L_1^{k,j-1} = \frac{L_1(Y^{k,j-1})}{\rho _{k-1}} + \Vert A^TA\Vert + 1$, and
$$\begin{aligned} \Vert \nabla _{Y} H_k(X^{k,j}, Y, G^{k,j-1}) - \nabla _{Y} H_k(X^{k,j}, {\tilde{Y}}, G^{k,j-1})\Vert \le L_2^{k,j-1}\Vert Y - {\tilde{Y}}\Vert , \end{aligned}$$
where $L_2^{k,j-1} = \frac{L_2(X^{k,j})}{\rho _{k-1}} + \Vert B^TB\Vert + 1$. In addition, let ${\bar{L}}_1 = \sup _{Y}L_1(Y)$ and ${\bar{L}}_2 = \sup _XL_2(X)$, then the boundedness of $\{W^{k,j}\}_{j\in {\mathbb {N}}}$ and Assumption 1 (iii) imply that ${\bar{L}}_1 < \infty $ and ${\bar{L}}_2 < \infty $, and we have
$$\begin{aligned}&\Vert A^TA\Vert + 1 \le L_1^{k, j-1} \le {\bar{L}}_1/\rho _0 + \Vert A^TA\Vert + 1,\quad \\&\quad \Vert B^TB\Vert + 1 \le L_2^{k, j-1} \le {\bar{L}}_2/\rho _0 + \Vert B^TB\Vert + 1, \end{aligned}$$
which imply
$$\begin{aligned} \inf _j\{L_1^{k, j-1}\} \ge \Vert A^TA\Vert + 1> -\infty , \quad \inf _j\{L_2^{k, j-1}\} \ge \Vert B^TB\Vert + 1 > -\infty , \end{aligned}$$
and
$$\begin{aligned}&\sup _j\left\{ L_1^{k, j-1}\right\} \le {\bar{L}}_1/\rho _0 + \Vert A^TA\Vert + 1< +\infty , \quad \sup _j\left\{ L_2^{k, j-1}\right\} \\&\quad \le {\bar{L}}_2/\rho _0 + \Vert B^TB\Vert + 1 < +\infty . \end{aligned}$$
Moreover, Assumption 2 (iii) holds by the definition of $B_i^{k, j-1}, i = 1, 2$ and $B_3^k$. Thus, Assumption 2 (i)–(iii) hold.
(c)
Assumption 2 (iv) holds since h(X, Y) satisfies Assumption 1 (iii).

2) From the proof of 1), we know that $\{W^{k,j}\}_{j\in {\mathbb {N}}}$ is bounded. Then the proof of 2) remains to show that $L_k(W)$ is a K–L function and apply Lemma 1 d). Notice that

$$\begin{aligned} L_k(W)&= \frac{1}{\rho _{k-1}} \phi (X,Y) + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {M}}}(G) + \frac{1}{2}\Vert AX + BY - C + {\bar{{\varLambda }}}_1^{k-1} / \rho _{k-1}\Vert _F^2 \nonumber \\&\quad + \frac{1}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}/\rho _{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2, \end{aligned}$$

(38)

i.e., $L_k(W)$ satisfies the K–L properties as a finite sum of functions satisfying the K–L properties, then the result holds directly. $\square $

Appendix 4: Outline of Algorithms for CMs, Sparse ULDA and Sparse CCA

In this section, we describe the detailed derivations of algorithms for CMs, sparse ULDA and sparse CCA in Sect. 5.

1.1 Compressed Modes

The scaled augmented Lagrangian function associated with (21) is

$$\begin{aligned} L_{\rho _{k-1}}({\varPsi }, X; {\bar{{\varLambda }}}^{k-1}) = \frac{1}{\kappa \rho _{k-1}}\Vert {\varPsi }\Vert _1 + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {X}}}(X) + H_k({\varPsi }, X), \end{aligned}$$

where ${\mathscr {X}} = St(n,N)$ and

$$\begin{aligned} H_k({\varPsi }, X) = \frac{1}{\rho _{k-1}}tr({\varPsi }^TH{\varPsi }) + \langle {\bar{{\varLambda }}}^{k-1}/\rho _{k-1}, {\varPsi } - X\rangle + \frac{1}{2}\Vert {\varPsi } - X\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with $B_1^{k,j-1} = \gamma _1^k I$ and $B_2^{k,j-1} = \gamma _2 I$, we get the following updating of $({\varPsi }^{k, j}, X^{k, j})$ for any fixed $k\in {\mathbb {N}}$

$$\begin{aligned} {\varPsi }^{k,j}&=\mathbf{shrink }\left( {\varPsi }^{k, j-1} - \frac{1}{\gamma _1}\left( \frac{2}{\rho _{k-1}}H{\varPsi }^{k,j-1} + \frac{{\bar{{\varLambda }}}^{k-1}}{\rho _{k-1}} + {\varPsi }^{k,j-1} - X^{k,j-1}\right) ,~\frac{1}{\gamma _1^k \kappa \rho _{k-1}}\right) , \end{aligned}$$

(39)

$$\begin{aligned} X^{k, j}&=\arg \max _{X^TX = I} \left\langle X, ~ {\varPsi }^{k,j} + \frac{{\bar{{\varLambda }}}^{k-1}}{\rho _{k-1}} + (\gamma _2 - 1) X^{k,j-1} \right\rangle = {\tilde{U}}{\tilde{V}}^T, \end{aligned}$$

(40)

where $\mathbf{shrink }(x, \eta ) = \text{ sign }(x)\odot \max \{|x| - \eta , 0\}$ is the soft-shrinkage operator and $\odot $ denotes component-wise product, and ${\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = {\varPsi }^{k,j} + \frac{{\bar{{\varLambda }}}^{k-1}}{\rho _{k-1}} + (\gamma _2 - 1) X^{k,j-1}$ is the reduced SVD. Outline of the algorithm is given in Algorithm 4.

1.2 Sparse ULDA

By introducing auxiliary variable $Z = X$, the scaled augmented Lagrangian function associated with (25) is

$$\begin{aligned} L_{\rho _{k-1}}(X, G, Z; {\bar{{\varLambda }}}^{k-1}) = \frac{1}{\rho _{k-1}}\Vert G\Vert _1 + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {O}}}(Z) + H_k(X, G, Z), \end{aligned}$$

where ${\mathscr {O}}$ denotes the set of orthogonal matrices and

$$\begin{aligned} H_k(X, G, Z)= & {} \left\langle {\bar{{\varLambda }}}_1^{k-1}/\rho _{k-1},~{\mathscr {A}}X + {\mathscr {B}}G \right\rangle + \frac{1}{2}\Vert {\mathscr {A}}X + {\mathscr {B}}G\Vert _F^2 \\&+\, \left\langle {\bar{{\varLambda }}}_2^{k-1}/\rho _{k-1}, X - Z\right\rangle + \frac{1}{2}\Vert X - Z\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with $B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I$ and $B_3^{k} = \gamma _3 I$, we get the following updating of $(G^{k, j}, X^{k, j}, Z^{k, j})$ for any fixed $k\in {\mathbb {N}}$

$$\begin{aligned} X^{k, j}&=X^{k, j-1} - \frac{1}{\gamma _1}\left[ {\mathscr {A}}^T\left( {\mathscr {A}}X^{k, j-1} + {\mathscr {B}}G^{k, j-1} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}\right) + X^{k, j-1} - Z^{k, j-1} + \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}\right] , \end{aligned}$$

(41)

$$\begin{aligned} G^{k,j}&=\mathbf{shrink }\left( G^{k, j-1} - \frac{1}{\gamma _2}{\mathscr {B}}^T({\mathscr {A}}X^{k, j} + {\mathscr {B}}G^{k, j-1} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}),~\frac{1}{\gamma _2\rho _{k-1}}\right) , \end{aligned}$$

(42)

$$\begin{aligned} Z^{k, j}&=\arg \max _{Z^TZ = I} \left\langle Z, ~ X^{k,j} + \gamma _3 Z^{k,j} + \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}\right\rangle = {\tilde{U}}{\tilde{V}}^T, \end{aligned}$$

(43)

where ${\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = X^{k,j} + \gamma _3 Z^{k,j} + {\bar{{\varLambda }}}_2^{k-1} /\rho _{k-1}$ is the reduced SVD.

Applying Algorithm 2 to problems (25) and (26) are the same except that in the latter case we compute $G^{k,j}$ as follows: Let ${\varDelta }_G^{k,j-1}:=G^{k, j-1} - \frac{1}{\gamma _2}{\mathscr {B}}^T({\mathscr {A}}X^{k, j} + {\mathscr {B}}G^{k, j-1} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}})$, then

$$\begin{aligned} G^{k,j}_{i,:} = \max \left\{ 0,~ 1 - 1/\left( \gamma _2 \rho _{k-1}\Vert ({\varDelta }_G^{k,j-1})_{i,:}\Vert _2\right) \right\} ({\varDelta }_G^{k,j-1})_{i,:}, \end{aligned}$$

(44)

where subscript i, : means the i-th row.

Summarizing the above procedure leads to sparse ULDA algorithm SULDAAL and group sparse ULDA algorithm SULDAAL outlined in Algorithm 5.

1.3 Sparse CCA

By introducing auxiliary variables $P = W_x, Q = W_y$, and denoting ${\mathscr {X}} = \{P\ |\ P^TX^TXP = I\}, {\mathscr {Y}} = \{Q\ |\ Q^TY^TYQ = I\}$, we can reformulate (28) as

$$\begin{aligned}&\min _{W_x,~W_y,~P,~Q} \{-tr(W_x^TX^TYW_y) + \lambda _1\Vert W_x\Vert _1 + \lambda _2\Vert W_y\Vert _1 + \delta _{{\mathscr {X}}}(P) \\&\,\quad + \delta _{{\mathscr {Y}}}(Q) \ |\ W_x - P = 0, \ W_y - Q = 0\}. \end{aligned}$$

The scaled augmented Lagrangian function associated with (28) is

$$\begin{aligned} L_{\rho _{k-1}}(W_x, W_y, P, Q; {\bar{{\varLambda }}}^{k-1})&= \frac{\lambda _1}{\rho _{k-1}}\Vert W_x\Vert _1 + \frac{\lambda _2}{\rho _{k-1}}\Vert W_y\Vert _1 + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {X}}}(P) + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {Y}}}(Q) \\&\quad + H_k(W_x, W_y, P, Q), \end{aligned}$$

where

$$\begin{aligned} H_k(W_x, W_y, P, Q)&= -\frac{1}{\rho _{k-1}}tr(W_x^TX^TYW_y) + \left\langle \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}, W_x - P\right\rangle + \left\langle \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}, W_y - Q\right\rangle \\&\quad + \frac{1}{2}\Vert W_x - P\Vert _F^2 + \frac{1}{2}\Vert W_y - Q\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with $B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I, B_3^{k,j} = \alpha ^kX^TX - I_{d_1} + \alpha ^k(I_{d_1} - U_1U_1^T)$ and $B_4^{k,j} = \beta ^kY^TY - I_{d_2} + \beta ^k(I_{d_2} - V_1V_1^T)$, where $U_1$ and $V_1$ are obtained from the reduced SVD of X and Y, respectively, as in Eq. (29), we get the following updating of $(W_x^{k, j}, W_y^{k,j})$ for any fixed $k\in {\mathbb {N}}$

$$\begin{aligned} W_x^{k,j}&=\mathbf{shrink }\left( W_x^{k,j-1} - \frac{1}{\gamma _1}(-\frac{X^TYW_y^{k,j-1}}{\rho _{k-1}} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + W_x^{k,j-1} - P^{k,j-1}),~\frac{\lambda _1}{\gamma _1\rho _{k-1}}\right) , \end{aligned}$$

(45)

$$\begin{aligned} W_y^{k,j}&=\mathbf{shrink }\left( W_y^{k,j-1} - \frac{1}{\gamma _2}(-\frac{Y^TXW_x^{k,j-1}}{\rho _{k-1}} + \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + W_y^{k,j-1} - Q^{k,j-1}),~\frac{\lambda _2}{\gamma _2\rho _{k-1}}\right) . \end{aligned}$$

(46)

The resulting algorithm SCCAALN is outlined in Algorithm 6.

For problem (31), the associated scaled augmented Lagrangian function is

$$\begin{aligned} L_{\rho _{k-1}}(W_x, W_y, W; {\bar{{\varLambda }}}^{k-1}) = \frac{1}{\rho _{k-1}}\Vert W_x\Vert _1 + \frac{\lambda }{\rho _{k-1}}\Vert W_y\Vert _1 + \frac{1}{\rho _{k-1}} \delta _{{\mathscr {O}}}(W) + H_k(W_x, W_y, W), \end{aligned}$$

where

$$\begin{aligned} H_k(W_x, W_y, W)&= \left\langle \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}, U_1^TW_x - {\varSigma }_1^{-1}P_1W\right\rangle + \left\langle \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}, V_1^TW_y - {\varSigma }_2^{-1}P_2W\right\rangle \\&\quad + \frac{1}{2}\Vert U_1^TW_x - {\varSigma }_1^{-1}P_1W\Vert _F^2 + \frac{1}{2}\Vert V_1^TW_y - {\varSigma }_2^{-1}P_2W\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with $B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I$ and $B_3^{k,j} = \gamma _3 I$, we get the following updating of $(W_x^{k, j}, W_y^{k,j}, W^{k, j})$ for any fixed $k\in {\mathbb {N}}$

$$\begin{aligned} W_x^{k,j}&=\mathbf{shrink }\left( W_x^{k,j-1} - \frac{1}{\gamma _1}U_1 \left( \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + U_1^TW_x^{k,j-1} - {\varSigma }_1^{-1}P_1W^{k,j-1}\right) ,~\frac{1}{\gamma _1\rho _{k-1}}\right) , \end{aligned}$$

(47)

$$\begin{aligned} W_y^{k,j}&=\mathbf{shrink }\left( W_y^{k,j-1} - \frac{1}{\gamma _1}V_1 \left( \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + V_1^TW_y^{k,j-1} - {\varSigma }_2^{-1}P_2W^{k,j-1}\right) ,~\frac{\lambda }{\gamma _1\rho _{k-1}}\right) , \end{aligned}$$

(48)

$$\begin{aligned} W^{k, j}&=\arg \max _{W^TW = I} \bigg \langle W, ~ W^{k,j-1} + \frac{1}{\gamma _3}\left( P_1^T{\varSigma }_1^{-1}{\varDelta }_x^{k,j} + P_2^T{\varSigma }_2^{-1}{\varDelta }_y^{k,j}\right) \bigg \rangle = {\tilde{U}}{\tilde{V}}^T, \end{aligned}$$

(49)

where ${\varDelta }_x^{k,j} = \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + U_1^TW_x^{k, j}, {\varDelta }_y^{k,j} = \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + V_1^TW_y^{k, j}$ and ${\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = W^{k,j-1} + \frac{1}{\gamma _3}(P_1^T{\varSigma }_1^{-1}{\varDelta }_x^{k,j} + P_2^T{\varSigma }_2^{-1}{\varDelta }_y^{k,j})$ is the reduced SVD. The resulting algorithm WSCCAAL is outlined in Algorithm 7.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, H., Zhang, X., Chu, D. et al. Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method. J Sci Comput 72, 331–372 (2017). https://doi.org/10.1007/s10915-017-0359-1

Download citation

Received: 20 May 2016
Revised: 28 November 2016
Accepted: 09 January 2017
Published: 20 January 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10915-017-0359-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method

Abstract

Access this article

Similar content being viewed by others

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

The Frank-Wolfe Algorithm: A Short Introduction

Notes

References

Acknowledgements