Skip to main content
Log in

Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

Nonconvex and nonsmooth optimization problems with linear equation and generalized orthogonality constraints have wide applications. These problems are difficult to solve due to nonsmooth objective function and nonconvex constraints. In this paper, by introducing an extended proximal alternating linearized minimization (EPALM) method, we propose a framework based on the augmented Lagrangian scheme (EPALMAL). We also show that the EPALMAL method has global convergence in the sense that every bounded sequence generated by the EPALMAL method has at least one convergent subsequence that converges to the Karush–Kuhn–Tucker point of the original problem. Experiments on a variety of applications, including compressed modes and multivariate data analysis, have demonstrated that the proposed method is noticeably efficient and achieves comparable performance with existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. We did not compare with the ADMM method although the convergence for compact constraints is given (e.g. [48]) since the only difference between the SOC method and the ADMM method is that the augmented penalty parameters in the latter method are generally the same for different constraints.

  2. SULDA_admm and SULDA\(\_{\ell _1}\) can be found at http://web.bii.a-star.edu.sg/~zhangxw/Publications.html, SDA and GLOSS can be found at http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=5671 and http://www.hds.utc.fr/~grandval/dokuwiki/doku.php?id=en:code, respectively.

  3. Brain, Lymphoma and Srbct are obtained from http://stat.ethz.ch/~dettling/bagboost.html.

  4. ORL\(_{64\times 64}\) is from http://www.cl.cam.ac.uk/Reasearch/DTG/attarchive:pub/data/attfaces/tar.Z, and Palmprint is from http://www4.comp.polyu.edu.hk/~biometrics/.

  5. All data sets were downloaded from http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download.

  6. The MATLAB implementation of SCCA with acceleration is available at http://web.bii.a-star.edu.sg/~zhangxw/Publications.html.

  7. http://www.isi.edu/natural-language/download/hansard/.

  8. http://www.statmt.org/europarl/.

References

  1. Abrudan, T., Eriksson, J., Koivunen, V.: Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Trans. Signal Process. 56(3), 1134–1147 (2008)

    Article  MathSciNet  Google Scholar 

  2. Abrudan, T., Eriksson, J., Koivunen, V.: Conjugate gradient algorithm for optimization under unitary matrix constraint. Signal Process. 89(9), 1704–1714 (2009)

    Article  MATH  Google Scholar 

  3. Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)

    MATH  Google Scholar 

  4. Andreani, R., Birgin, E.G., Martínez, J.M., Schuverdt, M.L.: On augmented lagrangian methods with general lower-level constraints. SIAM J. Optim. 18(4), 1286–1309 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  5. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the kurdyka-lojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  6. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss-seidel methods. Math. Program. 137(1–2), 91–129 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London (1982)

    MATH  Google Scholar 

  8. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  9. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)

    Article  Google Scholar 

  11. Chen, W.: Wavelet frames on the sphere, high angular resolution diffusion imagining and \(l_1\)-regularized optimization on stiefel manifolds. Ph.D. thesis, The National University of Singapore (2015)

  12. Chen, W., Ji, H., You, Y.: An augmented lagrangian method for \(\ell _1\)-regularized optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 38(4), B570–B592 (2016)

    Article  MATH  Google Scholar 

  13. Chu, D., Liao, L.Z., Ng, M.K., Zhang, X.: Sparse canonical correlation analysis: new formulation and algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 3050–3065 (2013)

    Article  Google Scholar 

  14. Chu, M.T., Trendafilov, N.T.: The orthogonally constrained regression revisited. J. Comput. Graph. Stat. 10(4), 746–771 (2001)

    Article  MathSciNet  Google Scholar 

  15. Clarke, F.H., Ledyaev, Y.S., Stern, R.J., Wolenski, P.R.: Nonsmooth Analysis and Control Theory, vol. 178. Springer, Berlin (2008)

    MATH  Google Scholar 

  16. Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)

  17. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, New York (2000)

    MATH  Google Scholar 

  18. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J Matrix Anal. Appl. 20(2), 303–353 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  19. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)

    Article  MathSciNet  Google Scholar 

  20. Francisco, J.B., Martínez, J.M., Martínez, L., Pisnitchenko, F.: Inexact restoration method for minimization problems arising in electronic structure calculations. Comput. Optim. Appl. 50(3), 555–590 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  21. Grubišić, I., Pietersz, R.: Efficient rank reduction of correlation matrices. Linear Algebra Appl. 422(2), 629–653 (2007)

    MathSciNet  MATH  Google Scholar 

  22. Hardoon, D.R., Shawe-Taylor, J.: Sparse canonical correlation analysis. Mach. Learn. 83(3), 331–353 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  23. Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4(5), 303–320 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  24. Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)

    Article  MATH  Google Scholar 

  25. Howland, P., Jeon, M., Park, H.: Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl. 25, 165–179 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  26. Jiang, B., Dai, Y.H.: A framework of constraint preserving update schemes for optimization on stiefel manifold. Math. Program. 153(2), 535–575 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  27. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86. Citeseer (2005)

  28. Kokiopoulou, E., Chen, J., Saad, Y.: Trace optimization and eigenproblems in dimension reduction methods. Numer. Linear Algebra Appl. 18(3), 565–602 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kovnatsky, A., Glashoff, K., Bronstein, M.M.: Madmm: a generic algorithm for non-smooth optimization on manifolds. arXiv preprint arXiv:1505.07676 (2015)

  30. Kurdyka, K.: On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier 48(3), 769–783 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  31. Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58(2), 431–449 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  32. Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25(4), 2434–2460 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  33. Lojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles pp. 87–89 (1963)

  34. Lu, Z., Zhang, Y.: An augmented lagrangian approach for sparse principal component analysis. Math. Program. 135(1–2), 149–193 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  35. Merchante, L., Grandvalet, Y., Govaert, G.: An efficient approach to sparse linear discriminant analysis. In: Preceedings of the 29th International Conference on Machine Learning (2012)

  36. Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I: Basic Theory, vol. 330. Springer, Berlin (2006)

    Google Scholar 

  37. Mordukhovich, B.S., Shao, Y.: On nonconvex subdifferential calculus in banach spaces. J. Convex Anal. 2(1/2), 211–227 (1995)

    MathSciNet  MATH  Google Scholar 

  38. Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France 93, 273–299 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  39. Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the stiefel manifold. Neurocomputing 67, 106–135 (2005)

    Article  Google Scholar 

  40. Ozoliņš, V., Lai, R., Caflisch, R., Osher, S.: Compressed modes for variational problems in mathematics and physics. Proc. Natl. Acad. Sci. 110(46), 18368–18373 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  41. Powell, M.J.: A method for non-linear constraints in minimization problems. UKAEA (1967)

  42. Rockafellar, R.T.: Augmented lagrange multiplier functions and duality in nonconvex programming. SIAM J. Control 12(2), 268–285 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  43. Rockafellar, R.T., Wets, R.J.B.: Variational Analysis, vol. 317. Springer, Berlin (2009)

    MATH  Google Scholar 

  44. Savas, B., Lim, L.H.: Quasi-newton methods on grassmannians and multilinear approximations of tensors. SIAM J. Sci. Comput. 32(6), 3352–3393 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  45. Sriperumbudur, B.K., Torres, D.A., Lanckriet, G.R.: A majorization-minimization approach to the sparse generalized eigenvalue problem. Mach. Learn. 85(1–2), 3–39 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  46. Vinokourov, A., Cristianini, N., Shawe-Taylor, J.S.: Inferring a semantic representation of text via cross-language correlation analysis. In: Advances in Neural Information Processing Systems, pp. 1473–1480 (2002)

  47. Voorhees, E.M.: The sixth text retrieval conference (trec-6). Inf. Process. Manag. 36(1), 1–2 (2000)

    Article  MathSciNet  Google Scholar 

  48. Wang, Y., Yin, W., Zeng, J.: Global convergence of admm in nonconvex nonsmooth optimization. arXiv preprint arXiv:1511.06324 (2015)

  49. Wen, Z., Yang, C., Liu, X., Zhang, Y.: Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 66, 1175–1203 (2016)

    Article  MathSciNet  Google Scholar 

  50. Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  51. Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009)

    Article  Google Scholar 

  52. Yang, C., Meza, J.C., Wang, L.W.: A trust region direct constrained minimization algorithm for the Kohn–Sham equation. SIAM J. Sci. Comput. 29(5), 1854–1875 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  53. Yang, K., Cai, Z., Li, J., Lin, G.: A stable gene selection in microarray data analysis. BMC Bioinform. 7(1), 228 (2006)

    Article  Google Scholar 

  54. Ye, J.: Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. J. Mach. Learn. Res. 6(4), 483–502 (2005)

    MathSciNet  MATH  Google Scholar 

  55. Zhang, L., Li, R.: Maximization of the sum of the trace ratio on the stiefel manifold, i: theory. Sci. China Math. 57(12), 2495–2508 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  56. Zhang, L., Li, R.: Maximization of the sum of the trace ratio on the stiefel manifold, ii: computation. Sci. China Math. 58(7), 1549–1566 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  57. Zhang, X.: Sparse dimensionality reduction methods: algorithms and applications. Ph.D. thesis, The National University of Singapore (2013)

  58. Zhang, X., Chu, D.: Sparse uncorrelated linear discriminant analysis. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 45–52 (2013)

  59. Zhang, X., Chu, D., Tan, R.C.: Sparse uncorrelated linear discriminant analysis for undersampled problems. IEEE Trans. Neural Netw. Learn. Syst. 27(7), 1469–1485 (2015)

    Article  MathSciNet  Google Scholar 

  60. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the two anonymous referees for their valuable comments and suggestions. The work of L.-Z. Liao was supported in part by grants from Hong Kong Baptist University (FRG) and General Research Fund (GRF) of Hong Kong.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li-Zhi Liao.

Appendices

Appendix 1: Proof of Lemma 2

To prove Lemma 2, we need a few preliminary results regarding the limiting subdifferential of indicator functions. For any closed set \({\mathscr {X}}\), it is well-known [37] that the limiting subdifferential of indicator function \(\delta _{{\mathscr {X}}}\) at x is given by the normal cone to \({\mathscr {X}}\) with respect to x denoted by \(N_{{\mathscr {X}}}(x)\), that is,

$$\begin{aligned} \partial \delta _{{\mathscr {X}}}(x) = N_{{\mathscr {X}}}(x), \quad \forall x \in {\mathscr {X}}. \end{aligned}$$

Next, we consider the normal cone of two specific closed sets. Let \(\ell (X) = {\mathscr {A}}X : {\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}^{\gamma \times n}\) be a linear mapping with \({\mathscr {A}} \in {\mathbb {R}}^{\gamma \times m}\), and define the closed set \({\mathscr {X}}:= \{X \in {\mathbb {R}}^{m\times n} \ | \ {\mathscr {A}}X = 0\}\). It is easy to show that

$$\begin{aligned} N_{{\mathscr {X}}}(X)=\{{\mathscr {A}}^T {\varGamma } ~|~ {\varGamma } \in {\mathbb {R}}^{\gamma \times n}\}. \end{aligned}$$
(32)

Let \({\mathscr {M}} = \{X\in {\mathbb {R}}^{n\times q} \ |\ X^TMX = I_q, M\in S_+^n\}\) be the set of matrices satisfying the generalized orthogonality constraints, it follows that for any curve \(Y(t)\in {\mathscr {M}}\) with \(Y(0) = X, (Y'(t))^TMY(t) + Y(t)^TMY'(t) = 0\). Let \(t = 0\), notice that \(Y'(0)\in T_{{\mathscr {M}}}(X)\), we have

$$\begin{aligned} \eta ^TMX + X^TM\eta = 0\quad \forall \ \eta \in T_{{\mathscr {M}}}(X). \end{aligned}$$

Let \(MX = {\bar{U}}{\bar{{\varSigma }}} {\bar{V}}^T\) be the reduced SVD of MX, and \((MX)_\perp \) be a column orthogonal matrix such that \([{\bar{U}} \ (MX)_\perp ]\) is an orthogonal matrix, then \(\eta \) can be written as

$$\begin{aligned} \eta = {\bar{U}}\eta _1 + (MX)_\perp \eta _2, \end{aligned}$$

where \(\eta _1^T{\bar{{\varSigma }}}{\bar{V}}^T + {\bar{V}}{\bar{{\varSigma }}}\eta _1 = 0\), which means \({\bar{V}}{\bar{{\varSigma }}}\eta _1\) is skew-symmetric. Moreover,

$$\begin{aligned} \eta= & {} {\bar{U}}\eta _1 + (MX)_\perp \eta _2\\= & {} {\bar{U}}{\bar{{\varSigma }}}^{-1}{\bar{V}}^T({\bar{V}}{\bar{{\varSigma }}}\eta _1) + (MX)_\perp \eta _2\\= & {} (X^TM)^\dagger ({\bar{V}}{\bar{{\varSigma }}}\eta _1) + (MX)_\perp \eta _2, \end{aligned}$$

where \((X^TM)^\dagger \) is the pseudo-inverse of \(X^TM\). Therefore, the tangent space of \({\mathscr {M}}\) is given by

$$\begin{aligned} T_{{\mathscr {M}}}(X) = \{(X^TM)^\dagger {\varOmega } + (MX)_\perp K \ | \ {\varOmega }^T + {\varOmega } = 0\}. \end{aligned}$$

In addition, any \(\zeta \in N_{{{\mathscr {M}}}}(X)\) can be written as

$$\begin{aligned} \zeta = {\bar{U}}\zeta _1 + (MX)_\perp \zeta _2 = (MX){\bar{V}}{\bar{{\varSigma }}}^{-1}\zeta _1 + (MX)_\perp \zeta _2. \end{aligned}$$

Since \(\left\langle \zeta , \eta \right\rangle = 0\) for any \(\eta \in T_{{\mathscr {M}}}(X)\), we have

$$\begin{aligned} \left\langle \eta _1, \zeta _1\right\rangle = 0\quad \text{ and } \quad \zeta _2 = 0. \end{aligned}$$

The first equality implies that

$$\begin{aligned} 0 = \left\langle \eta _1, \zeta _1\right\rangle = \left\langle {\bar{V}}{\bar{{\varSigma }}}\eta _1, {\bar{V}}{\bar{{\varSigma }}}^{-1}\zeta _1\right\rangle , \end{aligned}$$

and thus \({\bar{V}}{\bar{{\varSigma }}}^{-1}\zeta _1\) must be symmetric. Hence, the normal cone of \({\mathscr {M}}\) at X is given by

$$\begin{aligned} N_{{\mathscr {M}}}(X) = \{(MX)S\ |\ S\in S^{q}\}. \end{aligned}$$
(33)

Now, we are ready to prove Lemma 2.

Proof

Equalities

$$\begin{aligned} AX^* + BY^* - C = 0, \ X^* - G^* = 0\ \text{ and } \ (G^*)^TMG^* = I_q \end{aligned}$$

hold since \((X^*, Y^*, G^*)\) is feasible as a local minimizer. For the convenience of analysis, we denote \(W := \begin{pmatrix} X\\ Y\\ G \end{pmatrix}\in {\mathbb {R}}^{(2n+m)\times q}\) and define \(g_1: {\mathbb {R}}^{(2n+m)\times q} \rightarrow {\mathbb {R}} ^{(l+ n)\times q}\) as

$$\begin{aligned} g_1(W) = \begin{pmatrix} A &{}\quad B &{}\quad 0\\ I &{}\quad 0 &{}\quad -I \end{pmatrix} \begin{pmatrix} X\\ Y\\ G \end{pmatrix}. \end{aligned}$$

Let \({\varOmega } = \{W\in {\mathbb {R}}^{(2n+m)\times q}\ | \ g_1(W) = 0\}\), then problem (9) is equivalent to

$$\begin{aligned} \min _W \ f(X) + g(Y) + h(X, Y)+ \delta _{{\mathscr {M}}}(G) + \delta _{{\varOmega }}(W). \end{aligned}$$

Since \((X^*, Y^*, G^*)\) is a local minimizer, by the generalized Fermat’s rule and subdifferentiability property [15, 43], we have

$$\begin{aligned} 0\in \begin{pmatrix} \partial f(X^*) + \partial _X h(X^*, Y^*)\\ \partial g(Y^*) + \partial _Y h(X^*, Y^*)\\ \partial \delta _{{\mathscr {M}}}(G^*) \end{pmatrix} + \partial \delta _{{\varOmega }}(W^*). \end{aligned}$$

As described at the beginning of this “Appendix 1”, subdifferential of indicator functions \(\delta _{{\mathscr {M}}}\) and \(\delta _{{\varOmega }}\) are given by the normal cones (33) and (32), respectively. In particular,

$$\begin{aligned} \partial \delta _{{\mathscr {M}}}(G^*) = N_{{\mathscr {M}}}(G^*) = \{MG^*S\ | \ S\in S^q\}, \end{aligned}$$

and

$$\begin{aligned} \partial \delta _{{\varOmega }}(W^*) = N_{{\varOmega }}(W^*) = \left\{ \begin{pmatrix} A^T &{}\quad I\\ B^T &{}\quad 0\\ 0 &{}\quad -I \end{pmatrix} \begin{pmatrix} {\varLambda }_1\\ {\varLambda }_2 \end{pmatrix} | \ {\varLambda }_1\in {\mathbb {R}}^{l\times q}, \ {\varLambda }_2\in {\mathbb {R}}^{n\times q} \right\} . \end{aligned}$$

Therefore, there exist \(v^*\in \partial f(X^*), \ w^*\in \partial g(Y^*), \ {\varLambda }_1^*\in {\mathbb {R}}^{l\times q}, \ {\varLambda }_2^*\in {\mathbb {R}}^{n\times q}, \ {\varLambda }_3^*\in S^q\) such that

$$\begin{aligned} 0= & {} \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 2MG^*{\varLambda }_3^* \end{pmatrix} + \begin{pmatrix} A^T &{}\quad I\\ B^T &{}\quad 0\\ 0 &{} -I \end{pmatrix} \begin{pmatrix} {\varLambda }_1^*\\ {\varLambda }_2^* \end{pmatrix} = \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 0 \end{pmatrix}\\&+ \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{} -I &{} 2MG^* \end{pmatrix} \begin{pmatrix} {\varLambda }_1^* \\ {\varLambda }_2^* \\ {\varLambda }_3^* \end{pmatrix}, \end{aligned}$$

which proves equality (12). Moreover, it yields that \({\varLambda }_2^* = 2MG^*{\varLambda }_3^*\). Substitute this into (12) and eliminate \(G^*\), we get equality (13). This completes the proof. \(\square \)

Appendix 2: Proof of Theorem 1

Proof

For any limit point \((X^*, Y^*, G^*)\) of the bounded sequence \(\{(X^k, Y^k, G^k)\}_{k\in {\mathbb {N}}}\), there exists an index set \({\mathscr {K}}\subset {\mathbb {N}}\) such that \(\{(X^k, Y^k, G^k)\}_{k\in {\mathscr {K}}}\) converges to \((X^*, Y^*, G^*)\). To prove that \((X^*, Y^*, G^*)\) is a KKT point, we first show that it is a feasible point. The equality \((G^*)^TMG^* = I_q\) is trivial to check since \((G^k)^TMG^k = I_q\) holds for any \(k\in {\mathbb {N}}\). If \(\{\rho _k\}\) is bounded, then by the updating rule of \(\rho _k\) in Algorithm 2, there exists an \(k_0\in {\mathbb {N}}\) such that

$$\begin{aligned} \Vert R_j^k\Vert _\infty \le \tau \Vert R_j^{k - 1}\Vert _\infty \quad \forall k\ge k_0, \ j = 1, 2. \end{aligned}$$

By the definition of \(R_j^k, j = 1, 2\), it holds that

$$\begin{aligned} \left\{ \begin{array}{ll} \Vert AX^k + BY^k - C\Vert _\infty \le \tau \Vert AX^{k - 1} + BY^{k - 1} - C\Vert _\infty , \\ \Vert X^k - G^k\Vert _\infty \le \tau \Vert X^{k - 1} - G^{k - 1}\Vert _\infty , \end{array} \right. \end{aligned}$$

for any \(k\ge k_0\). Thus

$$\begin{aligned} \left\{ \begin{array}{ll} AX^* + BY^* - C = 0, \\ X^* - G^* = 0. \end{array} \right. \end{aligned}$$

If \(\{\rho _k\}\) is unbounded, by the generalized Fermat rule, finding a solution satisfying the constraint (11) is equivalent of calculating a point \((X^k, Y^k, G^k)\) such that

$$\begin{aligned}&\left\| \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix}/\rho _{k-1} + \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{}\quad -I &{}\quad 2MG^k \end{pmatrix} \begin{pmatrix} \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + (AX^k + BY^k -C) \\ \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + (X^k - G^k) \\ {\varLambda }_3^k/\rho _{k-1} \end{pmatrix} \right\| _\infty \nonumber \\&\quad \le \frac{\epsilon _{k-1}}{\rho _{k-1}}, \end{aligned}$$
(34)

for some \(v^k\in \partial f(X^k), w^k\in \partial g(Y^k), {\varLambda }_3^k \in {\mathscr {S}}^q\) , and \(\epsilon _{k}\downarrow 0\) as \(k\rightarrow \infty \). Notice that \(\{{\bar{{\varLambda }}}_1^k\}\) and \(\{{\bar{{\varLambda }}}_2^k\}\) are bounded, \(\{v^k\}_{k\in {\mathscr {K}}}, \{w^k\}_{k\in {\mathscr {K}}}, \{\partial _X h(X^k, Y^k)\}\) and \(\{\partial _Y h(X^k, Y^k)\}\) are bounded under Assumption 1 (iii)–(iv). Let \(k\in {\mathscr {K}}\) go to infinity, Eq. (34) implies that

$$\begin{aligned} \begin{pmatrix} A^T &{}\quad I\\ B^T &{}\quad 0 \end{pmatrix} \begin{pmatrix} AX^* + BY^* - C \\ X^* - G^* \end{pmatrix} =0. \end{aligned}$$

Recall that B has full row rank, we get \(AX^* + BY^* - C = 0\) and \(X^* - G^* = 0\). Therefore, in both cases, we show that \((X^*, Y^*, G^*)\) is a feasible point.

Next, we show that there exist \({\varLambda }_1^*\in {\mathbb {R}}^{k\times q}, {\varLambda }_2^*\in {\mathbb {R}}^{n\times q}\) and \({\varLambda }_3^*\in {\mathscr {S}}^q\) such that \((X^*, Y^*, G^*; {\varLambda }_1^*, {\varLambda }_2^*, {\varLambda }_3^*)\) satisfies (12). If \(\{X^k, Y^k, G^k\}_{k\in {\mathbb {N}}}\) is bounded, there exists an index set \({\mathscr {K}}\subseteq {\mathbb {N}}\), such that \(\lim _{k\in {\mathscr {K}}}(X^k, Y^k, G^k) = (X^*, Y^*, G^*)\). Since \(\{v^k\}_{k\in {\mathscr {K}}}\) is bounded, there exists a subsequence \({\mathscr {K}}_2\subseteq {\mathscr {K}}\) such that \(\lim _{k\in {\mathscr {K}}_2}v^k = v^*\). Moreover, by the closedness property of the limiting subdifferential, we get

$$\begin{aligned} v^*\in \partial f(X^*). \end{aligned}$$

Similarly, there exists a subsequence \({\mathscr {K}}_3\subseteq {\mathscr {K}}_2\) such that \(\lim _{k\in {\mathscr {K}}_3}w^k = w^*\), and

$$\begin{aligned} w^*\in \partial g(Y^*). \end{aligned}$$

Combining with the updating formula of \({\bar{{\varLambda }}}_1^k\) and \({\bar{{\varLambda }}}_2^k\) in Step 2 of Algorithm 2, (34) implies that there exists a \(\xi ^k\) with \(\Vert \xi ^k\Vert _\infty \le \frac{\epsilon _{k-1}}{\rho _{k-1}}\) such that

$$\begin{aligned} \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{}\quad -I &{}\quad 2MG^k \end{pmatrix} \begin{pmatrix} {{\varLambda }}_1^{k} \\ {{\varLambda }}_2^{k} \\ {\varLambda }_3^k \end{pmatrix}/ \rho _{k-1} = \xi ^k - \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix}/\rho _{k-1}. \end{aligned}$$
(35)

Define

$$\begin{aligned} {\varXi }^k = \begin{pmatrix} A^T &{}\quad I &{}\quad 0\\ B^T &{}\quad 0 &{}\quad 0\\ 0 &{}\quad -I &{}\quad 2MG^k \end{pmatrix} \in R^{(2n+m)\times (2k+q)}, \quad {\varUpsilon }^k = \begin{pmatrix} {\varLambda }_1^{k}\\ {\varLambda }_2^{k}\\ {\varLambda }_3^{k} \end{pmatrix}, \end{aligned}$$

we can rewrite (35) as

$$\begin{aligned} {\varXi }^k{\varUpsilon }^k = \rho _{k-1} \xi ^{k} - \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix}. \end{aligned}$$

Since the columns of \({\varXi }^k\) are linearly independent, \(({\varXi }^k)^T({\varXi }^k)\) is nonsingular. Thus

$$\begin{aligned} {\varUpsilon }^k = (({\varXi }^k)^T{\varXi }^k)^{-1}({\varXi }^k)^T \left[ \rho _{k-1} \xi ^k - \begin{pmatrix} v^k + \partial _X h(X^k, Y^k)\\ w^k + \partial _Y h(X^k, Y^k)\\ 0 \end{pmatrix} \right] . \end{aligned}$$
(36)

Taking limit on (36) as \(k\in {\mathscr {K}}_3\) goes to infinity, and noticing that \(\Vert \xi ^k\Vert _{\infty } \le \frac{\epsilon _{k-1}}{\rho _{k-1}}\) with \(\epsilon _k\downarrow 0\) as \(k\rightarrow \infty \), we have

$$\begin{aligned} \lim \limits _{k\in {\mathscr {K}}_3, k\rightarrow \infty }{\varUpsilon }^k = {\varUpsilon }^* := - (({\varXi }^*)^T{\varXi }^*)^{-1}({\varXi }^*)^T \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 0 \end{pmatrix}, \end{aligned}$$

where

$$\begin{aligned} {\varXi }^* = \begin{pmatrix} A^T &{}\quad I &{}\quad 0 \\ B^T &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad -I &{}\quad 2MG^* \end{pmatrix} \end{aligned}$$

has full column rank. From the definition of \({\varUpsilon }^k\), taking limit \(k\in {\mathscr {K}}_3\) goes to infinity on both sides of (35) yields that

$$\begin{aligned} \begin{pmatrix} v^* + \partial _X h(X^*, Y^*)\\ w^* + \partial _Y h(X^*, Y^*)\\ 0 \end{pmatrix} + \begin{pmatrix} A^T &{}\quad I &{}\quad 0 \\ B^T &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad -I &{}\quad 2MG^* \end{pmatrix} \begin{pmatrix} {\varLambda }_1^* \\ {\varLambda }_2^* \\ {\varLambda }_3^* \end{pmatrix}=0, \end{aligned}$$

where \({\varLambda }_3^*\in {\mathscr {S}}^q\) since \({\varLambda }_3^k\in {\mathscr {S}}^q\) for any \(k\in {\mathbb {N}}\). According to Lemma 2, \((X^*, Y^*, G^*)\) is a KKT point of problem (9). Moreover, \((X^*, Y^*)\) is a KKT point of problem (1). \(\square \)

Appendix 3: Proof of Proposition 1

Proof

In this proof, we assume k is fixed and for the simplicity of notations, we let \(W := \begin{pmatrix} X\\ Y\\ G \end{pmatrix}\) and \(L_k(W): = L_{\rho _{k-1}}(X, Y, G; {\bar{{\varLambda }}}^{k-1})\).

1) By the first-order optimality conditions of three subproblems in (15), there exist \(v^{k,j}\in \partial f_1^k(X^{k, j}), w^{k,j}\in \partial f_2^k(Y^{k, j}), \nu ^{k,j}\in \partial f_3^k(G^{k,j})\) such that

$$\begin{aligned} \left\{ \begin{array}{ll} 0 = v^{k,j} + \nabla _{X} H_k(X^{k,j-1}, Y^{k,j-1}, G^{k,j-1}) + B_1^{k,j-1}(X^{k,j}- X^{k,j-1}), \\ 0 = w^{k,j} + \nabla _{Y} H_k(X^{k,j}, Y^{k,j-1}, G^{k,j-1}) + B_2^{k,j-1}(Y^{k,j}- Y^{k,j-1}), \\ 0 = \nu ^{k,j} - {\bar{{\varLambda }}}_1^{k-1}/\rho _{k-1} + (G^{k, j} - X^{k, j}) + B_3^k(G^{k, j} - G^{k, j-1}). \end{array} \right. \end{aligned}$$

Combined with the above relationships, by simple calculations, \(A^{k,j}\) defined in (19) has the following properties,

$$\begin{aligned} A_{X}^{k,j}= & {} \nabla _X H_k(W^{k,j}) - \nabla _X H_k(W^{k,j-1}) + B_1^{k,j-1}(X^{k, j-1} - X^{k,j})\\= & {} v^{k,j} +\nabla _{X} H_k(W^{k,j}) \in \partial _{X} L_k(W^{k,j}),\\ A_{Y}^{k,j}= & {} \nabla _Y H_k(W^{k,j}) - \nabla _Y H_k(X^{k,j}, Y^{k,j-1}, Z^{k,j-1}) + B_2^{k,j-1}(Y^{k, j-1} - Y^{k,j}), \\= & {} w^{k,j} + \nabla _{Y} H_k(W^{k,j}) \in \partial _{Y} L_k(W^{k,j}),\\ A_G^{k,j}= & {} B_3^k(G^{k, j-1} - G^{k, j}) = \nu ^{k,j} - {\bar{{\varLambda }}}_1^{k-1}/\rho _{k-1} + (G^{k,j} - X^{k,j}) \\= & {} \nu ^{k,j} + \nabla _G H_k(W^{k,j}) \in \partial _G L_k(W^{k,j}). \end{aligned}$$

Since \(H_k(W)\) is continuously differentiable, by subdifferentiability property [5, 43],

$$\begin{aligned} \partial L_k(W) = \partial _{X} L_k(W) \times \partial _{Y} L_k(W) \times \partial _G L_k(W), \end{aligned}$$

which implies

$$\begin{aligned} A^{k,j} \in \partial L_k(W^{k,j}) = \partial L_{\rho _{k-1}}(X^{k,j}, Y^{k,j}, G^{k,j}; {\bar{{\varLambda }}}^{k-1})\quad \forall j\in {\mathbb {N}}. \end{aligned}$$

To show \(\Vert A^{k,j}\Vert _\infty \rightarrow 0\), it suffices to show that \(L_k(W)\) satisfies the conditions in Assumption 2 and apply Lemma 1 c).

We first show that \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) generated by Algorithm 3 is bounded. This can be accomplished via proof by contradiction. Notice that \({\tilde{L}}_k(W) = \rho _{k-1}L_k(W)\) is a coercive function under the assumption that \(\phi (X, Y) + \frac{\rho _0}{2}\Vert AX + BY - C\Vert _F^2\) is coercive function and the fact that \(\{\rho _k\}_{k\in {\mathbb {N}}}\) is non-decreasing, \(\delta _{{\mathscr {M}}}(G)\) is a coercive function and \(\frac{\rho _{k-1}}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2\) is bounded from below. Suppose \(\lim _{j\rightarrow \infty }\Vert W^{k,j}\Vert _\infty = +\infty \), then there must hold

$$\begin{aligned} \lim _{j\rightarrow \infty } {\tilde{L}}_{k}(W^{k,j}) = +\infty . \end{aligned}$$

On the other hand, we know from Lemma 1 a) that \(\{L_{k}(W^{k,j})\}_{j\in {\mathbb {N}}}\) is a decreasing sequence, thus \(\{{\tilde{L}}_{k}(W^{k,j})\}_{j\in {\mathbb {N}}}\) is non-increasing, which implies

$$\begin{aligned} \lim _{j\rightarrow \infty } {\tilde{L}}_{k}(W^{k,j}) \le {\tilde{L}}_{k}(W^{k,0}) < +\infty . \end{aligned}$$

Hence, a contradiction, and \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) is bounded.

Now, we verify that \(L_k(W)\) satisfies the conditions in Assumption 2. By the definitions of \(L_k(W), H_k(W), f_i^k, i = 1, 2, 3\) and Assumption 1 (i), it is easy to see that for any given \({\bar{{\varLambda }}}^{k-1}\) and \(\rho _{k-1}\), the following results hold:

  1. (a)

    \(f_i^k, i = 1, 2, 3\), are proper and lower semicontinuous functions satisfying \(\inf f_i^k > -\infty , H_k\) is a \(C^1\) function, and

    $$\begin{aligned} H_k(W)&= \frac{1}{\rho _{k-1}} h(X,Y) + \frac{1}{2}\Vert AX + BY - C + {\bar{{\varLambda }}}_1^{k-1} / \rho _{k-1}\Vert _F^2 \nonumber \\&\quad + \frac{1}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}/\rho _{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2. \end{aligned}$$
    (37)

    Thus \(\inf _{W} H_k(W) \ge \frac{1}{\rho _{k-1}}\min _{X,Y}h(X,Y) - \frac{1}{2(\rho _{k-1})^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2 > -\infty \).

  2. (b)

    Since \(H_k\) is a quadratic function with respect to \(G, \nabla _G H_k\) is obviously Lipschitz continuous. Regarding the Lipschitz continuity of partial derivatives \(\nabla _X h(X, Y)\) and \(\nabla _Y h(X, Y)\), we have

    $$\begin{aligned} \Vert \nabla _{X} H_k(X, Y^{k,j-1}, G^{k,j-1}) - \nabla _{X} H_k({\tilde{X}}, Y^{k,j-1}, G^{k,j-1})\Vert \le L_1^{k,j-1} \Vert X - {\tilde{X}}\Vert , \end{aligned}$$

    where \(L_1^{k,j-1} = \frac{L_1(Y^{k,j-1})}{\rho _{k-1}} + \Vert A^TA\Vert + 1\), and

    $$\begin{aligned} \Vert \nabla _{Y} H_k(X^{k,j}, Y, G^{k,j-1}) - \nabla _{Y} H_k(X^{k,j}, {\tilde{Y}}, G^{k,j-1})\Vert \le L_2^{k,j-1}\Vert Y - {\tilde{Y}}\Vert , \end{aligned}$$

    where \(L_2^{k,j-1} = \frac{L_2(X^{k,j})}{\rho _{k-1}} + \Vert B^TB\Vert + 1\). In addition, let \({\bar{L}}_1 = \sup _{Y}L_1(Y)\) and \({\bar{L}}_2 = \sup _XL_2(X)\), then the boundedness of \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) and Assumption 1 (iii) imply that \({\bar{L}}_1 < \infty \) and \({\bar{L}}_2 < \infty \), and we have

    $$\begin{aligned}&\Vert A^TA\Vert + 1 \le L_1^{k, j-1} \le {\bar{L}}_1/\rho _0 + \Vert A^TA\Vert + 1,\quad \\&\quad \Vert B^TB\Vert + 1 \le L_2^{k, j-1} \le {\bar{L}}_2/\rho _0 + \Vert B^TB\Vert + 1, \end{aligned}$$

    which imply

    $$\begin{aligned} \inf _j\{L_1^{k, j-1}\} \ge \Vert A^TA\Vert + 1> -\infty , \quad \inf _j\{L_2^{k, j-1}\} \ge \Vert B^TB\Vert + 1 > -\infty , \end{aligned}$$

    and

    $$\begin{aligned}&\sup _j\left\{ L_1^{k, j-1}\right\} \le {\bar{L}}_1/\rho _0 + \Vert A^TA\Vert + 1< +\infty , \quad \sup _j\left\{ L_2^{k, j-1}\right\} \\&\quad \le {\bar{L}}_2/\rho _0 + \Vert B^TB\Vert + 1 < +\infty . \end{aligned}$$

    Moreover, Assumption 2 (iii) holds by the definition of \(B_i^{k, j-1}, i = 1, 2\) and \(B_3^k\). Thus, Assumption 2 (i)–(iii) hold.

  3. (c)

    Assumption 2 (iv) holds since h(XY) satisfies Assumption 1 (iii).

2) From the proof of 1), we know that \(\{W^{k,j}\}_{j\in {\mathbb {N}}}\) is bounded. Then the proof of 2) remains to show that \(L_k(W)\) is a K–L function and apply Lemma 1 d). Notice that

$$\begin{aligned} L_k(W)&= \frac{1}{\rho _{k-1}} \phi (X,Y) + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {M}}}(G) + \frac{1}{2}\Vert AX + BY - C + {\bar{{\varLambda }}}_1^{k-1} / \rho _{k-1}\Vert _F^2 \nonumber \\&\quad + \frac{1}{2}\Vert X - G + {\bar{{\varLambda }}}_2^{k-1}/\rho _{k-1}\Vert _F^2 - \frac{1}{2\rho _{k-1}^2}\Vert {\bar{{\varLambda }}}^{k-1}\Vert _F^2, \end{aligned}$$
(38)

i.e., \(L_k(W)\) satisfies the K–L properties as a finite sum of functions satisfying the K–L properties, then the result holds directly. \(\square \)

Appendix 4: Outline of Algorithms for CMs, Sparse ULDA and Sparse CCA

In this section, we describe the detailed derivations of algorithms for CMs, sparse ULDA and sparse CCA in Sect. 5.

1.1 Compressed Modes

The scaled augmented Lagrangian function associated with (21) is

$$\begin{aligned} L_{\rho _{k-1}}({\varPsi }, X; {\bar{{\varLambda }}}^{k-1}) = \frac{1}{\kappa \rho _{k-1}}\Vert {\varPsi }\Vert _1 + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {X}}}(X) + H_k({\varPsi }, X), \end{aligned}$$

where \({\mathscr {X}} = St(n,N)\) and

$$\begin{aligned} H_k({\varPsi }, X) = \frac{1}{\rho _{k-1}}tr({\varPsi }^TH{\varPsi }) + \langle {\bar{{\varLambda }}}^{k-1}/\rho _{k-1}, {\varPsi } - X\rangle + \frac{1}{2}\Vert {\varPsi } - X\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1^k I\) and \(B_2^{k,j-1} = \gamma _2 I\), we get the following updating of \(({\varPsi }^{k, j}, X^{k, j})\) for any fixed \(k\in {\mathbb {N}}\)

$$\begin{aligned} {\varPsi }^{k,j}&=\mathbf{shrink }\left( {\varPsi }^{k, j-1} - \frac{1}{\gamma _1}\left( \frac{2}{\rho _{k-1}}H{\varPsi }^{k,j-1} + \frac{{\bar{{\varLambda }}}^{k-1}}{\rho _{k-1}} + {\varPsi }^{k,j-1} - X^{k,j-1}\right) ,~\frac{1}{\gamma _1^k \kappa \rho _{k-1}}\right) , \end{aligned}$$
(39)
$$\begin{aligned} X^{k, j}&=\arg \max _{X^TX = I} \left\langle X, ~ {\varPsi }^{k,j} + \frac{{\bar{{\varLambda }}}^{k-1}}{\rho _{k-1}} + (\gamma _2 - 1) X^{k,j-1} \right\rangle = {\tilde{U}}{\tilde{V}}^T, \end{aligned}$$
(40)

where \(\mathbf{shrink }(x, \eta ) = \text{ sign }(x)\odot \max \{|x| - \eta , 0\}\) is the soft-shrinkage operator and \(\odot \) denotes component-wise product, and \({\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = {\varPsi }^{k,j} + \frac{{\bar{{\varLambda }}}^{k-1}}{\rho _{k-1}} + (\gamma _2 - 1) X^{k,j-1}\) is the reduced SVD. Outline of the algorithm is given in Algorithm 4.

figure d

1.2 Sparse ULDA

figure e

By introducing auxiliary variable \(Z = X\), the scaled augmented Lagrangian function associated with (25) is

$$\begin{aligned} L_{\rho _{k-1}}(X, G, Z; {\bar{{\varLambda }}}^{k-1}) = \frac{1}{\rho _{k-1}}\Vert G\Vert _1 + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {O}}}(Z) + H_k(X, G, Z), \end{aligned}$$

where \({\mathscr {O}}\) denotes the set of orthogonal matrices and

$$\begin{aligned} H_k(X, G, Z)= & {} \left\langle {\bar{{\varLambda }}}_1^{k-1}/\rho _{k-1},~{\mathscr {A}}X + {\mathscr {B}}G \right\rangle + \frac{1}{2}\Vert {\mathscr {A}}X + {\mathscr {B}}G\Vert _F^2 \\&+\, \left\langle {\bar{{\varLambda }}}_2^{k-1}/\rho _{k-1}, X - Z\right\rangle + \frac{1}{2}\Vert X - Z\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I\) and \(B_3^{k} = \gamma _3 I\), we get the following updating of \((G^{k, j}, X^{k, j}, Z^{k, j})\) for any fixed \(k\in {\mathbb {N}}\)

$$\begin{aligned} X^{k, j}&=X^{k, j-1} - \frac{1}{\gamma _1}\left[ {\mathscr {A}}^T\left( {\mathscr {A}}X^{k, j-1} + {\mathscr {B}}G^{k, j-1} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}\right) + X^{k, j-1} - Z^{k, j-1} + \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}\right] , \end{aligned}$$
(41)
$$\begin{aligned} G^{k,j}&=\mathbf{shrink }\left( G^{k, j-1} - \frac{1}{\gamma _2}{\mathscr {B}}^T({\mathscr {A}}X^{k, j} + {\mathscr {B}}G^{k, j-1} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}),~\frac{1}{\gamma _2\rho _{k-1}}\right) , \end{aligned}$$
(42)
$$\begin{aligned} Z^{k, j}&=\arg \max _{Z^TZ = I} \left\langle Z, ~ X^{k,j} + \gamma _3 Z^{k,j} + \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}\right\rangle = {\tilde{U}}{\tilde{V}}^T, \end{aligned}$$
(43)

where \({\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = X^{k,j} + \gamma _3 Z^{k,j} + {\bar{{\varLambda }}}_2^{k-1} /\rho _{k-1}\) is the reduced SVD.

Applying Algorithm 2 to problems (25) and (26) are the same except that in the latter case we compute \(G^{k,j}\) as follows: Let \({\varDelta }_G^{k,j-1}:=G^{k, j-1} - \frac{1}{\gamma _2}{\mathscr {B}}^T({\mathscr {A}}X^{k, j} + {\mathscr {B}}G^{k, j-1} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}})\), then

$$\begin{aligned} G^{k,j}_{i,:} = \max \left\{ 0,~ 1 - 1/\left( \gamma _2 \rho _{k-1}\Vert ({\varDelta }_G^{k,j-1})_{i,:}\Vert _2\right) \right\} ({\varDelta }_G^{k,j-1})_{i,:}, \end{aligned}$$
(44)

where subscript i,  :  means the i-th row.

Summarizing the above procedure leads to sparse ULDA algorithm SULDAAL and group sparse ULDA algorithm SULDAAL outlined in Algorithm 5.

1.3 Sparse CCA

figure f

By introducing auxiliary variables \(P = W_x, Q = W_y\), and denoting \({\mathscr {X}} = \{P\ |\ P^TX^TXP = I\}, {\mathscr {Y}} = \{Q\ |\ Q^TY^TYQ = I\}\), we can reformulate (28) as

$$\begin{aligned}&\min _{W_x,~W_y,~P,~Q} \{-tr(W_x^TX^TYW_y) + \lambda _1\Vert W_x\Vert _1 + \lambda _2\Vert W_y\Vert _1 + \delta _{{\mathscr {X}}}(P) \\&\,\quad + \delta _{{\mathscr {Y}}}(Q) \ |\ W_x - P = 0, \ W_y - Q = 0\}. \end{aligned}$$

The scaled augmented Lagrangian function associated with (28) is

$$\begin{aligned} L_{\rho _{k-1}}(W_x, W_y, P, Q; {\bar{{\varLambda }}}^{k-1})&= \frac{\lambda _1}{\rho _{k-1}}\Vert W_x\Vert _1 + \frac{\lambda _2}{\rho _{k-1}}\Vert W_y\Vert _1 + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {X}}}(P) + \frac{1}{\rho _{k-1}}\delta _{{\mathscr {Y}}}(Q) \\&\quad + H_k(W_x, W_y, P, Q), \end{aligned}$$

where

$$\begin{aligned} H_k(W_x, W_y, P, Q)&= -\frac{1}{\rho _{k-1}}tr(W_x^TX^TYW_y) + \left\langle \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}, W_x - P\right\rangle + \left\langle \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}, W_y - Q\right\rangle \\&\quad + \frac{1}{2}\Vert W_x - P\Vert _F^2 + \frac{1}{2}\Vert W_y - Q\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I, B_3^{k,j} = \alpha ^kX^TX - I_{d_1} + \alpha ^k(I_{d_1} - U_1U_1^T)\) and \(B_4^{k,j} = \beta ^kY^TY - I_{d_2} + \beta ^k(I_{d_2} - V_1V_1^T)\), where \(U_1\) and \(V_1\) are obtained from the reduced SVD of X and Y, respectively, as in Eq. (29), we get the following updating of \((W_x^{k, j}, W_y^{k,j})\) for any fixed \(k\in {\mathbb {N}}\)

$$\begin{aligned} W_x^{k,j}&=\mathbf{shrink }\left( W_x^{k,j-1} - \frac{1}{\gamma _1}(-\frac{X^TYW_y^{k,j-1}}{\rho _{k-1}} + \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + W_x^{k,j-1} - P^{k,j-1}),~\frac{\lambda _1}{\gamma _1\rho _{k-1}}\right) , \end{aligned}$$
(45)
$$\begin{aligned} W_y^{k,j}&=\mathbf{shrink }\left( W_y^{k,j-1} - \frac{1}{\gamma _2}(-\frac{Y^TXW_x^{k,j-1}}{\rho _{k-1}} + \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + W_y^{k,j-1} - Q^{k,j-1}),~\frac{\lambda _2}{\gamma _2\rho _{k-1}}\right) . \end{aligned}$$
(46)

The resulting algorithm SCCAALN is outlined in Algorithm 6.

figure g

For problem (31), the associated scaled augmented Lagrangian function is

$$\begin{aligned} L_{\rho _{k-1}}(W_x, W_y, W; {\bar{{\varLambda }}}^{k-1}) = \frac{1}{\rho _{k-1}}\Vert W_x\Vert _1 + \frac{\lambda }{\rho _{k-1}}\Vert W_y\Vert _1 + \frac{1}{\rho _{k-1}} \delta _{{\mathscr {O}}}(W) + H_k(W_x, W_y, W), \end{aligned}$$

where

$$\begin{aligned} H_k(W_x, W_y, W)&= \left\langle \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}}, U_1^TW_x - {\varSigma }_1^{-1}P_1W\right\rangle + \left\langle \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}}, V_1^TW_y - {\varSigma }_2^{-1}P_2W\right\rangle \\&\quad + \frac{1}{2}\Vert U_1^TW_x - {\varSigma }_1^{-1}P_1W\Vert _F^2 + \frac{1}{2}\Vert V_1^TW_y - {\varSigma }_2^{-1}P_2W\Vert _F^2. \end{aligned}$$

Applying Algorithm 3 with \(B_1^{k,j-1} = \gamma _1 I, B_2^{k,j-1} = \gamma _2 I\) and \(B_3^{k,j} = \gamma _3 I\), we get the following updating of \((W_x^{k, j}, W_y^{k,j}, W^{k, j})\) for any fixed \(k\in {\mathbb {N}}\)

$$\begin{aligned} W_x^{k,j}&=\mathbf{shrink }\left( W_x^{k,j-1} - \frac{1}{\gamma _1}U_1 \left( \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + U_1^TW_x^{k,j-1} - {\varSigma }_1^{-1}P_1W^{k,j-1}\right) ,~\frac{1}{\gamma _1\rho _{k-1}}\right) , \end{aligned}$$
(47)
$$\begin{aligned} W_y^{k,j}&=\mathbf{shrink }\left( W_y^{k,j-1} - \frac{1}{\gamma _1}V_1 \left( \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + V_1^TW_y^{k,j-1} - {\varSigma }_2^{-1}P_2W^{k,j-1}\right) ,~\frac{\lambda }{\gamma _1\rho _{k-1}}\right) , \end{aligned}$$
(48)
$$\begin{aligned} W^{k, j}&=\arg \max _{W^TW = I} \bigg \langle W, ~ W^{k,j-1} + \frac{1}{\gamma _3}\left( P_1^T{\varSigma }_1^{-1}{\varDelta }_x^{k,j} + P_2^T{\varSigma }_2^{-1}{\varDelta }_y^{k,j}\right) \bigg \rangle = {\tilde{U}}{\tilde{V}}^T, \end{aligned}$$
(49)

where \({\varDelta }_x^{k,j} = \frac{{\bar{{\varLambda }}}_1^{k-1}}{\rho _{k-1}} + U_1^TW_x^{k, j}, {\varDelta }_y^{k,j} = \frac{{\bar{{\varLambda }}}_2^{k-1}}{\rho _{k-1}} + V_1^TW_y^{k, j}\) and \({\tilde{U}}{\tilde{{\varSigma }}}{\tilde{V}}^T = W^{k,j-1} + \frac{1}{\gamma _3}(P_1^T{\varSigma }_1^{-1}{\varDelta }_x^{k,j} + P_2^T{\varSigma }_2^{-1}{\varDelta }_y^{k,j})\) is the reduced SVD. The resulting algorithm WSCCAAL is outlined in Algorithm 7.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, H., Zhang, X., Chu, D. et al. Nonconvex and Nonsmooth Optimization with Generalized Orthogonality Constraints: An Approximate Augmented Lagrangian Method. J Sci Comput 72, 331–372 (2017). https://doi.org/10.1007/s10915-017-0359-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10915-017-0359-1

Keywords

Navigation