Skip to main content
Log in

A framework of regularized low-rank matrix models for regression and classification

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

While matrix-covariate regression models have been studied in many existing works, classical statistical and computational methods for the analysis of the regression coefficient estimation are highly affected by high dimensional matrix-valued covariates. To address these issues, this paper proposes a framework of matrix-covariate regression models based on a low-rank constraint and an additional regularization term for structured signals, with considerations of models of both continuous and binary responses. We propose an efficient Riemannian-steepest-descent algorithm for regression coefficient estimation. We prove that the consistency of the proposed estimator is in the order of \(O(\sqrt{r(q+m)+p}/\sqrt{n})\), where r is the rank, \(p\times m\) is the dimension of the coefficient matrix and p is the dimension of the coefficient vector. When the rank r is small, this rate improves over \(O(\sqrt{qm+p}/\sqrt{n})\), the consistency of the existing work (Li et al. in Electron J Stat 15:1909-1950, 2021) that does not apply a rank constraint. In addition, we prove that all accumulation points of the iterates have similar estimation errors asymptotically and substantially attaining the minimax rate. We validate the proposed method through a simulated dataset on two-dimensional shape images and two real datasets of brain signals and microscopic leucorrhea images.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)

    Google Scholar 

  • Absil, P.-A., Oseledets, I.V.: Low-rank retractions: a survey and new results. Comput. Optim. Appl. 62(1), 5–29 (2015). https://doi.org/10.1007/s10589-014-9714-4

    Article  MathSciNet  Google Scholar 

  • Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  • Boumal, N.: On intrinsic cramér-rao bounds for Riemannian submanifolds and quotient manifolds. IEEE Trans. Signal Process. 61(7), 1809–1821 (2013). https://doi.org/10.1109/TSP.2013.2242068

    Article  MathSciNet  Google Scholar 

  • Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15(1), 1455–1459 (2014)

    Google Scholar 

  • Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15(42), 1455–1459 (2014)

    Google Scholar 

  • Campbell, N.A.: Robust procedures in multivariate analysis i: Robust covariance estimation. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 29(3), 231–237 (1980)

    Google Scholar 

  • Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)

    Article  MathSciNet  Google Scholar 

  • Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)

    Article  MathSciNet  Google Scholar 

  • Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)

    Article  MathSciNet  Google Scholar 

  • Chen, H., Guo, Y., He, Y., Ji, J., Liu, L., Shi, Y., Wang, Y., Yu, L., Zhang, X., Initiative, A.D.N., et al.: Simultaneous differential network analysis and classification for matrix-variate data with application to brain connectivity. Biostatistics 23(3), 967–89 (2021)

    Article  MathSciNet  Google Scholar 

  • Choi, Y., Taylor, J., Tibshirani, R.: Selecting the number of principal components: estimation of the true rank of a noisy matrix. Annals Stat. 1, 2590–2617 (2017)

    MathSciNet  Google Scholar 

  • Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)

    Article  MathSciNet  Google Scholar 

  • Elsener, A., Geer, S.: Robust low-rank matrix estimation. Ann. Stat. 46(6B), 3481–3509 (2018)

    Article  MathSciNet  Google Scholar 

  • Epstein, C.: American clinical neurophysiology society guideline 5: Guidelines for standard electrode position nomenclature. J. Clin. Neurophysiol. 23(2), 107–110 (2006)

    Article  Google Scholar 

  • Fan, J., Wang, W., Zhu, Z.: A shrinkage principle for heavy-tailed data: high-dimensional robust low-rank matrix recovery. Ann. Stat. 49(3), 1239–1266 (2021). https://doi.org/10.1214/20-AOS1980

    Article  MathSciNet  Google Scholar 

  • Hao, R., Wang, X., Zhang, J., Liu, J., Du, X., Liu, L.: Automatic detection of fungi in microscopic leucorrhea images based on convolutional neural network and morphological method. In: 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), IEEE, pp. 2491–2494, (2019)

  • Huang, H.-H., Zhang, T.: Robust discriminant analysis using multi-directional projection pursuit. Pattern Recogn. Lett. 138, 651–656 (2020)

    Article  Google Scholar 

  • Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964). https://doi.org/10.1214/aoms/1177703732

    Article  MathSciNet  Google Scholar 

  • Hung, H., Jou, Z.-Y.: A low rank-based estimation-testing procedure for matrix-covariate regression. Stat. Sin. 29(2), 1025–1046 (2019)

    MathSciNet  Google Scholar 

  • Hung, H., Wang, C.-C.: Matrix variate logistic regression model with application to EEG data. Biostatistics 14(1), 189–202 (2012). https://doi.org/10.1093/biostatistics/kxs023

    Article  Google Scholar 

  • Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011)

    Article  MathSciNet  Google Scholar 

  • Le Cam, L.: Maximum likelihood: an introduction. Int. Stat. Rev. 1, 153–171 (1990)

    Article  Google Scholar 

  • Li, M., Kong, L., Su, Z.: Double fused lasso regularized regression with both matrix and vector valued predictors. Electron. J. Stat. 15(1), 1909–1950 (2021)

    Article  MathSciNet  Google Scholar 

  • Lu, Z., Monteiro, R.D., Yuan, M.: Convex optimization methods for dimension reduction and coefficient estimation in multivariate linear regression. Math. Program. 131(1–2), 163–194 (2012)

    Article  MathSciNet  Google Scholar 

  • Luo, Y., Huang, W., Li, X., Zhang, A.R.: Recursive importance sketching for rank constrained least squares: algorithms and high-order convergence. arXiv preprint arXiv:2011.08360 (2020)

  • Maronna, R.A., Martin, R.D., Yohai, V.J., Salibián-Barrera, M.: Robust Statistics: Theory and Methods (with R). Wiley Series in Probability and Statistics. Wiley, Armstrong (2018)

    Book  Google Scholar 

  • Maurer, A., Pontil, M.: Concentration inequalities under sub-gaussian and sub-exponential conditions. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021). https://openreview.net/forum?id=WJPAqX5M-2

  • Negahban, S., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 39(2), 1069–1097 (2011). https://doi.org/10.1214/10-AOS850

    Article  MathSciNet  Google Scholar 

  • Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010). https://doi.org/10.1137/070697835

    Article  MathSciNet  Google Scholar 

  • Rohde, A., Tsybakov, A.B.: Estimation of high-dimensional low-rank matrices. Ann. Stat. 39(2), 887–930 (2011). https://doi.org/10.1214/10-AOS860

    Article  MathSciNet  Google Scholar 

  • She, Y., Chen, K.: Robust reduced-rank regression. Biometrika 104(3), 633–647 (2017)

    Article  MathSciNet  Google Scholar 

  • She, Y., Wang, Z., Jin, J.: Analysis of generalized Bregman surrogate algorithms for nonsmooth nonconvex statistical learning. Ann. Stat. 49(6), 3434–3459 (2021)

    Article  MathSciNet  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.: Ser. B (Methodol.) 58(1), 267–288 (1996)

    MathSciNet  Google Scholar 

  • Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, USA (2008)

    Google Scholar 

  • Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23(2), 1214–1236 (2013)

    Article  MathSciNet  Google Scholar 

  • Wainwright, M.J.: High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2019). https://doi.org/10.1017/9781108627771

  • Wang, X., Zhu, H., Initiative, A.D.N.: Generalized scalar-on-image regression models via total variation. J. Am. Stat. Assoc. 112(519), 1156–1168 (2017)

    Article  MathSciNet  Google Scholar 

  • Zhang, T., Yang, Y.: Robust PCA by manifold optimization. J. Mach. Learn. Res. 19(1), 3101–3139 (2018)

  • Zhang, X.L., Begleiter, H., Porjesz, B., Wang, W., Litke, A.: Event related potentials during object recognition tasks. Brain Res. Bull. 38(6), 531–538 (1995)

    Article  Google Scholar 

  • Zhou, H., Li, L.: Regularized matrix regression. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 76(2), 463–483 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Editor, the Associate Editor and the reviewers for their constructive and insightful comments that greatly improved the manuscript. This work was partially supported by NSF grants (DMS-1924792, DMS-2318925 and CNS-1818500).

Author information

Authors and Affiliations

Authors

Contributions

H-HH: conceptualization, methodology, formal analysis, investigation, writing-original draft preparation, writing-review, supervision. FY: conceptualization, methodology, formal analysis, investigation, writing review, programming and numerical results. XF: programming and numerical results. TZ: conceptualization, methodology, formal analysis, investigation, writing-review, supervision. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Teng Zhang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Technique proofs

Appendix A: Technique proofs

1.1 A.1 Proof of Proposition 1

Proof

(a) By the line search rule, we have that \(F(\textbf{C}^{(k+1)}, \)\( \gamma ^{(k+1)})\le F(\textbf{C}^{(\textrm{iter})},\gamma ^{(\textrm{iter})})\) for all \(k\ge 1\). Since F is bounded below, the limit \(\lim _{k\rightarrow \infty }F(\textbf{C}^{(\textrm{iter})},\gamma ^{(\textrm{iter})})\) exists. Assume that one of the limiting point of the sequence \((\textbf{C}^{(\textrm{iter})},\gamma ^{(\textrm{iter})})\) is \((\tilde{\textbf{C}},\tilde{\gamma })\), then the line search rule implies that \(\frac{\partial }{\partial _{\gamma }} F(\tilde{\textbf{C}},\tilde{\gamma })=0\) and \(\frac{\partial }{\partial _{\textbf{C}}} F(\tilde{\textbf{C}},\tilde{\gamma })=0\).

(b) The proof follows (Absil et al. 2009, Theorem 4.3.1). \(\square \)

1.2 A.2 Proof of Lemma 5

Proof

We assume that for any \(\textbf{x}\) in the neighborhood of \(\textbf{x}^*\), \(\textbf{x}-\textbf{x}^*\) can be uniquely decomposed into \(\textbf{x}-\textbf{x}^*=\textbf{x}^{(1)}+\textbf{x}^{(2)}\) such that \(\textbf{x}^{(1)}\in T_{\textbf{x}^*}({\mathcal M})\) and \(\textbf{x}^{(2)}\in T_{\textbf{x}^*,\perp }({\mathcal M})\). Let \(b=\Vert \textbf{x}-\textbf{x}^*\Vert \), then if \(b\le c_0\), then \(\Vert \textbf{x}^{(1)}\Vert \le b\) and \(\Vert \textbf{x}^{(2)}\Vert \le C_Tb^2\).

Let \(\textbf{v}=\frac{\textbf{x}-\textbf{x}^*}{\Vert \textbf{x}-\textbf{x}^*\Vert }\) be the direction from \(\textbf{x}^*\) to \(\textbf{x}\), then

$$\begin{aligned} f(\textbf{x})-f(\textbf{x}^*)= & {} \int _{\textbf{x}^*}^{\textbf{x}}\langle \textbf{v},\nabla f(\textbf{t})\rangle {\mathrm d}\textbf{t}=\langle \textbf{x}-\textbf{x}^*, \nabla f(\textbf{x}^*) \rangle \\{} & {} +\int _{\textbf{x}^*}^{\textbf{x}}\langle \textbf{v},\nabla f(\textbf{t})-\nabla f(\textbf{x}^*)\rangle {\mathrm d}\textbf{t}, \end{aligned}$$

where the first term can be bounded by

$$\begin{aligned}&\langle \textbf{x}-\textbf{x}^*, \nabla f(\textbf{x}^*) \rangle =\langle \textbf{x}^{(1)}+\textbf{x}^{(2)}, \nabla f(\textbf{x}^*) \rangle \\&\quad =\langle \textbf{x}^{(1)},\Pi _{T_{\textbf{x}^*}({\mathcal M})}\nabla f(\textbf{x}^*) \rangle +\langle \textbf{x}^{(2)},\Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*) \rangle \\&\quad \le b \Vert \Pi _{T_{\textbf{x}^*}({\mathcal M})}\nabla f(\textbf{x}^*)\Vert +C_Tb^2\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert . \end{aligned}$$

On the other hand, the second term can be bounded by

$$\begin{aligned} \int _{\textbf{x}^*}^{\textbf{x}}\langle \textbf{v},\nabla f(\textbf{t})-\nabla f(\textbf{x}^*)\rangle {\mathrm d}\textbf{t}\ge \frac{1}{2}b^2C_{H,1}. \end{aligned}$$

Combining these two inequalities, the lemma is proved.

\(\square \)

1.3 A.3 Technical Assumptions

We first present a few conditions for establishing the model consistency of the proposed estimator in (2.4).

Assumption 1

There exists a positive constant \(C_1>0\) such that

$$\begin{aligned} \frac{1}{n}\Big \Vert \sum _{i=1}^n \textrm{vec}(\textbf{X}_i,\textbf{z}_i)\textrm{vec}(\textbf{X}_i,\textbf{z}_i)^T\Big \Vert \le C_1, \end{aligned}$$

where \(\textrm{vec}(\textbf{X}_i,\textbf{z}_i)\in \mathbb {R}^{qm+p}\) is a vector consisting of the elements in \(\textbf{X}_i\in \mathbb {R}^{m\times q}\) and \(\textbf{z}_i\in \mathbb {R}^p\).

Assumption 2

For \(\textbf{H}(\textbf{C},\gamma )\in \mathbb {R}^{(qm+p)\times (qm+p)}\) defined by

$$\begin{aligned} \textbf{H}(\textbf{C},\gamma )=\sum _{i=1}^nw_{2,i}\textrm{vec}(\textbf{X}_i,\textbf{z}_i)\textrm{vec}(\textbf{X}_i,\textbf{z}_i)^T, \end{aligned}$$
(A1)

where

$$\begin{aligned} w_{2,i}={\left\{ \begin{array}{ll}2,&{}\text {for the ordinary matrix}\\ &{}\qquad \text {-covariate regression model,}\\ \frac{e^{\langle \textbf{X}_i,\textbf{C}\rangle +\textbf{z}_i^T\gamma }}{(1+e^{\langle \textbf{X}_i,\textbf{C}\rangle +\textbf{z}_i^T\gamma })^2}, &{}\text {for the logistic matrix}\\ &{}\qquad \text {-covariate regression model},\end{array}\right. } \end{aligned}$$

and \(\frac{1}{n}\textbf{H}(\textbf{C},\gamma )\) is positive definite with eigenvalues bounded from below for all \((\textbf{C},\gamma )\) in a neighborhood of \((\textbf{C}^*,\gamma ^*)\). Specifically, there exists positive constants \(C_2>0\) and \(c_0\le \sigma _r(\textbf{C}^*)/2\) such that \( \frac{1}{n}\lambda _{\min }\Big (\textbf{H}(\textbf{C},\gamma )\Big )\ge C_2 \) and for all \((\textbf{C},\gamma )\) such that

$$\begin{aligned} \textrm{dist}((\textbf{C},\gamma ),(\textbf{C}^*,\gamma ^*))=\sqrt{\Vert \textbf{C}-\textbf{C}^*\Vert _F^2+\Vert \gamma -\gamma ^*\Vert ^2}\le c_0. \end{aligned}$$

Assumptions 1 and 2 can be considered as the generalized versions of the restricted isometry property (RIP) (Candes and Tao 2005) in Recht et al. (2010) and are comparable to (Li et al. 2021, Conditions 1,2,5,6). In particular, Assumption 1 ensures that the sensing vectors \(\textrm{vec}(\textbf{X}_i,\textbf{z}_i)\in \mathbb {R}^{qm+p}\) are not the same or in a similar direction, and it is satisfied if the sensing vectors are sampled from a distribution that is relatively uniformly distributed among all directions. Assumption 2 ensures that the Hessian matrix of \(\sum _{i=1}^n l(y_i,\langle \textbf{X}_i,\textbf{C}\rangle +\gamma ^T\textbf{z}_i)\), \(\textbf{H}\), is not a singular matrix.

Remark 2

Assumptions 1 and 2 are reasonable when \(n>O(pm+q)\) and \((\textbf{X}_i,\textbf{z}_i)\) are sampled from a reasonable distribution that does not concentrate around certain directions. For example, if \(\textrm{vec}(\textbf{X}_i,\textbf{z}_i)\) are i.i.d. sampled from a distribution of \(N(0,\textbf{I})\), then the standard concentration of measure results (Wainwright 2019) imply that for the matrix-covariate regression model, \(\textbf{H}=2\sum _{i=1}^n \textrm{vec}(\textbf{X}_i,\textbf{z}_i)\textrm{vec}(\textbf{X}_i,\textbf{z}_i)^T\), and the standard concentration of measure results (Wainwright 2019) imply that Assumption 1 holds with a high probability when \(n=O(qm+p)\) and Assumption 2 holds for the regression model as well since \(\textbf{H}=2\sum _{i=1}^n \textrm{vec}(\textbf{X}_i,\textbf{z}_i)\textrm{vec}(\textbf{X}_i,\textbf{z}_i)^T\). Assumption 2 holds for the logistic regression model as long as \(a_i=\langle \textbf{X}_i,\textbf{C}\rangle +\textbf{z}_i^T\gamma \) is bounded above for most indices i as \(w_{2,i}\ge 1/(2e^{a_i})\).

Assumption 3

(Assumption on the noise for the matrix-covariate regression model): For the matrix-covariate regression model, the error \(\epsilon _i\)’s in Eq. (2.1) follow an independent and identically distributed (i.i.d.) zero-mean and sub-Gaussian distributions with zero mean and variance one, i.e., \(\textrm{E}(\epsilon _i )= 0\) and \(\textrm{Var}(\epsilon _i )=1\) to ensure that the distribution of outliers is limited as this model is not designed to handle outliers.

Assumption 4

There exists a positive constant \(C_3>0\) such that \(\frac{1}{n}\Big \Vert \textbf{H}(\textbf{C},\gamma )\Big \Vert \le C_3\) for all \((\textbf{C},\gamma )\) such that \(\textrm{dist}((\textbf{C},\gamma ),(\textbf{C}^*,\gamma ^*))=\sqrt{\Vert \textbf{C}-\textbf{C}^*\Vert _F^2+\Vert \gamma -\gamma ^*\Vert ^2}\le c_0.\) Combining it with Assumption 2, it suggests that \(\textbf{H}\) is a well-conditioned matrix. Hence, Assumption 4 can be considered as the generalized version of the Restricted Isometry condition in Recht et al. (2010) and comparable to Conditions 2,6 of (Li et al. 2021).

Assumptions 4 is reasonable when \(n>O(pm+q)\) and \((\textbf{X}_i,\textbf{z}_i)\) is sampled from a reasonable distribution that does not concentrate around certain directions. For example, if \(\textrm{vec}(\textbf{X}_i,\textbf{z}_i)\) are i.i.d. sampled from a distribution of \(N(0,\textbf{I})\), then the standard concentration of measure results (Wainwright 2019) imply that Assumption 2 holds for all three models with a high probability, since \(w_{2,i}\) are bounded above. With Assumption 4, we have the following result showing the convergence of Algorithm 1, with its proof deferred to the appendix. It shows that with a good initialization, all accumulation points have estimation errors converging to zero as \(n\rightarrow \infty \).

We will need the following assumptions:

Assumption 5

There exists \(C_{upper}>0\) such that for all \((\textbf{C},\gamma )\in \mathcal {A}(r,a)\),

$$\begin{aligned} \sum _{i=1}^n\Big (\langle \textbf{X}_i,\textbf{C}\rangle +\gamma ^T\textbf{z}_i\Big )^2\le C_{upper}n\Vert \textrm{vec}(\textbf{C},\gamma )\Vert ^2. \end{aligned}$$

Assumption 6

There exists \(c_{\epsilon }>0\) such that for all \(x\in \mathbb {R}\), \( KL(P_{\epsilon ,0},P_{\epsilon ,x})\le c_{\epsilon }x^2, \) where \(P_{\epsilon ,x}\) represent the distribution of \(\epsilon _i+x\) for the matrix-covariate regression with continuous responses models, and \(P_{\epsilon ,x}\) represent the Bernoulli distribution with parameter x for the logistic regression model with binary responses. In addition, KL represents the Kullback–Leibler divergence: For distributions P and Q of a continuous random variable with probability density functions p(x) and q(x), it is defined to be the integral \(KL(P,Q)=\int _{x}p(x)\log (p(x)/q(x)){\mathrm d}x\).

Assumption 5 is less restrictive of Assumption 1 as it only needs to be true for all \((\textbf{C},\gamma )\in \mathcal {A}(r,a)\), and Assumption 6 holds under both the matrix-covariate logistic regression model for binary responses and matrix-covariate regression models for continuous responses, Assumption 6 holds for zero-mean, symmetric distributions with tails decaying not faster than Gaussian, including Gaussian distribution, exponential distribution, Cauchy distribution, Bernoulli distribution, and Student’s t distribution.

Fig. 6
figure 6

A visualization of the manifold \({\mathcal M}\), two points \(\textbf{x},\textbf{x}^*\in {\mathcal M}\), the tangent space \(T_{\textbf{x}^*}({\mathcal M})\), and the projectors \(\Pi _{T_{\textbf{x}^*}({\mathcal M})}\) and \(\Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\)

1.4 A.4 Sketch of the Proof of Theorem 2

We start with the intuition of the proof with a function \(f: \mathbb {R}^p\rightarrow \mathbb {R}\). To show that f  has a local minimizer around \(\textbf{x}^*\), it is sufficient to show that the gradient \(\nabla f(\textbf{x}^*)\approx 0\) and the Hessian matrix of \(f(\textbf{x})\), \(\textbf{H}(\textbf{x})\), is positive definite with eigenvalues strictly larger than some constant \(c>0\). The intuition of the proof follows from the Taylor expansion that

$$\begin{aligned} f({\textbf {x}})\approx&f({\textbf {x}}^*)+({\textbf {x}}-{\textbf {x}}^*)^T\nabla f({\textbf {x}}^*) + \frac{1}{2}({\textbf {x}}-{\textbf {x}}^*)^T H({\textbf {x}}^*)({\textbf {x}}-{\textbf {x}}^*)\nonumber \\ \ge&f({\textbf {x}}^*)+({\textbf {x}}-{\textbf {x}}^*)^T\nabla f({\textbf {x}}^*) + \frac{c}{2}\Vert {\textbf {x}}-{\textbf {x}}^*\Vert ^2. \end{aligned}$$
(A2)

As a result, there is local minimizer in the neighbor of \(\textbf{x}^*\) with radius \(\frac{\Vert \nabla f(\textbf{x}^*)\Vert }{c}\), i.e., \(B\left( \textbf{x}^*,\frac{\Vert \nabla f(\textbf{x}^*)\Vert }{c}\right) \). To extend this proof to (2.4), the main obstacle is the nonlinear constraint in the optimization problem. To address this issue, we consider the constraint set in (2.4) as a manifold and generalize the “second-order Taylor expansion” in (A2) to the function defined on a manifold. With this generalized Taylor expansion, a similar strategy can be applied to prove that the minimizer of (2.4) is close to \((\textbf{C}^*,\gamma ^*)\).

To analyze functions defined on manifolds, we introduce a few additional notations. We assume a manifold \({\mathcal M}\subset \mathbb {R}^p\) and a function \(f:\mathbb {R}^p\rightarrow \mathbb {R}\), and investigate \(f(\textbf{x})\) for \(\textbf{x}\in B(\textbf{x}^*,r)\cap {\mathcal M}\), i.e., a local neighborhood of \(\textbf{x}^*\) on the manifold \({\mathcal M}\). We denote the first and second derivatives of \(f(\textbf{x})\) by \(\nabla f(\textbf{x})\in \mathbb {R}^{p}\) and \(\textbf{H}(\textbf{x})\in \mathbb {R}^{p\times p}\) respectively, the tangent plane of \({\mathcal M}\) at \(\textbf{x}^*\) by \(T_{\textbf{x}^*}({\mathcal M})\), and let \(\Pi _{T_{\textbf{x}^*}({\mathcal M})}\) and \(\Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\) be the projectors to \(T_{\textbf{x}^*}({\mathcal M})\) and its orthogonal subspace respectively. These definitions are visualized in Fig. 6.

Then, we say that a manifold \({\mathcal M}\) is curved with parameter \((c_0,C_T)\) at \(\textbf{x}^*\in {\mathcal M}\), if for any \(\textbf{x}\in B(\textbf{x}^*,c_0)\cap {\mathcal M}\), we have \(\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }(\textbf{x}-\textbf{x}^*)\Vert \le C_T\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M})}(\textbf{x}-\textbf{x}^*)\Vert ^2\). Intuitively, it means that the projection of \(\textbf{x}-\textbf{x}^*\) to the tangent space \(T_{\textbf{x}^*}({\mathcal M})\) has a larger magnitude than the projection to the orthogonal subspace of the tangent space (see Fig. 6).We remark that a larger \(C_T\) means that the manifold \({\mathcal M}\) is more “curved” around \(\textbf{x}^*\). Then, Lemma 5 establishes the lower bound of f based on the local properties such as the first and second derivatives of f at \(\textbf{x}^*\), the tangent space of \({\mathcal M}\) around \(\textbf{x}^*\), and the curvature parameters \((c_0,C_T)\).

Lemma 5

Consider a d-dimensional manifold \({\mathcal M}\subset \mathbb {R}^p\) and a function \(f: \mathbb {R}^P\rightarrow \mathbb {R}\), define \( C_{H,1}= \min _{\textbf{x}\in B(\textbf{x}^*,c_0)} \)\( \lambda _{\min }(\textbf{H}(\textbf{x})), \) and assume that \({\mathcal M}\) is curved with parameter \((c_0,C_T)\) at \(\textbf{x}^*\) and \(4C_{H,1}\ge C_T\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert \), then we have the following lower bound for any \(\textbf{x}\in B(\textbf{x}^*,c_0)\cap {\mathcal M}\):

$$\begin{aligned} f(\textbf{x})-f(\textbf{x}^*)\ge & {} \frac{1}{2}b^2C_{H,1}- b \Vert \Pi _{T_{\textbf{x}^*}({\mathcal M})}\nabla f(\textbf{x}^*)\Vert \\{} & {} -C_Tb^2\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert , \end{aligned}$$

where \(b=\Vert \textbf{x}-\textbf{x}^*\Vert \).

Lemma 5 can be viewed as a generalization of Inequality (A2): when \({\mathcal M}=\mathbb {R}^p\), then we have \(C_T=0\), \(\Pi _{T,\perp }=\emptyset \) and as a result, \(\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert =0\). To apply Lemma 5 to our problem, we need to estimate the parameters \(c_0, C_T, C_{H,1},\) \(\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M})}\nabla f(\textbf{x}^*)\Vert \), and \(\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert \) in the statement of Lemma 5. In particular, \((c_0,C_T)\) depends on the manifold \({\mathcal M}\) used in the optimization problem (2.4), which is

$$\begin{aligned} {\mathcal M}=\{(\textbf{C},\gamma )\in \mathbb {R}^{q\times m}\times \mathbb {R}^p: \textrm{rank}(\textbf{C})=r\}. \end{aligned}$$
(A3)

By treating \({\mathcal M}\) as the product space of \(\mathbb {R}^p\) and the manifold of low-rank matrices \(\{\textbf{C}\in \mathbb {R}^{q\times m}: \textrm{rank}(\textbf{C})=r\}\) and following the tangent space of the set of low-rank matrices in the literature (Absil and Oseledets 2015; Zhang and Yang 2018), we obtain the following lemma of “curvedness” parameters \((c_0,C_T)\) of \({\mathcal M}\) at \((\textbf{C}^*,\gamma ^*)\).

Lemma 6

The manifold \({\mathcal M}\) defined in (A3) is curved with parameter \((c_0,C_T)\) at \((\textbf{C}^*,\gamma ^*)\), for any \(c_0\le \sigma _{r}(\textbf{C}^*)/2\) and \(C_T=2/\sigma _{r}(\textbf{C}^*)\), where \(\sigma _{r}(\textbf{C}^*)\) represents the r-th (i.e., the smallest) singular value of \(\textbf{C}^*\).

In addition, the parameter \(C_{H,1}\) in Lemma 5 can be estimated from Assumption 2, and the derivatives \(\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M})} \)\( \nabla f(\textbf{x}^*)\Vert \) and \(\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert \) in Lemma 5 can be estimated from Assumptions 1 and 3. Then the proof of Theorem 2 follows from Lemma 5 and the intuition introduced at the beginning of this section, with technical details deferred to the appendix.

1.5 A.5 Proof of Lemma 6

Proof

By following the tangent space of the set of low-rank matrices in the literature (Absil and Oseledets 2015; Zhang and Yang 2018), we have the following explicit expressions of the tangent space of \({\mathcal M}\) at \((\textbf{C}^*,\gamma ^*)\) that

$$\begin{aligned} T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}}=\{(\textbf{A}\textbf{V}^*\textbf{V}^{*T}+\textbf{U}^*\textbf{U}^{*T}\textbf{B},\textbf{y}): \textbf{A},\textbf{B}\in \mathbb {R}^{q\times m}, \textbf{y}\in \mathbb {R}^p \}, \end{aligned}$$

where \(\textbf{U}^*\in \mathbb {R}^{q\times r}\) and \(\textbf{V}^*\in \mathbb {R}^{m\times r}\) are obtained from the singular value decomposition of \(\textbf{C}^*\) such that \(\textbf{C}^*=\textbf{U}^*\Sigma \textbf{V}^{*T}\). The projection operators in Lemma 5 are given by

$$\begin{aligned} \Pi _{T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}}}(\textbf{D},\textbf{y})&=(\textbf{U}^*\textbf{U}^{*T}\textbf{D}+\textbf{D}\textbf{V}^*\textbf{V}^{*T}-\textbf{U}^*\textbf{U}^{*T}\textbf{D}\textbf{V}^*\textbf{V}^{*T},\textbf{y}),\\ \Pi _{T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}},\perp }(\textbf{D},\textbf{y})&=((\textbf{I}-\textbf{U}^*\textbf{U}^{*T})\textbf{D}(\textbf{I}-\textbf{V}^*\textbf{V}^{*T}),0). \end{aligned}$$

By choosing \(\textbf{U}^{*\perp }\in \mathbb {R}^{q\times (q-r)}\) such that \([\textbf{U}^{*\perp },\textbf{U}^*]\in \mathbb {R}^{q\times q}\) is orthogonal and choose \(\textbf{V}^{*\perp }\in \mathbb {R}^{m\times (m-r)}\) such that \([\textbf{V}^{*\perp },\textbf{V}^*]\in \mathbb {R}^{m\times m}\) is orthogonal, then we can express the projectors as follows: for any \((\textbf{C},\gamma )\) close to \((\textbf{C}^*,\gamma ^*)\), we may write \(\textbf{C}-\textbf{C}^*=\textbf{U}^*\textbf{D}_1\textbf{V}^{*T}+\textbf{U}^{*\perp }\textbf{D}_2\textbf{V}^{*T}+\textbf{U}^{*}\textbf{D}_3\textbf{V}^{^*\perp T}+\textbf{U}^{*\perp }\textbf{D}_4\textbf{V}^{^*\perp T}\),

$$\begin{aligned} \Vert \Pi _{T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}},\perp }(\textbf{C}-\textbf{C}^*,\gamma -\gamma ^*)\Vert =\Vert (\textbf{U}^{*\perp T}\textbf{D}_4\textbf{V}^{^*\perp },0)\Vert =\Vert \textbf{D}_4\Vert _F, \end{aligned}$$

and

$$\begin{aligned}{} & {} \Vert \Pi _{T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}}}(\textbf{C}-\textbf{C}^*,\gamma -\gamma ^*)\Vert \\{} & {} \quad =\sqrt{\Vert \textbf{D}_1\Vert _F^2 +\Vert \textbf{D}_2\Vert _F^2+\Vert \textbf{D}_3\Vert _F^2+\Vert \gamma -\gamma ^*\Vert ^2}. \end{aligned}$$

By \(\textrm{rank}(\textbf{D})=r\), we have \(\textbf{D}_4=\textbf{D}_2(\textbf{D}_1+\textbf{U}^{*T}\textbf{C}^*\textbf{V}{^*})^{-1}\textbf{D}_3\). Thus, when \(\Vert \textbf{C}-\textbf{C}^*\Vert _F\le \sigma _r(\textbf{C}^*)/2\),

$$\begin{aligned}&\Vert \Pi _{T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}},\perp }(\textbf{C}-\textbf{C}^*,\gamma -\gamma ^*)\Vert \\&\quad \le \frac{\Vert \textbf{D}_2\Vert _F\Vert \textbf{D}_3\Vert _F}{\sigma _r(\textbf{D}_1)-\Vert \textbf{C}-\textbf{C}^*\Vert _F}\\&\quad \le \frac{2\Vert \Pi _{T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}}}(\textbf{C}-\textbf{C}^*,\gamma -\gamma ^*)\Vert ^2}{\sigma _r(\textbf{C}^*)}, \end{aligned}$$

and Lemma 6 is proved. \(\square \)

1.6 A.6 Proof of Theorem 2

Proof

In the proof, we mainly work with

$$\begin{aligned} f(\textbf{C},\gamma )= \sum _{i=1}^n l(y_i,\langle \textbf{X}_i,\textbf{C}\rangle +\gamma ^T\textbf{z}_i), \end{aligned}$$

and it is sufficient to show that for all \((\textbf{C},\gamma )\in {\mathcal M}\) such that

$$\begin{aligned} \sqrt{\Vert \textbf{C}-\textbf{C}^*\Vert _F^2+\Vert \gamma -\gamma ^*\Vert ^2}&\ge C_{error,1},\\ f(\textbf{C},\gamma )-f(\textbf{C}^*,\gamma ^*)&\ge \lambda P(\textbf{C}^*,\gamma ^*). \end{aligned}$$

To prove it, we first calculate the constants and the operators in Lemma 5 as follows. For all three models, the constant on the curvature of \({\mathcal M}\) is the same. Hence, we may choose \(C_T=2/\sigma _{\min }(\textbf{C}^*)\). In addition, as discussed in the proof of Lemma 6, the projectors \(\Pi _{T}\) and \(\Pi _{T,\perp }\) at \((\textbf{C}^*,\gamma ^*)\) can be defined by

$$\begin{aligned} \Pi _{T_{(\textbf{C}^*,\gamma ^*)}({\mathcal M})}(\textbf{C},\gamma )&=(\textbf{C}-\Pi _{\textbf{U}^*,\perp }\textbf{C}\Pi _{\textbf{V}^*,\perp },\gamma ), \nonumber \\ \Pi _{T_{(\textbf{C}^*,\gamma ^*)}({\mathcal M}),\perp }(\textbf{C},\gamma )&=(\Pi _{\textbf{U}^*,\perp }\textbf{C}\Pi _{\textbf{V}^*,\perp },0), \end{aligned}$$
(A4)

where \(\textbf{U}^*\in \mathbb {R}^{q\times r}\) and \(\textbf{V}^*\in \mathbb {R}^{m\times r}\) are the left and right singular components of \(\textbf{C}^*\), \(\Pi _{\textbf{U}^*}=\textbf{U}^*\textbf{U}^{*T}\), \(\Pi _{\textbf{U}^*,\perp }=\textbf{I}-\Pi _{\textbf{U}^*}\), \(\Pi _{\textbf{V}^*}=\textbf{V}^*\textbf{V}^{*T}\), and \(\Pi _{\textbf{U}^*,\perp }=\textbf{I}-\textbf{V}^*\textbf{V}^{*T}\). As for the first derivative, we have

$$\begin{aligned} \nabla f(\textbf{C}^*,\gamma ^*)=\sum _{i=1}^nw_{1,i}\textrm{vec}(\textbf{X}_i,\textbf{z}_i), \end{aligned}$$

where

$$\begin{aligned} w_{1,i}={\left\{ \begin{array}{ll}2\epsilon _i,\,\,\text {for the matrix variate regression model}; \\ \epsilon _i,\,\,\text {for the logistic matrix variate regression model}.\end{array}\right. } \end{aligned}$$

Combining it with (A4),

$$\begin{aligned} \Pi _{T_{({\textbf {C}}^*,\gamma ^*)}({\mathcal M})}\nabla f({\textbf {C}}^*,\gamma ^*)&=(\Pi _{{\textbf {U}}}{} {\textbf {X}}_i+{\textbf {X}}_i\Pi _{{\textbf {V}}}-\Pi _{{\textbf {U}}}{} {\textbf {X}}_i\Pi _{{\textbf {V}}},{\textbf {z}}_i),\\\Pi _{T_{({\textbf {C}}^*,\gamma ^*)}({\mathcal M}),\perp }\nabla f({\textbf {C}}^*,\gamma ^*)&=({\textbf {X}}_i-\Pi _{{\textbf {U}}}{} {\textbf {X}}_i-{\textbf {X}}_i\Pi _{{\textbf {V}}}+\Pi _{{\textbf {U}}}{} {\textbf {X}}_i\Pi _{{\textbf {V}}},0). \end{aligned}$$

Now let us introduce a lemma as follows.

Lemma 7

For any projection matrix \(\textbf{U}\in \mathbb {R}^{n\times d}\) and a random vector \(\textbf{x}\in \mathbb {R}^n\) with each element i.i.d. sampled from a sub-Gaussian distribution of parameter \(\sigma _0\), then for \(t\ge 2\),

$$\begin{aligned} \Pr (\Vert \textbf{x}^T\textbf{U}\Vert \ge t\sigma _0\sqrt{d})\le C\exp (-Ct). \end{aligned}$$

Proof

This lemma follows from the McDiarmid’s inequality (Maurer and Pontil 2021, Theorem 3). In particular, we have that

$$\begin{aligned} \textrm{E}\Vert \textbf{x}^T\textbf{U}\Vert \le \sqrt{\textrm{E}[\textbf{x}^T\textbf{U}\textbf{U}^T\textbf{x}]}=\sqrt{\textrm{E}\left[ \sum _{i=1}^n\textbf{x}_i^2\sum _{j=1}^d\textbf{U}_{ij}^2\right] }\le \sigma _0^2d, \end{aligned}$$

and let \(\textbf{x}^{(i)}\in \mathbb {R}^n\) be defined such that \(\textbf{x}^{(i)}_j=\textbf{x}_j\) if \(j\ne i\) and \(\textbf{x}^{(i)}_i=0\), then \(|\Vert \textbf{x}^T\textbf{U}\Vert -\Vert \textbf{x}^{(i)T}\textbf{U}\Vert |\le |\textbf{x}_i|\Vert \textbf{U}(i,:)\Vert \), where \(\Vert \textbf{U}(i,:)\Vert \) represents the norm of the i-th row of \(\textbf{U}\). As a result, \(\Vert \textbf{x}^T\textbf{U}\Vert -\Vert \textbf{x}^{(i)T}\textbf{U}\Vert \) is sub-Gaussian with parameter \(\sigma _0\Vert \textbf{U}(i,:)\Vert \). Combining it with the fact that \(\sum _{i=1}^n\Vert \textbf{U}(i,:)\Vert ^2=d\) and the sub-Gaussian version of the McDiarmid’s inequality (Maurer and Pontil 2021, Theorem 3), the lemma is proved. \(\square \)

Assumption 1 and Lemma 7 imply that with a probability of at least \(1-C\exp (-Cn)-C\exp (-Ct(r(q+m)+p))\),

$$\begin{aligned}&\Big \Vert \sum _{i=1}^nw_{1,i} \text {vec}\Big (\Pi _{T_{({\textbf {C}}^*,\gamma ^*)}({\mathcal M})}\nabla f({\textbf {C}}^*,\gamma ^*)\Big )\Big \Vert \le tC_1t\sigma _0\sqrt{n(r(q+m)+p)} , \nonumber \\ {}&\Big \Vert \sum _{i=1}^nw_{1,i} \text {vec}\Big (\Pi _{T_{({\textbf {C}}^*,\gamma ^*)}({\mathcal M}),\perp }\nabla f({\textbf {C}}^*,\gamma ^*)\Big )\Big \Vert \le tC_1t\sigma _0\sqrt{n(qm-r(q+m)+p)}, \end{aligned}$$
(A5)

where

$$\begin{aligned} \sigma _0={\left\{ \begin{array}{ll}\sigma ,\,\,\text{ for } \text{ the } \text{ matrix-variate } \text{ regression } \text{ model; } \\ 1,\,\,\text{ for } \text{ the } \text{ logistic } \text{ matrix-variate } \text{ regression } \text{ model }.\end{array}\right. } \end{aligned}$$

For the Hessian matrix, it is as defined in (A1). As a result, we have \(C_{H,1}=C_2n\). Plug in Lemma 5, we have that for \(b=\sqrt{\Vert \hat{\textbf{C}}-\textbf{C}^*\Vert _F^2+\Vert \hat{\gamma }-\gamma ^*\Vert ^2}\), we have

$$\begin{aligned}&f(\textbf{C},\gamma )-f(\textbf{C}^*,\gamma ^*)\ge \frac{1}{2}b^2C_{2} - b \Vert \Pi _{T_{\textbf{x}^*}({\mathcal M})}\nabla f(\textbf{x}^*)\Vert \\&\quad -C_Tb^2\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert \end{aligned}$$

which is larger than \(\lambda P\) if

$$\begin{aligned} \frac{b^2C_{2}}{6}\ge \max \left( \lambda P, b \Vert \Pi _{T_{\textbf{x}^*}({\mathcal M})}\nabla f(\textbf{x}^*)\Vert , C_Tb^2\Vert \Pi _{T_{\textbf{x}^*}({\mathcal M}),\perp }\nabla f(\textbf{x}^*)\Vert \right) , \end{aligned}$$

or equivalently, if \(C_2\sqrt{n}\ge 6 C_1t\sigma _0\sqrt{(qm+p)} \), and

$$\begin{aligned} b\ge \max \left( \frac{6C_1t\sigma _0\sqrt{n(r(q+m)+p)} }{C_2},\sqrt{\frac{6\lambda P(\textbf{C}^*,\gamma ^*)}{C_2}}\right) . \end{aligned}$$
(A6)

As a result, we have

$$\begin{aligned}&f(\textbf{C},\gamma )-f(\textbf{C}^*,\gamma ^*)\ge \lambda P(\textbf{C}^*,\gamma ^*) \nonumber \\&\quad \text{ for } \text{ all } \{(\textbf{C},\gamma )\in {\mathcal M}: \textrm{dist}((\textbf{C},\gamma ),(\textbf{C}^*,\gamma ^*))\in \mathcal {I}\},\nonumber \\ \end{aligned}$$
(A7)

where

$$\begin{aligned} \mathcal {I}=\Bigg [\max \left( \frac{6C_1t\sigma _0\sqrt{n(r(q+m)+p)} }{C_2},\sqrt{\frac{6\lambda P(\textbf{C}^*,\gamma ^*)}{C_2}}\right) ,c_0\Bigg ]. \end{aligned}$$

Next, for all \((\textbf{C},\gamma )\) such that \(\textrm{dist}((\textbf{C},\gamma ),(\textbf{C}^*,\gamma ^*))=\sqrt{\Vert {\textbf{C}}-\textbf{C}^*\Vert _F^2+\Vert {\gamma }-\gamma ^*\Vert ^2}=b\) where \(b\le c_0\), we have

$$\begin{aligned} f(\textbf{C},\gamma )-f(\textbf{C}^*,\gamma ^*)&\ge \frac{1}{2}C_2n b^2- b \Vert \nabla f(\textbf{C}^*,\gamma ^*)\Vert \\&\ge \frac{1}{2}C_2n b^2-bC_1t\sigma _0\sqrt{n(qm+p)} . \end{aligned}$$

That is, when \(b\ge \frac{4C_1t\sigma _0\sqrt{n(qm+p)} }{C_2 n}\) and \(\frac{1}{4}C_2n b^2\ge \lambda P(\textbf{C}^*,\gamma ^*)\). By (3.1), such a choice of b exists and we have \(f(\textbf{C},\gamma )-f(\textbf{C}^*,\gamma ^*)\ge \lambda P(\textbf{C}^*,\gamma ^*)\) for all \(\{(\textbf{C},\gamma ):\textrm{dist}((\textbf{C},\gamma ),(\textbf{C}^*,\gamma ^*))=b\}\). Since f is convex, it holds for all \(\{(\textbf{C},\gamma ):\textrm{dist}((\textbf{C},\gamma ),(\textbf{C}^*,\gamma ^*))\ge b\}\). Combining it with (A7), we have that for all \((\textbf{C},\gamma )\in {\mathcal M}\) such that \(\sqrt{\Vert \textbf{C}-\textbf{C}^*\Vert _F^2+\Vert \gamma -\gamma ^*\Vert ^2}\ge C_{error,1}\), \(f(\textbf{C},\gamma )-f(\textbf{C}^*,\gamma ^*)\ge \lambda P(\textbf{C}^*,\gamma ^*)\), and the theorem is proved. \(\square \)

1.7 A.7 Proof of Theorem 4

Proof

First, by Assumption 2, for all \(\left\{ (\textbf{C},\gamma ):\right. \)\( \left. \sqrt{\Vert \textbf{C}-\textbf{C}^*\Vert _F^2+\Vert \gamma -\gamma ^*\Vert ^2}=c_0'^2\right\} \),

$$\begin{aligned} F(\textbf{C},\gamma )\ge f(\textbf{C},\gamma ) > \frac{C_2n}{2}c_0'^2 + \sum _{i=1}^n l(y_i,\langle \textbf{X}_i,\textbf{C}^*\rangle +\gamma ^{*T}\textbf{z}_i. \end{aligned}$$

Since \(F(\textbf{C}^{(\textrm{iter})}, \gamma ^{(\textrm{iter})})\) is nonincreasing, we can choose a small initial step size \(\alpha _0>0\), such that if the initial step size \(\alpha \) in line search satisfies \(\alpha <\alpha _0\), then \((\textbf{C}^{(\textrm{iter})}, \gamma ^{(\textrm{iter})})\in \mathcal {B}\) for all \(\textrm{iter}\ge 1\).

By the proof of (A5) we have

$$\begin{aligned} \Vert T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}}\nabla f(\textbf{C}^*,\gamma ^*)\Vert&\le nC_1t\sigma _0\sqrt{n(r(q+m)+p)} \\ \Vert T_{(\textbf{C}^*,\gamma ^*),{\mathcal M},\perp }\nabla f(\textbf{C}^*,\gamma ^*)\Vert&\le nC_1t\sigma _0\sqrt{n(qm-r(q+m)+p)} . \end{aligned}$$

As a result, for \((\textbf{C},\gamma )\) with \(\textrm{dist}((\textbf{C},\gamma ),(\textbf{C}^*,\gamma ^*))=b\), we have

$$\begin{aligned}&\Vert T_{(\textbf{C}^*,\gamma ^*),{\mathcal M}}\nabla f(\textbf{C}^*,\gamma ^*)\Vert \\&\quad \ge nC_2b-C_1t\sigma _0\sqrt{n(r(q+m)+p)} \\&\Vert T_{(\textbf{C}^*,\gamma ^*),{\mathcal M},\perp }\nabla f(\textbf{C}^*,\gamma ^*)\Vert \\&\quad \le nC_3b+C_1t\sigma _0\sqrt{n(qm-r(q+m)+p)} . \end{aligned}$$

That is, if

$$\begin{aligned}&nC_2b-C_1t\sigma _0\sqrt{n(r(q+m)+p)} >\\&C_Tb(nC_3b+C_1t\sigma _0\sqrt{n(qm-r(q+m)+p)} )+\lambda C_{partial}, \end{aligned}$$

then \(\Vert T_{(\textbf{C},\gamma ),{\mathcal M}}\nabla f(\textbf{C}^*,\gamma ^*)\Vert \ne 0\). This is satisfied if

$$\begin{aligned} \frac{1}{4}nC_2b>\max \biggl \{C_1t\sigma _0\sqrt{n(r(q+m)+p)} ,C_TC_3nb^2,\\ C_TC_1t\sigma _0\sqrt{n(qm-r(q+m)+p)} b,\lambda C_{partial}\biggr \}, \end{aligned}$$

i.e., when \(\sqrt{n}>\frac{4C_TC_1t\sigma _0\sqrt{(qm-r(q+m)+p)} }{C_2}\) and

$$\begin{aligned} \frac{4C_1t\sigma _0\sqrt{(r(q+m)+p)} }{C_2\sqrt{n}}+\frac{4\lambda C_{partial}}{nC_2}<b<\frac{C_2}{4C_TC_3}. \end{aligned}$$

By assumptions, this is satisfied with initialization \(b=\textrm{dist}((\textbf{C}^{(0)},\gamma ^{(0)}),(\textbf{C}^*,\gamma ^*))\), so

$$\begin{aligned} (\textbf{C}^{(\textrm{iter})}, \gamma ^{(\textrm{iter})})\in \mathcal {B}\;\forall \;\textrm{iter}\ge 1. \end{aligned}$$

It remains to prove (3.3), which is similar to the proof of (3.2). \(\square \)

1.8 A.8 Proof of Theorem 3

Proof of Theorem 3

WLOG assume that \(m\ge q\). Let \(\theta =(\textbf{C},\gamma )\), and

$$\begin{aligned}{} & {} \mathcal {C}=\Bigg \{(\textbf{C},\gamma ):\textbf{C}=[\textbf{C}',0]\,\, \text {where}\,\, \textbf{C}'\in \mathbb {R}^{m\times r}\,\, \\{} & {} \quad \text {and}\,\, 0\in \mathbb {R}^{m\times (q-r)}, \textbf{C}'_{ij}\in \{s,-s\}, \gamma _k\in \{s,-s\}\Bigg \}, \end{aligned}$$

where \(s=c_0\sqrt{\frac{r(m+q)+p}{n}}\). Then we have \(|\mathcal {C}|=2^{rm+p}\), and for any \((\textbf{C}_1,\gamma _1), (\textbf{C}_2,\gamma _2)\in \mathcal {C}\), \(\textrm{dist}((\textbf{C}_1,\gamma _1), (\textbf{C}_2,\gamma _2))\ge 2\,s\). In addition, for any \((\textbf{C},\gamma )\in \mathcal {C}\) and \(P_0\) represents the model when \(\textbf{C}=0\) and \(\gamma =0\),

$$\begin{aligned} K(P_{0},P_{(\textbf{C},\gamma )})&\le c_{\epsilon }\sum _{i=1}^n\Big (\langle \textbf{X}_i,\textbf{C}\rangle +\gamma ^T\textbf{z}_i\Big )^2\\&\le c_{\epsilon }C_{upper}(\Vert \textbf{C}\Vert _F^2+\Vert \gamma \Vert ^2)\\&\le c_{\epsilon }C_{upper} (r(m+q)+p)s^2. \end{aligned}$$

Applying (Tsybakov 2008, Theorem 2.5) and note that \(\log |\mathcal {C}|=\log (2^{rm+p})=(rm+p)\log 2\), we may choose \( \alpha ={2c_{\epsilon }C_{upper}}{s^2}/\log 2\), and \(\alpha \) can be sufficiently small by choosing \(c_0\) to be small. The rest of the proof following applying (Tsybakov 2008, Theorem 2.5). \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, HH., Yu, F., Fan, X. et al. A framework of regularized low-rank matrix models for regression and classification. Stat Comput 34, 10 (2024). https://doi.org/10.1007/s11222-023-10318-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10318-z

Keywords

Navigation