1 Introduction

In the area of data mining for high-dimensional data with ‘m’ units and ‘n’ variables, where ‘m’ and ‘n’ are very large, there is a great deal of interest to reduce the dimensions on both sides. For the reduction of large number of variables to a smaller number, the techniques based on Principal Component Analysis (PCA) are normally used. The interpretation of these components is usually improved by the sparseness constraints such as LASSO or RIDGE. On the other hand, the reduction in the large number of units is handled by clustering techniques or through finite mixtures models. In the former, the focus is on correlation between the variables, and in the latter the focus is on the distance between units with appropriate standardization of the variables. In this paper, we want to study the relationship between two sets of variables \(Y_t\) and \(X_t\), which are of dimensions ‘m’ and ‘n’ collected over ‘T’ time periods via multivariate regression model and we want to reduce the dimension of the (\(m \times n\)) coefficient matrix on both units and variables. Biclustering methods generally focus on a single data matrix; here we focus on the estimated coefficient matrix that relates \(Y_t\) to \(X_t\), that represents both the data matrix Y and X. But this has to be properly weighted in for the estimation errors.

In Sect. 2, we introduce the model, review the methodology of clustering of regression models. The following section will cover the estimation methods and some approximate solutions for biclustering. In Sect. 4, we define the problem of biclustering of the regression models and evaluate the procedures using both simulated and real data. The final section will outline the future work.

2 Clustering of Regression Models

In microarray experiments, the objectives are twofold: to detect differentially expressed genes and to group genes with similar expression. It is believed that genes with different expression can provide disease-specific markers and co-expressed genes can contribute to understanding the regulatory network aspect of the gene expression (see Qin and Self [5]). In finance, there is a great deal of interest in assessing the commonality in returns among stocks and how this commonality can be related to commonality in order flow (see Hasbrouck and Seppi [3]). This research has led to fundamental questions related to diversification—a central tenet of Markowitz’s portfolio theory. The basic regression model can be written as:

$$\begin{aligned} \underset{T \times 1}{Y_i}= \underset{T \times n}{X_i} \cdot \underset{n \times 1}{c_i} + e_i, i=1,\ldots ,m, \end{aligned}$$
(1)

where the times series of responses (returns) on unit i are given in \(Y_i\) and the corresponding (order flow) variables related to ith response is given in \(X_i\).

If \(e_i\)’s are assumed to be correlated over ‘i’, then the above model set-up is called seemingly unrelated regression equations (SURE). The focus on the clustering methods is on in-grouping ‘\(c_i\)’ into gene-based groups, or in the finance context into trading-pattern based or into industry groups.

Finite Mixture Regression Models: We will assume that there are ‘K’ groups and each set of models in (1) can come from one of these groups. The random indicator is given by \(S_i\) is distributed with unknown probability distribution \(\eta = (\eta _1,\ldots ,\eta _K)\). The unknown parameters are given in \(\theta = (\eta _1,\ldots ,\eta _K, c_1,\ldots ,c_K,\) \(\varSigma _1,\ldots ,\varSigma _K)\). The marginal distribution of \(Y_i\) given \(X_i\) and \(\theta \) can be written as:

$$\begin{aligned} \begin{aligned} f(y_i \;|\; X_i,\theta )&= \sum _{k=1}^K f(y_i \;|\; X_i,S_i,\theta ) f(S_i=k \;|\; \theta ) \\&= \sum _{k=1}^K \eta _k f(y_i, X_i c_k, \varSigma _k), \end{aligned} \end{aligned}$$
(2)

as stated in Frühwirth-Schalter [2, p. 243]. It is important to note that the regression coefficients are identifiable if and only if the number of clusters is less than the number of distinct (\(n-1\)) dimensional hyperplanes generated by the covariates. If the covariates show not much variability, there may be identifiability problems which indicates that the groups may not be that distinctive.

Because the cluster membership is not known a priori, treating them as missing information and then using the EM algorithm is what is typically followed. The initial cluster centers are formed from the least squares estimate of c’s and grouping them via empirical clustering procedures such as K-means clustering. The number of clusters, the choice of ‘K’ is made through some criterion of reproducibility over two or more clustering samples. The following BIC criterion is suggested for the selection of ‘K’:

$$\begin{aligned} \text {BIC}_K= 2 \text {loglikelihood} - n \ln T. \end{aligned}$$
(3)

Qin and Self [5] also suggest a measure for the stability of cluster centers:

$$\begin{aligned} \text {BMV}_K= \max _{k=1 \ldots ,K} (\text {volume}(\hat{\varSigma }_k)), \end{aligned}$$
(4)

where BMV is the bootstrapped maximum value and the volume is measured by largest eigenvalue. The joint use of BIC and BMV will lead to a selection of ‘K’ value that has large BIC and a small BMV.

The clustering of regression models is fit by applying the EM algorithm as follows: stack the data as \(Y= (Y_1', \ldots ,Y_m')'\) and \(X= (X_1', \ldots ,X_m')'\), \(S= (S_1,\ldots ,S_m)'\) and \(\theta = (c_1', \ldots ,c_K', \eta _1,\ldots ,\eta _K)\) be the stacked parameter vector. The E-step finds the expectation of unknown variables, such as cluster membership and the M-step maximizes the likelihood to obtain parameter estimates. These result in the following steps:

$$\begin{aligned} \begin{aligned} \hat{s}_{ik}&= \dfrac{\eta _k \text {Pr}(y_i \;|\; s_{ik}=1; \theta )}{\sum _{k=1}^K \eta _k \text {Pr}(y_i \;|\; s_{ik}=1; \theta )} \\ \hat{\eta }_k&= \sum _{i=1}^m \hat{s}_{ik} \\ \hat{c}_k&= \left( \sum _{i=1}^m \hat{s}_{ik} X_i' X_i \right) ^{-1} \sum _{i=1}^m \hat{s}_{ik} X_i' Y_i \\ \hat{\sigma }_k^2&= \sum _{i=1}^m \hat{s}_{ik} (Y_i - X_i \hat{c}_k)' (Y_i - X_i \hat{c}_k) / \sum _{i=1}^m T \hat{s}_{ik}. \end{aligned} \end{aligned}$$
(5)

The convergence of these estimates depend on the choice of initial values.

From the structure of the estimates of ‘c’ coefficients in (5), it is useful to note that the cluster ‘\(c_{k}\)’ coefficients are weighted average of the models entertained with weights being assigned by the chance of their belongings to the cluster membership. Note

$$\begin{aligned} \hat{c}_{k}=(\sum _{i=1}^{m}w_{i})^{-1}(\sum _{i=1}^{m}w_{i}\tilde{c}_{i}) \end{aligned}$$
(6)

where \(w_{i}=\hat{s}_{ik}X^{'}_{i}X_{i}\) and \(\tilde{c}_{i}=(X^{'}_{i}X_{i})^{-1}X^{'}_{i}Y_{i}\), the least squares estimate.If the design matrix \(X_{i}\)’s are same, then the above simplifies to,

$$\begin{aligned} \hat{c}_{k}=\sum _{i=1}^{m}\hat{s}_{ik}\tilde{c}_{i} \end{aligned}$$
(7)

which implies the likelihood equations based on the data can be replaced by the likelihood equations using the least squares estimates and their distributions.

3 Biclustering of Observations and Variables

In the last section the focus was on clustering subjects but clustering of variables is also very useful in practice. The similarity in variables is captured through PCA. The concept of biclustering was discussed earlier by Rao [6] who advocated computing the singular value decomposition (SVD) of the \(m \times n\) data matrix Y where ‘m’ is the number of subjects and ‘n’ is the number of variables and also using the SVD of \(Y{'}\). The biclustering procedure simultaneously clusters both the subjects and the variables thus providing a way to be selective of both subjects and the variables. This is quite useful if any investments in follow-up studies are expensive.

The sparse singular value decomposition (SSVD) is proposed as an exploratory tool for biclustering in Lee, Shen, Huang and Marron [4]. The requirement is that for a ‘\(m \times n\)’ data matrix, Y, use the following penalized sum-of-squares criterion.

$$\begin{aligned} \left\| Y-suv' \right\| ^{2}_{F}+\lambda _uP_1(su)+\lambda _vP_2(sv) \end{aligned}$$
(8)

where the first term is the squared Frobenius norm, \(P_{1}\) and \(P_{2}\) are sparse-inducing penalty terms. Then penalty terms considered are LASSO-based and the solutions are based on soft-thresholding rule. Because of our intent in biclustering of regression models, we want to discuss the regression approach to solve (8), as given in Lee et al. [4]. For fixed ‘u’, minimization of (8) with respect to (sv) is equivalent to minimizing (8) with respect to \(\tilde{v}=sv\) if

$$\begin{aligned} \left\| Y-u\tilde{v}' \right\| ^{2}_{F}+\lambda _vP_2(\tilde{v})=\left\| vec(Y')-(I_{n}\otimes u) \tilde{v}\right\| ^{2}_{F}+\lambda _vP_2(\tilde{v}) \end{aligned}$$
(9)

where \(\otimes \) denotes the Kronecker product. Note for a given ‘u’, \(I_{n}\otimes u\) acts as the design matrix and \(\tilde{v}\) is the regression coefficient. Note \(P_2(\tilde{v})=\sum _{j=1}^{n}\left| \tilde{v}_{j} \right| \) is the LASSO-penalty.

To solve for ‘v’, fix \(\tilde{u}=su\) and

$$\begin{aligned} \left\| Y-\tilde{u}v' \right\| ^{2}_{F}+\lambda _uP_1(\tilde{u})=\left\| vec(Y)-(I_{m}\otimes v) \tilde{u}\right\| ^{2}_{F}+\lambda _uP_1(\tilde{u}) \end{aligned}$$
(10)

with the LASSO penalty, \(P_1(\tilde{u})=\sum _{i=1}^{m}\left| \tilde{u}_{i} \right| \). Interestingly this regression approach is inherent in the solution of the classical Eckart-Young theorem, that is applied to arrive at the components of singular value decomposition.

4 Parsimonious Regression Models

The main interest in biclustering is to discover some desirable unit-variable association. In the analysis of microarray data, the goal is to identify biologically relevant genes that are significantly expressed for certain cancer types. Initially the biclustering is used as an unsupervised learning tool; the measurements, the expression levels are for thousands of genes, ‘m’ over a small number of subjects, ‘n’. The information on cancer types is used a posterior to interpret and evaluate the performance of SSVD (see Lee et al. [4]). Visually the low-rank approximations can reveal ’checkerboard’ structure resulting from gene and subjects grouping. This feature can be appreciated in many different settings and scientific areas as well.

It is well understood that supervised learning tools fare better as they use the additional information which would also support validating the results. Thus in the context of microarray data, the use of information on the presence or absence of cancer will lead to better clustering; if so, where and how do we look for ‘checkerboard’ structure? In the regression context the covariances between the response and the predictor variables play an important role and thus the form of regression coefficient matrix matters. We suggest two approaches. Stack up the ‘c’ coefficients in a matrix and use the SVD on it; we have the \(m \times n\) matrix,

$$\begin{aligned} C\equiv (c_1, \ldots ,c_m)'=U\Lambda V'=A\cdot B \end{aligned}$$
(11)

with rows of U refer to ‘m’ subjects and the rows of \(V'\) refer to the combinations of the predictors, the ‘X’ variables in the model. When the design matrices \(X_i\) are identical, the solution to (11) can be related to reduced-rank regression but when they are different, the calculation of \(A(m \times r)\) and \(B(r \times n)\) is not straightforward. (See (Chapter 7) in Reinsel and Velu [7]). We provide some essential details here.

With the model as stated in (1) and with the condition on the stacked coefficient matrix, C as stated in (11), it follows that we want to estimate C under the condition,

$$\begin{aligned} Rank(C)=r\le min(m,n) \end{aligned}$$
(12)

which implies that A and B are full-rank matrices. Notice that the model (1) can be rewritten as

$$\begin{aligned} Y_{i}=X_{i}B'a_{i}+e_{i} \end{aligned}$$
(13)

where each \(a_{i}\) is of dimension \(r \times 1\). Also notice that \(BX'_{i}\) provides the most useful linear combinations of the predictors and the distinction among the regression equations in (1) are reflected in the coefficients \(a_{i}\). The dimension-reductions through reduced rank seemed reasonable because of the ’similarity’ of the predictor variable sets (\(x_{i}\)) among the m different regression equations in (1). To move forward, we need to impose some normalization conditions (as in normalization required in SVD on U and V):

$$\begin{aligned} A'\varGamma A=I_{r}, B\hat{\varSigma }_{xx} B'=\varLambda ^2_{r} \end{aligned}$$
(14)

where \(\varGamma =\hat{\varSigma }^{-1}_{ee}\) and \(\hat{\varSigma }_{xx}=\frac{1}{mT}\sum _{i=1}^m X_{i}'X_{i}\). To set up the estimation criterion and the resulting estimates, we closely follow the details given in Reinsel and Velu [7].

Observe that the model (1) can be expressed in the vector form as

$$\begin{aligned} \mathbf {y}=\bar{X}\mathbf {c}+\mathbf {e} \end{aligned}$$
(15)

where \(\mathbf {y}= (Y_1', \ldots ,Y_m')'\), \(\bar{X} =Diag (X_1, \ldots ,X_m)\) and \(\mathbf {c}= vec(C')\) with \(\mathbf {e}= (e_1', \ldots ,e_m')'\), \(Cov(\mathbf {e})=\varSigma _{ee} \otimes I_{T}\). The generalized least squares (GLS) estimator is given as

$$\begin{aligned} \mathbf {\hat{c}}=[\bar{X}'( \varSigma ^{-1}_{ee} \otimes I_{T}) \bar{X} ]^{-1} \bar{X} '(\varSigma ^{-1}_{ee} \otimes I_{T} )\mathbf {y} \end{aligned}$$
(16)

The covariance matrix of \(\hat{c}\) that is useful to make inferences on c is given as

$$\begin{aligned} Cov(\mathbf {\hat{c}})=[\bar{X}'( \varSigma ^{-1}_{ee} \otimes I_{T}) \bar{X} ]^{-1} \end{aligned}$$
(17)

The error covariance matrix is estimated by stocking the residuals \(\hat{e_{i}}=Y_{i}-X_{i}\hat{c_{i}}\) as \(\hat{e}=(\hat{e_{1}}, \ldots ,\hat{e_{m}})'\) and \(\hat{\varSigma }_{ee}=\frac{1}{T} \hat{e} \hat{e}'\).

To obtain the estimates of A and B, we need to resort to iterative procedures as there is no one step solution as in SVD. Denote \(\alpha =vec(A')\) and \(\beta =vec(B')\) and \(\theta =(\alpha ',\beta ')'\). The criterion to be minimized is

$$\begin{aligned} S_{T}(\theta )=\frac{1}{2T} \cdot \mathbf {e}'(\varSigma ^{-1}_{ee} \otimes I_{T}) \mathbf {e} \end{aligned}$$
(18)

subject to the normalizing constraints in (14). Note that \(\mathbf {c}=vec(C')=(A \otimes I_{n}) vec(B')=(I_{m} \otimes B')vec(A')\) and so \(\mathbf {e}=\mathbf {y}-\bar{X}(A \otimes I_{n})\beta =\mathbf {y}-\bar{X}(I_{m} \otimes B')\alpha \). The first order equations that result from minimizing (18) leads to the following iterative solutions—given \(\alpha \) solve for \(\beta \) and vice versa.

$$\begin{aligned} \begin{aligned} \hat{\alpha }=[\bar{X}(B)'( \varSigma _{ee} \otimes I_{T}) \bar{X}(B) ]^{-1} \bar{X}(B)'(\varSigma ^{-1}_{ee} \otimes I_{T} )\mathbf {y}\\ \hat{\beta }=[\bar{X}(A)'( \varSigma _{ee} \otimes I_{T}) \bar{X}(A) ]^{-1} \bar{X}(A)'(\varSigma ^{-1}_{ee} \otimes I_{T} )\mathbf {y} \end{aligned} \end{aligned}$$
(19)

In the above, \(\bar{X}(B)=\bar{X}(I_m \otimes B')\) and \(\bar{X}(A)=\bar{X}(A \otimes I_n)\).

If the design matrices \(X_{i}\) are the same, the solution is rather straight-forward. We will comment on this model later.

Some observations are in order: Note because \(c_{i}=B'\alpha _{i}\), which from the estimation point of view implies that,

$$\begin{aligned} X'_{i}Y_{i}=(X'_{i}X_{i})c_{i}=(X'_{i}X_{i})B'\alpha _i \end{aligned}$$
(20)

will lead to some simplifications in the cluster ‘\(\theta \)’ estimates if we follow the mixtures model approach taken in Sect. 2. Before formulating the problem as finite mixtures with constraints on the regression coefficient matrix, we want to discuss briefly an approach to introduce sparseness in the estimated A and B matrices.

In the first approach we discuss here in similar to Lee et al. [4] where the spareness structure on A and B matrices are introduced and the simplified structure would be used for identifying both the clusters of units and the clusters of the variables. While sparseness studies in the context of reduced-rank regression is of recent origin, many tend to use norms other than Frobenius. In order to keep the focus and for continuity, we will consider Frobenius norm and in that setting, we discuss the decomposition of C-matrix with spareness constraints. Chen and Huang [1] discuss the simultaneous dimension reduction and variable selection. The penalited regression version imposes group-LASSO type penalty that assumes that each ‘\(c_{k}\)’ as a group. Note that the range of ‘\(c_{k}\)’ over various ‘k’ is not linear space and has certain manifold structure. The methodology provided in this paper is quite straightforward and is easy to implement.

The optimization methods related to reduced-rank regression exploit the bilinear nature of the rank decomposition in (11). Note the coefficient matrix, ‘C’ is bilinear in the component matries, ‘A’ and ‘B’ because given either one, the ‘C’ matrix is linear function of the other. This leads to the simplified iterative solutions given in (19). Before we formulate the penalized version of the problem, observe from (15)

$$\begin{aligned} \mathbf {y}=\bar{X} vec(C')+\mathbf {e}=\bar{X} (A \otimes I)B+\mathbf {e} \end{aligned}$$
(21)

Using (21), the criterion \(S_T(\theta )\) in (18) can be restated in terms of approximating full-rank estimate of ‘C’ matrix by a matrix of reduced rank.

With (16), (17) and (21), the penalized version of the reduced-rank regression model can be stated as,

$$\begin{aligned} \underset{A,B}{Min} S_{T}(\theta ) + \lambda _{A} P_{1}(A)+ \lambda _{B}P_{2}(B) \end{aligned}$$
(22)

similar to what is given in (8) for the SSVD of the data matrix Y for one set of variables. As argued in Chen and Huang [1], we will reduce the problem in (22) to minimizing over ‘B’ for a given ‘A’, thus resulting in some simplification. First note that minimizing the criterion \(S_{T}(\theta )\) in (18) under the full rank for ‘C’ matrix is equivalent to

$$\begin{aligned} \underset{\theta }{Min} S^{*}_{T}(\theta )=\frac{1}{2T} (\mathbf {\hat{c}}-\mathbf {c})'[\bar{X} '(\varSigma ^{-1}_{ee} \otimes I_{T} ) \bar{X}](\mathbf {\hat{c}}-\mathbf {c}) \end{aligned}$$
(23)

and thus for a given ‘A’, it is the same as,

$$\begin{aligned} \underset{B}{Min} S^{*}_{T}(\theta )+\sum _{i=1}^{n}\lambda _{i}\left\| \beta ^{i} \right\| \end{aligned}$$
(24)

where ‘\(\beta ^i\)’ denotes the \(i^{th}\) row of vector of ‘\(B'\)’ matrix and \(\lambda _i\)’s are penalty factors that are positive. The condition is known as group-LASSO which implies that the \(i^{th}\) predictor can be taken out of the regression framework. The formulation in (24) is more direct and relates how even in the extended set-up, the problem simplifies to calculating SVD of appropriately weighed full-rank regression coefficient matrix. Observe that minimizing criterion in (24) can be further simplified to, because \(\mathbf {\hat{c}}-\mathbf {c}=\mathbf {\hat{c}}-(A \otimes I) \beta \), for a given A, (24) reduces to,

$$\begin{aligned} \begin{aligned} \underset{\beta }{Min} [(A'\varSigma ^{-1}_{ee} \otimes I)\mathbf {\hat{c}}- \beta ]'[\bar{X} '(\varSigma ^{-1}_{ee} \otimes I_{T} ) \bar{X}]\\ [(A'\varSigma ^{-1}_{ee} \otimes I)\mathbf {\hat{c}}- \beta ]+\sum _{i=1}^{n}\lambda _{i}\left\| \beta ^{i} \right\| \end{aligned} \end{aligned}$$
(25)

Here \(\beta ^{i}\) is the transpose of \(B^{i}\) and denotes the i-th subpart of the ’\(\beta \)’ vector.

Solving (25) requires iterative procedures. Before we describe and apply these methods, we want to observe that certain simplifications that occur when the design matrix, \(X_{i}=X\), in a commonly used multivariate regression model. With \(\bar{X}=I_{T} \otimes X\), the optimization in (25) reduces to

$$\begin{aligned} \begin{aligned} \underset{\beta }{Min} [(A'\varSigma ^{-1}_{ee} \otimes I)\mathbf {\hat{c}}- \beta ]'[\varSigma ^{-1}_{ee} \otimes XX']\\ [(A'\varSigma ^{-1}_{ee} \otimes I)\mathbf {\hat{c}}- \beta ]+\sum _{i=1}^{n}\lambda _{i}\left\| \beta ^{i} \right\| , \end{aligned} \end{aligned}$$
(26)

the criterion given in Chen and Huang [1]. The solution is easier to obtain because of the simplified structure of

$$\begin{aligned} \hat{c}=(\varSigma ^{-1}_{ee} \otimes XX')^{-1}(\varSigma ^{-1}_{ee} \otimes XY')=I_m \otimes (X'X)^{-1}X'Y \end{aligned}$$
(27)

Our model although appears to be more complex, but it is solvable by numerical routines.

The subgradient solution we suggest here follows Yuan and Lin [8]. The subgradient equations are:

$$\begin{aligned}{}[(A'\varSigma ^{-1}_{ee} \otimes I)\mathbf {\hat{c}}- \beta ]_{l}+\lambda _{l} s_{l}=0 \end{aligned}$$
(28)

where \([\cdot ]\) denotes the \(l^{th}\) subvector of \(\beta \), where \(l=1,2,\cdots ,n\).

Here \(s_{l}=\beta ^{l}/\left\| \beta ^{l} \right\| \) if \(\beta ^{l}\ne 0\) and \(s_{l}\) is a vector with \(\left\| s_{l} \right\| _{2}<1\) otherwise. When \(\left\| \beta ^{l} \right\| =0\), to obtain the subgradient equations, inpute ‘\(\beta \)’ without the ‘\(l^{th}\)’ variable and go through the same process of computations and the soft-threshold estimator is given as:

$$\begin{aligned} \hat{\beta ^{l}}=\left( 1-\frac{\lambda _{l}}{\left\| s_{l} \right\| }\right) _{+}s_{l} \end{aligned}$$
(29)

In our practice, the subgradient algorithm converges in several iterations (less than five) and is not so computationally costly for big datasets.

There may be other ways to obtain this and the research is underway to explore these methods.

5 Parsimonious Finite Mixtures Biclustering

The parsimonious modeling proposed in (22) where both ‘A’ and ‘B’ matrices are shrunk through LASSO type penalty may be appropriate to reduce the number of coefficients. But it is not clear how this can be used for grouping. The finite mixture model has the natural appeal as it can be used systematically to decide the number of clusters as well as the probabilities of each unit belonging to various clusters. Recall that in the decomposition the regression coefficient matrix, ‘C’, the part ‘A’ represents the units’ side and the part ‘B’ represents variables’ side. While the parsimonious or sparse representation may be appropriate, it is not clear how the spareness in ‘A’ can easily lead to clustering of observations. So we begin with model (13), where the ‘r’ dimensional ‘\(a_i\)’ corresponds to unit ‘i’. Assume that there are ‘K’ groups and for a given ‘B’, thus given \(x^*_i=x_iB'\), we postulate that, \(a_i\)’s come from one of these groups. Thus from (2),

$$\begin{aligned} \begin{aligned} f(y_i \;|\; X^*_i,\theta )&= \sum _{k=1}^K f(y_i \;|\; X^*_i,S_i,\theta ) f(S_i=k \;|\; \theta ) \\&= \sum _{k=1}^K \eta ^*_k f(y_i, X^*_i a_k, \varSigma _k),\\ where \quad \theta&=(\eta _1,\ldots ,\eta _K,a_1,\ldots ,a_K,\varSigma _1,\ldots ,\varSigma _K). \end{aligned} \end{aligned}$$
(30)

This implies that although the row rank of ‘A’ is taken to be ‘m’, based on the distance between the rows, it can be reduced to ‘K’ independent rows.

The EM algorithm (5) can be applied to estimate \(\theta \). This is conditional on ‘B’ given; thus replace ‘\(X_i\)’ in (5) by \(X^*_i\). Now given ‘A’, estimate ‘B’ via the second equation in (19). This process can be iterated back and forth between (5) and the estimation of ‘\(\beta \)’.

Sparseness: To introduce further sparseness in ‘B’, we can follow the logic given following equation (21). For a given ‘A’, use equation (25) and the subgradient method ((28) and (29)) to solve for ‘\(\beta \)’. With this the joint modeling of clustering and the dimension reduction are both achieved.

The appropriate model selection as one can see is going to depend upon a number of parameters: rank of the ‘C’ matrix, ‘r’ and the row rank of the ‘A’ matrix, K and the associated other parameters. We will use the grid search via:

$$\begin{aligned} \begin{aligned} \text {BIC}(r,K)&= 2 \text {loglikelihood} - n \ln T \\ \text {BMV}(r,K)&= \max _{r,K}(\text {volume}(\hat{\varSigma }_{r,K})) \end{aligned} \end{aligned}$$
(31)

This approach is novel and combines both the clustering via finite mixtures and the dimension reduction via SVD. Its performance and its properties are to be studied in depth.

6 Numerical Illustration

The illustration given here is not meant to draw any serious implications toward economic theory but our goal is eventually relate the methodology to draw some useful inferences on the trading behavior on major stocks. Since the early 2000s, both academics and practitioners have paid more attention to the magnitudes of cross-sectional interactions among stocks. However, the study of commonality in short-horizon returns, order flows, and liquidity is still of interest in the microstructure analysis of equity markets. Hasbrouck and Seppi [3] note that both short-horizon returns and order flows are characterized by common factors. With this, two research questions have emerged.

Liquidity commonality can easily arise even when trading activity runs in different directions for different stocks, because sizable buy or sell motivated trading can strain liquidity. But commonality in returns can arise because of less firm-specific and more market-wide factor. For instance public information flows and correlation in order imbalances across market, that may affect all stocks. Furthermore, commonality in order flows may be influenced by the differential liquidity of individual stocks as well as by other factors such as asymmetric information, idiosyncratic risks, transaction costs and other forms of market imperfections.

If commonality in stocks’ order flows account for the covariance structure of short-term returns, how should we characterize relationships involving commonality in both returns and order flows? Microstructure research focuses on how individual asset price adjusts to new idiosyncratic information. If the market is efficient, new information would be disseminated and interpreted immediately by all market participants, thus prices would quickly adjust to a new equilibrium value determined by the content of the information. But in practice, the price adjustment does not seem to be processed at the same speed for all stocks. Therefore, the price discovery and order flow dynamics have more complex relationship when we consider multiple assets at the same time.

We chose the Dow stocks as our sample because first, the rapid pace of trading provides frequently updated prices and allows us to construct some high-frequency trading measures. Of the 30 firms in the index, we excluded Kraft Foods, for which data are not available for some duration. Second, these 29 stocks, considered as the large capitalization stocks, are normally categorized in the same style and traded mainly by institutional traders, that is, we can expect more correlated trading on these stocks such as index arbitrage, dynamic hedging strategies and naive momentum trading.

Our sample covers 252 trading days in 2015. We establish a standard time frame for the data series using 15-min intervals covering 9:30–9:45, 9:45–10:00, ..., 15:45–16:00 for a total of 26 intervals per trading day. The 15-min time resolution represents a compromise between, on the one hand, needing to look at correlations in contemporaneous order flows across stocks at very short intervals and, on the other hand, requiring enough time for feedback effects from prices into subsequent order submissions. Such data smoothing is essential in handling high frequency data.

We define the stock returns as the difference between the log end-of-interval quote midpoint and the log begin-of-interval quote midpoint. We also consider the following eight order flow measures: (1) Total number of trades; (2) Total share volume; (3) Total dollar volume using log price as stock price (4) The square root of the dollar volume defined as the sum of the square root of each trade’s dollar volume using regular stock price; (5) Signed trades defined as the difference between buy trades and sell trades; (6) Signed share volume defined as the difference between buying share volume and selling share volume; (7) Signed dollar volume defined as the difference between buying dollar volume and selling dollar volume using log price as stock price; (8) The square root of the signed dollar volume defined as the difference between the sum of the square root of each buy trade’s dollar volume and the same measure for each sell trade using regular stock price. These measures are in the 15-min intervals of trading and as the stocks considered are high capitalization stocks, the time intervals always have active observations. These measures generally reflect two dimensions of order flows (See Fig. 1). Table 1 presents two corresponding eigenvectors and shows that the first eigenvector is dominated by “aggregate measures” and the second one is by “signed measures”. Thus the first component represents a ‘sum’ measure and the second component, a ‘contrast’ measure. The contrast is between two sides, buy and sell.

Fig. 1.
figure 1

Scree plot of eight standardized order flows measures

Table 1. Eigenvectors corresponding to first two largest eigenvalues of eight standardized order flows measures
Fig. 2.
figure 2

The histograms of returns mean and 8 order flow measures mean for 29 stocks

Figure 2 presents the histogram of average returns and averages of eight order flow measures for 29 stocks in the sample. Because these variables have significantly different order of magnitude, we standardize our variables to have unit variance and to remove the time-of-day effects. For a representative variable “z”, let \(z_{i,d,k}\) denote the observation from firm i on the k-th 15-miniute subperiod of day d. Then the standardized variable becomes \(z_{i,d,k}^*=(z_{i,d,k}-\) \(\mu \) \(_{i,k})/\) \(\sigma \) \(_{i,k}\), where \(\mu \) \(_{i,k}\) and \(\sigma \) \(_{i,k}\) are the mean and standard deviation for firm i and subperiod k, estimated across days.

We run the following regression for each stock i and set the coefficient equals to 0 if it is insignificant at 0.1 level. The whole \(29 \times 8\) coefficient matrix is shown in Table 5.

$$\begin{aligned} r_{i,t}=\sum _{k=1}^{8} c_{k}x_{k,i,t}+e_{i} \end{aligned}$$
(32)

where \(r_{i,t}\) is the return for stock i at time t, \(x_{k,i,t}\) is the k-th order flow measure for stock i at time t. The coefficient matrix clearly indicates that not all order flow variables are significant. This is an ad-hoc calculation that does not account for any commonality in the stocks. The methodology developed in this paper provides a more comprehensive framework.

Table 2. Two centroids generated by the K-means clustering (\(K=2\))
Fig. 3.
figure 3

The histograms of coefficients with 2 clusters

First, financial theory confirms that there are two types traders in the market: noise traders and informed trader, therefore we simply use K-means clustering (\(K=2\)) to group the stocks based on the coefficient matrix. Two centroids are shown in Table 2. We observe that the signed share volume and signed dollar volume have significant difference in these two centroids. It implies that these two variables represent important features for classifying the stocks with different trading behaviors. Figure 3 supports our observation because only the histograms of signed share volume and signed dollar volume have no overlap.

Table 3. Rank two approximation of coefficient matrix using SVD and SSVD algorithm (U matrix)

Furthermore, we compute the singular value decomposition (SVD) and the sparse singular value decomposition (SSVD) on the regression coefficients matrix as in (8), (9) and (10). Table 3 shows the results for the U matrix using SVD and SSVD algorithm. When we use K-means clustering(K=2) to cluster these stocks based on U matrix we get from SVD, we obtain the same clustering results as the one using the whole coefficient matrix. It implies that rank-two approximation matrix may have captured most of the information in the structure and the coefficient matrix. From the elements of \(U^{(1)}\) from SSVD, we can refer that there may be three groups.

Table 4. Rank two approximation of coefficient matrix using SVD and SSVD algorithm (V matrix)

Table 4 presents the ‘V’ matrix using SVD and SSVD algorithm, we notice that “signed share volume” and “signed dollar volume” dominate all the other variables in both SVD and SSVD cases. Furthermore, it can be observed that more weights are assigned to these two variables when we use the SSVD algorithm.

Table 5. Estimated coefficients of the regressions (32) for all 29 stocks