Network-based diagnostic probability estimation from resting-state functional magnetic resonance imaging

: Brain functional connectivity is a useful biomarker for diagnosing brain disorders. Connectivity is measured using resting-state functional magnetic resonance imaging (rs-fMRI). Previous studies have used a sequential application of the graphical model for network estimation and machine learning to construct predictive formulas for determining outcomes (e.g., disease or health) from the estimated network. However, the resulting network had limited utility for diagnosis because it was estimated independent of the outcome. In this study, we proposed a regression method with scores from rs-fMRI based on supervised sparse hierarchical components analysis (SSHCA). SSHCA has a hierarchical structure that consists of a network model (block scores at the individual level) and a scoring model (super scores at the population level). A regression model, such as the multiple logistic regression model with super scores as the predictor, was used to estimate diagnostic probabilities. An advantage of the proposed method was that the outcome-related (supervised) network connections and multiple scores corresponding to the sub-network estimation were helpful for interpreting the results. Our results in the simulation study and application to real data show that it is possible to predict diseases with high accuracy using the constructed model.


Introduction
The evaluation of brain functional networks and network connectivity is an important approach for the study of brain disorders [1].Functional connectivity is measured using resting-state functional magnetic resonance imaging (rs-fMRI), which has a spatial-temporal 4-dimensional data structure with a voxel size of 64 × 64 × 49 for spatial and 128 time points in typical cases.Different patterns of functional connectivity can be used as biomarkers for disease diagnosis and assessment [2].Recent studies have employed statistical models and machine learning techniques to assess individual networks of anatomical brain areas in the context of several brain disorders, such as Alzheimer's disease (AD) [3], mild cognitive impairment (MCI) [4], a review of research on AD and MCI [5,6], autism spectrum disorder [7], schizophrenia [8], major depressive disorder [9], chronic insomnia disorder [10], and attention deficit hyperactivity disorder [11].
Prediction models have been an important topic in current research [12,13].[14] established a data-driven framework of connectome-based predictive modeling, which was utilized in the protocol proposed by [15].In recent studies related to the framework of connectome-based modeling, many found it useful to use neural networks.In [16], a deep 3D convolutional neural network (3DCNN) was trained on a large cohort of healthy subjects of a wide age range to produce a map representing the probability that a voxel belongs to a particular brain network.[17] developed an attention-based graphical neural network (GNN) framework to detect accelerated brain aging in AD patients.First, graph data were constructed from Pearson correlation matrices computed from rs-fMRI, and then GNN models were trained using the graph data to predict the brain age of HC, MCI patients and AD patients.Although there have been many studies on rs-fMRI, our study focuses on a different target, diagnostic aids and proposes a different analysis approach.In general, the input for predictive modeling is not the observed data itself, but the features, whose construction and selection is an important process in the analysis.There are several types of features, such as a mean time series of the regions of interest (ROI) [18], a graph of theoretical indicators such as small-worldness [19,20], and connectivity strength (edges) estimated from the mean time series of the ROI.In this study, edges were targeted for ease of interpretation and usefulness of prediction because of direct output from the networks.Several approaches have been used to estimate brain networks from rs-fMRI data, including pairwise Pearson's correlation analyses [21,22], partial correlation analyses [23,24], independent component analysis [25] and sparse regressions [26].Partial correlation coefficients are easily implemented and have been, often selected for this type of analysis.The graphical least absolute shrinkage and selection operator (glasso) [27,28] provided the sparse estimation of the partial correlation coefficients.This method is based on the inverse covariance matrix and allows for the choice of connections between regions and is useful for computational cost and interpretation of results.
Previous studies have employed the sequential application of outcome-independent network scores and machine learning to construct a prediction formula for outcomes (e.g., disease or healthy) from the estimated network.In the first step, the average time series within the ROI was computed from a multiple voxel time series within an anatomically defined region, for example, by anatomical automatic labeling.However, this approach provided a less informative network for diagnosis because the network was estimated independent of the outcome.Thus, we consider the sequential approach, in which network estimation is followed by regression analysis, to be potentially inconvenient.Such network estimation methods describe the relationships between brain regions based on a class of correlation coefficients, which do not necessarily include the relationship with the disease.This is unlikely to lead directly to increased accuracy in the diagnostic or predictive models targeted by this study.If the network estimation includes not only the association between brain regions but also the association with disease, the accuracy of diagnosis or prediction could be improved.
We aimed to develop a supervised network estimation method for use in a prediction formula.Additionally, we compared the proposed method with the existing sequential approach.Network estimation is performed by extending the supervised scoring method proposed by [29].While [29] uses simple linear combination scoring, network estimation in this framework requires hierarchical (multiple levels) scoring.At the lower level, network estimation is performed for each individual.In a data matrix consisting of multiple node time series (time points in rows, brain region node in columns), the regression model takes the equation form with one node as the objective variable (output values) and the remaining node time series as the explanatory variables (input values).The scores obtained in the lower level are further scored as a subject group in the upper level with information about the diagnosis given as a supervisor.Performing these processes in a single algorithm, it is expected that the network score would also include information about the outcome, and the resulting scores would be useful in improving the accuracy of the diagnostic or predictive model.
In this paper, the scoring methods and algorithms are introduced in Section 2.1.The evaluation of the diagnostic or predictive model accuracy is then planned in Sections 2.2 or 2.3, and the results are presented in Sections 3.1 or 3.2 for the simulation study and the real data application.

Methods
In the regression model of this study, the outcome was the response and the regional time series from rs-fMRI was the predictor.We proposed a regression method with rs-fMRI scores based on supervised sparse hierarchical components analysis (SSHCA).The SSHCA had a hierarchical structure consisting of a network model (block scores at the individual level) and a scoring model (super scores at the population level).Multiple super scores (components) were subsequently computed.A regression model with super scores as predictors was used to estimate the diagnostic probability.The methods in this study were implemented in the R programming environment using the latest version of the msma package.The mand package [30] was used to handle the display and other aspects of the brain imaging data.The proposed method was compared with existing methods through simulation studies, and its usefulness was investigated through real data analysis.

SSHCA
This section describes the score structure and the estimation method of the weights.For the estimation method, we first define the objective function and introduce the algorithm for obtaining its solution.The reasonableness of the algorithm is provided in the Appendix.Notation is given for all beginnings.Considering n subjects, T is the number of time points, and M is the number of nodes (ROIs).Subjects also have a univariate outcome, and the n-dimensional outcome vector is denoted by Z.
Our basic model was a hierarchical (multiblock) score structure divided into two parts: a population level and an individual level.The individual level could be further divided into two levels, individual bottom and individual top.The network was estimated at the individual bottom level, and the resulting scores were obtained at higher levels.We formulated the following score representation.First, consider the population level score s with the following multiblock (hierarchical) structures: where w 2 = (w 2,1 , . . ., w 2,M ) ⊤ is the weight vector with length M for n × M matrix S 2 with the m-th column s 2,m and the i-th (individual level) element is given as where w 3,i = (w 3,i (1), . . ., w 3,i (T )) ⊤ is the weight vector with length T for the m-th sub-block of i-th subject score s 3,i,m given by where w 4,i,m is the weight vector with length M − 1 for the m-th sub-blocks X i,(−m) = (x i,1 , . . ., x i,m−1 , x i,m+1 , . .., x i,M ), which is the data matrix X i except for the m-th column.The t-th element of s 3,i,m is also given as s 3,i,m (t) = X ⊤ i,(−m) (t)w 4,i,m .The optimal value of the weight w 4,i,m is obtained by maximizing n i=1 M m=1 cov(s 3,i,m , x im ) = n i=1 M m=1 cov(X i,(−m) w 4,i,m , x im ), which will be discussed in more detail later in the algorithm for finding the weights.This is an original way to create the network developed in this paper.The method is based on a regression model in which one node is removed from a data matrix consisting of a multi-node time series (rows: time points, columns: nodes), and it is used as the objective variable (output values) and the remaining node time series as explanatory variables (input values).It is similar to multiple regression analysis, which analyzes one-to-many relationships between nodes rather than one-to-one relationships like the correlation coefficient.Note that every node is considered to be a node that is the objective variable.
Figure 1 shows a graphical representation of score relationships.The network diagram on the left side of this figure is illustrated using simplified symbols with M=3 nodes.The data matrix can be written as X i = (x i,1 , x i,2 , x i,3 ).The upper network is a model in which (x i,1 is the objective variable and the remaining X i,(−1) = (x i,2 , x i,3 ) are explanatory variables.The lower network is a model in which (x i,2 is the objective variable and the remaining X i,(−2) = (x i,1 , x i,3 ) are explanatory variables.
Note that the score s could be regarded as population-level scores, and scores s 2,i were individuallevel scores.There were several types of scores with hierarchical structures, and corresponding weights had the following roles.The weights w 4,i,m represented the edge strength to the m-th node variable x m from the others, and the corresponding score s 3im represented the predictor for the node variable x m .The M × T matrix S 3,i consisting of these scores, reduced the time course by using weight w 3,i .The resulting M-dimensional vector s 2,i was used as a representative variable for individual i at the population level; then, at the population level, the score was computed again from these scores by using weight w 2 .This super score s was used in the prediction model such as the logistic regression model.
In recent years, dynamic network estimation has been widely used in brain image analysis.It is possible to extend the proposed method for such analysis.Because s 3,i,m (t) = j m x i, j (t)w 4,i,m, j , we can rewrite s 2,i,m as Thus, v i, j,m (t) = w 4,i,m, j w 3,i (t) could be interpreted as a dynamical relationship between the node j to the node m.In the scores given so far, the objective function for the weights to be estimated is given below.An objective function is defined for estimating the network and for summarizing the scores obtained from it and using them for diagnostic probability estimation.This is done in a hierarchical manner, and finally an objective function is considered that optimizes these simultaneously.When matrices X i were normalized by their column, the weight w = (w ⊤ 2 , w ⊤ 3 , w ⊤ 4 ) ⊤ , where w 3 = (w ⊤ 3,1 , . . ., w ⊤ 3,n ) ⊤ and w 4 = (w ⊤ 4,1,1 , . . ., w ⊤ 4,n,M ) ⊤ was estimated by maximizing the following function: At the population level, the scores obtained at the individual level are reduced to a single score per person by maximizing the score variance in order to include more information across subjects in the subject population and to reduce the number of nodes dimensionally.In addition to summarizing the individual-level scores in this way at the population level, the correlation between scores and outcomes is also incorporated into the objective function to make the scores useful for the prediction model.From this perspective, the sub objective function for the scoring model in the population level as follows.
where 0 ≤ µ ≤ 1 defines the proportion of the supervision.The weights were evaluated by maximizing the variance of the super scores s supervised by the outcome.In other words, it was obtained by maximizing the variance of the super scores and the covariance with the outcome with a trade-off.
At the individual level, there are two additional layers, with the upper layer summarizing within-subjects the network information obtained in the lower layer.The time series of scores per node obtained in the lower layer is reduced in the time domain by maximizing the score variance and reducing the number of time points, and the score per node is calculated within subjects.The lower layer uses an objective function for network estimation that maximizes the covariance between the linear combined score time series values of one node value and the other node values for each individual.From this perspective, the subobjective functions for the network models in the individual level were as follows.
Note that L 02 (w) was a (sub) objective function for the score with temporal (time course) reduction, and L 03 (w) was a (sub) objective function for the score to construct the network.
where P λ (x) was the penalty function (P λ (x) = 2λ|x| in this study), and λ > 0 the regularized parameter.The function P λ (x) is defined for a scalar input x, but for a vector x it is defined as The algorithm for maximizing equation 2.1 is given as follows.The rationale is provided in the Appendix.As defined above, the m-th sub-blocks X i,(−m) is the data matrix X i except for the m-th column.
3) (Deflation step) Set p 3,i,m = x ⊤ i,m s 3,i,m /s ⊤ 3,i,m s 3,i,m and p 3,i = (p 3,i,1 , . . ., p 3,i,M ), X i are deflated by X i ← X i − S 3,i p ⊤ 3,i for i = 1, . . ., n. Start again from step 1 and repeat for the given number of times (number of components).
Note that the deflation steps yield multiple components and several alternatives.Extracting multiple components in this way corresponds to multiple estimations of the network, which means that the proposed method can decompose the network.There were several derivations of the parameter update method in Step 2(a).The method that used the update formula written in Step 2(a) was called SSHCA-corde (coordinate updating).Next was the update formula in a form that did not include any weights other than w 4,i,m , and the method that used h λ 4,i,m (X ⊤ i,(−m) x i ) as the update formula was called SSHCA-corde.ind(coordinate updating with independent network estimation).The method of estimating w 4,i,m using the glasso method was called SSHCA-glasso (coordinate updating with independent glasso network estimation).All these derivations are described in the Appendix and were compared in simulation studies and real data analysis.
The larger value of the regularization parameter λ had many non-zero elements in the weight w values.Its optimal value was selected by minimizing the Bayesian information criterion (BIC).It is denoted by is the number of effective parameters and depends on the value of λ.In the following, the regularization parameters are simplified such that λ 3,i = 0 and λ 4,i,m = λ 4 to avoid redundancy in the calculation.

Simulation study
The proposed method was evaluated and compared with the sequential approach using synthetic data.The total sample size was n = 50 and 100.The true graph had 50 and 100 nodes with edges that were randomly generated with 5 and 20 difference edges between the two groups (n/2 sample size per group).Multivariable data with a time length of 100 for the individual were generated as random numbers with a correlation structure using partial correlation coefficients based on the true graph.Then, the actual indicators Z for the case or control were generated by using the binomial random number with the probability being the logistic transformation of the partial correlation coefficients.
The resulting data set was a 100 × 20 matrix X i (i = 1, 2, . . ., n).The parameters in the proposed method were set as µ = 0, 0.5 and 1.As explained in the previous section, there were three types of proposed SSHCA methods: SSHCA-corde, SSHCA-corde.indand SSHCA-glasso.The glasso method was used for network estimation in comparisons.The strength of the edge (penalized estimated partial correlation coefficient) was used as an explanatory variable for prediction using glasso.The machine learning methods: generalized linear model (glm), glmnet, support vector machine (svmRadial), random forests (rf) and neural networks (nnet) were applied for diagnostic probability estimation.This application is also reviewed in the discussion as a consideration.The hyperparameters for machine learning were chosen based on five repeated 10-fold cross-validations.Both estimated networks and diagnostic probabilities were evaluated using receiver operating characteristic (ROC) analysis.In the network estimation, the selected edges were evaluated in the case and control groups.The number of iterations for the above procedures was 50 (the number of simulated data sets).

Real data analysis
The proposed method was applied to real data from the Alzheimer's Disease Neuroimaging Initiative (http://adni.loni.usc.edu/), a collection of imaging data from 50 subjects at baseline with a mean age of 75 years for 23 healthy subjects and 72.9 years for 27 patients with early MCI (eMCI).Z was a binary variable for Normal, or eMCI.Table 1 summarizes the characteristics of the patients.The Data Processing Assistant for Resting-State fMRI (DPARSF) was used to perform rs-fMRI preprocessing, slice timing, realignment, normalization, smoothing, detrending and band path filtering.The resulting data set contained 90 ROIs and 130 time points for each subject.The estimated diagnostic probability was evaluated using ROC analysis.The sensitivity, specificity and area under the curve (AUC) were computed for 20 iterations by taking 70% of the samples randomly, training them and then predicting and evaluating the remaining 30% as a validation set.We compared the proposed method to the existing sequential approach with a glasso as the network estimation, popular machine learning methods (glm, glmnet, svmRadial, rf, nnet as in the simulation study) as the prediction model, and the unsupervised version of our method.The hyperparameters for machine learning were chosen based on five repeated 10-fold cross-validations.

Simulation study
The results in Table 2 are for the following settings: the number of subjects (nsample) is 50, the number of edges (nedge) is 100 and the nedgedif of the edges is 5 and 20.The proportions of nedgedif to nedge were 5 and 20%, respectively.The proposed SSHCA method used four components and the supervision parameters µ = 0, 0.5 and 1.The results for the SSHCA were the area under the ROC curve (pathauc), which is an evaluation index for the true graph structure, the area under the ROC curve (scorecvauc), which is an evaluation index for disease discrimination, and the average of pathauc and scorecvauc (allmean) for each µ.For scorecvauc, the results obtained for each of the five prediction methods glm, glmnet, svmRadial, rf and nnet were averaged.
When µ was changed, the values of pathauc and scorecvauc were both higher when µ = 1.The AUC values were not better when the glasso weights, the method of the previous study, were used directly for prediction.It was no better even in the case of the SSHCA method with µ = 1.The pathauc was higher when the network estimation was done independently (SSHCA-corde.indor SSHCA-glasso).Furthermore, pathauc was higher when glasso was used for network estimation (SSHCA-glasso), but there was no significant difference when SSHCA-corde.ind was used.Focusing on scorecvauc, the method that performs network and score estimation simultaneously (SSHCA-corde) had the highest value.However, the scorecvauc of SSHCA-glasso was not comparable, and the average allmean was the largest.SSHCA-corde.indoutperformed SSHCA-glasso in scorecvauc.3 shows the results for nsample = 50, nedge = 50 and nedgedif = 5 and 20.The nedge was changed from 100 (in Table 2) to 50.The proportions of nedgedif to nedge were 10 and 40%, respectively.
As the number of nedges decreased, the results were generally better; the percentage of nedgedif was not relevant; the network estimation of glasso was much better, but the predictive power was not very high.The results for the nsample = 100 case are included in the Appendix, but the pattern was the same as for these nsample = 50 cases.In addition, a comparison of the results for each regression model is illustrated in the Appendix.The results show no significant differences among the regression models.The scoring may ensure some high degree of predictive accuracy.

Real data analysis
We apply the method to real data of AD described in Section 2.3.The network was estimated using three SSHCA methods (corde, corde.indand glasso) and by the glasso.The results of predicting eMCI are shown in Table 4 as sensitivity, specificity and AUC.We used glm, glmnet, svmRadial, rf and nnet for machine learning, as in the simulation study, and the resulting values are the average among values from those machine learnings.The AUC was higher for the proposed SSHCA method than for the glasso method.For SSHCAcorde, the AUC was high, even for µ = 0.75.The highest AUC was 0.881 for SSHCA-corde.ind.Therefore, next we looked closer at the network used in the estimation for the case of SSHCA-corde.ind(µ=1), which had the highest AUC value.The three components were estimated, as shown in Figure 2. The super scores of these networks were significant in the univariate logistic regression model.
We examined which of the networks estimated by each component was closest to the networks examined in previous studies.For the reference network, we used the Yeo 17 network, which is stored in the R package brainGraph and has 17 networks.For each network, we computed the AUC at the edge of the estimated network, and the one with the highest AUC was the one that was most closely related to the network estimated for that component.
The visual network and default mode networks were selected as the significant components in the univariate logistic regression, as shown in Figure 3.As listed in [5], many studies have reported the association between the default mode network and AD, and the results of the present analysis were also reasonable.

Discussion
We aimed to characterize brain function based on data measured by fMRI at rest as a time series of voxels arranged in three spatial dimensions and presented a novel regression method based on supervised sparse hierarchical component analysis (SSHCA) with a hierarchical structure consisting of a network model (individual-level block scores) and a scoring model (population-level super score).In addition, the (supervised) network connections associated with the outcomes and the multiple scores corresponding to their subnetwork estimates facilitate data interpretation.We estimated the functional networks between brain regions of each individual and applied discriminant analysis methods such as machine learning as a biomarker to assist diagnosis.The proposed score showed good disease prediction accuracy in both numerical experiments and real data analysis, and reasonable results were produced by the real data analysis.
In this study, a SSHCA method for constructing prognostic risk scores was proposed.The method could be run on the latest version of the R package msma and it was characterized by supervised learning to improve the prediction accuracy for scoring of the estimated network.The brain time series images had a spatio-temporal structure per person, and the spatial structure was transformed into a network structure, and the scoring process had a hierarchical structure.At the lower level, brain networks are estimated for each individual, and at the upper level, they are integrated to enable group analysis.
Moreover, a method to break up the hierarchical structure and make the network estimation independent was considered.A score was created after the network was given.Because the glasso method was useful and faster for network estimation, we incorporated the glasso into our algorithm to estimate the network.The method of calculating the score while estimating the network tended to be more accurate in predicting the score than the method of calculating the score after the network estimation was completed.Because their prediction accuracies were not very different, we considered that the independent estimation method offered a more precise network estimation.Furthermore, as the score is decomposed into multiple components like principal component analysis, data-driven network decomposition is possible and networks useful for diagnosis can be selected.Thus, the score structure of the proposed method may allow for more detailed interpretation of the analysis results.
Despite these advantages, setting tuning parameters remains a challenge.Theoretically, it is possible to set many parameters to adjust sparseness, but this must be restricted if the sample size for training is small.In practice, we applied a simplification but it may be an open question as to how many parameters to set.Moreover, the same may be said for the tuning parameters used to adjust the degree of supervision.In the real data used in this study, the difference in SSHCA scores between the disease group and the healthy group was small and difficult to discriminate.This may be the reason why the tuning parameter µ = 1 was chosen in the proposed method.It will be a challenge to investigate this in various stages of AD progression.
Although there are many machine learning methods, the focus of this study was to determine if the proposed network scores were useful as features.Although limited methods were applied for this reason, the simulation study and the analysis of the real data showed that all the methods produced scores that could be predicted with a certain degree of accuracy.In view of this, it was possible to estimate disease prediction probabilities with good disease prediction accuracy using simple methods, such as a multiple logistic regression model (with variable selection) when using the proposed scores.Such a simple model has been used in many clinical studies because they make interpretation of results easier and may be very useful for interpreting results without discussing explanatory possibilities in complex models.
Neural networks have been developed in recent years, and their deep learning has become increasingly useful in neuroimaging [16].The graph neural network is specialized to perform the network analysis targeted in this study.This method takes a given network as input and requires the network to be estimated a priori.In the latest research [17], Pearson's correlation coefficient is used to estimate the network first, and then graph neural networks are applied.We attempted to improve upon the sequential approach used in existing methods, in which a regression analysis is performed after network estimation.Such a network estimation method based on correlation coefficients describes relationships between brain regions, and these relationships do not necessarily include the relationship with diseases.This is unlikely to directly improve the accuracy of the diagnostic or predictive models targeted in this study.It is expected that the accuracy will be improved if not only the network estimation but also the relationship with the disease is included in the scoring.The results of numerical experiments and applications in this paper show that scoring has a certain accuracy in diagnosis and network estimation simultaneously, and moreover, there was not much difference in terms of prediction accuracy between neural networks or other machine learning methods and simple logistic regression analysis.In general, neural networks are more likely to have complexity and still need stability in terms of interpretability.On the contrary, the scores of the proposed method have a linear structure, and the application of a linear logistic regression model to them ensures prediction accuracy, which is advantageous in that it preserves the interpretability of linearity.Nevertheless, the application of the proposed scores to graph neural networks is interesting and could be a future challenge.
The proposed method has many potential extensions, and it could be used to estimate directed networks with arrows between nodes.It could also be used for dynamic modeling, as mentioned in the methods section.To interpret the results, we needed to simplify the method and have discussions with experts, which is beyond the scope of this study.The scoring used in this method could be incorporated into the multiblock method by [29].This method enabled us to evaluate the relationship between rs-fMRI and other brain images, such as structural MRI, while scoring.This is called multimodal analysis and is one of the most important analyses in brain image analysis [31].This has not been developed within the framework of the aforementioned graph neural networks, which is another useful aspect of our method.Scoring could be considered as a dimension reduction and could contribute not only to discrimination, but also to subtype classification by clustering.[32] performed network clustering on genetic data and [33] analyzed the relationship between structural MRI and estimated networks from non-imaging data such as CSF and blood biomarkers.Thus, this is expected to be a method of analysis that can be used to develop many brain studies.
Our method is applicable to other medical data as well, and may be useful for general clinical data, education and social medicine.Since our method has an element of dimension reduction, it is useful in high-dimensional data analysis, and a representative area is genetic data, which was also the subject of analysis in [32].In recent years, with the development of measuring instruments, especially single-cell RNA-sequencing (scRNA-seq) measurements, it has become possible to make more precise measurements, identify known novel cell types and characterize gene-gene interactions within each cell type.Because of the heterogeneity revealed in scRNA-seq data among various cell types in the same tissue, cell-type-level gene networks are expected to reveal gene-gene interactions that have not been revealed in previous tissue-level gene networks.As described in [34], the method of analysis is closely related to the functional brain networks targeted in this study.Many statistically challenging issues have been pointed out, and it may be possible to develop the framework of our method toward these issues.

Conclusions
We developed a method to estimate diagnostic probability from rs-fMRI data using supervised and data-driven (sub) network estimation.The scoring method and its algorithm were introduced and simulation analysis and application to real brain imaging data revealed that the regression model with the created scores could predict diseases with higher accuracy.There are several potential extensions of this method, and future work is to apply it to various diseases and obtain new medical knowledge.Our method can assist in the construction of brain disease biomarkers from functional imaging data.top score.s 2,i,m = T t=1 s 3,i,m (t)w 3,i (t) = s w 3,i with w 3,i = (w 3,i (1), w 3,i (2), . . ., w 3,i (T )) ⊤ and the individual bottom score s 3,i,m (t) defined by s 3,i,m (t) = x ⊤ i,(−m) (t)w 4,i,m and w 4,i,m is the n-dimensional weight vector.The hierarchical structure can be seen by writing down the scores as follows.

A.1. Optimization function
First, consider Eq 2.1 as the optimization problem max L 0 (w) subject to ∥w 2 ∥ 2 = 1,∥w 3,i ∥ 2 = 1, ∥w 4,i,m ∥ 2 = 1 where L 0 (w) = {L 01 (w 2 , w 3 , w 4 ) − P 2,λ 2 (w 2 )} + {L 02 (w 3 , w 4 ) − P 3,λ 3 (w 3 )} + {L 03 (w 4 ) − P 4,λ 4 (w 4 )} The first term on the right side of L 0 (w) is given by L 01 (w 2 , w 3 , w 4 ) = (1 − µ) × cov(s, u) + µ × cov(s, Z) with 0 ≤ µ ≤ 1.Note that we actually considered the covariance of s (i.e., the variance of s), but the algorithm used an iterative calculation similar to the principal component analysis, where one score was fixed and the weight for the other score was calculated, then reversed and the weight was calculated again.For this purpose, we prepared a vector u which has the same length as s.The covariance is defined as follows: Thus, we see that L 01 (w 2 , w 3 , w 4 ) depends on w 2 , w 3 and w 4 .The function L 02 (w 3 , w 4 ) in the second term on L 0 (w) is given by where s 2,i is the M-dimensional vector with the m-th element s 2,i,m . in the first term, prepared a vector u 2,i of the same length as s 2,i for the iterative calculation.The covariance is given by cov(s 2,i , u 2,i ) = s ⊤ 2,i u 2,i = w ⊤ 3,i S ⊤ 3,i u 2,i where S 3,i is the T × M matrix with the (t, m)-element s 3,i,m (t) = x ⊤ i,(−m) (t)w 4,i,m .Thus, we see that L 02 (w 3 , w 4 ) depends on w 3 and w 4 .The function L 03 (w 4 ) in the third term on L 0 (w) is given by where s 3,i,m is the T -dimensional vector with the t-th element s 3,i,m (t) = x ⊤ i,(−m) (t)w 4,i,m , and the is covariance defined as cov(s P λ (x) = 2λ|x| and P λ (x) = 2λ j |x j | = j P λ (x j ).
A population weight w 2 for X is obtained by considering the following objective function: From the Lemma, w 2 such that L 11 (w 2 ) is maximized is as follows.
Similarly, an individual top weight w 3 for X is obtained by considering the objective function: The individual top weight w 3 depends on functions L 01 (w 2 , w 3 , w 4 ) and L 02 (w 3 , w 4 ).We write this expression for each element i.
From the Lemma, w 3,i such that L 121 (w 3,i ) is maximized is as follows.
A individual bottom weight w 4 for X is obtained by considering the following objective function: The individual bottom weight w 4 depends on functions L 01 (w 2 , w 3 , w 4 ), L 02 (w 3 , w 4 ) and L 03 (w 4 ).There are several possible optimizations for w 4 .Since the first is the so-called the coordinate descent method, we refer to it as "corde" for short.For "corde", we rewrite L 13 (w 4 ) by each elements i and m.Thus, the final estimate is obtained by normalizing ŵ4,i,m = w4,i,m /∥ w4,i,m ∥.
the expression L 01 (w 2 , w 3 , w 4 ) + L 02 (w 3 , w 4 ) + L 03 (w 4 ) of the objective function L 13 (w 4 ) for w 4 , the upper level weights w 2 and w 3 are dependent.Ignoring this dependency, we consider an objective function with only w 4 , then the alternative object function for the bottom weight w 4 is given as follows.
The "corde" was based the both functions L 131 (w 4 ) and L 132 (w 4 ).If the network estimation was to be implemented independently, then a graphical lasso solution would be available for the network estimation.The "glasso" is based the only function L 132 (w 4 ).For the variance-covariance matrix Σ i and the sample variance-covariance matrix Σi = X ⊤ i X i , the subfunction L 1323 (w) of objective function L 132 (w 4 ) is given as follows.

A.1.2. Additional simulation results
Here, we showed the results when nsample was changed to 100 using the same settings as in simulation study with results on the Tables 2 and 3.As nsample was increased, the overall value became better.
Other results of the simulation study are illustrated in Figure A2.The results for the regression models were averaged in the text, but here we have illustrated the results for each regression model when averaged over the other simulation parameters.The models were generalized linear model (glm), glmnet, support vector machine (svmRadial), random forests (rf) and neural networks (nnet).The results show that the prediction accuracy is not so different among the regression models.

Figure 1 .
Figure 1.Hierarchical score structure.The hierarchical score structure was divided into two parts: a population level and an individual level.The individual level could be further divided into two levels, individual bottom and individual top.

Figure 2 .
Figure 2.Estimated networks using the SSHCA method where p values were from the logistic regression model.

Figure 3 .
Figure 3. Related reference networks.The most closely related networks estimated for that component are displayed.

Table 1 .
characteristics for real data

Table 4 .
Real data analysis results.