Systemize the Probabilistic Discrete Event Systems with Moore-penrose Generalized-inverse Matrix Theory for Cross-sectional Behavioral Data

Moore-Penrose (M-P) generalized inverse matrix theory provides a powerful approach to solve an admissible linear-equation system when the inverse of the coefficient matrix does not exist. M-P matrix theory has been used in different areas to solve challenging research questions, including operations research, signal process, and system controls. In this study, we report our work to systemize a probability discrete event systems (PDES) modeling in characterizing the progression of health risk behaviors. A novel PDES model was devised by Lin and Chen to extract and investigate longitudinal properties of smoking multi-stage behavioral progression with cross-sectional survey data. Despite its success, this PDES model requires extra exogenous equations for the model to be solvable and practically implementable. However, exogenous equations are often difficult if not impossible to obtain. Even if the additional exogenous equations are derived, the data used to generate the equations are often error-prone. By applying the M-P theory, our research demonstrates that Lin and Chen’s PDES model can be solved without using exogenous equations. For practical application, we demonstrate the M-P approach using the open-source R software with real data from 2000 National Survey of Drug Use and Health. The removal of extra data facilitate researchers to use the novel PDES method in examining human behaviors, particularly, health related behaviors for disease prevention and health promotion. Successful application of the M-P matrix theory in solving the PDES model suggests potentials of this method in system modeling to solve challenge problems for other medical and health related research. Journal of Biometrics & Biostatistics J o ur al of Bio metrics & Bistatis t i c s


Background
The Moore-Penrose (M-P) generalized inverse matrix theory [1,2] provides a powerful tool to solve a liner equation system that cannot be solved by using the inverse of the coefficient matrix. Although M-P matrix theory has been used to solve challenging problems in operations research, signal process, system controls and various other fields [3][4][5][6][7], to date this method has not been used in health and human behavior research. In this study, we report our work to solve a probability discrete event system-based modeling characterizing cigarette smoking behavior among an adolescent population in the United States.
To extract and model the longitudinal properties of multi-stage behavioral system, such as cigarette smoking with cross-sectional survey data, Chen et al. [8][9][10] developed the probability discrete event systems(PDES) modeling approach. In this approach, the continuous development process of a behavior (such as, cigarette smoking, disease progression) is first conceptualized as a PDES with multiple states. These states describe the multiple stages of logic behavioral progression with the transition paths linking one state (stage) to another [8][9][10]. This model has been successfully used in describing the dynamics of cigarette smoking behavior [8,9] and the responses to smoking prevention intervention among adolescents in the United States [11]. Despite the success, the established PDES method has a limitation: the model cannot be determined without extra exogenous equations. Furthermore, such exogenous equations are often impractical to obtain and even if an equation is derived, the data supporting the construction of the equation may be error prone.
To overcome the limitation of the PDES modeling method, we proposed the use of M-P inverse matrix method that can solve the established PDES model without exogenous equation (s) to create a full-ranked coefficient matrix. The combined approach of the M-P inverse matrix theory with PDES (or "M-P Approach" for short) will increase the efficiency and utility of PDES modeling in investigating many dynamics of human behavior without fully observed data. To facilitate the use of the M-P Approach, an R program with examples and data are provided in Appendix A for interested readers to apply their own research data.

A Review of the PDES for Smoking Behavior
To be self-contained, we make use of the notations in Lin and Chen [10] in this paper to describe the PDES model. According to Lin and Chen [10], in estimating the transitional probability with crosssectional survey data to model smoking multi-behavioral progression (Figure 1), five behavioral states are defined to construct a PDES: • NS -never-smoker, a person who has never smoked by the time of the survey.
• EX -experimenter, a person who smokes but not on a regular basis after initiation.
• SS -self stopper, an ex-experimenter who stopped smoking for at least 12 months.
• RS -regular smoker, a smoker who smokes on a daily or regular basis.
• QU -quitter, a regular smoker who stopped smoking for at least 12 months.
The smoking dynamics as shown in Figure 1 can be described using the PDES model: where Q is the set of discrete states. In this smoking behavior model of Figure 1. Q={NS,EX,SS,RS,QU} Let ∑={σ 1 , σ 2 , ….., σ 11 } be the set of events. In Figure 1, ∑={σ 1 , σ 2 , ….., σ 11 }, where each σ i is an event describing the transition among the multiple smoking behaviors. For example σ 2 is the event of starting smoking. δ: Q×∑→Q is the transitional function describing what event can occur at which state and the resulting new states. For example, in Figure 1, δ (NS, σ2)=EX. q 0 is the initial state. For the smoking behavior model in Figure 1, q 0 =NS. With slight abuse of notation, we also use q to denote the probability of the system being at state q and use σ i to denote the probability of σ i occurring. Therefore, NS also denotes the probability of being a never-smoker and σ i also denotes the probability of starting smoking. If it is important to specify the age, then we will use a to denote age. For example, σ 2 (a) denotes the event or the probability of starting smoking at age a.
Based on the defined PDES model shown in Figure 1, the following equationset can be defined conceptually: EX (a +1) = EX (a) + NS(a) 2

QU(a +1) = QU(a) + RS(a) σ 9 (a)− QU (a) σ 10 (a)
For example, Equation (2) states that the percentage of people who are never-smoker at age a+1 is equal to the percentage of people who are never-smoker at age a, subtracted from the percentage of people who are never-smoker at age a, times the percentage of never-smokers who start smoking at age a. Similar explanations can be done for the other equations. Furthermore, we have the following additional equations with respect to Figure 1.
The above 10 equations from Equation (2) to Equation (11) can be casted into the matrix format: Equation (12) is denoted by Aσ=b where A is the coefficient matrix, σ the bolded is the solution vector and vector b denotes the right-side of Equation (12).
It can be shown that rank (A)=9. Therefore, among the 10 equations, only 9 are independent. However there are 11 transitional probabilities, σ 1 (a), σ 2 (a), ….., σ 11 (a) to be estimated. Therefore the PDES equation set (12) cannot be solved uniquely as indicated in Lin and Chen [10]. This condition will restrict the application of this novel approach in research and practice.
To solve this challenge, Lin and Chen [10] sought to derive two more independent equations by squeezing the survey data to define two additional progression stages (1) SS , old self-stoppers (e.g., those who stopped smoking one year ago) and (2) QU old quitters (e.g., those who quit smoking one year ago). With data for these two newly defined smokers, two more independent equations  has a definite solution. However, the introduction of the two types of smokers SS and QC may have also brought in more errors from the data because two newly defined smokers must be derived from recalled data one year longer than other data. If this is the case, errors introduced through these two newly defined smokers will affect the estimated transitional probabilities that are related to self-stoppers and quitters, including σ 3 , σ 4 , σ 5 , σ 6 , σ 10 , and σ 11 ( Figure 1). When searching for methods that can help to solve Equation (12) without depending on the two additional equations, we found the generalized inverse matrix approach [1,12]. It is this "M-P Approach" that makes the impossible PDES work possible.

Generalized-inverse Matrix for PDES
In matrix theory, the generalized-inverse of a matrix A with dimension m×n (i.e. m rows with m equations and n columns with n variables) is defined as: The purpose of introducing a generalized-inverse for any matrix is to have a general solution  as commonly known in any elementary linear algebra course. From the definition of the generalized inverse matrix, it can be seen that if A is a full-rank square matrix. In this case, rank (A)=m=n. Obviously as described earlier, the matrix A for the PDES system (e.g., Equation 12) is not a full-rank matrix (i.e. rank(A) is less than m, n), in another word, the system is complete but the observed data to support solving the system is incomplete. Therefore a system without fully observed data like the PDES model cannot be solved using the classic matrix approach. With the introduction of the generalized-inverse matrix approach, we will show that for any matrix equation A b s = , including the PDES described in Equation 12: is a solution to Aσ=b.
The general solution to the PDES matrix equation of Aσ=b can be expressed in Ais any fixed generalizedinverse of A, while z represents an arbitrary vector. Therefore, the generalized-inverse 1 Ais not unique which is equivalent to say that the PDES equation system (12) cannot be solved uniquely as indicated in Lin and Chen [10]. To practically solve this challenge, Lin and Chen [10] sought to derive two exogenous equations in order to solve for 11 parameters. However, the data used to construct those exogenous equations are hard to obtain and error-prone. Inspired by the general inverse matrix theory, particularly the work by Moore and Penrose, we introduced a mathematical approach to this problem: the M-P Approach. In his famous paper, Moore proposed three more conditions to the generalized-inverse Adefined above. They are as follows: The original definition of generalized-inverse matrix is to allow any admissible linear system Aσ=b to be solved easily by matrix representation regardless of the existence of the inverse of coefficient matrix. Extending the classical inverse matrix definition, 1 1 AA -= with the identity matrix I, which is equivalent to AA -is relaxed and no longer needs to be an identity matrix. With this extension, the only requirement is that (Appendix B.2). It provides a mathematical approach to overcome the challenge in solving a PDES model with a non-full rank coefficient matrix.

Demonstration with the "MASS" Package in R
To demonstrate the M-P Approach in solving the PDES model, a linear equation system without full rank, we make use of the R library "MASS" [4]. This package includes a function named "ginv". It is devised specifically to calculate the Moore-Penrose generalized-inverse of a matrix. We used this function to calculate the Moore-Penrose generalized-inverse of the coefficient matrix A in the PDES smoking behavior model described in Equation (12).
As shown in Lin and Chen, smoking data from 2000 National Survey on Drug Use and Health (NSDUH) were compiled for US adsolescentsand young adults aged 15 to 21 (Table 1). According to the PDES, the state probability for each of the seven types of defined smokers by single year of age was calculated with the NSDUH data ( Table 1). The state probabilities were estimated as the percentages of subjects in various behavioral states. Since the five smoking stages (i.e. NS, EX, SS, RS, QU) were all defined on the current year, the sum of them were one (i.e. 100%). While SS and QU were defined as the participants who self-stopped smoking and quit one year before.
With data for the first five types of smokers in Table 1, we estimated the transition probabilities with the M-P Approach. The results are presented in Table 2 (the R codes are included in Appendix A).
For validation and comparison purpose, we also computed the transitional probabilities using data for all seven types of smokers and the original PDES method by Lin and Chen using R (Codes are included also in Appendix A) [10]. The results from Table 3 were almost identical to those reported in the original study by Lin and Che n. As we expected, by comparing the results in Table 2 with those  in Table 3, for the five transitional probabilities (e.g., σ 1 , σ 2 , σ 7, σ 8 ,σ 9 ) that are not directly affected by the two additionally defined stages SS or old self-stoppers and QU or old quitters, the results from the "M-P Approach" are almost identical to those from the original method. On than nicotine dependence [13,21,22]. However, no clear age trend was observed in the same probability σ 6 estimated using the original method by Lin and Chen.

Evaluation of Intervention Impact for Smoking Behaviors
As indicated in the previous section, the introduction of the "M-P Approach" will greatly facilitate the application of the PDES method in behavior research. In addition to characterizing smoking behavior, and to assessing effects from exposure to prevention programs, the PDES method can be used to predict changes in smoking behavior in the future, supporting public health planning and decision-making [8,9,11]. Next, we introduce the "M-P Approach" and the PDES model to evaluate the intervention impact for smoking behaviors.
As seen from the PDES model, the multi-stage behavioral transitions provide information on the likelihood that a person will progress from never-smoking (NS) to start smoking (EX), further to regular smoking (RS); regular smokers can quit smoking (QU) and quitters may relapse and become regular smokers again. These transitional probabilities are influenced by the environment of the person is in. Various tobacco control programs, such as tobacco taxation, restriction of smoking in public places, restriction of tobacco sales to minors, school-based programs, and media campaign, are intended to change the environment and hence the transitional probabilities. Different tobacco control programs have different impacts on the transitional probabilities. For example, restriction of tobacco sales to minors and school-based programs has greater impact on ) ( 2 a σ than on other transitional probabilities. The goal of tobacco control programs is to the contrary, however, the other six estimated probabilities (σ 3 , σ 4 , σ 5 , σ 6 , σ 10 , σ 11 ) differed between the two methods. For example, compared with the original estimates by Lin and Chen, σ 10 (the transitional probability to relapse to smoke again) with the "M-P Approach" are higher and σ 11 (the transitional probability of remaining as quitters) arelower; furthermore, these two probabilities show little variations across ages compared to the originally reported results.
To the best of our understanding, the results from the "M-P Approach" are more valid for a number of reasons. (1) The M-P Approach did not use additional data from which more errors could be introduced. (2) More importantly, the results from the M-P Approachscientifically make more sense than those estimated with the original method. Using σ 10 and σ 11 as examples, biologically, it has been documented that it is much harder for adolescent smokers who quit and remain as quitters than to relapse and smoke again [13][14][15]. Consistent with this finding, the estimated σ 10 (quitters relapse to regular smokers) was higher and σ 11 (quitters remain as quitters) was lower with the new method than those with the original method. The results from the "M-P Approach" more accurately characterize these two steps of smoking behavior progression. Furthermore, the likelihood to relapse or to remain as quitter is largely determined by levels of addiction to nicotine, rather than chronological age [16][17][18][19][20]. Consistent with this evidence, the estimated σ 10 and σ 11 with the "M-P Approach" varied much less along with age than those estimated with the original method. Similar evidence, supporting a high validity of the "M-P Approach", is the difference in the estimated σ 6 (self-stoppers remaining as self-stoppers) between the two methods. The probability estimated through the "M-P Approach" showed a declining trend with age, reflecting the dominant influence of peers and society rather   reduce smoking among adolescents and adults. In terms of PDES, the goal is to reduce the (state) probability RS. To qualitatively assess the impact of a tobacco control program to RS, this PDES can be employed for this purpose. We illustrate this evaluation of intervention impact both theoretically and numerically as follows.
Suppose new intervention program is devised to reduce Corresponding to equations (2) to (6) Let the multi-stage transitional probabilities at ages a and a+1 under the new transitional probabilities ) ( ' a Π be denoted by respectively. Therefore, the future smoking behavior distribution at different ages can be calculated as follows:  Table 1 and the estimated transitional probabilities from "M-P Approach" in Table  2 to illustrate the program impact. Suppose a tobacco intervention program is designed to decrease the probability of 2 σ (i.e. "Never-Smoking (NS)" to "Experimenter (EX)") by 20%. This 20% reduction would change the estimated probabilities in Table 2 for 2 σ from (16.6%, 11.6%, 12.3%, 13.8%, 10.7%, 4.4%) to (13.3%, 9.3%, 9.8%, 11.0%, 8.6%, 3.5%) for age of 15, 16, 17, 18, 19 and 20, respectively. With this 20% reduction from the intervention program, the smoking multi-behavioral distribution can be calculated using equation (14) as seen in Table 4 as follows: Table 4 can be compared to Table 1 to investigate the percent changes of the smoking population for each smoking behavior under different age. For example, we can investigate the absolute change of the smoking population using the differences between the values from Table 4 to Table 1 as well as the relative change (  19.6%, 20.0%) relative reduction in the population of "Experimenter (EX)" as seen in Table 5 (in column "EX"). Furthermore, with this 20% reduction, the "never-smoker (NS)" population would increase by about 4% to 17%, the "self-stopper (SS)" by 6.5% for age 16, but will dramatically increase to 113% for age 21, the "quitter (QU)" from early age of 16 by 58% to 6.3% for age 21 as seen in Table 5. It is also interesting to see from Table 5 that the "regular smoker (RS)" dropped by 31% for age 17 by more than 60% for ages 19, 20 and 21.

Discussion and Conclusions
The Moore-Penrose generalized-inverse matrix theory has significant applications in many fields, including multivariate analysis, operations research, neural network analysis, pattern recognition, system control, and graphics processing [3][4][5][6][7]. To the best of our knowledge, this is the first time this "M-P Approach" is used in solving a PDES model to describe smoking behavior progression in an adolescent population. Our study fills a methodology gap in PDES modeling. After an introduction to the "M-P Approach", we illustrate its application with the same data reported in the original study using the R software [10]. Results from the analysis using the "M-P Approach", although using less data, better reflect the dynamics of smoking behavior change in adolescents than do the results from the original analysis.
Findings of this study provide evidence that the "M-P Approach" can be used to solve a PDES model constructed to characterize complex health behaviors with cross-sectional data even if the coefficient matrix has no full rank. Behavioral modeling, like in many other systems research fields, has frequently been challenged because of the lack of "fully" observed data to quantitatively characterize a system, even when the system is constructed based on scientific theory or data. Successful application of the "M-P Approach" in solving the PDES model for smoking behavior will greatly facilitate system modeling of various other human behaviors with or without fully observed data.
According to the "M-P Approach", as long as a model is "true" (e.g., as long as it has a solution), it should be solvable even with partial observations. In our study, since the PDES smoking model has been proved to be true through previous analysis, the "M-P Approach" works. This success is not by chance. Similar to a system with extra observed data (e.g., multiple regression with the number of equations greater than the number of unknowns) that can be solved using the "M-P Approach" (e.g., the least square approach is in theory a "M-PApproach"), a system with the number of unknowns greater than the number of independent equations (e.g., partially observed data) can also be solved based on the minimum-norm approach with M-P inverse matrix.   Despite a successful application of the M-P approach in solving a PDES model, more research is needed to investigate more specific conditions in which the application of the "M-P Approach" is indicated to solve complex modeling questions with a linearequation system but without a full-rank coefficient matrix. We are initializing a systematic simulation study to validate this new approach.