1 Introduction

Since the first case was reported more than a year ago, the coronavirus disease 2019 (COVID-19) has significantly changed the world. Common symptoms attributable to COVID-19 include dry cough, fever, sore throat, diarrhea, vomiting, headache, and altered smell sensation. More severe complications include respiratory failure with invasive airway support and death (WHO & China 2020). The broad range of symptoms and complications appears to contribute to the easy spread of the disease. Many infected individuals with mild to moderate symptoms had unknowingly spread the disease, increasing its transmissibility. At the time of writing, the pandemic is responsible for more than 2.2 million deaths while the total number of global cases surge pass 102 million (Jhu 2020). COVID-19 has thus brought a record scale of impact on the economy, industries, health, and social well-being of societies around the world.

Consequently, countries worldwide have implemented mass population testing and extensive contact tracing to track, isolate, and treat infected individuals at the earliest time possible. The critical need for mass testing to identify at-risk and infected individuals early is evident by China and South Korea's successful containment strategies. In both countries, mass population diagnostic testing enabled early diagnosis, identification, and isolation infected individuals. This achievement significantly reduced the number of new cases than was initially predicted (WHO & China 2020; Balilla 2020). Unfortunately, mass diagnostic testing requires heavy use of resources, limiting this option to countries with access to rich supplies and human resources. Therefore, a more practical alternative population-based screening with acceptable validity will complement the existing measures.

In response, Universiti Sultan Zainal Abidin (UniSZA) developed the COVID-19 Health-Risk Assessment and Self-Evaluation (CHaSe) System. This system, accessible via the internet browser and mobile app application, aimed to provide an intelligent and comprehensive analysis of at-risk individuals at the earliest stage possible. The development of CHaSe was primarily motivated by the critical need to conduct a thorough daily risk-assessment screening among all staff and students of the university as well as civil servants under the ministry of science, technology, and innovation (MOSTI) and various other government and non-governmental agencies. During the CHaSe development, Malaysia implemented a movement control order (MCO) policy that warranted work-from-home for all working class and student groups. Upon lifting the MCO policy, many travelers from areas with high-density cases of COVID-19 may return to workplaces, such as university campuses and MOSTI headquarter, to resume study and work duty. Consequently, the magnitude of risk of outbreaks associated with the influx of returning students and staff mandates more practical, robust, vigilant, and reliable population-based screening measures.

CHaSe fills this gap by serving as a platform for intelligent and comprehensive analysis of COVID-19 screening, promotes preventive measures against COVID-19, and recommends advice on action according to individual risk categories. Featuring daily risk-screening assessment, CHaSe extends its role as a comprehensive surveillance tool where individuals with an elevated risk of contracting COVID-19 may declare their status and be contactable by the surveillance team while preserving the ethical medical practice autonomy and confidentiality.

Hence, CHaSe enhances the conventional active case detection and contract tracing practice, typically initiated by the health personnel. Through CHaSe, the community can access a valid population-based screening and declare these risk assessments voluntarily to public health personnel. This access significantly increases contact tracing and active case detection outreach, which otherwise would have been practically impossible to attain. This study outlines the development of the risk-prediction model employed within the CHaSe system to achieve this goal.

2 Methodology

The urgent development of the CHaSe system began by having brainstorming sessions among UniSZA experts from the Faculty of Medicine, Faculty of Informatics and Computer and the Deputy of Research and Innovation Vice Chancellor. Due to the significant morbidity and mortality associated with COVID-19, the development of the CHaSe system adopted the Six Sigma approach of DMAIC (Define, Measure, Analyze, Improve, and Control) (Juahir et al. 2017) in its design.

Six Sigma is a data- and statistical-driven approach for continuous quality improvement via eliminating defects or problems in the manufacturing and development process. Hundreds of businesses and conglomerates worldwide, such as McDonald’s, Texas Instruments, and many Japanese companies, have adopted the Six Sigma approach. Six Sigma's emphasis on quality translates to increased productivity speed with reduced defects at a lower utilization of resources and cost (Juahir et al. 2017; Cox et al. 2016). The systematic approach assures goal attainment and continuous improvement given the relatively rapid development in knowledge and policies associated with this novel pathology.

3 Development of CHaSe via Six Sigma approach of DMAIC

There are five major phases to DMAIC six-sigma in the development of CHaSe (Juahir et al. 2017; Aris et al. 2010).

  1. a.

    Define:

    Lead by the Vice Chancellor of UniSZA, this phase concentrates on identifying the gaps to address. The ultimate goal is to determine the solutions to ensure health and safety of all students, staff, and general members of the public within the locality of UniSZA and MOSTI. The thorough discussion identified the need for a population-based screening among students and staff that is practical, valid, robust, and timely for decision-making. Eleven medical specialists from the faculty of medicine, UniSZA, with diverse medical and surgical expertise, participated in a comprehensive literature review to identify established risks associated with COVID-19 infection. We also elicit assistance from an information specialist from the library to perform a robust searching strategy with sensitive search strings from all relevant databases. Two independent medical experts screened and selected articles associated with COVID-19. All papers and reports were then uploaded into Microsoft Teams (Microsoft Inc. 2018) for real-time online collaboration during the next phase.

  2. b.

    Measure:

    The measuring phase involved substantiating the prevalence, frequency, and relevance of variables associated with COVID-19 based on evidence from the literature and medical experts’ consensus. Extensive COVID-19 literature was divided and delegated among all individual experts. As a critical-to-quality (CTQ) measure, a standardized data extraction tool was employed within Microsoft Team. The use of cloud-based data extraction enabled a secured synchronization of collaboration and co-authoring (Microsoft Inc. 2018), which brought together a total of 49 variables extracted from the literature by individual medical experts into a single master data collection (WHO & China 2020; Jhu 2020; Balilla 2020; Center for Disease Control and Prevention 2021; Ministry of Health Malaysia 2020; Islam et al. 2020; Siddiqi et al. 2021; Kluster Tawar 2020; Noor Hisham Abdulah 2020; McMichael et al. 2020; MOH 2020a, b). Based on all data collected in the database, the associated reported prevalence and relevance were thoroughly scrutinized, vetted, and revalidated by medical experts on their significance and relevance. The revalidated variables were then used for simulation model calculation, where each variable was given a dedicated weightage/scoring to reflect their significance toward risk prediction of COVID-19. This model’s outcome are three distinct risk categories, namely, Low, Medium, and High.

  3. c.

    Analyze:

    This phase encapsulated the complex process of data preprocessing, data analyses, data analytics, and data validation to generate a risk-prediction model that meets or exceeds the minimum standard of public health practice. We simulated a pilot study by distributing a questionnaire consisting of the validated variables to 200 healthy volunteers. Deliberate weightage determination was assigned for selected parameters. These include close contact with patients of COVID-19, overseas travel abroad within the last 14 days, cough symptom, fever symptom, breathing difficulty, and critical illnesses.

  4. d.

    Improve:

    To further improve the modeling and CHaSe system itself, we conduct the first trial of mass-screening among 5000 university students during the nation-wide movement control order (MCO). Respondents under the high-risk category were contacted by medical experts and advised to visit the nearest local health center for further assessment.

    The real mass-screening provides empirical evidence to facilitate the modification of processes as CHaSe is continuously monitored and improvised. The iterative process of recalculating each medical parameter's weightage/scoring was employed using the principal component analysis (PCA). After the new weightages are recalculated, the improved model statistically categorized respondents into three groups; (a) Low Risk, (b) Moderate Risk, and (c) High Risk. This improvement phase of the DMAIC Six Sigma for the CHaSe system acts to evaluate the modification of processes to obtain a more meaningful data interpretation.

  5. e.

    Control

As a system-driven methodology, the quality of recommendation from the CHaSe system conforms to the standard operating procedure (SOP) and guidelines by the health ministry.

  1. (a)

    Low Risk: students and civil servants should record the daily Self-Assessment Module from CHaSe until further notice. They should also be practicing the recommended personal preventive measures against COVID-19 infection.

  2. (b)

    Medium Risk: students and civil servants in this category should limit contact with the community and contact UniSZA’s contact tracing team for a telephone-based clinical assessment and recommendation.

  3. (c)

    High Risk: UniSZA’s contract tracing team initiates telephone contact at the earliest time practical under the active-case-detection strategy. Respondents in this category must limit contact with the community and urgently visit the nearest health center with COVID-19 diagnostic capacity for further assessment.

4 Development health screening model within CHaSe

The health screening model within the CHaSe system was developed with the consideration of the following aspects:

4.1 Principal component analysis (PCA)

Principal component analysis (PCA) is that variance in common from the original dataset-based multivariate techniques in revealing the strong factors or factor loadings of a dataset via the underlying latent vectors of principal components (PCs) (Ismail et al. 2018; Fahmi 2011). These vectors transform the original dataset into new uncorrelated variables producing the best linear fitting with an orthogonal basis. The new axes lie along the orthogonal lines in the directions of maximum variance (Shrestha and Kazama 2007). The equation below defines the derived outcome (Cloutier et al. 2008);

$${\text{PCs }}z_{ij} { = }a_{{i{1}}} x_{1j} + a_{{i{2}}} x_{2j} + \ldots + a_{{i{\text{m}}}} x_{mj}$$
(1)

In this equation, ȥ denotes the component score and a represents the component loading. Meanwhile, \(x\) is the measured value of the variable, \(i\) is the component number, \(j\) is the sample number, and \(m\) is the total number of variables. At the same time, it maximizes the variance and defines the association between the PCs, namely the clinical aspects and epidemiological link to COVID-19 (Epid-Link). Recycled data were important in this study. The index values from the factor score of PCA in the first place consisted of varying scales (internally inconsistent values of negatives) that the algorithm was not easy to make the comparison or assumption with it (Helena et al. 2000).

Generally these PCs are unstable and must be subjected to rotation known as the varimax rotation generating the varimax factors (VFs) (Kamaruddin et al. 2015). Varimax rotation is important to regulate the overfitting and complexity of the components by making enhancing the positive factor loadings and negative factor loadings within each principal component. The varimax rotation produces a more transparent quantitative measure to correlate each variable with one PC or almost one factor only (Juahir et al. 2017; Ismail et al. 2018; Kamaruddin et al. 2015; Dominick et al. 2012; Fazillah et al. 2017). The varimax factors with an eigenvalue significantly greater than one (1), form the critical criterion for further analysis (Juahir et al. 2017). In this study, the strong factor loading of VFs equal to or greater than 0.4 is considered a strong factor loading. The value ranging from 0.2 to 0.39 is deemed moderate, while 0.10 to 0.199 is regarded as weak factor loading (Liu et al. 2003). The negative value of factor loadings represents the variables’ insignificant loading in either the clinical aspects or Epid-Link.

In this study, PCA was applied to all initial and revalidated variables in the questionnaire distributed among 200 respondents during the measurement phase and 5000 respondents in the improvement phase. Specifically, the PCA re-constructs factors (such as clinical symptoms of fever and overseas travel) from the large dataset into their respective underlying latent variables, the clinical aspects and the Epid-Link. Subsequently, the factor scores of PCA were used as a single index for every contributing factor. The weighted score was used to determine the risk level of the respondents toward the COVID-19 infection based on an algorithm expressed by the equation below:

$$\mathrm{PRI}=\sum_{i}^{n}{F}_{I}{w}_{i}$$
(2)

where \(n\) denotes the number of factors, \({F}_{I}\) represents the score of factor \(i\) and \({w}_{i}\) is the percentage of the variance factor that factor \(i\) explains.

The index values from the factor scores of PCA consisted of varying scales (internally inconsistent values of negatives) which results in in algorithmic difficulty in making comparison and developing assumptions. To achieve consistency, the scales were normalized via the rescaling method; the index values were rescaled to z values (to ensure variance for each variable would equal to unity) using the expression below:

$$\mathrm{Rescaling} (1 \mathrm{to} 100)=a+\left[\left(xi-A\right)\times \left(b-a\right)/\left(B-A\right)\right]$$
(3)

where \(a\) is equal to 1, \(xi\) is the actual observation, A and B are the lowest and highest factor score, and \(b\) is the constant value of 100. As the rescaling involved the data normalization, univariate clustering was selected to categorize the risks, namely: Low Risk, Moderate Risk, and High Risk. The lowest value of risk indicates that the respondents have a low risk of COVID-19 infection. Likewise, a high value suggests that respondents have a high risk of contracting COVID-19, thus requiring an urgent further clinical assessment and management.

4.2 Discriminant analysis (DA)

Discriminant analysis is a technique for developing the independent variables' discriminant functions. These functions significantly distinguish these independent variables among themselves (predictors variables) according to the dependent variables (categorical variables). The result is a classification method that produces perfect separation or very minimal misclassification rate (Juahir et al. 2017; Ismail et al. 2018; Gazzaz et al. 2012). By applying a linear combination of the predictor variables (Lambrakis et al. 2004), achieved by maximizing the correlation of variance–covariance between classes while minimizing the variance–covariance correlation within the same class, the discriminant factors (DFs) reveal the derived categorical-variables-based classification (Juahir et al. 2017; Ismail et al. 2018; Gazzaz et al. 2012). The DFs for each cluster follow the equation:

$$f (\mathrm{Gi}) =\mathrm{k}i + \sum_{j=1}^{n}{w}_{ij}{p}_{ij}$$
(4)

where \(i\) denotes the number of groups (G), \(\mathrm{k}i\) represents the constant inherent of each group, \(n\) is the number of parameters used to classify a set of data into a given group and \({w}_{j}\) is the weight coefficient assigned by DF analysis (DFA) to a given parameter (\({p}_{j}\)) (Kannel et al. 2007).

The DFs to evaluate the categorical variables of clinical aspects and the Epid-Link were constructed via the DA technique based on two distinct modes, viz. standard and forward stepwise. In the initial phase of the pilot study, the construction of DFs was for 200 respondents with 49 variables. Later, the variables were revalidated that subsequently reduced to 12 variables to serve 5000 respondents. Therefore, the data input to discriminant analysis constituted 200 and 5000 samples of respondents. In the DA stepwise forward mode, variables were added step by step, starting from the most significant variables until no changes were observed (Fazillah et al. 2017).

5 Results and discussion

5.1 Define phase

The first phase of the Six Sigma DMAIC’s critical-to-quality (CTQ) approach identified several root causes of outbreaks and potential interventions to COVID-19 infection in Malaysia. The Fishbone Diagram in Fig. 1 below illustrates the dynamic relationship between crucial elements from the known COVID-19 clusters in Malaysia with the underlying mechanism of transmission.

Fig. 1
figure 1

A fishbone diagram illustrating the cause and effect of COVID-19 outbreak in Malaysia

5.2 Measure phase

The initial extensive literature review identified 49 parameters associated with COVID-19, as shown in Table 1 below.

Table 1 Forty-Nine (49) parameters or variables used for pilot study based on literature reviews

5.3 Analyze phase

5.3.1 Pilot study using forty-nine (49) clinical aspects and epidemiological variables linked to COVID-19 for 200 respondents

5.3.1.1 Determination of weightage scoring using the principal component analysis (PCA)

All the initial forty-nine (49) parameters of COVID-19 were scrutinized before confirming the weightage score for categorizing the index using the PCA. PCs with the eigenvalues of more than 1.0 were considered significant. The PCA output revealed twelve significant PCs extracted from the variables with eigenvalues greater than 1 (Table 2). The variance explanations of the PCs were 23.84%, 11.794%, 10.347%, and 8.785% for PC1, PC2, PC3, and PC4, respectively, with a total cumulative explanation of 58.287% in data variability. Thus, the PCA's twelve factor loadings are accountable for most clinical symptoms and clinical history variations. Therefore, predicting the risk of contracting COVID-19 infection based on clinical symptoms and Epid-Link is scientifically determined by these twelve factors (Niranjanamurthy et al. 2020; Cao et al. 2020) (Fig. 2).

Fig. 2
figure 2

Variance explanation of the strong factor loadings of 27.36% for PC1 and 11.794% for PC2 after varimax rotation of 49 parameters for 200 respondents

Table 2 Eigenvalues from the principal component analysis illustrating variance, cumulative variance, and factor loading of clinical aspects (clinical symptoms and clinical history) and epidemiological link of COVID-19 (Epid-Link) after varimax rotation of 49 parameters for 200 respondents

Referring to PC1 in Table 2, C6 (History of joining high-risk gathering where confirmed cases had been recorded), CH11 [History of contact with confirmed cases (close contact)], and CH13 [Duration of exposure with confirmed cases (minutes)] display strong positive factors of 0.705, 0.706 and 0.509, respectively. These values suggest they are significant variables to predict the risk of contracting COVID-19. Nevertheless, the variables of CS11 [Duration of muscle pain/ fatigue (days)] and CS15 (Headache) with the factor loadings of 0.444 and 0.463 are also considered significant.

The PC2 denotes underlying risks associated with age (S1) and underlying history of hypertension (Com2) at the weighted value of 0.8478 and 0.5880, respectively. In the United States of America (US), every 8 out of 10 deaths are from adults aged 65 or older (Center for Disease Control and Prevention 2021). Similarly, those aged 60 or above account for the majority of deaths from COVID-19 in Malaysia (Ministry of Health Malaysia 2020). The relatively lower cardiovascular reserve may explain the high burden of disease and complications caused by COVID-19 among this age group. Furthermore, the older population is more likely to have limited mobility issues and possibly take residence in aged-care centers. These factors explain the increased risk of contracting COVID-19 due to shared environment and close physical contact. In contrast, hypertension, which also steadily increases with age, is commonly treated with angiotensin-converting enzyme inhibitor (ACE-i) and angiotensin-receptor blocker (ARB). Both ACE-i and ARB are associated with COVID-19 through upregulation of angiotensin-converting enzyme 2 receptor (ACE2) (Islam et al. 2020). ACE2 is a crucial factor for the mechanism of COVID-19 infection into the human body (Islam et al. 2020). On the other hand, the extent of COVID-19 infection has manifested beyond localized symptoms. The third factor (PC3) of PCA illustrates fever (CS1), difficulty breathing (CS8), and vomiting (CS12) as significant variables with the loading values of 0.6456, 0.7064, and 0.6718, respectively. These three clinical symptoms may represent bodily response toward an infection that has turned systemic. There is emerging evidence that COVID-19 leads to multi-organ dysfunction through endothelial injury, a mechanism conventionally associated with a vascular pathology (Siddiqi et al. 2021).

Nonetheless, it is possible COVID-19 may only cause localized or predominantly respiratory symptoms. The PC4 summarizes sore throats (CS7) and difficulty breathing (CS8) as strong factors, with loading values of 0.8513 and 0.6824, respectively. Lung tissue damage caused by COVID-19 is evident by the autopsy of a 50-year-old patient from the initial case in Wuhan in 2019, which showed bilateral diffuse alveolar damage with cellular fibromyxoid exudates (WHO & China 2020). Difficulty breathing, however, is noticeably a significant factor for both PC3 and PC4. Clinically, both systemic and localized pathologies can cause breathing difficulty. Thus, it is conceivable that COVID-19, which may initially cause difficulty breathing primarily due to the assault to the respiratory system, may later cause the same clinical symptoms when it has gone systemic. The mixed significant factor loading of difficulty breathing to PC3 and PC4 thus reflects the dynamic and complexity of COVID-19 clinical presentations attributable to this symptom.

5.4 Validation of risk index for pilot study based on discriminant analysis (DA)

Discriminant Analysis (DA) provides a further statistical estimation on the index variations of clinical aspects and Epid-Link (Table 3). Based on PCA's risk indices, the DA evaluates the most significant discriminant functions of variables from the clinical aspects (clinical symptoms and clinical history) or Epid-Link. The risk indices were treated as dependent variables, and the clinical aspects (clinical symptoms and clinical history) and Epid-Link variables were treated as independent variables. The discriminant functions (DFs) and classification matrices (CMs) obtained from the DA standard stepwise mode and DA forward stepwise mode are depicted in Fig. 3a and b.

Table 3 Confusion Matrix of DA (Standard Step Wise and Forward Step Wise) validating the percentage correction of the risk indices
Fig. 3
figure 3

a DA Standard step mode. b DA Forward step mode of COVID-19 for 200 respondents

The standard DA mode constructs DFs that contained all clinical aspects (clinical symptoms and clinical history) and epidemiological link of COVID-19 variables. The high-risk indices were 92.31% discriminated correctly (Table 3) with different significant clinical aspects and epidemiological variables. Table 3 displays the discriminant analyses that 100% correctly assign categories of Low Risk and Moderate Risk. Three variables appear to be strong discriminants toward the high-risk category. They are the last contact with confirmed cases in days (CH14), duration of exposure with confirmed cases in minutes (CH13), and age (S1).

In standard mode, the 200 respondents act as discriminant variables, which the clustering matrix of DA correctly assigns categories at 97.50% accuracy. In contrast, the DA forward stepwise mode worked on Low and Moderate Risks' matrices with nine (9) discriminant variables, responsible for 98.31% and 95.45% correctness of the CMs. Through this study deliberation, the clinical aspects and epidemiological variables of CS11 [Duration of muscle pain/ fatigue (days)], CS9 [Duration of difficulty breathing (days)], and S1 (Age) are attributable to Moderate Risk of COVID-19. However, CM produced 88.00% correct assignments in the entire stepwise forward mode that used 200 discriminant parameters with different clinical aspects and epidemiological links to COVID-19. The matrix of High Risk, however, yielded the lowest at 65% percentage correctness of CMs. These results are notably dissimilar to those obtained from the DA standard mode.

The Wilks' Lambda test for standard mode and forward modes revealed a Lambda value of 0.025 (p < 0.0001), and 0.181 p < 0.0001, respectively. The DA null hypothesis, H0, states that the mean vectors were equal among the three classes. The alternative hypothesis, Ha states that at least one of the mean vectors is different from the others. As the computed p value is lower than the significance level alpha (0.05), this study rejects the null hypothesis, H0, and accepts the alternative hypothesis, Ha. The risk of rejecting H0 is true when the result is lower than 0.05. Thus, the discriminant analysis suggests that the included variables were sufficient to discriminate the three groups and accounted for most of the index scale variations of COVID-19 infections. These variables were the last contact with confirmed cases (days)(CH14), duration of exposure with confirmed cases (minutes) (CH13), age (years) (S1), duration of muscle pain/fatigue (days) (CS11), and duration of difficulty breathing (days) (CS9).

5.5 Improvement

5.5.1 Validation of CHaSe system using live data of 5000 respondents

A real mass-screening using the CHaSe system was conducted among 5000 university staff and students from the pilot study analysis phase. The modeling employed in the CHaSe system comprises fifteen (15) questions as outlined in Table 4 below:

Table 4 List of Questionnaire distributed to University students and staff

After consideration was made based on these 5000 respondents’ actual responses, the system’s modeling was revalidated. The association of the clinical and Epid-Link parameters was reduced to eleven (11) variables. To re-confirm the risk level of COVID-19 (Low Risk, Moderate Risk and High Risk), we conducted a thorough iterative process. These include interactive machine learning algorithm (the weightage scoring), simulation of the modeling, and predictive modeling integration into the CHaSe System with live raw data of 5000 respondents using the matrices index of PCA (Fig. 4). The medical experts revalidated all the initial eleven parameters of COVID-19 before re-confirming the weightage score for categorizing the index using the PCA based on 5000 respondents. Interestingly, the revalidated PCA for 5000 respondents has also yielded twelve significant PCs with an eigenvalue greater than 1 (Table 3; Fig. 3) for a total of the explained cumulative variance of 58.288%. Importantly, the CHaSe system’s modeling underwent repeated revalidation by medical experts for knowledge accuracy of all the clinical aspects and the Epid-Link (clinical symptoms and clinical history). After repeated revalidation by medical experts, the PCs' variance explanations are 27.36% for PC1, 11.79% for PC2, 10.347% for PC3, 8.785% for PC4, with a cumulative explanation of 58.288% in data variability. Figure 4 indicated twelve factor loadings in the PCA accountable for 58.288% (Fig. 3) of the clinical symptoms and clinical history variations. Therefore, these results demonstrate the initial clinical aspects and Epid-Link variables yielded from PCA of the pilot study were validated by further analyses of the real mass-screening among 5000 respondents (Niranjanamurthy et al. 2020; Cao et al. 2020) (Figs. 5 and 6; Tables 5 and 6).

Fig. 4
figure 4

Variance explanation of the strong factor loadings of 27.36% for PC1 and 11.79% for PC2 after varimax rotation of 11 parameters for 5000 respondents

Fig. 5
figure 5

The development of Chase System with the integration of raw data (clinical aspects, data analytics, and modeling

Fig. 6
figure 6

a DA standard stepwise mode of the pilot study for 200 respondents, and b DA stepwise forward mode of the pilot study for 200 respondents

Table 5 Eigenvalues from the principal component analysis illustrating variance, cumulative variance, and factor loading of clinical aspects (clinical symptoms and clinical history) and Epid-Link after varimax rotation of 11 parameters for 5000 respondents
Table 6 Weightage Scoring of 200 respondents (pilot study) and 5000 respondents (actual data) of the development of CHaSe System

After model revalidation, the CHaSe System of the COVID-19 screening was accessible by students and the civil servants via the internet browser or mobile application for self-monitoring (Fig. 7a). As part of the organizational policy to prevent COVID-19 outbreaks, it was made mandatory for the civil servants and students to undergo daily screening through the CHaSe system throughout the national Movement Control Order (MCO). The CHaSe system administration will monitor the updated data, and the outbreak details were illustrated as shown in Fig. 7b.

Fig. 7
figure 7

a A self-monitoring CHaSe System of COVID-19 for health closed-surveillance of individuals. b A COVID-19 closed-surveillance monitored by the administration embedded in CHaSe System

The administration of the CHaSe system prompted the mapping of risk trends covering the entire states of Malaysia. Figure 8 portrays the COVID-19 data outbreaks that comprise Risk Trend over Time, Risk Level of Malaysia as a whole, and Risk by States. This mobile application of COVID-19 is the first online monitoring developed in Malaysia based on the integration of computer science, data analytics (algorithms on the weightage score for index category) and medical factors based on actual respondents.

Fig. 8
figure 8

Visual data of COVID-19 outbreaks, comprising of risk trend over time, risk level in entire Malaysia and risk by states

5.6 Control

The Six Sigma approach adopted in this study serves as a problem-solving tool to address the critical need to conduct validated mass-screening when resources are limited (Juahir et al. 2017). As discussed in the define phase, this study identifies the crucial gap in achieving a valid mass screening to mitigate the risk of cluster outbreak upon the return of staff and students post MCO. This objective is a form of control measure in evaluating if the CHaSe system meets its purpose. Through the CHaSe system, this study completed a mass-screening of 5000 university students and staff. Robust active case detection and contact tracing ensure that respondents that have a high risk of contracting COVID-19 be identified. Without the advanced application of modern internet of things (IoT) technology and prediction model based on data analytics (Aris et al. 2010), the traditional practice warrants enormous human resources to initiate one-to-one contact for personalized risk assessment. China shared this experience on how they managed to contain COVID-19 when it first emerged in Wuhan in November 2019 (WHO & China 2020). The experience from that exercise was described as “painstaking” (WHO & China 2020) and practically not achievable with our resources, given the pandemic's current scale of impact. Hence, by facilitating early detection for many respondents within the university's current capacity, the CHaSe system meets and exceeds public health practice standards. Therefore, the CHaSe system demonstrates the significant scale of clinical outcomes attained by harvesting the benefits of the mobile technology.

However, the screening model based on self-declaration of clinical history and symptoms has its limitations. There were reports on asymptomatic infection of COVID-19 (WHO & China 2020; Kluster Tawar 2020; Noor Hisham Abdulah 2020). Unfortunately, reports on the prevalence of asymptomatic infected individuals are conflicting (WHO & China 2020; Noor Hisham Abdulah 2020; McMichael et al. 2020). Additionally, there is a need to assign caution on interpreting ‘asymptomatic infection’ from cross-sectional data since patients may not develop symptoms during the diagnostic test but may consequently develop them as the disease progresses. Thus, asymptomatic infection is defined as lack of symptoms since exposure until complete recovery. These challenges motivate us to advocate daily screening with validated models via the CHaSe system for our students and staff to increase screening sensitivity further.

Therefore, the technology-enhanced screening via the CHaSe system does not just make possible a one-time COVID-19 screening. The use of mobile technology also enables regular risk assessment as a surveillance strategy for a large population. The ability to conduct risk assessments and further personalized clinical evaluations through daily follow-up phone contact for high-risk groups serves as a robust and comprehensive screening and surveillance strategy to reduce the risk of cluster outbreaks. Nonetheless, asymptomatic individuals with no known Epid-Link were also reported among those diagnosed with COVID-19 in Malaysia (Kluster Tawar 2020; Noor Hisham Abdulah 2020; MOH 2020a, b). Hence, the CHaSe system is not replacing the existing public health measures to control the pandemic. Instead, CHaSe serves to complement and enhance the current strategy for screening and surveillance for COVID-19. On the system side, CHaSe has been developed into two main applications: web-based system and mobile apps (android and IOS). The technologies employed include MySQL as the Database Management System (DBMS) and Laravel Framework (PHP) for the web-based programs. The rules extracted from experiments using various machine learning techniques (PCA for dimensional reduction, features selection) as well as classification and regression methods were translated into PHP programming. Pilot testing conducted to verify the results and the rules generated from the computed models were observed to be consistent with the result by obtained from the medical experts.

6 Conclusion

COVID-19 outbreak has rapidly escalated to become a pandemic within months of its emergence. The easy-spread and scale of the disease's impact demand medical experts and policy-makers to establish a valid screening model that can be applied effectively and efficiently to the mass population. Comprehensive conventional active case detection and contact tracing are effective in containing the outbreak. Nonetheless, these measures require large supplies and resources which are not always immediately accessible to government and private organizations. In Malaysia, students and staff of higher educational institutions in Malaysia will return to campus at the end of the movement restrictions policy to invigorate the economy and educational activities. Thus, there is an urgent need for educational institutions to conduct mass screening and surveillance as part of the organization’s initiative to mitigate the risk of COVID-19 cluster outbreaks.

We have developed a robust risk-prediction modeling within the CHaSe system by means of a collaboration between medical, data and computing experts to address this need. DMAIC Six Sigma approach was applied as a systematic and critical-to-quality (CTQ) approach to achieving this goal. In this study, we report the vigorous iterative process of how the risk-prediction modeling was comprehensively developed and then revalidated by front liner medical experts to procure optimal accuracy, reliability, and efficiency.

Combined with the mobile technology, the results demonstrate the feasibility of mass screening 5000 students and staff within the capacity of our university’s resources. Furthermore, the analysis provides risk trend over time, risk level of the entire country and risk by states. Thus, the modeling system within CHaSe serves as a valuable approach to complement and enhance the existing public health measures against COVID-19.