Unraveling educational networks: Data-driven exploration through multivariate regression, geographical clustering, and multidimensional scaling

Enhancing rates of school participation holds significant importance for a nation’s educational achievements. This research employs a comprehensive approach that combines various methodologies, including multivariate regression analysis, geographic categorization


Introduction
Education is a fundamental human right and a driving force in societal and economic progress.It is required to meet the Sustainable Development Goals 4 (SDGs 4), a set of worldwide goals aimed at providing universal access to high-quality education.Indonesia has made considerable progress in improving educational opportunities as a country committed to accomplishing these goals.Attendance rates vary at several levels, including SD, SMP, SMA and SMK.Understanding the factors driving these rates is critical for developing policies and activities that will successfully improve educational results.
A crucial indicator of a nation's educational accessibility and quality is the school participation rate.Despite the National Education System Law No. 20 of 2003's emphasis on the value of equal and outstanding education, a number of barriers still prevent many Indonesian students from enrolling in school.A major issue is the unequal distribution of educational access among regions.Even though the law places a strong emphasis on the idea of educational equality, access to schools can occasionally be hindered, especially in distant areas, by insufficient infrastructure and economic inequality across locations.
Conversely, the caliber of education bears notable influence over school enrollment metrics.Legislative provisions in Law No. 14 of 2005 concerning Teachers and Lecturers outline benchmarks for educators to uphold educational excellence.Nevertheless, certain regions grapple with an insufficiency of adept instructors and requisite provisions, potentially dampening students' fervor and motivation for academic engagement.This, in turn, directly reverberates on the rates of school participation, with diminished educational attainment potentially prompting locals to opt for less frequent enrollment of their progeny.
On the contrary, the impact of education quality on school enrollment rates is significant.Professional standards for educators are delineated in Law No. 14 of 2005 on Teachers and Lecturers to ensure the provision of quality education.However, the inadequacy of proficient educators and necessary resources in certain areas can undermine students' eagerness and drive to attend school.This directly impacts school enrollment rates, as lower education levels can lead residents to send their children to school with reduced frequency.
The purpose of this research is to look into the factors that influence Indonesian students' participations rates at various educational levels.We hope to provide a full knowledge of the numerous mechanisms that lead to variable participation rates by using an integrated technique that incorporates multivariate regression, spatial clustering, and multidimensional scaling.The purpose of this research is to discover the elements that have a substantial impact on school attendance rates in order to shed light on their current situation.
Using multivariate regression to analyze the relationship between various factors and school attendance rates at different educational levels.It explores the individual and combined effects of these factors, providing insights into participation rates.Geographic clustering analysis helps identify spatial patterns and regional variations in school attendance rates, aiding in targeted interventions and resource allocation.Multidimensional scaling analysis is utilized to understand relationships between variables and participation rates, highlighting key factors influencing attendance.Understanding these factors is crucial for evidence-based policies to enhance participation rates, ultimately promoting sustainable development, social mobility, poverty reduction, and achieving SDG 4, breaking the cycle of poverty through education.
The findings of this study will aid in allocating resources to the places most in need and will encourage evidence-based decisions.We can build targeted and effective measures to increase school enrollment rates across the country by taking a holistic approach that blends quantitative research with geographical awareness.

Multidimensional Scaling
Multidimensional Scaling (MDS) analysis is a multivariable technique that can be used to determine the similarity between any pair of N observed elements and to plot elements in multiple dimensions based on the proximity between elements and their similarity elements (Johnson & Winchern, 2007).When the distance value is SMAller, similarity means the object is more similar, while dissimilarity itself means that the object becomes progressively more dissimilar as the distance value is larger (Rabinowitz, 1975).This analysis is used to determine the relationship of interdependence or interdependence between variables or data.This visual perception map is executed in a multidimensional map (Adlakha & Sharma, 2019).Based on the scale of the data used, multidimensional scaling analysis is divided into multidimensional metric and multidimensional nonmetric analysis (Johnson & Winchern, 2007).

Multidimensional scaling not metric
The distance data used in this scaling is ordinal scaled data.Rabinowitz (1975) describes several analysis phases when performing a multidimensional scaling analysis, including (Rabinowitz, 1975): 1. Calculation of the distance matrix using the Euclidean distance value.Euclidean distance is used to calculate the inter-object proximity between the first object and the j-th object perception map with the following formula.
2. Find the eigenvalue and eigenvector using the following formula 3. Forming object coordinates based on eigenvectors  =  ,  , then the next calculation  .That's Euclidean distance is formed by coordinates.4. Calculate the voltage value using the following formula.
From the stress value, it can be seen that the lower the stress value, the better the resulting model.The following are guidelines for criteria that can be used to assess the feasibility of models using stress values (Johnson and Winchern, 2007).

Table 1
Criteria for the value of the emphasis on the feasibility of the model.

Panel-Regression
The regression analysis of the panel data is the result of observing several people, each observed in several consecutive periods (time units) (Bai and Kunpeng, 2014).

Estimation Model of Panel Data Regression Analysis Common Effects Model
Model that there is no difference in the intercept and slope values in the regression results, either due to differences between individuals or between times.In general, the equation of the common effects model is as follows (Baltagi, 2008).

Fixed Effect Model
Estimation method regression Panel data on the fixed effect model use the technique of adding a dummy variable or Least Square Dummy Variable (LSDV).There are two assumptions in the Fixed Effect Model namely as follows (Hsiao, 2003).
Step 0. The slope value is constant, but the intercept varies between units; Step 1.The slope value is constant, but the intercept varies between individuals and between periods.

Random Effect Model
The panel data regression estimation in the random effects model employs the Generalized Least Squares (GLS) method.
There are two key assumptions regarding the random effect (Hsiao, 2003).
Step 2. The intercept and slope vary from person to person; Step 3. The intercept and slope differ between individuals and over time.

Selection of Panel Data Regression Model Chow Test
The Chow test is a test performed to select one of the models in panel data regression, namely between the Fixed Effect Model and with Common Effect Model (Ioan et al., (2020).This test was conducted with the following hypothesis (Binkley et al., 2018). ∶  =  =. . .=  =  (Common Effect Model)  ∶ at least there is one otherwise (Fixed Effect Model) The basis for rejection is determined based on the F-statistic tests as follows (Baltagi, 2008).
The test statistics for the Chow test follow the distribution of the F-statistics, which is  ( , ); With the test criteria whether the statistical Chow value is greater than the F-table or if the  −  < , then  is rejected and vice versa.

Hausman Test
This test is performed based on Fixed Effect Model contains an item trade off namely the loss of degrees of freedom due to the inclusion of dummy variables and Random Effect Model Care must be taken to ensure that there are no violations of the assumptions of each error component (Binkley et al., 2018).The hypothesis used is: The basis of rejection  The value derived from the Hausman statistic is formulated as follows (Greene, 2000).
The test criteria for this test follow the chi-square distribution, which is if the value  greater than the value  ( , ) or if the  −  < , then  is rejected and vice versa.

Lagrange Multiplier Test
The Lagrange multiplier test (LM) is a test to find out whether the random effects model is better than the common effects model (Breusch & Pagan, 1980)  ∶ The correct model is the common effects model  ∶ The appropriate model is the random effects model.
The test statistics for the Lagrange multiplier test are as follows (Baltagi et al., 2012).
where K is the number of sectors, T is the number of periods and  is the residual Common Effect Model.The test criteria used are: if the value  >  ( , ) or if the  −  < , then  is rejected and vice versa.

Breusch Pagan Test
The Breusch-Pagan test was performed to see if there were single, temporal, or both effects on the fixed effects model and random Effect.

Model Selection
An optimal regression model yields unbiased linear estimates, known as the Best Linear Unbiased Estimator (BLUE).Meeting classical assumptions is crucial for this, especially in the context of combined cross-sectional and time-series data.Overcoming issues related to these assumptions, such as heteroscedasticity and autocorrelation, is vital to ensure the model is analyzable and delivers accurate results.

Cluster Analysis
Cluster analysis is a multivariate data analysis used to group objects/cases based on the similarity of the objects/cases' characteristics (Johnson & Winchern, 2007).K-Means is a non-hierarchical cluster analysis algorithm in which the clustering process is performed based on the nearest distance to the specified center (Bhattacharjee et al., 2017).One of the commonly used distances is the Euclidean distance.The formula for Euclidean distance is as follows.
where  ,  are the two data calculated using the distance and p is the dimension of the data used.The determination of the cluster center can be seen from the following equation.
The results of grouping each distance calculation can be checked for quality by performing a homogeneity test.This test is calculated using the Silhouette coefficient equation to be performed after convergence reaches 0, with the results of the last binning being identical to those of the previous binning (Erda et al., 2023).
The silhouette coefficient is determined by averaging the distance of the i-th data to all data in the same cluster.Here we assume that the i-th data is in cluster A. The formula of ().Written in the following equation (Struyf et al., 1997).
where A is the amount of data in cluster A. Next, calculate the value (), this is the minimum value of the ith data center distance with all data in different clusters.Now suppose that clusters other than A originate from cluster C. So, the calculation of the average distance between the it h data and all data in cluster C is as follows: After counting (, ) for all clusters,  ≠ , then select the minimum distance value as the value ().
If cluster B has a minimum distance value, then (, ) = () This is called the neighbor of the i-th data and is the secondbest cluster for the i-th data after cluster A. After () and () is known, the final process of computing the silhouette coefficient is as follows:

Findings and Discussion
This study utilized panel data collected from government educational websites and central statistical offices across diverse Indonesian regions.Its objective is to identify significant factors influencing school enrolment rates, offering a comprehensive overview of the present educational scenario.The dataset, with its longitudinal nature, enables the analysis of changes over time and variations across regions and educational aspects.

Multidimensional Scaling (MDS) Analysis
Dimensional Diagram Representation of MDS at SD Level

Dimensional Diagram Representation of MDS at SMP Level
Regarding the results in Fig. 2 of the MDS visualization for the SMP level of education, it was found that there were similarities in the grouping pattern with the MDS visualization for the SD level.In the SMP MDS visualization, geographically close provinces reflect similarities in the characteristics of supportive education.For example, the provinces of West Java, Central Java and East Java appear to have a similar approach to educational development.Also bordering this group, the province of North Sumatra shows some similarities in its approach to education.
The MDS analysis consistently shows a stable grouping pattern for the SMP level from 2019 to 2023.Despite minor changes, particularly in East Java and Central Java's proximity, the overall grouping pattern remains constant.Notably, East Nusa Tenggara province has shifted away from the main cluster in the 2021 MDS visualization in Fig. 3, indicating an alteration in educational characteristics or approach.However, in the subsequent year, the province returned to the main cluster.The SMP level analysis aligns with the previous SD level analysis, highlighting the consistent characteristics shared by pro-educationalists in certain Indonesian regions.Despite dynamics and shifts, the grouping pattern generally remains stable throughout the study period.In Fig. 4, as Multidimensional Scaling (MDS) plot for SMA level educational support in Indonesia, provinces in East Java, Central Java, and West Java cluster, showcasing varying education levels among them evident by point distances on the MDS plot.Additionally, Bali, East Nusa Tenggara, and North Sumatra stand isolated, indicating marked differences in educational support aspects.Over subsequent years (2020-2022), MDS analysis consistently forms seven similar groups, maintaining the isolation of regions like Central Java, East Java, West Java, North Sumatra, East Nusa Tenggara, and Bali, emphasizing notable differences in educational support characteristics within these regions.
The MDS analysis consistently showed 7 groups over 4 years, except for the last year 2023, in Fig. 5, where only 6 groups formed.However, the characteristic disparities remained unchanged, hinting at a potential shift in the pattern formation or dynamics between regions.The decline in group numbers in recent years suggests a possible association or grouping of areas that were previously separate clusters.Despite fewer groups, the unchanged trait differences imply efforts to bridge educational support gaps across regions, indicating a shift in strategy or policy to promote uniform educational approaches, resulting in fewer but more similar groups.Dimensional Diagram Representation of MDS at SMK Level Fig. 6 as a Multidimensional Scaling (MDS) analysis results illustrate the proximity between objects, helping identify objects with similar characteristics.Closer points denote higher similarity, while farther points signify differences.Bordering provinces like West Java, Central Java, and East Java, along with North Sumatra, DKI Jakarta, and Banten, are noticeable on the map.Conversely, provinces like Aceh, Bali, and Bengkulu exhibit relatively high affinity, indicating unique characteristics or specific educational approaches.Some provinces across Indonesia are more distant, reflecting the country's diverse education approaches.Despite this, proximity on the spatial map emphasizes similarities in educational support characteristics, especially among neighboring areas like Central Java and East Java.In subsequent years, MDS analysis consistently portrays similar grouping patterns, with certain provinces remaining isolated from the main cluster, while the proximity between East Java and Central Java suggests increasing similarity in characteristics.
In Fig. 7, the multidimensional scaling analysis demonstrates a noteworthy deviation in object distribution compared to the visualizations of 2021, 2020, and 2019.The grouping pattern in 2021 shifted notably by 2022, although the major cluster count remained at three.Several points, including East Java, Central Java, and West Java regions, remained isolated from these clusters.The 2023 analysis shows a distribution pattern similar to 2022, implying stability in grouping patterns compared to the preceding years, 2021 and 2022.Three main clusters persist, with certain regions like East Java, Central Java, and West Java remaining isolated from the primary clusters.Overall, the MDS analysis at the SMK level shows greater variation in aspects of educational support.Consistent visualization from year to year indicates an ongoing pattern, with several provinces exhibiting closer traits that become more similar over time.According to Faguet and Sanchez (2006), government spending in the education sector has a significant influence on school participation.In line with Dreher (2006) that government spending in the education sector is a supply factor that influences the quality of education and school participation.This is in line with the mapping that has been carried out with MDS at the elementary, middle school, high school and vocational school levels.These plots tend not to experience significant changes from the past 5 years, as it is known that the government continues to strive to increase the amount of the education budget, but the results obtained in the mapping MDS shows that there is no change in characteristics quite significant, so the addition and use of education funds must be given more attention and monitored closely by the government so that the policy on using these funds is appropriate and not misdirected so that Indonesian education continues to progress and develop and experiences significant changes every year.

Outliers
To detect outliers in regional clustering data regarding educational support, we observe points significantly distant or isolated from the main cluster in each MDS visualization.Notably, East Java, Central Java, West Java, Bali, D.K.I.Jakarta, Banten, East Nusa Tenggara, and North Sumatra stand out as isolated points, indicating substantial disparities in educational support.These outliers offer valuable insights into areas necessitating special attention.Related to Table 5 it is known that the chi-square probability is > 0.05.Probability value > 5%, so H0is accepted, the random effects model should be used in panel data regression modeling.There is a two-way effect.However, after testing for cross section and time effects, there is only a cross section effect.

Breusch Pagan Test
From the Hausman test and the Breusch-Pagan test, it can be concluded that the model to be estimated is a cross-sectional effect random effects data model.The Durbin-Watson test in Table 7 resulted in a DW value of 2.0585 with a corresponding p-value of 0.7544.Given the DW value's proximity to 2 and the high p-value (0.7544), there is insufficient evidence to reject the null hypothesis.Thus, the test does not suggest a significant serial correlation in idiosyncratic errors within this panel regression model.

Homoscedasticity Acceptance Test
Testing the assumption of homoscedasticity with Robust covariance estimator for heteroscedasticity Based on the homoscedasticity acceptance test, it was found that there was no difference in the coefficients of the independent variables in the t-test with the covariance matrix.This satisfies the robust test results for the heteroscedasticity of the covariance matrix or the assumption that the residual variance-covariance structure is the same.
The results of the estimation are displayed in Table 8.

Scoring Model
The panel regression results indicate a random effects model run on 34 individual units with 4 observation times, totaling 680 observations.The idiosyncratic component has a variance of 6.292e-05 and a standard deviation of approximately 7.932e-03.Individual component variances are zero, implying no explainable individual variation in the model.
The coefficient of determination (R-squared) is approximately 0.92223, signifying that this model explains about 92.22% of the data variability.The best formula for the panel regression equation is as follows: (22)

Cluster Analysis
Cluster analysis can help identify common characteristics between regions and support targeted interventions (Arisanti et al., 2023).

Fig. 8. Province code in the clustering visualization
The results in Fig. 9 of regional grouping using cluster analysis shows that it consists of two clusters at SD level, with the first cluster containing West Java, Central Java and East Java, while the second cluster contains other provinces.The results in Fig. 10 of regional grouping using cluster analysis show that it consists of two clusters at the SMP level, the first cluster being West Java, Central Java, East Java, East Nusa Tenggara, South Sulawesi, West Kalimantan and includes North Sumatra, while the second cluster includes clusters includes other provinces.The results in Fig. 11 of regional grouping using cluster analysis show that the SMA level consists of two clusters, with the first cluster including West Java, Central Java, East Java, East Nusa Tenggara and North Sumatra, while the second cluster includes other provinces.The last results of regional grouping using cluster analysis show that it consists of two clusters at the SMK level in Fig. 12, with the first cluster containing West Java, Central Java, East Java and North Sumatra, while the second cluster contains other Provinces.All clusters formed at each level show that they are only formed into 2 clusters, namely low and high levels of educational participation based on the silhouette method, but this is different from the results of research produced by Aryawwan et al. (2022) it is known that there are 5 clusters created, namely cluster 1 as a province with a high level of educational participation, cluster 2 as a province with a medium level of educational participation, cluster 3 as a province with a low level of educational participation, cluster 4 as a province with a very low level of educational participation, and cluster 5 as provinces with unknown educational participation rates.Observers concluded that the more clusters there are, the more detailed results will be provided about the real situation regarding the number of education participants in Indonesia, so it is hoped that detailed information will make it easier for the government to draw conclusions and make policies to increase education participants in a region.

Conclusion
The Multidimensional Scaling (MDS) evaluation revealed distinct educational promotion clusters in Indonesia.Certain areas, such as East Java, Central Java, and West Java, showed significant disparities in educational support at primary and uppersecondary levels.However, regional grouping patterns varied more within the SMA and SMK educational tiers.
Regression analysis on the panel data highlighted significant correlations between educational support factors and student distribution across academic levels.Positive effects were observed for various variables promoting educational growth, while certain factors in specific regions had adverse effects on student proportions.This emphasizes the need for public investment to elevate educational standards.
Cluster analysis outcomes identified zones with similar educational support patterns and unique clusters, providing insights into shared characteristics among regions.This information serves as a basis for strategic policy interventions, aiming to improve Indonesia's educational quality.With a deeper understanding of the determinants influencing education, there's an optimistic outlook for implementing effective measures to enhance educational quality and foster a brighter future for generations to come.

Fig. 1 .Fig. 2 .
Fig.1.Visualization of the use of MDS at the SD level in 2023 Fig.2.Visualization of the use of MDS at the SMP level in 2019

Fig. 3 .Fig. 4 .
Fig. 3. Visualization of the use of MDS at the SMP level in 2021 Fig.4.Visualization of the use of MDS at the SMA level in 2019

Fig. 5 .Fig. 6 .
Fig. 5. Visualization of the use of MDS at the SMA level in 2023 Fig.6.Visualization of the use of MDS at the SMK level in 2019

Fig. 7 .
Fig.7.Visualization of the use of MDS at the SMK level in 2022

Fig. 9 .
Fig. 9. Visualization clustering at SD level Fig.10.Visualization clustering at the SMP level Schools by Source of Water Protected WellIn Table3the partial significance test indicates the significance of independent variables on the dependent variable in the regression model.If the p-value (Pr(>|z|)) in the Independent Variable column is < 0.05, it suggests a significant impact.Nine variables notably influence the proportion of students across educational levels.Table4it is known that the probability of F Statistics is < 0.05.The probability value < 5%, so H0is rejected, the fixed effect model should be used in panel data regression modeling.

Table 8
Test Results of The Robust Covariance Estimator