PREDICTOR

The necessity to identify was grounded and assesses the dynamics of albuminuria and glomerular ﬁltration rate in patients with type 1 and type 2 diabetes (as an early marker of endothelial dysfunction and a predictor of cardiovascular complication) on the background of pathogenetic therapy with glycosaminoglycans sulodexide was done. Clinically signiﬁcant reduction in the excretion of both total protein and albumin in the urine was found. It indicates an improvement of microcirculation in the kidney along with an increase in glomerular ﬁltration rate. The increase in the glomerular ﬁltration rate is most pronounced in patients with type 2 diabetes with comorbid non-alcoholic fatty liver disease.


Introduction
According to UNESCO, education, beyond being a fundamental right, entails a transformative process that enables the identification of whether the defined standards align with the demands imposed by society.Consequently, the educational evaluation process becomes pivotal for the progressive enhancement of education (Sanabria James et al., 2020).In this vein, the educational policy advocated by the Ministry of National Education of Colombia aims for the development of academic competencies by educational institutions to be the primary objective in basic, middle, and higher education (Palacios-Mena, 2018), conceiving competencies from the standpoint of a successful life and a developed society.One of the most significant indicators of the quality of education in Colombia is the results attained by students in the Saber tests, which gauge the extent of development of academic competencies across various disciplines (Garizabalo Dávila, 2012).These test outcomes serve as a diagnostic tool facilitating the identification of students' strengths and weaknesses, thereby enabling educational institutions to receive feedback on curricular aspects, fostering strategic decisions aimed at enhancing the effectiveness of the teaching-learning process (Acero et al., 2016;Sanabria James et al., 2020).The ICFES (Colombian Institute for the Evaluation of Education) is the governmental institute tasked with designing and evaluating the Saber tests for the third and fifth grades of primary education, as well as for the ninth and eleventh grades of secondary education and also for the university level, in which case they are referred to as Saber Pro tests (Morales-Piñero et al., 2019;Palacios-Mena & Rodríguez-Márquez, 2019).
The Ministry of National Education (MEN) utilizes the Saber tests across its various levels to monitor and assess the quality of education accessible to students, where the Saber 11 tests are pivotal due to the linkage between secondary and higher education (Iguarán Jiménez et al., 2023;Timarán Pereira et al., 2020).The Saber 11 tests correspond to standardized assessments taken by grade 11 students as a requirement for admission to different undergraduate programs at universities in Colombia, thereby evaluating through closed-ended questions five areas, namely: Critical Reading, Mathematics, Natural Sciences, Social Sciences, and English (Alonso et al., 2012;Ruiz-Escorcia et al., 2017).The competencies assessed by the ICFES through the Saber tests in various areas include interpretation and representation, reasoning and argumentation, formulation and execution, articulation of a text, reflection on content, social thinking, explanation of different phenomena, among others (Díaz Pinzón, 2020).

205
With the advancement of data science in recent years, educational data mining (EDM) has emerged as an interdisciplinary field tasked with applying computational methods to explore data originating from the educational context.Its purpose is to extract value-added information that benefits decision-making in the educational sector regarding academic performance, the teaching-learning process, or the enhancement of educational quality (Devasia et al., 2016;Nasiri et al., 2012;Pathan et al., 2014;Romero & Ventura, 2010).In this regard, various contributions have been identified in the state of the art concerning the application of EDM in the context of Saber tests.Thus, in (Timarán- Pereira et al., 2019), a classification model based on decision trees is applied to detect factors associated with the academic performance of Colombian students who took the Saber 11 tests in 2015and 2016. Similarly, in (Timarán Buchely & Timarán Pereira, 2023), a data mining study based on decision trees is conducted to determine patterns associated with academic performance in the generic competencies of the Saber Pro tests among students at the Universidad Javeriana de Cali in the years 2017 and 2018.In (Chanchí-Golondrino et al., 2021), an exploratory and spatial analysis of the data pertaining to the Saber 5 tests of 2016 is conducted to correlate and classify test results with the spatial distribution of the data.In (García-González et al., 2019), a neural networkbased model is proposed for predicting performance in the Saber Pro tests at Higher Education Institutions in the city of Barranquilla, based on a dataset containing the results of this test.In (Arboleda-Posada et al., 2022), a data mining study is conducted to identify factors influencing performance levels for military or police training programs in Colombia, utilizing data from the Saber Pro tests between 2012 and 2019.In (Timarán-Pereira et al., 2023), decision tree models are applied to detect patterns influencing academic performance in the mathematics competency of the Saber 5 tests.In (Oviedo Carrascal & Jiménez Giraldo, 2019), a data mining study based on supervised and unsupervised learning is conducted on the results of the Saber Pro tests in the Antioquia department (Colombia) from the year 2016, aiming to determine the economic, social, and demographic factors influencing students' performance.In (Narváez Zúñiga, 2022), various predictive models are tested to determine the most effective one for predicting performance based on different factors from the Saber Pro tests between 2016 and 2019, revealing that the multivariable linear regression model is the most suitable.In (Acevedo et al., 2015), the factors associated with course repetition and graduation delays in Engineering programs at the University of Cartagena are assessed, considering the results obtained in the competencies of the Saber Pro tests, as well as information published in the university's statistical bulletins.In (Gorostiaga & Rojo-Álvarez, 2016), a study based on exploratory analysis and machine learning is conducted to characterize the PISA tests in Spain, revealing that variables such as computer availability and immigration status are key factors in mathematics performance.
Based on the studies, it is important to highlight that various research endeavors have been conducted across different regions, utilizing data mining techniques to identify factors influencing academic performance in standardized tests among primary, secondary, and university-level students.In most of these studies, besides conducting exploratory data analysis, machine learning models are applied, with decision tree models being predominantly used.However, there is a lack of research focused on studying the variables affecting the academic performance of 11th-grade students in the city of Cartagena.Similarly, the explored studies did not reveal a prior investigation into the optimal attributes influencing the prediction of performance across the five areas through heuristic-based selection methods.The objective of studying Cartagena is that, despite being one of the most important and tourist cities in Colombia, it is also one of the major cities with a high index of poverty and inequality (Ayala-García & Meisel-Roca, 2016).This article proposes both the conduct of exploratory data analysis and the application of a heuristic search method for selecting the best attributes on a representative sample of the dataset from the Saber 11 tests of 2019, corresponding to the city of Cartagena de Indias (13,535 records).The objective of the analysis is to determine possible relationships between the various variables in the academic, social, and demographic dataset with the overall academic performance of students and in the knowledge areas assessed by the test (critical reading, mathematics, natural sciences, social sciences, and English).It is worth mentioning that the dataset used is freely accessible and was obtained from the Colombia open data portal.Similarly, it is important to highlight that both the exploratory data analysis and the process of selecting the best attributes from the dataset were carried out with the support of open-source tools.Thus, for exploratory data analysis, opensource libraries such as pandas, numpy, matplotlib, seaborn, and scikit-learn were used, while for obtaining the best attributes, the BestFirst algorithm implementation provided by the GPL-licensed weka tool was utilized.Hence, the purpose of this study is to identify relevant findings that lead governmental authorities and educational institutions to identify the main factors influencing student performance in the city of Cartagena, aiming to propose strategies that contribute to improving academic performance.Similarly, the identification of relevant attributes is a fundamental contribution to the construction of predictive models associated with the academic performance of students in Cartagena in the Saber 11 tests.
The remainder of the article is organized as follows: firstly, the methodology considered for the development of the present study is presented.Subsequently, the results obtained from the pre-processing of the dataset are described, as well as those related to the application of exploratory data analysis and the selection of relevant dataset attributes.Likewise, this section presents the discussion of the results in relation to some literature works.Finally, in the last section, the conclusions and future work derived from the present research are presented.

Methodology
For the development of the present research, a 4-phase adaptation of the SEMMA methodology (Sample, Explore, Modify, Model, and Assess) (Palacios-Gómez et al., 2016;Tariq et al., 2019) was employed, whereby the following phases were defined: F1.Data sampling, F2.Data exploration and modification, F3.Application of the attribute selection method, and F4.Analysis of obtained results (see Figure 1).
In phase 1 of the methodology, the dataset corresponding to the Saber 11 tests at the national level was obtained from the Colombia open data portal.This dataset initially comprises a total of 546,612 records corresponding to the results of students from different municipalities and departments of Colombia.Through the use of the pandas Python library, the records corresponding to the city of Cartagena were filtered, resulting in a total of 13,535 records.It is worth mentioning that, when filtering for the city of Cartagena, there are fields or columns that must be removed as they are not considered relevant in the analysis.

209
identifiers, consecutive values, or columns that all have the same value (country, department, municipality).It is also worth noting that these columns were eliminated utilizing the advantages provided by the pandas library.This step ensured that the dataset was refined and relevant for the subsequent phases of data exploration and attribute selection.

Source: Own elaboration
Once the columns listed in Table 1 were deleted, Table 2 presents the set of 60 columns considered for the development of exploratory data analysis and the application of inference rules.These variables generally include information about the student at the socio-economic, family, educational, and nutritional levels, as well as the performance obtained in the areas of critical reading, mathematics, natural sciences, social sciences, and English.From the columns defined for the dataset and presented in Table 2, the process of cleaning different records in these columns with NA or "-" values was carried out.
For categorical attributes with missing data, imputation was performed using the mode, while for numerical attributes with missing data, imputation was carried out using the mean.The aforementioned imputation processes were conducted using the functionalities provided by the panda's library data frames in Python.With the dataset free of null or missing values, the descriptive and exploratory data analysis continued, which included the following operations: counting categories and determining the mode of discrete variables, obtaining statistical measures (mean, minimum, and maximum values) for numerical variables, analyzing correlations between numerical variables to determine their impact on performance, analyzing scatter plots of numerical variables to determine linear increasing relationships, analyzing the distribution of some numerical variables, generating box and whisker plots for columns associated with performance in the 5 areas, and finally generating violin plots that relate different categorical variables to the student's overall performance in the Saber tests.
In phase 3 of the methodology, following the adjustments and cleaning carried out in phase 2, the BestFirst algorithmic model was applied to determine the best attributes affecting performance in each of the 5 areas of the test and overall performance in the context of Cartagena.This algorithm corresponds to a heuristic search model that explores the space of possible attribute subsets, continuously evaluating the quality of each combination with respect to a predefined evaluation criterion.This algorithm allows the user to specify an evaluator, such as CfsSubsetEval, which determines the relevance and redundancy of the attributes.
BestFirst conducts a heuristic-guided search to find subsets that maximize the evaluation criterion.Among its advantages is the ability to handle large attribute spaces and flexibility to adapt to different evaluation criteria.By selecting optimal subsets, BestFirst facilitates improving the performance of machine learning models by reducing dimensionality and highlighting key attributes for the specific task.
Finally, in phase 4 of the methodology, the attributes determined by the BestFirst algorithm for each of the 5 areas evaluated in the test were collected, along with the analysis of the merit metric of each subset of data obtained.This metric indicates the quality of the subset of attributes selected by the algorithm during the search and refers to the measure of how well a particular set of attributes performs according to the selected evaluation criterion.

Results
In the first instance, at the level of descriptive analysis of the dataset, statistical measures associated with the numerical variables were obtained.A key finding was the average performance achieved globally and in the five disciplinary areas of the test (critical reading, mathematics, natural sciences, social sciences, and English).
The average overall score obtained in the city of Cartagena was 236 out of 500 possible points, while the maximum and minimum scores obtained globally were 475 and 195, respectively.
Similarly, the score with the highest mean in the city of Cartagena corresponds to critical reading competence, with a value of 50.642, whereas the score with the lowest mean in Cartagena is in social and civic sciences, with a value of 43.8963.
Additionally, the score showing the lowest data dispersion (standard deviation) is in critical reading, with a value of 11.2107.The average results in the five considered areas can be more clearly observed in Figure 2. In the same vein, upon conducting quartile analysis, it was found that for critical reading, 75% of students (Q3 quartile) obtained scores equal to or below 59, whereas for social sciences, 75% of students obtained scores equal to or below 53.
Furthermore, regarding the count of categorical variables, it is noteworthy in terms of mode that most students who took the test belong to socioeconomic stratum 1, come from families of 3 to 4 members, reside primarily in houses with 2 rooms, and the most common educational level among their parents is high school.Moreover, the fathers mostly work independently, the schools attended by the students are predominantly technical/academic and non-bilingual, many schools are located in urban areas and operate in the morning shift, mostly adhering to schedule A.

Average performance
Average performance obtained in the 5 areas evaluated 217 Now, to ascertain a potential correlation among the continuous numerical variables of the dataset and to determine whether two variables exhibit a possible linear relationship that would subsequently allow for the fitting of a regression model, a correlation matrix was obtained among these variables (see Figure 3), which is depicted through a heatmap generated using the seaborn library in Python.
Based on the correlation matrix presented in Figure 3, it becomes pertinent to assess the correlations between performance in the five different areas and the overall score obtained.Likewise, it is important to describe the correlation existing between economic indices and the score of each area of interest.
In accordance with the foregoing, it is observable from Figure 3 that all five disciplinary areas exhibit correlations exceeding 82% in all cases, with social sciences showing the highest correlation with the overall score at 92%, while English displays the lowest correlation at 82%.Additionally, it is noteworthy that naturally, each column of scores for the five areas demonstrates a high correlation (values exceeding 0.9) with their associated attributes (percentile and performance).
Likewise, it is important to mention that the highest correlation of critical reading scores is observed with social sciences scores (0.8).Similarly, the highest correlation of mathematics scores is with natural sciences scores (0.81).
On the other hand, the highest correlations of natural sciences scores are with mathematics scores (0.81) and social sciences scores (0.78).Furthermore, the highest correlation of English scores is with natural sciences scores (0.73).Finally, concerning economic variables, it is notable that a high correlation between the overall score and the student's socioeconomic index (INSE) is not observed, with a value of 0.43.Moreover, the highest correlation between the socioeconomic index (INSE) and scores in the different areas is with English (0.48), however, this value is not high.This leads to the conclusion that it is possible to fit linear regression models between some knowledge areas and the overall score, as correlations exceeding 0.8 are evident in these cases.According to the results obtained from the box plot in Figure 5, the median values of the five areas range between 40 and 60, with the highest median value observed in the critical reading area, while the lowest median value is in the social sciences area.
Additionally, the area with the least data dispersion (narrower box) is natural sciences, whereas the area with the most data dispersion (wider box) is social sciences.Concerning data symmetry, the critical reading and mathematics areas demonstrate better symmetry in the data distribution, suggesting a balanced distribution within the interquartile region and an equal amount of data above and below the median for these two areas.In contrast, the natural sciences, social sciences, and English areas exhibit a negative skewness or left tail, indicating that scores in these areas are concentrated below the median.Finally, the areas with a higher number of different outliers are mathematics, natural sciences, and English, corresponding to scores close to or equivalent to the maximum and minimum values.
To identify the relationship between certain socioeconomic categorical variables and overall performance, violin plots were employed.In Figure 6, the violin plot relates the student's socioeconomic stratum (FAMI_ESTRATOVIVIENDA) to the normalized overall score obtained in the Saber tests.From Figure 6, it is possible to observe that there is no clear differentiation between socioeconomic stratum and performance in the test, as evidenced by overlaps in the different violins for each area.However, the highest median is observed for stratum 5, while the lowest median is apparent in the categories without stratum and in stratum 6.
On the other hand, Figure 7 depicts the violin plot relating the categorical variable FAMI_EDUCACIONPADRE to the normalized overall score obtained in the test, where FAMI_EDUCACIONPADRE corresponds to the categorized educational level of the student's father.The purpose of this graph is to ascertain whether there is an impact between the various levels of education of the father and the overall score.
According to the results in Figure 7, it is evident that students whose fathers have postgraduate education achieved a higher overall score in the test, with a median of 0.6 in the normalized value of the overall score.Similarly, students whose fathers do not have any formal education exhibit the lowest median in the overall score.FAMI_TIENEINTERNET corresponds to whether the student's family has internet service.From Figure 9, it can be observed that although the median (0.4) of students who have internet is slightly higher than the median of students who do not have internet service (0.3), there is significant overlap between the violins, indicating that this attribute is not considered decisive in student performance.
Similarly, in Figure 10, the violin plot relates the variable FAMI_TIENEAUTOMOVIL to the normalized overall test score, where FAMI_TIENEAUTOMOVIL corresponds to whether the student's family has a car or not.From Figure 10, it can be observed that both the median of students whose families own a car and the median of students who do not have one are close to 0.4, with significant overlap between the violins of these categories.Therefore, this attribute is not considered decisive in student performance.The dataset was adapted from spreadsheet format to ARFF format.Thus, using Weka, the best attributes influencing performance in the Saber tests at both global and individual area levels were determined (see Figure 13).It is worth mentioning that attributes involving scores and performances in the different areas were discarded for the process so that these attributes can be used in the future as predictor attributes.Both continuous and remaining discrete attributes were considered for the process.Based on the results obtained in Table 4, it is notable that for the critical reading and social sciences areas in relation to the overall test score, the attribute related to maternal education is more significant than that involving paternal education.
Similarly, in these cases, the attribute referring to the number of hours dedicated to daily reading (ESTU_DEDICACIONLECTURADIARIA) influences performance in these two areas.Regarding the mathematics and natural sciences areas concerning the overall test score, Table 4 allows for the identification that maternal education is more relevant than paternal education for predicting performance in these areas.On the other hand, in the case of the English area, it was determined that concerning the overall score, the attribute related to maternal education is more important than that addressing paternal education.Likewise, in this case, the attributes referring to the school's calendar (COLE_CALENDARIO) and the student's household stratum (FAMI_ESTRATOVIVIENDA) are crucial in predicting performance in these areas.It is worth mentioning that these two attributes (COLE_CALENDARIO and FAMI_ESTRATOVIVIENDA) do not appear as relevant in the other four analyzed areas.Finally, it is important to conclude that the set of attributes presenting a better metric is associated with the English area with a merit metric of 1.75, indicating that this set of attributes can contribute to the implementation of more efficient supervised learning models.
Similarly, the free tool Weka, through the BestFirst algorithm, was valuable in identifying the best attributes influencing performance in the Saber tests at the global level and in each of the five test areas.
The application of the BestFirst algorithm for determining the relevant attributes impacting performance in each area revealed that, in general terms, maternal education, school characteristics, economic factors, and benefits received through the Generación-E program serve as indicators of performance.In the specific cases of critical reading and social sciences, the parameter associated with hours devoted to daily reading influences performance in these areas.For the English area, two additional attributes related to household stratum and school calendar are significant.
The set of attributes selected for the English area achieved a better merit metric than those obtained in the other four areas, suggesting promising outcomes when implementing supervised learning models to predict performance based on the attributes presented in Table 4. Unlike the other areas, English performance has more categories associated with classifications corresponding to international standards.
As future work stemming from this research, the intention is to evaluate various supervised learning models to determine, either through cross-validation or metrics from the confusion matrix, the best model for predicting performance in the Saber 11 tests.This will be done considering the different attributes selected and presented in Table 4.

Figure 2 .
Figure 2. Average performance obtained in the 5 areas evaluated

Figure 3 .
Figure 3. Correlation matrix of the numerical variables of the dataset

Figure 4 .
Figure 4. Distribution of the global score and individual socioeconomic index variables

Figure 5 .
Figure 5. Box plot diagram for the 5 evaluated knowledge areas

Figure 6 .
Figure 6.Violin plot for the strata and the normalized overall score

Figure 7 .
Figure 7. Violin plot for paternal education and normalized overall score

Figure 8 .
Figure 8. Violin plot for maternal education and normalized overall score

Figure 9 .
Figure 9. Violin plot for internet service and normalized overall score

Figure 10 .
Figure 10.Violin plot for car availability and normalized overall score

Figure 11 .
Figure 11.Violin plot for number of books and normalized overall score

Figure 12 .
Figure 12.Violin plot for meat/fish/egg consumption frequency and normalized overall score

Figure 13 .
Figure 13.Application of the BestFirst algorithm in determining the optimal parameters

Table 1 .
Columns of the dataset deleted

Table 2 .
Columns considered in the dataset ESTU_FECHANACIMIENTOIt corresponds to the student's date of birth.ESTU_TIENEETNIAIt indicates whether the student belongs to a particular ethnic group or not.It is categorical data.FAMI_ESTRATOVIVIENDA, FAMI_PERSONASHOGAR, FAMI_CUARTOSHOGAR They correspond respectively to the stratum to which the household belongs, the number of people per household, and the number of rooms in the household.They are categorical data.FAMI_EDUCACIONPADRE, FAMI_EDUCACIONMADRE They respectively correspond to the educational level of the student's father and mother.They are categorical data.

Table 3
presents the results obtained when applying the BestFirst algorithm to overall performance.Prior to this, the PERCENTIL_GLOBAL attribute was categorized into quartiles, creating a new attribute with categories named Q_GLOBAL.

Table 3 .
Best attributes identified for overall performanceThe results from Table3highlight the relevant attributes to be considered for implementing predictive models associated with the overall score.These attributes predominantly focus on economic indices, school characteristics (such as schedule, nature, location), parental education, the number of books available at home, the student's age, and benefits received through the GENERACION-E program.On the other hand, Table4presents the best attributes obtained by the BestFirst algorithm for the five areas evaluated in the Saber tests.

Table 4 .
Best attributes identified for the 5 evaluated areas