An Integrated Clustering Method for Pedagogical Performance

We present an interdisciplinary approach to data clustering, based on an algorithm originally developed for the Big Data Modelling of Sustainable Development Goals (BDMSDG). Its application context combines mechanics of machine learning techniques with underlying domain knowledge–unifying the narratives of data scientists and education-ists in searching for potentially useful information in historical data. From an initial structure masking, results from multiple samples of identiﬁed set of two to ﬁve clusters, reveal a consistent number of three clear clusters. We present and discuss the results from a technical and soft perspectives to stimulate interdisciplinarity and support decision making. We explain how the ﬁndings of this paper present not only continuity of on–going clustering optimisation, but also an intriguing starting point for interdisciplinary discussions aimed at enhancement of students performance.


Introduction
The United Nations Sustainable Development Goals SDG [1] identifies good quality education as the foundation to creating sustainable development, improved quality of life, innovation and creativity. Investment in the sector of education and pedagogical innovations are well-documented, especially in the developed world. However, despite all the evidence on its impact on our livelihood, we are still witnessing huge gaps and variations in attainment and performance across the world. This paper presents an interdisciplinary approach to Big Data Modelling, based on two algorithms designed for machine learning techniques. A key motivation of this paper is to expand pathways for educationists and researchers in attaining unified efforts to uncover and analyse such factors in interdisciplinary contexts. It seeks to address the foregoing challenges by tracking undiscernible and potentially useful information hidden in multiple data attributes. Unlike in Miguis et al. [2], Brooks et al. [3] and Hua Leong and Marshall [4], where the focus was on the segmentation of the dynamics of static groups, this paper takes a Big Data modelling approach to tracking potential triggers of performance among University students (3639 observations on 19 variables) over an 11-year period (2005-2016). This work follows national guidelines of the Commission for Academic Accreditation (CAA) within the Ministry of Education (MoE) in the United Arab Emirates (UAE) which is authorized to license educational institutions, accredit programs and grant degrees and other academic awards across the country.
The Standards that guide the foregoing processes and the criteria that institutions must meet are specified in the Standards for Institutional Licensure and Program Accreditation [5]. It is clearly stipulated in SILPA [5] that institutions offering programs in professional fields such as medicine and other health-related disciplines, education, engineering and others must have to provide opportunities for learning through workplace experience, such as internships or practicums. Internships provide a structured practical learning experience where students are academically supervised and undergo a rigorous process to complement their theoretical learning. At the university degree level, internships are usually required as a part of the majors curriculum and as such they provide students with the opportunity to implement what they have learned theoretically while being supervised to insure they are on the right track. Research shows that through internships, students add more value to their knowledge by getting exposed to real life experimental learning experiences and opportunities. The paper is organised as follows. Section 1 presents the background, motivation, 1 1. INTRODUCTION research aim, objectives and a brief review of relevant literature. Section 2 details the methods-data description and modelling techniques, followed by implementation, analyses, results and general discussions in Section 3. Finally, concluding remarks are drawn in Section 4, highlighting potential new research directions.

Motivation
Attaining good quality education is the ideal dream of all learners, institutions and nations across the world Attwell and Pumilia [6], Meusburger [7]. The United Nations identifies good quality education as the foundation to creating sustainable development, naturally leading to improved quality of life, innovation and creativity. In the modern era where we generate more data than we can process, the issue becomes both a challenge and an opportunity. In a typically academic environment where thousands of multilateral demographic students study multiple modules at different levels, the underlying and resulting data attributes are highly correlated sources of Big Data [8,9]. Just what type of data, how much of it and how fast are questions that researchers have to deal with routinely. A key motivation of this paper is to expand pathways for educationists and researchers in attaining unified efforts to uncover and analyse such factors in interdisciplinary contexts. It is expected that this work will contribute to the work of the Center for Higher Education Data and Statistics (CHEDS) that collects vital educational data for the MoE. CHEDS [10] makes evidence-based decisions, influencing higher education policies and planning at both institutional and national levels. This helps the educational sector to enhance their strengths and ranking in the increasingly competitive world of higher education. Reports and analyses will help in advancing students learning experiences and curriculum designs.

Research Aim and Objectives
The aim of this paper is to highlight robust pathways for applying machine learning techniques in real-life applications in an interdisciplinary context [11]. It seeks to address the problem around optimising naturally arising patterns in large datasets-applying a clustering technique within an integrated generic algorithm in detecting and modelling potentially relevant educational performance data attributes. Its objectives, listed below, are two-fold. Objectives 1 through 3 focus on the technical aspects of the work, while 4 and 5 are on the underlying domain knowledge.
1 To capture multiple data attributes on students' performance across disciplines and carry out data cleaning, data wrangling and initial exploratory analyses for the purpose of gaining insights into the data.
2 To explore initial data for indications of inherent patterns based on selected key attributes-specialisation, level of study, gender and their potential impact on performance.
3 To assess the performance of a novel algorithm based on the mechanics of a standard clustering algorithm.
4 To highlight pathways for educationists, data scientists and other researchers to follow in engaging policy makers, development stakeholders and the general public in putting generated data to use.
5 To share findings with colleagues across disciplines and contribute towards unification scientific research.

Preliminary Studies
Attwell and Pumilia [6] emphasised the need for forging pedagogical competences in analysing and sharing results across disciplines. They particularly reiterated the use of open-source material in higher education, mainly for providing scholars and learners with easy access to data, information and knowledge. Data-driven investigations into aspects of teaching, learning and assessment have attracted interests of many researchers and professionals, not least educationists and data analysts for many years. This paper looks at the two as homing in to a common interdisciplinary problem and solution. While the former seek to enhance the learning process, the latter focus mainly on the tools, techniques that are deployed for learning enhancements. On face value, the two may be seen as representing soft and technical skills respectively, but together they form an interdisciplinary fabric upon which the learning process can thrive. In recent years, interdisciplinarity has been widely promoted as a learning methodology. For example, Aikat et al. [12] see an interdisciplinary gap in graduate education, as it "...remains largely focused on individual achievement within a single scientific domain." They argue that lacking interdisciplinary pedagogy deprives students 2 J o u r n a l P r e -p r o o f 2. METHODS of data-oriented approaches that could help them "...translate scientific data into new solutions to today's critical challenges." Thus, they propose a data-centered pedagogy for graduate education that unifies the efforts of the educationist and the data scientist. This paper has been strongly influenced by the foregoing narratives [6,12], which despite a ten-year gap between them, they didn't exhibit a strong data-driven evidence. In searching for potentially useful information in the students data attributes, we shall be adopting their narrative.

Methods
We present the study methodology as a collection of projects, relating to cause-effect relationship between knowledge & development in a spatio-temporal context. The methodology, described below, focuses on gaining insights into the learning fabric of the sampled students, using identifiable attributes as drivers, to learn the concept via unsupervised. Its original ideas are in [8,9], where it has been applied to map and deliver knowledge about societal SDG clusters.

Implementation Strategy
Implementation strategy is driven by model optimisation achieved by harmonising data variability through Sampling-Measuring-Assessing (SMA) Algorithm 1 [8,9,13]. The algorithm can be adapted for both unsupervised and supervised modelling scenarios and, in a typical unsupervised learning, where the goal is to cluster data objects according to some measures of homogeneity (heterogeneity), the focus is on parameter estimation and likelihood. Implementation  Table 1 in Fisher's correlation form as follows holds in a multiple regression scenario, where the deviations between the fitted values and the mean are replaced by the deviations due to the linear relationship Kim and Timm [14]. We can use cluster analysis [15] and [16] to group students according to this type of similarity measures. That is, given data X = [x i,j ] and, assuming k distinct clusters, i.e., C = {c 1 , c 2 , . . . , c k } , each with a specified centroid, for each of the vectors j = 1, 2, . . . 10, we can obtain the distance from v j ∈ X to the nearest centroid from each of the remaining points in set {x 1 , x 2 , . . . x k } as where x k ∈ X and d (.) is an adopted measure of distance and the clustering objective would then be to minimise the sum of the distances from each of the data points in X to the nearest centroid. That is, optimal partitioning of C requires identifying k vectors x * 1 , x * 2 , . . . , x * k ∈ R n that solve the continuous optimisation function in Equation 3.
Minimisation will depend on the initial values in C and hence if we let z i=1,2,...,n be an indicator variable denoting group membership with unknown values, the search for the optimal solution can be through iterative smoothing of the random vector x|(z = k), for which we can computeμ = E(x) and δ = {µ k −μ|y = k ∈ c z } . Given labelled data, EDA outputs provide insights into the overall behaviour of the data particularly how the attributes relate to the target variable. Typically, SMA then learns the model in Equation 4, where D is the underlying distribution. Initialise: s as a percentage of [x ν,τ ] , say 1% 9: 10:

11:
for i := 1 → K do: Set K large and iterate in search of optimal values 12: while s ≤ 50% of [x ν,τ ] do Vary sample sizes to up to the nearest integer 50% of X

13:
Sampling for Training: s tr ← X

14:
Sampling for Testing: s ts ← X The SMA algorithm also caters for association rules, which can be used to investigate associations among the data attributes in Table 1 and data clustering, for investigating variations among the variables and the naturally emerging natural structures. The estimates can be obtained in various ways, one of the most common method is the Metropolis-Hastings algorithm, based on the original ideas of Markov Chain Monte Carlo (MCMC) simulation techniques [17], that allow for sampling from probability distributions as long as the density function can be evaluated.

Sequence of Analyses
Implementation goes through a sequence of logical steps. We deploy Exploratory Data Analysis (EDA) to provide initial insights into the general behaviour of the student data attributes. Ideally, EDA should guide through understanding interpretation of the analyses and results from data visualisation and other summaries. Based on the data insights from EDA, we adopt unsupervised modelling is implemented by deploying Algorithm 1 based on Affinity Propagation Clustering (APC) algorithm as originally described in [18] and illustrated in [19]. Its original ideas are to merge data clusters until satisfactory levels of similarity (or dissimilarity) are achieved. This type of cluster merging is only possible if the dataset has inherent clusters not less than the initial number stipulated by the algorithm, hence the rationale for EDA. Further, it should be possible to repeatedly extract samples from the data that could then be merged into a cluster. Frey and Dueck [18] describe the merged clusters as exemplars that maximize the levels of average similarity. By repeated sampling and validation, we shall gain a better understanding of the influential factors in the formation of clusters. In the next exposition, we describe the mechanics of propagated clustering as deployed via Algorithm 1 [8,9]. If we let X = [x i,j ] , where i = 1, 2, ..., n and j = 1, 2, . . . , p be the source dataset, with assumed k distinct clusters, we can extract repeatedly extract samples based on indicator variables z i = 1, 2, . . . , n z and s i = 1, 2, . . . , n s , such n z + n s n, as the initial potential joint examplar [exemp(z, s)] as the sample that maximizes the average similarity to all samples in the joint cluster C[z ∪ s], that is: where D i,j is the similarity matrix with the indices corresponding to the i th and j th items in the two samples. The choice of the measure of similarity is application-dependent and user-defined. Then the merging objective is computed as obj(z, s) = 1 2 ρ∈z D exemp(z,s)ρ n z + ν∈s D exemp(z,s)ν n s = n s ρ∈z D exemp(z,s)ρ + n s ν∈s D exemp(z,s)ν 2n z n s (7)

Implementation, Analyses and Results
Implementation goes through a sequence of logical steps. Insights gained from Exploratory Data Analysis (EDA) guide the applications of Algorithm 1 based on Affinity Propagation Clustering algorithm as originally described in [18] and illustrated in [19]. EDA plays a crucial role in defining the research problem and objectives. We adopt it here as an initial step in grouping students according to some measures of similarity.

Graphical Data Visualisation
The two panels in Figure 1 provide basic insights into existing frequency structures in the data based on three key attributes-specialisation, level of study and gender. The most popular courses are law, education and business administration at bachelors and diploma levels. Females have a significant representation in the three most popular courses. They dominate in education, have a fare share in business administration and they make over 34% of law enrolment.  The six panels in Figure 2 exhibit the overal GPA distributions between prior and post-intern semesters, appearing to be fairly similar. As our interest is in detecting naturally arising structures in, we can examine the distributions from different bandwidths. Figure 3 shows that only at very low bandwidths we can detect underlying structured in each of 6 J o u r n a l P r e -p r o o f

IMPLEMENTATION, ANALYSES AND RESULTS
the GPA category-more pronounced in the before semester than in the other two. The average GPAs before, in-semester and after semester are 2.74, 3.11 and 2.84 respectively, suggesting either spurious clusters or masking in the top left panel in Figure 3. In the next exposition, we carry out further explorations by looking at the densities of the individual dominating categories-Law, Education and Business Administration.

Unsupervised Modelling
The Affinity Propagation Clustering algorithm generated heavily overlapping clusters for the GPA data. Figure 4 show patterns for two, three, four and five clusters, clock-wise from top left respectively. They both indicate a separation not based on the average GPA. Hence, we take a closer look at the data to establish the basis of the clusters' formation.  Table 2a shows the proportions of of cases, based on selected attributes, in each of the detected clusters. The rows in the first column, coded as C12, C22 for cluster 1 and 2 in the two cluster group, to C15 through C55 for cluster 1 through 5 for the five cluster group. The remaining columns represent data from the attributes Specialisation and Gender. Table 2b shows the average GPA levels in each of the selected categories. These statistics are potentially useful in the sense that the choice of a course, specialisation and performance are conditional on various factors including the quality of teaching and delivery, course organisation and general management as well as assessment and feedback students receive [20]. Such statistics could help the CHEDS [10] in the UAE in making evidence-based decisions to guide and influence higher education policies and planning at all levels.
8 J o u r n a l P r e -p r o o f  The two panels in Figure 5 correspond to values in Table 2a and 2b respectively. The horizontal axis on the left hand side panel corresponds to the three specialisation categories and gender in the order given in the two tables and the vertical axis represents the category percentage. The horizontal axis on the right hand side panel displays the 14 clusters as shown in the first column of Table 2a, while the vertical axis shows the average GPAs. By visual inspection through the line, cluster overlapping is evident-those on the same horizontal line have similar scores.  The patterns in Figure 6 are the best representations of the underlying structure in the sampled data. They were obtained based on multiple runs of sampling through the data inside the clusters in Figure 4. Both panels present a clear conclusion that in terms of GPA performance based on the sampled data, we can isolate three distinct clusters, centred around GPAs of 3.5, 3.0 and between 2.5 and 2.0. It is important to note that three clusters are dependent on both level and gender, which the two panels do not distinguish. While a table detailing the dominance in each category may be useful, it is imperative to interpret such data in conjunction with other relevant attributes, such as the left hand side panel of Figure 5. The data for each of the 14 clusters is available for potential future examinations.

Concluding Remarks and General Discussions
The paper sought to address a two-fold problem. On the one hand, it focused on the technical aspects of Big Data Modelling, for which it deployed the affinity clustering algorithm [18,19] based on the mechanics of the SMA algorithm [8,9]. On the other hand, it focused on the soft, interdisciplinary aspects of BDM-i.e., applying machine learning techniques to real-life applications in an interdisciplinary context. Objectives 1 through 3 were met in sub-sections 3.1 and 3.2. It is imperative to note that more analyses could have been carried out based on the settings in this paper. However, the scope for this application was confined to 3 of the original 19 attributes-i.e., Specialisation, Level and Gender, so as to accommodate the technical aspect of the set objectives under limited interdisciplinary interpretations. The findings presented in this paper are therefore intended to fulfil objectives 4 and 5-i.e., they should open new discussions and highlight novel paths for interdisciplinary research involving data scientists and educationists.
Even within this limited application, our findings show that there are great potentials in incorporating interdisciplinary approaches in university curricula, bringing together domain sciences on the one side and data science on the other. Further tests and evaluations of the SMA algorithm can conducted using a wide range of unsupervised and supervised techniques, with any combination of the 19 data attributes. Algorithm 1 is also capable of handling association rules-originally developed for analysing shopping transactions-see Agrawal et al. [21]. In this particular application, association rules can play a unifying role between unsupervised and supervised modelling in that they can capture underlying rules of association among the students' data attributes. We expect the technical and soft aspects of the paper to increasingly attract attention to collaborative, interdisciplinary research activities in various sectors.
Finally, and as emphasised by Aikat et al. [12], our paper showed, via real data, that uncovering attainment and performance triggers cannot be confined to silos of domain knowledge, neither to algorithms developed by data scientists. A unified understanding can only be achieved through cross-institutional collaborative research, sharing data and findings. The outcomes of this work should provide useful inputs to the Center for Higher Education Data and Statistics 10 J o u r n a l P r e -p r o o f