K-Means for Majoring Informatics Students' Interests Based on Brainwave Signals

This study investigates the potential of utilizing EEG (electroencephalogram) as a determinant for the specialization choices of Informatics students. EEG, measuring brain activity patterns, is employed to discern majors of interest among students. A questionnaire revealed that some students opt for specializations due to class availability and peer influence, leading to potential mismatches between their abilities and interests, consequently affecting their final project or thesis. EEG data from 30 respondents, recorded using NeuroSky Mindwave and MyndPlayer Pro software, were subjected to K-Means Clustering after feature extraction through PCA. However, the evaluation using Silhouette indicated a low score of 0.453, possibly due to significant distance between cluster data and centroids, minimal dataset size, and random respondent selection without considering their specific areas of interest. This suggests limitations in using EEG alone for determining specialization choices, necessitating further refinement and integration with additional factors for more accurate predictions.


INTRODUCTION
The brain is the control center of all human activities which is the center of communication and body decisions [1].Brain activity in the form of communication between neurons generates flow and creates brain wave signals that can only be known by recording using an Electroencephalogram (EEG) [2].
The brain waves recorded by the EEG are in the frequency between 0.5 Hz to 100 Hz.Brain wave signals from EEG recordings can identify all conditions of a person, including in a state of not doing anything, full of concentration and thinking until a person's condition is in a state of high mental activity such as panic and fear [3].The recording results can be used to evaluate a person based on the activities carried out during the process of recording brain wave signals.The recording results can be used to evaluate a person based on the activities carried out during the process of recording brain wave signals in answering or deciding on a certain condition [4].
The activity is used as a recorded brain wave stimulus.One of the activities that have been used in previous research is completing a learning ability test or can be called an achievement test in the form of a Basic Mathematical Test in Yumiko, Triroasmoro, and Fauzi's research in 2021 [5].In this study, the achievement test was used as a stimulus to obtain an EEG signal in determining the majors in the field of interest in the Informatics study program at Ahmad Dahlan University.Areas of interest that will be offered include "sistem cerdas" and "relata".Students begin to determine their area of interest in the sixth semester by taking several relevant specialization courses.The reality is that not all students can confidently choose an area of interest based on their abilities even though there is data on learning outcomes as material for decision considerations.
Based on the results of a questionnaire to 30 Informatics students who are in the 6th semester of college to students who have just finished college, there are 12 students taking an area of interest based on their ability in the chosen field of interest compared to other fields of interest, 14 students taking an interest field because it is based on their interests and interests.interest in their specialization courses as well as the development of knowledge in the field of interest, 2 students confidently determined the field of interest because of their abilities and interests, but the other 2 students chose the field of interest because the class was full and unsure of their abilities and interests so they followed their friends when choosing courses interest.This reason can be a factor in changes in the field of interest and unpreparedness of students in the process of working on their final project or thesis.
These factors are the reason for conducting research that can determine the majors in the field of interest based on EEG data, achievement test data and learning outcomes data.The research was conducted by recording brain wave signals using EEG and accompanied by a stimulus as a brain wave signal stimulant in the form of working on questions from achievement tests.The achievement test is a collection of basic course questions that represent each area of interest.
EEG data, stimulus result data and value transcript data will be processed using the k-means clustering method.The k-means method is a clustering method that is widely used because it includes a simple unsupervised clustering technique and can be used for large datasets [6].This method groups data into a cluster that has similar characteristics with one another, so that data with different characteristics will be included in other clusters [2], [6], [7].

Field of Interest
The Informatics Engineering Study Program at Ahmad Dahlan University concentrates on 2 areas of interest, namely "sistem cerdas" and "relata".The field of interest determines the topic of the student's thesis.Students choose the field of interest in the 6th semester.

Data Collection
This study acquired data from 30 respondents.Respondents are students of Informatics Engineering from Ahmad Dahlan University at least semester 5, because respondents are students who have or will choose a major in the field of interest.For each respondent 3 data were taken, namely, brain wave signal data, research stimulus data, and value transcript data.The brain wave signal data collection was done by recording the respondent's brain wave signal which was taken using an Electroencephalogram (EEG) and MyndPlayerPro software.During the recording process the respondents were given a test as a research stimulus and the time given to complete the entire test was 30 minutes.The recording result is still a file with *.log format and must be reprocessed with the same software to convert it into *.csv format data that is ready to proceed to the next stage.

Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is the initial stage of processing EEG data.At this stage, the EEG data will be analyzed and examined to find patterns, find anomalies, test hypotheses and check assumptions using statistical calculations [8], [9].EDA is used to analyze and select the best data variables to be processed at the extraction stage to data grouping.

Feature Extraction
Feature extraction is the process of separating EEG signals based on their feature categories with the aim of producing distinguishing characteristics between one object and another.In this study, the feature extraction process goes through two stages of extraction, namely Order I and the Principal Component Analysis (PCA) method.Orde I Orde I is a method with a statistical approach with several parameters including the mean, median, standard deviation, skewness, and kurtosis.
• Mean ( ̅ ) Used to determine the mean of the data distribution.The means equation can be seen in equation 1.
Where n is the number of data.

• Median (𝑀𝑒𝑑)
Used to calculate the mean value of the data distribution.To find the middle value, there is a difference in determining the midpoint when the data is even and odd.For data with an odd number, use the equation in equation 2 below.
Data with even numbers count using the equation in equation 3 below.
Where Med is the Median, X is the data value, and n is the number of data.

• Standard Deviation (𝑠)
Used to measure the value of the standard deviation of the data distribution by finding the value of the variance first.The standard deviation equation can be seen in equation 4.
Where (x+x ) is the difference between x and x .• Skewness ( 0 ) Skewness is used to measure the level of slope of the data distribution.The skewness equation can be seen in equation 5.
Kurtosis is used to measure the height distribution of the data.The equation of kurtosis can be seen in equation 6.

Principal Component Analysis (PCA)
The PCA method is used to reduce dimensions that have many variables by selecting the most important components with the aim of making calculations more optimal when processing signals [10]- [12].Data reduction is needed to reduce the complexity of the data, most of which have correlations between other data, and convert the data into small and uncorrelated datasets so that the data is easier to interpret [11].
The PCA algorithm begins with looking for data  %,' * which has dimensions m × n, where m is the number of samples and n is the number of attributes.And using the zero-mean technique, namely by subtracting all the values of  9,; in the X matrix, with the average value being the matrix value of X. Zero-mean is a process-to-process data into a standard normal distribution.According to the central limit theorem, this is done if the data taken is close to the population, the data is closer to the normal distribution.So the results of these calculations can represent a number of population data.Find the matrix value  9,; * by using the equation contained in equation 7 below.
Where,  -is the covariance matrix of j x j , and m is the number of samples.The next step is to find the eigen values, as seen in equation 9.
|  -− λI| = 0 and (  -− λI) ×  = 0 (9) Where, I is the identity matrix, is the eigenvalue and v is the eigenvector.Eigenvector is the main component to determine the new variable.To determine the number of new variables used depending on the perception of the cumulative contribution of  ?variation, the calculation of  ?can be seen in quation 10.
Where, D is the number of initial attributes and r is the number of selected components. [10]

Clustering
This stage is the stage of grouping data using K-Means Clustering.K-Means Clustering is a non-hierarchical data clustering method that attempts to group data into one or more clusters.The method works by grouping data based on similar characteristics so that data with the same characteristics will be grouped into one cluster and data with different characteristics will be included in other clusters [6], [7], [13].In processing data, generally k-means clustering uses the following algorithm: • Determine the number of clusters to be used.
• Determine the initial center point (centroid) at random as many as k.Random centroid determination only applies to the first iteration.• Calculate the distance between the centroids with all data using the Euclidean distance with the equation contained in equation 11 below: • Group each data into the nearest cluster.
• Re-determine the centroid value to start the next iteration using the equation contained in equation 12 below.
Repeat Steps 3 and 4, if there is still a change in the position of the centroid.If there is no change in the position of the centroid, the clustering process is complete.

Evaluation
In this study, the evaluation of the system was carried out by analyzing the validity of clustering using the silhouette technique.The Silhouette technique is a comparison of tightness to object separation.Silhouettes can reflect grouped data, so that objects are grouped into clusters that have a match [14]- [17].Silhouette can be defined by the equation contained in equation 13 below.

𝑆 ̅ =
With, a(i) is the average distance from data i to other data in the same cluster, b(i) is the minimum distance from data i to other clusters [14].

Data Collection EEG Data
EEG data was obtained from the results of recording brain wave signals in 30 respondents using EEG tools and MyndPlayer Pro software.The recording process accompanied by giving stimulus to the respondents was carried out within ± 30 minutes.The recorded EEG data will be saved in a file with the format (*.log).
The recording results with the format (*.log) are then imported back into the software and the software will automatically translate the data into waveforms which are divided by category.After getting the waveform based on the data category, it can be exported to get the data in a file with the format (*.csv).This data will be used for further processing.An example of EEG data from one respondent that is ready to be processed can be seen in table 2, an example of EEG data from Respondent 1 below.

Stimulus Data
The stimulus carried out will be the data needed in this study.The stimulus is a collection of several questions representing each subject that forms the basis for specialization courses from semester one to semester 5.The courses include data structures, artificial intelligence, algorithmic strategies, and automata language theory for areas of interest in "sistem cerdas".And courses on web programming, data communication and computer networks, databases and courses on human and computer interaction for fields of interest.The results of the stimulus are scored for each question, the average value is calculated and labeled based on the average value of the area of interest which is greater than other areas of interest.The stimulus data can be seen in table 3 of the stimulus data.

Transcript Data
The transcript data was obtained through an online questionnaire by the respondents.The data is then searched for the average value for the basic courses from each area of interest.The average value obtained from each area of interest can be seen in table 4 of the Transcript Data.

Extraction Data Analysis (EDA)
Data analysis is the stage to find the best data variables using Exploratory Data Analysis (EDA) techniques.The data analyzed is only EEG data from one of the respondents by displaying data variables that have the possibility to be used for processing and have a good data distribution.The data variables analyzed were "Low Beta" and "High Beta" because these waves were waves obtained under conditions of concentration; "Attention", "Meditation" and "Zone" variables because the data is the result of brain wave processing by MyndPlayer Pro software which becomes data based on the conditions of each variable.The analysis shows that the data with the best distribution is in the variables "Zone" to "Meditation" and "Zone" to "Attention" as shown in Figure 1 the results of exploratory data analysis.

Fig 1. The Results of Exploratory Data Analysis
The distribution pattern of the same variable only shows an increase in the middle and continues to decrease until the end.The variable "Meditation" to "Attention" shows the pattern of data distribution but the distribution is too wide and irregular.The variables "Zone" against "Meditation" and "Zone" against "Attention" show regular graphic patterns.The increase in the graph in the "Zone" variable is directly proportional to the increase in the "Meditation" and "Attention" variables, and vice versa.
So it is determined that the variable used in this system is a variable "Zone" to "Attention".The variable "Zone" against "Meditation" was not selected because this variable is derived and generated from brain waves when a person is in a state of meditation or relaxation.Whereas in

Principal Component Analysis (PCA)
Before grouping the data, the data to be used must be reduced to minimize the data by extracting the most important information.This reduction process needs to be done because the python 3 programming language can only perform k-means clustering with 2 dimensions.The results of the data reduction process using PCA can be seen in table 6 of PCA feature extraction.

Fig 2. Visualization of K-Means Clustering
Based on the clustering process, the percentage of interest areas based on clusters is obtained, namely cluster 0 has 75% with a temporary label "relata" and 25% with a temporary label "sistem cerdas".In contrast to cluster 1, the percentage is 50% with the temporary label "relata" and 50% with the temporary label "sistem cerdas".Respondents and their temporary labels that fall into each cluster can be seen in table 7 field of interest based on clusters.

Evaluation
At the evaluation stage, the silhouette score was 0.453.The small score obtained can be caused by the distance between the data in the cluster and the far centroid as can be seen in Figure 3

Table 3 .
The Stimulus Data

Table 4 .
The Transcript Data

Table 7 .
Field of Interest Based on Clusters