The K-Means Algorithm for Generating Sets of Items in Educational Assessment

In a national-scale educational assessment system, such as the National Examination, the need for several sets of questions that have the same level of difficulty is very required to avoid cheating by students. Therefore, the objective, which is to make a set of questions with the same level of difficulty automatically, is done in this research. It used a machine learning approach, namely K-Means. To achieve this goal, several following procedures need to be implemented. Firstly, we need to create banks of questions to be assigned to students. Then, we build training data by determining the value of each question based on Bloom's Taxonomy, item characters/types, and other parameters. Then, with utilizing K-Means, several cluster centers are obtained to represent the uniformity of the questions in the cluster members. By using several heuristics criteria defined previously, several sets or packages of questions that have the same characteristics and difficulty levels are obtained. From the experiments conducted, the analysis with descriptive (i.e., mean, standard deviation, and data visualization) and inference (i.e., ANOVA) statistics of results are presented showing that questions of each sets have the same characteristics to ensure the fairness of examinations. Moreover, by using this system, the contents of the questions in the generated set do not need to be the same, the package of questions can be generated automatically quickly, and the level of the difficulties can be measured and guaranteed.


A B S T R A C T S A R T I C L E I N F O
In a national-scale educational assessment system, such as the National Examination, the need for several sets of questions that have the same level of difficulty is very required to avoid cheating by students. Therefore, the objective, which is to make a set of questions with the same level of difficulty automatically, is done in this research. It used a machine learning approach, namely K-Means. To achieve this goal, several following procedures need to be implemented. Firstly, we need to create banks of questions to be assigned to students. Then, we build training data by determining the value of each question based on Bloom's Taxonomy, item characters/types, and other parameters. Then, with utilizing K-Means, several cluster centers are obtained to represent the uniformity of the questions in the cluster members. By using several heuristics criteria defined previously, several sets or packages of questions that have the same characteristics and difficulty levels are obtained.
From the experiments conducted, the analysis with descriptive (i.e., mean, standard deviation, and data visualization) and inference (i.e., ANOVA) statistics of results are presented showing that questions of each sets have the same characteristics to ensure the fairness of examinations. Moreover, by using this system, the contents of the questions in the generated set do not need to be the same, the package of questions can be generated automatically quickly, and the level of the difficulties can be measured and guaranteed.

INTRODUCTION
Since to know the quality of education and the ability level of students is an important task to improve the whole of education system, educational evaluation should be done systematically (Scheerens and Glas, 2003;Tyler, 1942). Therefore, we should evaluate and analysis performances of education in all stages (i.e., input, teaching and learning processes, output, instruments, curriculum, etc.). A part of the educational evaluation, namely educational assessment, is used for obtaining information about the understanding levels of students to the materials that have been taught.
The goals of educational assessment are basically aimed to placement, formative, summative, diagnostic, and selective assessments. The first one is used to put a student into a certain level/class according to the prior knowledge/achievement so that we have the same ability on the class. To know a gap between students' knowledge and teachers' instructions is the formative assessment while the summative one is aimed to give a final course grade of each student (Harlen and James, 1997). Meanwhile information regarding students' difficulties during learning processes can be obtained through the diagnostic assessment, the test used for filtering or choosing some best participants is called by the selective assessment.
Moreover, many ways can be used to evaluate students' performance, for example: written tests (i.e., essay, multiple choices, etc.) and oral tests (e.g., interview and observations). In this research, we focus on the written tests. An issue, that could be happened, in the written test is how to build some sets of questions/items that provide the same characteristics and difficulties. The sets are necessary to avoid cheating among students. The usual way used to build sets of questions is such as by randomizing the order of questions and by modifying the options of answers. Modifying of questions is rare to be done because this task is not easy and spends a lot of time. So it can be seen that this issue should be solved by finding a strategy to generate sets of questions automatically.
Therefore, this research is aimed to generate some packages/sets of questions automatically. It should be noted that to ensure fairness all sets should contain questions with same characteristics and difficulties. To achieve this objective, we consider to utilize the K-Means algorithm (Bansal et al., 2017). It is a classic unsupervised learning method included in Machine Learning (Mitchell et al., 1990) to define cluster centers and their members that have the same characteristics. There are some implementations of K-Means showing its contributions in dealing with various problems. For example, K-Means was used to determine shuttlecock placement and stroke types in badminton . Related on generating sets of questions as our previous research, a variant of K-Means was used . Additionally, a method in Machine Learning, called the apriori association rule, was utilized to detect aspects of students' difficulty and its recommendations (Munir et al., 2018). Figure 1 is the proposed method used in this research for generating sets of items by using K-Means. It was adopted from the previous research in . Basically there are three stages as follows:

Data preparation.
This stage is aimed to generate data training, which is the data used for training the algorithm so that we obtain sufficient model for building sets of items. There are some processes in this stage, as follows: a. Collecting questions/items: In the data preparation step, we firstly need to collect items on a particular subject. To simplify in this research we just collect 638 questions from three following chapters: computer and networking, application layer, and transport layer, in five text books used many universities in the worlds entitled as follows: • Computer Network by Tanembaum & Wetherall (2011 (2013). b. Defining features: It means that we define some characteristics or features on each question. Therefore, this task is useful to determine whether a question is the same as another one or not. In this research, we had defined 14 features as follows: • C1: The first level of cognitive domain (i.e., remembering) that has a value between 0 and 1. • C2: The second level of cognitive domain (i.e., understanding) that has a value between 0 and 1. • C3: The third level of cognitive domain (i.e., applying) that has a value between 0 and 1. • C4: The fourth level of cognitive domain (i.e., analysing) that has a value between 0 and 1.

Clustering using K-Means
Basically, in this step we implement and execute the algorithm K-Means with supplying some input data, such as data training, maximum iteration, and number of cluster centers. Regarding the algorithm, the detailed explanation can be found in (Na et al., 2010 ). In short , it contains four steps as follows: a. Initialization of cluster centers: It means that at the first step we need to choose cluster centers. It can be done randomly. It should be noted that the b. Assignement step: After choosing the cluster centers, distances all data to cluster centers are calculated to determine the cluster member. So, the instances included in the same cluster mean they are closed each other.
c. Update step: Then we update the position/location of cluster centers by averaging all values of all members included in the cluster.
d. Repeat the processes: The same processes are repeated until maximum iteration or convergence.
The output of this step is cluster centers with their members representing questions that have the same/closed characteristics. It should be noted that cluster centers represent sets of items while items are their members.

Building sets of items:
The last thing that should be done is to pick question according to the cluster centers and their members. For example, we need to generate three packages of items where each set contains five questions. Therefore, we just need to choose one cluster center randomly. Then, we pick three questions from the selected cluster center to be a member of three packages. So, now we have one question for each set that have the same characteristics. Then, we repeat these process until we obtain five questions for all sets. It should be noted in these processes we can put other criteria to ensure the quality of questions, such as the duplication of questions is not allowed and the proportion of the selected question from all chapter is considered. Finally, by passing this step we obtain sets of items that have similar characteristics of questions so that fairness can be ensured.

RESULTS AND DISCUSSION
After designing the proposed computational model as explained previously, we build a web based application as shown in Figure 2. It is the result page showing some packages of items generated by the system. In the system, we also provide other functionalities, such as creating a new project, creating and loading items along with metadata required to build data, and then other parameters (e.g., numbers of sets and K-Means parameters).
Moreover, we had run some simulations to validate the performance of the proposed model. By using the data training obtained from five textbooks as introduced before, for example, we need to build three sets containing 10 questions obtained from 638 questions in the textbooks. The result can be seen as follows: 1. It should be noted that the ID of questions represents the chapter and question number. For example, a question with the ID 1.58 means that it is from the first chapter and the 58th question. According to the results, it can be seen that we obtain equal proportions of chapters. In other words, all questions represent all chapters for all sets. Moreover, we also analysis the results based on the values of the features of all questions. The average of values can be seen in Table 1.  According to the average value in Table  1, we obtain mean for all sets: 11.8, 11.4, and 9.86 and standard deviation for all sets: 35.3, 34.2, and 28.9. It means that all sets relatively have similar characteristics on 14 features. To explain in more detail, the data visualization of the average values of all features can also be seen in Figure 3.
Additionally, we perform the analysis of variance (ANOVA) test with α = 0.05. The following are hypotheses constructed to prove that items in each set have similar characteristics: • H0: There is no difference between the average of feature values on Set 1, 2 and 3. • H1: There is a difference between the average of feature values on Set 1, 2 and 3. After running ANOVA we obtained pvalue: 0.987. It means that H0 is accepted. Therefore, we can state that the characteristics of equations in all sets are relatively similar/same so that the fairness of examination can be kept. Additionally, we can also compare with our previous research  ) that shows that by using Fuzzy C-Means the system provides the same results In the future, we have a plan to improve the model by using different alternative methods, such as Rough Sets (Riza et al., 2014), Naïve Bayes (Mulyani et al., 2016), and Fuzzy Sets (Riza et al., 2015). These methods are included in Machine-Learning methods so that the computational model built can be smart. Moreover, we also propose a computational model to generate the bank of questions  and values of features of the questions automatically. Various intelligent classifiers can be used for improve the computational model (Alasker et al., 2017). We can also improve the computational cost by implementing data streaming (Mediayani et al., 2013).

CONCLUSION
The contributions of this research are that firstly we provide a computational model using K-Means for generating sets of items that have the same characteristics to ensure the fairness of the examination. Before performing the K-Means, we also proposed 14 features to be used for building data training. The 14 defined features, such as Bloom's taxonomy, types of questions, etc, represents inside characteristics on questions. Moreover, an experiment was done to validate the model. According to the results and their analysis using descriptive (i.e., mean, standard deviation, and data visualization) and inference (i.e., ANOVA) statistics, we can state that the proposed system produced the sets of items as required.

AUTHORS' NOTE
The author(s) declare(s) that there is no conflict of interest regarding the publication of this article. Authors confirmed that the data and the paper are free of plagiarism.