Combining Multiple Text Representations for Improved Automatic Evaluation of Indonesian Essay Answers

. Purpose: Essay questions serve as important examination methods to provide a more elaborative insight, compared to multiple-choices, regarding students’ learning achievement. When the number of students in a class is huge, however, examinations using essay questions become hard to conduct and take a long evaluation time. Automatic essay evaluation has, therefore, become a potential approach in this situation. Various methods have been proposed, however, optimal solutions for such evaluation in the Indonesian language are less known. Furthermore, with the rapid development of machine learning approaches, in particular deep learning approaches, the investigation of such optimal solutions becomes more necessary. Method: To address the aforementioned issue, this study proposed the investigation of text representation approaches for optimal automatic evaluation of Indonesian essay answers. The investigation compared pre-trained word embedding methods such as Word2vec, GloVe, FastText, and RoBERTa, as well as compared text encoding methods such as long short-term memories (LSTMs) and transformers. LSTMs are able to capture temporal semantics by employing state variables, while transformers are able to capture long-term dependency between parts of their input sequences. Additionally, we investigated classification-based and similarity-based training to build text encoders. We expected that these training approaches allowed encoders to extract different views of information. We compared classification results produced by different text encoders and combinations of text encoders. Result: We evaluated various text representation approaches using the UKARA dataset. Our experiments showed that the FastText word embedding method outperformed the Word2vec, GloVe, and RoBERTa methods. The FastText method achieved an F1-score of 75.43% on validation sets, while the Word2vec, GloVe, and RoBERTa methods achieved F1-scores of 69.56%, 74.53%, and 72.87%, respectively. In addition, the experiments showed that combinations of text encoders outperformed individual encoders. The combination of the LSTM encoder, the transformer encoder, and the TF-IDF encoder obtained an F1-score of 77.22% in the best case, which is better than the best F1-scores of the individual LSTM encoders (75.35%), the best combination of transformer encoders (71.49%), and the individual TF-IDF encoder (76.69%). We observed that LSTM encoders produced better performance when they were built using classification-based training. Meanwhile, the transformer encoders obtained better performance when built using similarity-based training. Novelty: The novelty proposed in this research is the optimal combination of text encoders specifically constructed for the evaluation of essay answers in the Indonesian language. Our experiments showed that the combination of three encoders - namely the LSTM encoder built using classification-based training, the transformer encoder built using classification-based and similarity-based training, and the TF-IDF encoder - obtained the best classification performance.


INTRODUCTION
Essay questions are important methods to examine student understanding of course materials [1].Compared to multiple-choices, essay questions allow a more elaborative examination and provide a deeper insight about students' learning achievement [2].Essay questions require students to apply logical reasoning and higher-order thinking skills to formulate answers [3].Moreover, this approach allows students to provide opinions with respective justifications and pushes them to practice systematic writing, which is an essential life skill.
With the increasing popularity of online courses, including the massive open online courses (MOOCs), a large number of students (hundreds or more) can be enrolled in a single online class.When this happens, examinations using essay questions become hard to conduct and take a long evaluation time.While course providers may spend more funding to hire graders or examiners, the additional cost may not be feasible.Meanwhile, removing essay questions from student assessments may degrade courses' quality.Automatic essay scoring has, therefore, become a potential solution to overcome this issue [4].An automatic system is able to quickly analyze students' answers and to decide whether the answers are correct, partially correct, or incorrect [4].
The history of automatic essay scoring has been reviewed by Burrows et al. [5], who divided the timeline into five eras: concept mapping, information extraction, corpus-based method, machine learning, and evaluation.The review emphasized the recent development of the area that focused on reproducibility, standardized corpora, and permanent evaluations e.g., those supported by evaluation forums such as SemEval Semantic Textual Similarity or STS [6]- [8].
The most direct method of automatic essay evaluation is applying grammar and semantic analysis to extract the core content of essay answers [9], [10].The core content can then be matched to ground-truth answers to obtain the degree of correctness.This method, which directly corresponds to the human way, though, is not adaptive.It requires an explicit encoding of human knowledge and often involves extensive manual tuning.Reapplying the method from one domain to another domain is not practical.A more practical approach for the automatic evaluation is the machine learning approach.Assuming that there are a large number of answers as examples, enclosed with the respective labels, a machine learning algorithm can learn and extract patterns from the examples based on which future answers can be evaluated and marked.
The machine learning approaches can be divided into two categories: response-based and reference-based approaches [11].The response-based approach employs the availability of a large number of pre-scored student responses to train reliable regression or classification models.Static-length features such as bag-ofwords or n-grams and vector-space classifiers such as support vector machines (SVMs) have been used to build scoring systems [12]- [14].Zedan and Al-Sultan [15], for example, combined features such as term weights, length ratios, and lexical or semantic similarities with supervised learning algorithms to produce essay scores.They also proposed the application of text augmentation, based on key-grading specific constructs, i.e., question demoting, to obtain more accurate text classifiers.Evaluations on SemEval-2013 and Mohler datasets [6], [12] have confirmed the effectiveness of the augmentation.
Reference-based approaches address the issue of scarce responses by formulating automatic scoring tasks as text similarity problems.In this case, the score of a student's response is computed by comparing the response to some reference answers.Higher syntactic or semantic similarity means a higher mark/grade for the answer.Current work on reference-based approaches is focused on vector-space regression.Pado [14] performed automatic essay evaluation by employing features such as n-grams, text similarities, dependencies, abstract semantic representations, and entailment votes.Features obtained from students' responses were matched against those obtained from reference answers.This work produced two types of corpora: language skills and content assessments, showing that the features may have different levels of effectiveness depending on corpus type.Ratna et.al [16] introduced a web-based essay grading system called SIMPLE-O that utilized latent semantic analysis (LSA) to estimate similarities between student responses and reference answers.Their experiments, which involved 40 students and 3 lecturers as graders, demonstrated that the system obtained 86% agreement with human ratings.One problem that appeared from the aforementioned methods is that they cannot directly handle inputs with variable lengths.Features employed by the methods are also "heuristically crafted", leading to the sub-optimality of their performance.
While various methods have been proposed, optimal solutions for automatic essay evaluation in the Indonesian language are less known.The Indonesian language is a standardized form of Malay that has been used as a lingua franca in the Nusantara archipelago for centuries [17].The language is now spoken by almost 300 million people [18] and has joined the ranks of UNESCO's ten official languages alongside English, Arabic, Mandarin, Russian, French, Spanish, Hindi, Italian, and Portuguese [19].Optimal text representations for Indonesian essay answers are, therefore, very important.With the rapid development of machine learning approaches, particularly deep learning approaches, investigating such optimal solutions becomes more necessary.
To seek the aforementioned optimal solutions, this study compared pre-trained word embedding methods such as Word2vec, GloVe, FastText, and RoBERTa and compared text encoding methods such as long short-term memories (LSTMs) [20] and transformers [21].LSTMs have been known to be able to capture temporal semantics by employing state variables, while transformers are able to capture long-term dependency between parts of their input sequences.We also investigated combinations of text encoding methods to determine their effectiveness in improving the accuracy of essay evaluation.

METHODS
Figure 1 shows the proposed essay evaluation method.The method takes a student's answer (text) at a time and processes this input by applying word embedding, text encoding, and classification/matching to obtain an evaluation result.

Pre-processing
The input of the proposed method is a text provided by a student as an answer to a particular question.To extract features from this text, some pre-processing steps are applied.The first step is replacing non-relevant characters in the text with spaces.The only characters considered relevant are a-z, A-Z, 0-9, ".", ",", "?", and "!".Following this, the text is split into words using whitespaces as delimiters.At this point, the text has become a sequence of words.

Word embedding
Before a sequence of words can be further processed, each word in the sequence has to be converted into a vector.We evaluated some word embedding methods for the conversion, namely Word2vec [22], GloVe [23], FastText [24], and RoBERTa [25], [26].Word2vec [22] finds the vector representation of a word as the most predictive vector with regard to the surrounding words.The resulting vectors obtained from this approach are able to capture semantic meanings, allowing words with similar meanings to have similar vector representations.GloVe [23] finds representations of words by employing information from the words' neighborhoods and explicitly using the number of word co-occurrences in a particular corpus.The method thus aims to obtain representations that are representative from the global perspective.FastText Representation Vectors z1, z2, … Concatenated Feature Vectors z [24] works on character n-grams to obtain more fine-grained representations of texts before the method produces vector representations of words.This approach captures sub-word information, making it effective for languages with rich morphology and handling out-of-vocabulary words.RoBERTa [25] is an improved version of BERT [26], which also works on sub-word tokens.This method employs the powerful representational capability of the transformer [21] while at the same time considering the use of both left and right context (thus bidirectional).RoBERTa is trained using dynamic masking on longer sequences to improve its generalization.
We employed pre-trained models of the Word2vec 2 , GloVe 3 , FastText 4 , and RoBERTa 5 trained using corpora in the Indonesian language.These pre-trained models were built from large and representative datasets such that they are able to produce more general and better representations.We used the models to embed words into vectors of certain dimensions.Specifically, the pre-trained Word2vec, GloVe, FastText, and RoBERTa models produce vectors of 300, 50, 300, and 768 dimensions, respectively.After applying word embedding, the sequence of words is converted into a sequence of vectors.To build both encoders, we plugged a feed-forward layer with a sigmoid activation function at the end of the encoders (Figure 3).Correct answers are given the label 1, while incorrect answers are given the label 0. We then conducted classification-based training using the mean squared error (MSE) as the loss function.Note that the MSE was computed from the values obtained from the sigmoid outputs and the targets specified in the training data.After the training finished, we unplugged the feed-forward layer and used the output of the original model as vector representations of texts.Note that we padded input sequences with zero vectors if necessary so that they had the same lengths (equals the length of the longest sequence in the dataset).

Text encoding using similarity-based method
To complete our investigation, we also applied similarity-based methods to build text encoders.For this purpose, we trained the encoders using the Siamese network architecture [27], [28], as shown in Figure 4.This architecture takes two sequences as inputs, both of which are given to the same text encoder.The two outputs produced by the encoder are compared using the cosine similarity formula.The Siamese networks were built using training sets containing matched pairs and unmatched pairs.Matched pairs were given the label +1 which represents high similarities, while unmatched pairs were given the label -1 which represents low similarities.Using these labels, the training was conducted by employing the mean squared error (MSE) as the loss function.

U +
2, U + 3, … , U + C of correct answers; 5.For each cluster U + i, randomly choose two answers from the cluster and add the pair to the set of matched pairs P + , repeat as required.
The procedure chooses any pairs of correct and incorrect answers as unmatched pairs.For the matched pairs, the procedures only choose pairs of correct answers from the same cluster.This is to make the matched pairs close enough to each other in the feature space.Note that we called correct answers in the training set as reference answers.
During the inference, we used the trained text encoders to encode input answers.The obtained representation vectors were then "compared" to the average representation vector of reference answers to produce the final representations (Figure 5).Note that we used the Hadamard product for the "comparison" since the Hadamard product is the "ingredients" for cosine similarity.

Feature fusion and classification
As explained in the previous subsections, we have two alternatives to perform text encoding, namely the LSTM encoders and the transformer encoders.In addition to using the encoders individually, we proposed to combine multiple representation vectors obtained from multiple encoders (by concatenation).This scheme is known as feature fusion.We considered that various representations may bring complementary information and can be combined to produce improved performance.The last step of the proposed method is the classification.We used support vector machines (SVMs) [29] to classify the given essay answers into correct or incorrect.

RESULTS AND DISCUSSIONS Dataset and evaluation setup
To evaluate the validity and effectiveness of the proposed method, experiments were carried out over the UKARA dataset 6 .The UKARA dataset contains 10 essay questions, each with approximately 5000 -6000 answers.Answers for each question are further classified into two sets, namely the sets of correct answers and incorrect answers.This grouping is performed by a lot of human graders who have received technical guidance on how to perform the grading.Table 1 shows in detail the number of answers and the grouping of the answers for the 10 questions.We regarded the set of correct answers as the positive one.We used 3-fold cross-validation to conduct experiments on the UKARA dataset.For each question, stratified random sampling was employed to create training, validation, and test sets using a proportion of approximately 60%, 10%, and 30%, respectively.The validation sets were used to fine-tune the proposed methods before they were applied to the test sets.By using three folds, we attempted to achieve the balance between real-world scenarios and experiment validity.When we used more folds (e.g.five or 10 folds), the number of training data, in the context of essay evaluation, became much lower than the number of test data.This did not conform to real situations of essay evaluation where the number of labeled data (i.e., the training data) is normally much less than the number of unlabeled data (test data).We also did not choose one or two folds since these numbers of folds were too low for cross-validation (and are, therefore, uncommon).When one or two folds were used, we were afraid that the obtained results were not valid and convincing enough to support research conclusions.

Model tuning and classification-based training
We built text encoders by connecting feed-forward neural networks with sigmoid outputs to the encoders and conducted classification-based training.Table 2 shows classification results when LSTM encoders, trained using test sets, were employed to classify answers in validation sets.Bold numbers represent the best performance of a particular metric, while underlined numbers represent the best performance of a particular metric for a certain word embedding.We compared the four word embedding methods and evaluated various numbers of neurons for text encoding.Since the number of correct answers is generally less than that of incorrect answers, we valued precision, recall, and F1-score more than accuracy.Table 2 shows that the best results were achieved by the FastText method and LSTM encoders of 20 or 40 neurons, respectively.The FastText method produced an F1-score of 75.43% in the best case, which was better than the best results produced by the Word2vec, GloVe, and RoBERTa methods.The Word2vec method produced the worst performance i.e., an F1-score of 69.56% in the best case.The GloVe and the RoBERTa methods were the second and the third performers, respectively.The GloVe method produced an F1-score of 74.26% in the best case, while the RoBERTa method produced an F1-score of 73.00% in the best case.
Table 3 shows classification results when transformer encoders were employed for text encoding.Similar to the previous results, Table 3 shows that the best results were achieved by the FastText method and text encoders of 20 or 40 neurons, respectively.The FastText method produced an F1-score of 75.14% in the best case.The Word2vec method still produced the worst performance i.e., an F1-score of 69.01% in the best case.The GloVe and the RoBERTa methods produced F1-scores of 74.53% and 72.87% in the best cases, respectively.We used these results to direct further experiments.In the subsequent experiments, we employed the FastText method for word embedding and text encoders of 40 neurons for text encoding.

Similarity-based training
We built text encoders using similarity-based training by employing the Siamese network architecture.The encoders were trained to produce low similarities when given pairs of correct answers and reference answers and high similarities when given pairs of incorrect answers and reference answers.Figure 7 shows distributions of cosine similarities of correct answers (a) and incorrect answers (b) when the answers were encoded using LSTM encoders and were compared to the average representation vectors of reference answers.Figure 8 shows similar distributions obtained from transformer encoders.Both figures show that most of the correct answers had high cosine similarities, while most of the incorrect answers had low cosine similarities.These results thus suggested that the similarity-based training has worked as expected.We compared the classification performance of the four aforementioned encoders and their combinations.These encoders and the combinations can be regarded as the methods "proposed" in this paper and are written in bold in Table 4. Table 4 shows classification results when encoders constructed using training sets were used to evaluate answers in validation and test sets.The bold number represents the best F1-score, while underlined numbers represent the best F1-score of a particular class of text encoding.Note that for combinations of multiple encoders, we concatenated the outputs of the encoders into a single long vector.We compared text encoders trained using the classification-based approach, as shown in the first three rows of Table 4.The best results were achieved by the individual LSTM encoder (achieved an F1-score of 75.35%), outperforming the individual transformer encoder (achieved an F1-score of 74.76%) and the combination of the two (achieved an F1-score of 75.19%).
We also compared text encoders trained using the similarity-based approach, as shown in the second three rows of Table 4.In this case, the best results were achieved by combining the LSTM encoder and the transformer encoder (achieved an F1-score of 71.49%).The individual transformer encoder came second with an F1-score of 69.62%, and the individual LSTM encoder came third with an F1-score of 68.18%.
From the results presented in Table 4, we may infer that LSTM encoders are better suited to classificationbased training, while transformer encoders are better suited to similarity-based training.In other words, recurrence and attention mechanisms have their own uniqueness and work differently in various situations.
After we obtained results from the classification-based approach and the similarity-based approach, we combined the best encoders obtained from the two approaches, as shown in the seventh row of Table 4.It turned out that combining the two encoders produced better performance than using the encoders individually.The combined model achieved an F1-score of 75.44%.The combined model was slightly better than the best individual encoder built using the classification-based approach and was clearly better than the best individual encoder built using the similarity-based approach.These results suggested that classification-based and similarity-based approaches have advantages that might complement each other.
Moreover, we compared our text encoders to existing methods found in previous research [30,31].Rajagede and Hastuti [30] proposed a method that encoded texts by taking the average of FastText's word embedding vectors.This method represents a basic text encoding approach that calculates some basic aggregate operations over a collection of vectors.Meanwhile, Sari et al. [31] proposed a method that encoded texts using a more classical approach, namely the bag of words approach or the TF-IDF method.Unlike recent methods, the TF-IDF method simply ignores the order of words in input texts and puts more emphasis on the appearance of distinguished keywords.The TF-IDF method was originally developed for information retrieval rather than semantic understanding.Table 4 shows that our combined encoders clearly performed better than the method used in [30].However, when compared to the TF-IDF method (the bag of words approach) used in [31], our combined encoders performed worse.The TF-IDF approach achieved an F1-score of 76.69%, while our combined encoders achieved an F1-score of 75.44%.These results revealed that although the TF-IDF approach is simple, it is able to extract useful information from Indonesian essay answers effectively.Furthermore, these results suggested that the appearance of certain keywords becomes a strong indicator of the correctness of the answers.
Motivated by this observation, as the last examination of this research, we combined the TF-IDF approach and the best combination of LSTM and transformer encoders.It turned out that combining the aforementioned approaches increased further classification performance.The combined approaches achieved an F1-score of 77.22%, the best performance obtained from the overall comparison.Therefore, these results suggested that the combined different encoders captured complementary information that can be used to improve the automatic evaluation of Indonesian essay answers.We may also conclude that classical and recent approaches have their own uniqueness that can be employed together to achieve better performance.

CONCLUSION
In this research, we proposed to evaluate and combine some text encoders, including LSTM encoders and transformer encoders, to extract text representations from essay answers in the Indonesian language.Some methods were used to build the encoders, including classification-based training and similarity-based training.We also evaluated some word embedding methods such as Word2vec, GloVe, FastText, and RoBERTa.We compared various alternatives of the proposed text encoders and compared them with some existing methods, such as average FastText vectors and TF-IDF.Our experiments have shown that FastText has become the best word embedding method.The experiments have also shown that the best text encoding has been performed by the combination of three encoders: the LSTM encoder, which was built using classification-based training; the transformer encoder, which was built using classification-based and similarity-based training; and the TF-IDF encoder.The combined encoders were able to achieve an F1score of 77.22% when evaluated on the UKARA dataset.

Figure 2 .
Text encoding using (a) the LSTM and (b) the transformer

Figure 3 .
Figure 3. Classification-based training to build text encoders

Figure 4 .
Figure 4. Similarity-based training to build text encoders Since the original dataset used does not provide pairs for training (only provides answers with correct or incorrect labels), we needed to generate the pairs from the dataset.For a particular question, we proposed to generate the pairs as follows.1. Initiate the set of unmatched pairs P -and the set of matched pairs P + to empty sets; 2. Choose randomly an answer from the training set of incorrect answers X -={x - 1, x - 2, x - 3, … , x - M} and an answer from the training set of correct answers X + ={x + 1, x + 2, x + 3, … , x + N} and add the pair to the set of unmatched pairs P -, repeat as required; 3. Extract features from each answer in the set of correct answers X + ={x + 1, x + 2, x + 3, … , x + N} to produce a set of representation vectors U + = {u + 1, u + 2, u + 3, … , u + N} ; 4. Apply the agglomerative clustering to U + using cosine distance as the metric to produce clusters U + 1,

Figure 5 .
Figure 5. Extraction of final representations in similarity-based text encoders

Figure 6
shows the MSE obtained from training the LSTM model of question number 1 with 40 neurons.The graph reveals that the MSE steadily decreased as the training progressed.At the beginning of the training, the MSE was close to 0.09 (both the MSE obtained from the training data and from the validation data).After 40 epochs, the MSE of the training data dropped to around 0.04, while the MSE of the validation data dropped to around 0.065.The MSE of the validation data was, therefore, always higher than the MSE of the training data (to be expected).The training, however, did not suffer from overfitting since the MSE of the validation data did not increase until the training finished.

Figure 7 .
Distributions of cosine similarities of (a) correct answers and (b) incorrect answers, with respect to the average representation vectors of reference answers, obtained from LSTM encoders (a) (b) Figure 8. Distributions of cosine similarities of (a) correct answers and (b) incorrect answers, with respect to the average representation vectors of reference answers, obtained from transformer encoders Method comparison As described previously, we have alternatives for text encoders, namely LSTM encoders and transformer encoders; both can be trained using a classification-based approach (denoted as CB) or a similarity-based approach (denoted as SB).We have, therefore four alternatives of text-encoders: LSTM constructed using classification-based training (LSTM-CB), transformer constructed using classification-based training (transformer-CB), LSTM constructed using similarity-based training (LSTM-SB), and transformer constructed using similarity-based training (transformer-SB).These different alternatives of encoders produce different representation vectors of texts.

Table 1 .
Statistics of the UKARA dataset

Table 2 .
Validation results obtained from the LSTM encoders

Table 3 .
Validation results obtained from the transformer encoders

Table 4 .
Classification results produced by various alternatives of text encoders