Automatic Essay Scoring Method Based on Multi-Scale Features

: Essays are a pivotal component of conventional exams; accurately, efﬁciently, and effectively grading them is a signiﬁcant challenge for educators. Automated essay scoring (AES) is a complex task that utilizes computer technology to assist teachers in scoring. Traditional AES techniques only focus on shallow linguistic features based on the grading criteria, ignoring the inﬂuence of deep semantic features. The AES model based on deep neural networks (DNN) can eliminate the need for feature engineering and achieve better accuracy. In addition, the DNN-AES model combining different scales of essays has recently achieved excellent results. However, it has the following problems: (1) It mainly extracts sentence-scale features manually and cannot be ﬁne-tuned for speciﬁc tasks. (2) It does not consider the shallow linguistic features that the DNN-AES cannot extract. (3) It does not contain the relevance between the essay and the corresponding prompt. To solve these problems, we propose an AES method based on multi-scale features. Speciﬁcally, we utilize Sentence-BERT (SBERT) to vectorize sentences and connect them to the DNN-AES model. Furthermore, the typical shallow linguistic features and prompt-related features are integrated into the distributed features of the essay. The experimental results show that the Quadratic Weighted Kappa of our proposed method on the Kaggle ASAP competition dataset reaches 79.3%, verifying the efﬁcacy of the extended method in the AES task.


Introduction
During the writing stage, examinees are asked to write an essay according to the prompt; then, scorers mark the essay. Since the scoring requires a lot of time and effort, it is difficult to grade a large number of essays [1]. In addition, the scorers are easily influenced by personal subjective factors in the grading process [2]. AES is a technique that utilizes NLP technology to evaluate essays, which can be analyzed and evaluated through multiple aspects (such as language, structure, and content) with an objective, fast, and accurate scoring result [3]. AES can circumvent many disadvantages of traditional scoring methods, save labor costs, and not be influenced by personal subjective factors [4], while substantially improving the fairness and accuracy of scoring. Over the years, we have designed many AES methods, which can generally be classified as feature engineering methods or DNN methods [5].
Early feature engineering-based AES methods construct shallow features based on scoring criteria, such as grammar, syntax, and chapter structure; then, they use machine learning to indirectly evaluate the essay [6]. Such technology, which is based on handcrafted features, overlooks the potential deep information in the essay. Consequently, it cannot obtain satisfactory results for AES tasks, which require the use of the deep semantic information in the essay. Secondly, this technology requires a lot of time and cost in terms of hand-crafted feature extraction. In addition, the evaluation criteria of different AES tasks

•
In order to reduce the influences of the above problems and to score essays more comprehensively, we propose the AES method based on Multi-Scale Semantic Features (MSSF). In particular, we extract multiple-scale characteristics with different modules: (1) We utilize the LSTM-MoT model to extract document-scale global semantic features of essays. (2) After the sentence vector of the essay is extracted by the SBERT, the context relevance of the local features is extracted by LSTM. Then, we utilize attention pooling to determine the contribution of the final scores and obtain sentence-scale local semantic features. (3) The relevance between the essays and their corresponding prompts is an important basis for scoring. We vectorize them by Doc2Vec and calculate their similarities to obtain their relevance features. (4) In addition, to address the shortcomings of DNN models in extracting shallow features, such as grammatical errors and text richness, we use manually extracted features with the adaptive weight to obtain the shallow linguistic features of the essays. MSSF fuses global semantic features, local semantic features, prompt relevance features, and shallow linguistic features for essays. Experiments are conducted on the Kaggle ASAP competition with our proposed model. The experimental results show that our proposed AES model with multi-scale semantic hybrid features can effectively improve the performance of the automatic scoring of essays and obtain the optimal performance compared with the baseline model. Our main contributions are as follows: we add 18 typical manual features with adaptive weights to the distributed representation of essays. They can not only extract valuable quantitative information from essays that are difficult to extract from DNN-AES, but can also adjust the weight parameters adaptively according to different AES tasks.

•
We utilize SBERT as the sentence vectorization method of the essay. Compared with manually extracted sentence-level features, it can be fine-tuned according to specific tasks after pre-training the tasks and can make the final score more accurate.

•
We add the relevance feature between the prompt and the essay. This feature allows the model to learn the correlation between them, rather than just utilizing the essay for scoring.
Furthermore, the features of point 2 and point 3 are easily extended to other AES models, since traditional AES models ultimately have a distributed representation layer for final scoring.
The remainder of this paper is organized as follows: we will introduce different types of existing AES methods and discuss their limitations in Section 2; Section 3 will develop the various components of the model proposed in this paper; Section 4 will present the experimental results and analysis, with the validity of the proposed method being discussed based on the experiments; finally, we will summarize the conclusions and propose further work in Section 5.

Related Work
The AES system is a significant tool used to assist teachers in scoring by computer technology; many research results have been achieved in related fields. According to different methods, the AES model mainly consists of four categories: AES based on shallow linguistic features, AES based on deep neural networks, pre-trained AES, and methods based on hybrid models.

AES Based on Shallow Linguistic Features
The main shallow linguistic features (such as spelling mistakes, essay length, and average sentence length) are extracted manually by scoring criteria and the final score is obtained by regression, classification, or ranking algorithms. Project Essay Grade (PEG) [11] is one of the first AES models for essays, which analyzes the text of an essay and scores it based on factors such as grammar, vocabulary, and coherence. Mathias et al. [12] extracted features, such as content, organization, and sentence fluency, based on the essay and gave each feature an independent score; they then fused them using a random forest to determine the score. Sakaguchi et al. [13] used the N-gram and syntactic features to obtain a score with the Support Vector Machine (SVM) before using the sentence similarity feature to find the cosine similarity between sentences to obtain another score; they obtained the final score by fusing the two scores score. Cummins et al. [14] designed a method to rank all essays utilizing quantitative metrics, such as explanatory article length, and grammatical relevance, as features and then transformed the ranking into a predicted essay score. The greatest advantage of such methods is that they possess strong explanatory power, but they cannot capture the deep semantic information of essays. Nguyen et al. [15] implement an end-to-end argument mining system, which collects argumentative structures of essays and creates argumentative features from these structures, using machine learning to develop the proof-ready argument-mining-enabled AES model.

AES Based on Deep Neural Networks
Many DNN structures have been used in AES and achieved excellent performance in recent years. Compared with the traditional manual feature extraction, DNN-AES does not need manual design and feature extraction, and can automatically learn the deep semantic representation of complex essays. Essays contain intricate information that is difficult to capture by traditional feature engineering and DNN-AES models have the potential to generalize well to unseen data. In addition, DNN-AES models are highly flexible and can adapt to different types of essay prompts and writing styles. TextCNN is effective in extracting the local features of texts [16], which is often used to extract important semantic information in essay sentences. The LSTM is often used in AES systems [17], which has advantages in processing long-sequence data, effectively extracting the interdependence between words and solving long-term dependence on sentences. Dong et al. [18] proposed a stratified convolutional neural network(CNN). The first layer of the model is utilized to extract sentence-scale features, the next is used to extract article-scale content information, and the fully-connected layer is utilized to generate the final score of the essay. Taghipour et al. [19] and Nguyen et al. [20] utilized the recurrent structure in the AES system to obtain the complex relevance between the text and its score to derive the final score. Dong et al. [21] proposed to utilize a CNN-LSTM with a double-layer network structure. The CNN is used to obtain local features, and LSTM mainly learns global context sequence features. Then, the weight relevance of each sentence vector to the final score is dynamically obtained through the attention pooling mechanism and the final score is obtained after weighted summation. Ridley et al. [22] proposed an algorithm for evaluating the total score and dimension score of an essay across topics and obtained the final essay representation through the shared layer and the private layer as input to obtain the final score.

AES Based on Pre-Trained Models
AES based on the pre-trained model has become an important part of the essay scoring task, which is a deep network structure based on large-scale unlabeled corpus training. AES based on the pre-trained model trains the parameters in advance by pre-training tasks and then fine-tuning the AES task. BERT [23] is a two-way Transform encoder that could be fine-tuned in 2018 and has achieved excellent performance in many NLP tasks. Pedroet et al. [24] proposed a sequence-to-sequence model, which extracts the semantics by BERT from both directions and features, such as the next sentence, by XLnet [25]. Wang et al. [26] used two BERT pre-trained models to extract features at multiple scales. The first BERT model was used to extract document-scale features; the second model extracted chapter-scale features through the LSTM layer after connecting multiple BERT models in parallel. The predicted scores of all scales are fused to determine the final score. The experiments of this work point out that using the pre-trained model can effectively improve the performance of AES. Yang et al. [27] used the combination of regression and sorting loss functions to finetune the BERT model for the same task. The purpose of the regression method is to try to obtain an accurate score while the sorting method is designed to obtain an accurate essay ranking.

AES Based on Hybrid Model
AES based on hybrid models has been shown to enhance the capability of scoring. Combining DNN features and hand-crafted features has been proven by many studies to obtain better scores in essays. Farag et al. [28] considered coherence features between sentences and combined them with a DNN model to further enhance the performance of scoring for AES. Cozma et al. [8] fused character-scale n-gram features and word representation features to extract semantic features and achieved better performance on the ASAP dataset. Liu et al. [29] designed a Two-Stage Learning Framework (TSLF) to extract semantic features, fluency features, and relevance features through the neural network model, fusing artificial features for scoring. Uto et al. [30] proposed a fusion method that utilizes item response theory to consider differences in scoring behavioral characteristics and integrate prediction scores from various AES models.
The methods often use a single network structure, which can no longer meet the accuracy requirements for AES. Therefore, we use extracted document-scale information and sentence-scale information of essays to characterize the deep semantic features of essays and manually extract typical shallow linguistic features to complement the features. In addition, we fuse essay and prompt relevance features to make the scoring more comprehensive.

Approach
In this section, we introduce the multi-scale AES method that fully considers the impact of all aspects on the final score. As shown in Figure 1, we utilize an LSTM-based model to extract document-scale global information and a hybrid model to mine essay sentence-scale local features to complement the former and extract deep semantic features of the essay through both. To mine the shallow semantic information, we design and extract the typical features. Finally, we combine these features with the prompt relevance to calculate the essay score.

Document-Scale Global Features
In this part, we introduce the LSTM-MoT network to extract document-scale features of essays, as shown in Figure 2, utilizing the LSTM-based AES model first proposed by Alikaniotis et al. [31]. This model predicts the score of an essay by inputting a sequence of words in the essay through a multilayer neural network. A large body of existing work shows that this structure has significant advantages in extracting document-scale features [32][33][34][35][36]. Let the dictionary table of the essay be V. Define the essay in a lexical way w i ∈ V, i = 1, 2, . . . , n, where w i denotes the -i-th word in the essay and n denotes the essay length and manipulate the words in the essay via the following method.
Lookup Table Layer. This layer is used to map the sequence of words to a Ddimensional hidden space to generate a vectorized representation of the words w = { w 1 , w 2 , . . . , w i , . . . , w n }; semantically similar words have similar vectorized representations among different word vectors. Specifically, the one-hot vector w i of essay input is a dot product with the D × V embeddings matrix to obtain the corresponding word embedding representation x i . Recurrent Layer. The sequence of the word embedding is out of order, which is the output of the lookup table layer, and if we use them directly for scoring, we cannot obtain an accurate score. To provide the model with the ability to extract context, the recurrent structure is added to capture the temporal features. LSTM is an important variant of RNN that incorporates a gate mechanism into the conventional RNN architecture. This gate mechanism is specifically designed to effectively mitigate issues related to gradient explosion, gradient disappearance, and long-term dependence, which is calculated as follows: where x t is the input vector at time t; h t is the output vector at each timestamp; W f , W i , W C and W o are the learnable parameter matrices; b f , b i , b C and b o are the bias terms; and σ represents the sigmoid function. Pooling Layer. We introduce a pooling layer to convert the output dimension of the recurrent layer to the target dimension, which can help reduce the number of parameters, decrease computational complexity, and improve the model's generalization ability. Meanover-time (MoT) [19] pooling is generally used to compute the average vectorization of each timestamp output in LSTM, as shown in the following equation: Linear layer. This layer performs a linear transformation on the output of the pooling layer, followed by the application of the sigmoid function to map the output to a range of [0, 1]. This process is illustrated by the equation presented below: where W o and b o are both learnable parameters of weight and bias while σ is the sigmoid function. During the model training process, the scores are normalized to [0, 1] by the actual scores for scoring. For the final prediction, we rescaled the calculated scores to match the original scoring range.

Sentence-Scale Local Features
Document-scale essay scoring treats an essay as a sequence of words; during scoring with document-scale features, the influence of the sentence as a whole on the scoring of essays is ignored. In previous work, researchers have been focusing on the utilization of sentence-scale features. For example, Uto et al. [37] proposed that combining documentscale features and sentence-scale features could achieve better performance, where the document-scale version is based on the features of the embedding of the input vocabulary and the other module is based on input sentence-scale features, combining the two for training to obtain excellent performance. Reimers et al. [38] proposed a sentence-embedding model SBERT utilizing the pre-training method on the Semantic Textual Similarity (STS) task; its aim is to quantify the degree of semantic similarity or relatedness between pairs of text snippets. In this paper, we extract sentence vectorized representations of essays by SBERT and propose a new sentence-scale feature of the AES task, as shown in Figure 3. Sentence vectorization layer. Let the words in the essay sentences be defined as {t 1 , t 2 , . . . , t m }, where m denotes the word number of one sentence. We use the pre-trained SBERT model to extract the deep semantic information of the words and obtain the vectorized representation of the sentences in the essay by mean pooling. The separate BERT model cannot be directly used for the calculation of sentence vectors without the pooling layer. The mean pooling applies the same weight to each word in the sentence and it can reduce over-reliance on specific words while preserving the overall semantic information.
Recurrent layer. After converting essays into sentence vectors, there is no sequential relationship between the vectors. In order to obtain the original temporal order relationship of the sentences in the essay, we utilize the LSTM model to mine the temporal order features. Specifically, the sequence of sentence vectors {s 1 , s 2 , . . . , s N } obtained from the output of the sentence vectorization layer is added to the LSTM model to obtain the sequence of output vectors {h 1 , h 2 , . . . , h N } to capture the dependencies of the sentences, where N is the number of sentences in the essay.
Pooling Layer. This layer generally uses attention-pooling to compute each time output of the LSTM model at each time step, converting it to a fixed-length vector. Dong et al. [21] compared MoT pooling and attention-pooling in a global essay scoring task, where MoT pooling can equally treat each word in an essay and can better extract the words' semantic information. During the process of sentence-level semantic extraction, attention-pooling is employed to emphasize the impact of important sentences on essay scoring. Additionally, the sentence sequences are much smaller in comparison to the extra-long sequences of global features. This causes attention to be diminished during attention-pooling for extra-long sequences. Taghipour et al. [19] focused on words on a single-layer LSTM but did not exceed the MoT baseline model. Attention pooling obtains the weight of the contribution of the words in the score to the final score in the essay, as shown in the following equation.
where W m and W u are the weight parameters, b m is the bias vector, h i is the LSTM intermediate hidden layer state at different time steps, m i and u i are the attention vector and attention weight at different time steps, and x is the final text representation. Linear layer. This layer applies a linear transformation to the output of the pooling layer and normalizes the output using the sigmoid function. This layer then calculates the final score. However, during the prediction process, the predicted scores are rescaled to match the original scoring range.

Prompt Relevance Features
A prevalent pattern in essay writing is the prompt essay, where writers receive a prompt asking them to write about a particular topic essay. Evaluators consider the conformity of the essay to the prompt as a crucial criterion during the grading process. Therefore, measuring the semantic relevance between the prompt and the essay is essential in the essay grading process. During the essay scoring process, determining whether the essay aligns with the given prompt becomes an important criterion for scoring essays. Extracting the semantic relevance between the essay and the corresponding prompt will directly affect the performance of AES. Figure 4 shows the prompt relevance model which is proposed in this paper. In the process of traditional prompt model extraction, word co-occurrence and LDA prompt models are commonly used extraction methods [39], which usually extract only the statistical features of essays and cannot capture the degree of relevance between the essay and prompt from the semantic scale.
Since essays are long texts, transformer-based models cannot provide a better-vectorized representation of long texts. Doc2Vec [40] can represent the whole essay using a single lowdimensional dense vector; its structure overcomes the shortcomings of traditional prompt models and transformer-based models and is more suitable for performing essay-prompt similarity degree analyses.
We utilize the Doc2Vec to capture the vectorized representation E = {e 1 , e 2 , . . . , e N } of the essay and the prompt of essay P = {p 1 , p 2 , . . . , p N } before introducing the cosine similarity function to calculate the semantic similarity relationship F T between them, which can be formulated as follows.

Shallow Linguistic Features
Scoring essays is a complex act and a single-structured model has constraints on scoring performance. Compared to traditional technologies, DNN-AES models have shown superior ability in extracting essay features from the semantic layer of neural networks. However, the detection of spelling and grammatical errors, which are critical in scoring, remains challenging. To address this limitation, the combination of artificial features and deep learning has been proposed to enhance the performance of AES. This hybrid approach takes advantage of both feature-engineering-based and neural-network-based approaches, with their relevance being complementary rather than competitive.
The selection of appropriate hand-crafted features has a significant impact on performance. On the one hand, too many hand-crafted features are time-consuming, laborintensive, and cause computational redundancy, resulting in sub-optimal performance. On the other hand, the features extracted by DNN-AES can hardly leverage too few handcrafted features as a useful complement, thus leading to sub-optimal performance. Here, we mainly select the linguistic richness features that are difficult to mine via DNN models and reflect the hand-crafted features of essay quality from various aspects; Table 1 shows the specific hand-crafted features. The average length of clauses 14 The average sentence length 15 The variance of sentence length 16 The average depth of the syntax tree of each sentence 17 the average depth of each leaf node of the syntax tree Prompt-relevant features 18 Number of words in the essay that appears in the prompt We design 18 typical hand-crafted features as shallow semantics to reflect the shallow information of the essay. The hand-crafted features are expressed as {F 1 , . . . . . . ., F 18 }. Different hand-crafted features contribute differently to the final score. We add weight parameters to each hand-crafted feature to obtain the hand-crafted features with the weight F w = {w 1 F 1 , . . . . . . ., w 18 F 18 }. We obtained the final shallow linguistic features via a linear transformation, as shown in the following equation.

Essay Scoring
In order to improve the scoring efficiency, we propose a method of blending multiscale features. The document-scale global features, sentence-scale local features, manually extracted shallow features, and prompt relevance features are fused to obtain the overall feature of the final essay F which is shown as follows: We take F as the input of the fully connected layer and finally generate the output as the score through the sigmoid activation function. Its calculation formula is as follows: where W is the weight matrix, b is the bias, σ represents the sigmoid function, and Score is the predicted essay score. We normalize all gold standard scores to [0, 1] and use them to train the network. However, we rescale the output to the original score and use the rescaled scores to evaluate the system during testing.
For the training of the model, the Mean Squared Error (MSE) between the predicted and actual scores is typically employed as the loss function, which can be expressed mathematically as follows: where y i denotes the actual score of the i-th essay, whileŷ i denotes the predicted score for the i-th essay, and N denotes the essay number. Many of the modules and features in our proposed method can easily expand the existing DNN-AES model because we finally output the distributed representation of the essays.

Dataset
We utilize a publicly available dataset published in the 2012 Kaggle Automated Student Assessment Prize (ASAP) competition (https://www.kaggle.com/c/asap-aes/ data accessed on 23 May 2023), which is widely used in the field of AES. The essays comprising the ASAP competition dataset are authored by students in grades 7 through 10 and are categorized into 8 groups based on the essay themes, as listed in Table 2. Each group of essays consists of an essay prompt document, which includes multiple essays related to a particular prompt, and each article has an overall rating. The time, place, person, organization, and other information appearing in the essay have been desensitized.

Evaluation Metric
The Quadratic Weighted Kappa (QWK) coefficient is commonly used as an evaluation method for AES in existing methods. This is primarily due to its strong sensitivity to incorrect predictions. The penalty mechanism becomes increasingly stronger as the difference between the actual score and the predicted result increases, which can better measure the consistency of the model score. This indicator is also officially designated by Kaggle as the evaluation standard for ASAP competition. QWK is improved by the Kappa coefficient and the quadratic weight matrix in QWK is defined as follows: Among them, i and j represent the scores given by the actual manual scoring and the AES system and N is the number of possible ratings. The final QWK coefficient formula is as follows: where O is the observation matrix and O i,j represents the number of essays that were manually rated as the i-th category by the AES system and misjudged as the j-th category. The expected count matrix E is obtained by taking the outer product of the reference and hypotheses-rating histograms. Finally, the QWK coefficients are calculated through the three matrices of W, O, and E.

Experimental Configuration
To evaluate the proposed model, a 5-fold cross-validation approach is utilized in this study. Specifically, each fold comprises 60% training data, 20% validation data, and 20% testing data.
We adopt the 50-dimensional Glove [41] as the embedding matrix in the documentscale DNN-AES model. The basic model of SBERT uses the public pre-trained RoBERTlarge (https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/ (accessed on 23 May 2023)). The LSTM hidden vector dimension used in the hybrid method is 300, the batch size is 16, and the maximum epoch is set to 60. We use dropout with a probability of 0.5 to avoid overfitting. We introduce Adam as the optimization algorithm and also construct an early stopping setting to prevent overfitting problems.

Comparative Experiment
To validate the efficacy of the proposed multi-scale feature-based AES method, we conducted a comparison with the baseline methods listed in Table 3. Table 3. Comparing the performance of the present models with that of the state-of-the-art models on ASAP. The best performance for each prompt is highlighted in bold.

1.
EASE (SVR) [42]: Enhanced AI Scoring Engine (EASE) is a public machine-learningbased classification scoring engine (https://github.com/openedx-unsupported/ease (accessed on 23 May 2023)). As EASE relies on hand-engineered features and regression techniques, we adopt the Support Vector Regression (SVR) model as the baseline approach for comparison purposes in this paper. [19]: Treating the essay text as a word sequence, we use the LSTM network to extract the temporal relationship and average the output of all time step states for the final scoring.

3.
CNN(10runs) + LSTM(10runs) [19]: By employing an integrated learning approach, ten CNN and ten LSTM models were utilized for prediction, with the resulting predictions being averaged to yield the final prediction outcome.

4.
CNN-LSTM-ATT [21]: We use CNN and the attention mechanism to extract sentencescale features of the essay; then, we input the result into the LSTM model and perform the final scoring through the attention mechanism. 5.
SkipFlow LSTM [43]: To enhance the performance of essay scoring, the model incorporates the SkipFlow mechanism into the LSTM network. This mechanism leverages the semantic relationships between the hidden layers of the LSTM as auxiliary features. 6.
TSLF [29]: We use LSTM to obtain semantic features, consistency features, and semantic relevance features, combined with features, such as grammatical errors. We then use XGBoost to score the essay.

7.
Tran-BERT-MS-ML-R [26]: Two BERT models are used to extract token-scale, document-scale, and segment-scale features; multiple losses are used for essay scoring.
The above table shows the proposed multi-scale essay scoring model and the baseline model for comparison in this paper and the experimental results show that: (1) The prediction of the final score by manually extracting features and then utilizing traditional machine learning algorithms is not effective, indicating that relying on shallow linguistic features alone to characterize the whole essay will miss the deep semantic meaning of the essay, which is very fatal in the process of scoring essays. Secondly, parameters such as the kernel function and penalty parameters of SVR have a significant impact on performance; however, it is difficult to determine the optimal value and it is more difficult to expand the future than the DNN-AES method. LSTM-MoT and CNN(10runs) + LSTM(10runs) utilize the document-scale features of the essay to extract deep semantic information, which has significantly improved the effect compared to the manually extracted features; however, the individual document-scale DNN structure does not extract the semantic features of longer texts well. Furthermore, the whole essay is split into words, with the relevance between the words in the same sentence and the relevance between different sentences of the essay being lost. The training of multiple individual networks using integrated learning has slightly improved the effect but the scoring performance is not good. (2) Incorporating the SkipFlow mechanism into the LSTM to model the semantic relevance between hidden layers leads to a slight improvement in performance. However, it can only rely on word-level co-occurrence patterns rather than capturing the overall meaning and coherence of the essay. Concretely, in comparison to the conventional LSTM, the performance improvement is only significant for the P1 and P8 subsets, whereas the performance enhancement is comparatively similar for the remaining subsets. The CNN-LSTM-ATT model uses sentence-scale features as the basis for scoring, which is better than document-scale scoring; however, the performance improvement of the model is limited by scoring only at the sentence semantic scale. Furthermore, its accuracy in scoring largely depends on the sentence vectorization method. (4) The Tran-BERT-MS-ML-R model achieves a QWK coefficient of 79.1% on the ASAP dataset, indicating the effectiveness and superiority of the pre-trained model in the AES task. This also indicated that scoring based on features of different scales can effectively improve the performance but would cause more computational load. Moreover, our proposed MSSF method outperforms the Tran-BERT-MS-ML-R model and achieves the highest performance scores on four subsets because MSSF extracts document-scale global features and sentence-scale local features from deep semantic features, proposes the similarity relevance between the essay and prompt from the topic scale, and extracts linguistic features from shallow information for scoring. Thus, the proposed model in this paper exhibits the best overall performance when compared to the other models and without excessive computational load.

Feature Performance Analysis
To verify the effect of features at different scales on the final essay scores, we perform ablation experiments on each scale module of the model. We use SLF to denote the shallow linguistic feature module; DOC to denote the global semantic module based on document scale; SEN to denote the local semantic module based on sentence scale; DOC-SEN to denote the deep feature module combining global semantics and local semantics; DOC-SEN-SLF to denote the simultaneous use of local, global, and shallow semantic features; and MSSF to denote the paper's proposed essay scoring method based on all features. deep and shallow semantic features, the addition of prompt-relevance features improves the scores on all subsets. However, the improvement is small, with an overall performance improvement of 0.5%. The experimental results demonstrate that incorporating essay and prompt relevance features can enhance the performance of essay scoring. However, the current approach yields limited performance improvement for the model.
To confirm the validity of the hand-crafted shallow linguistic features, we experimentally verified their validity. Figure 5 shows the heat map of the feature weights of the shallow linguistic features as they perform on each data set, with higher weights implying a greater impact on score prediction. We undertake further analysis and find that the weights of the shallow features varied widely between different groups. We believe there are four main causes of this issue: (a) The process of scoring by scorers not only refers to shallow manual features, such as the number of words and sentences, but means they will pay more attention to the deep semantics of the essay, which leads to inaccurate weights on shallow features. (b) Different scorers focus differently on shallow manual features. For example, some scorers care a lot about the number of sentences and some scorers care a lot about the number of nouns. (c) Even if scorers pay attention to the same shallow features, they will not quantitatively analyze the specific value of each shallow feature during the scoring process, which leads to bias in the final weights. (d) Different essay types and different prompts lead to a different emphasis on shallow feature weights for each essay. Therefore, it is difficult to guarantee the consistency of weights in different essays. Table 4. Ablation Experiment Results. The bold number is the best performance for each prompt. In order to verify the influence of the sentence vectorization method on the final score during the extraction of sentence-scale local features, we used different sentence vectorization methods for the essay to the one proposed in this paper. The experimental results are shown in Figure 6, which shows that utilizing Avg. Glove Embedding to obtain sentence vectors by obtaining word vectors and then averaging them did not yield excellent results. Since the SBERT-based model obtained excellent performance by pre-training in the STS task and by fine-tuning the mechanism, its result is substantially better than InferSent and Universal Sentence Encoder (USE). We analyzed the following four reasons that make MSSF work best:

Models
• It generates contextually aware sentence embeddings. This allows SBERT to capture the contextual meaning of sentences and include word order and dependencies, rather than just averaging out the word vectors in the sentence, resulting in more accurate embeddings.

•
It benefits from the large-scale pre-training of BERT models. It is typically pre-trained on vast amounts of text data, allowing it to learn a broad range of language patterns. This extensive pre-training contributes to the model's ability to generalize well across different tasks and domains. • It allows fine-tuning on specific downstream tasks. This fine-tuning process adapts the pre-trained model to the specific task, resulting in improved performance. On the other hand, InferSent and USE are not designed for easy fine-tuning and lack this adaptability. • It can be used to calculate the semantic similarity between sentences. The more accurate measure of similarity can be obtained by sentences embedding in a highdimension and computing the cosine similarity between them. It can better understand the semantics of sentences and improve the capability of sentence vectorization.

Time Complexity
We have many different-scale features in our proposed method. In order to verify the efficiency of the method, we performed the time complexity analysis experiment to study its computational load.
In the document-scale feature, since the Glove dimension is smaller than the hidden layer dimension of the LSTM, the main time complexity of the model is produced by LSTM, which is O KE(D 1 ) 2 where K is the number of LSTM layers; meanwhile, E is the sequence length, which is the length of the essay, and D 1 is the number of the hidden layer units in LSTM.
In the sentence-scale feature, we utilize the pre-trained SBERT model; additionally, the number of hidden layer units is greater than the length of the input sequence. Its main time complexity is produced by SBERT, which is PLS(D 2 ) 2 where P is the number of sentences in the essay, L is the number of layers in SBERT, S is the length of the sentence, and D 2 is the hidden dimensions of the model in BERT.
In the prompt-related features, Doc2vec has been trained in advance and its time complexity is similar to the lookup table layer. The time complexity after cosine operation is O(D 3 ) where D 3 is the dimension of the essay and the prompt vectorization.
In the shallow linguistic features, the main time complexity is O(D 4 D 5 ) where D 4 is the input dimension that is the number of manual features and D 5 is the dimension of the output.
On the whole, since the size of D 2 is much larger than D 1 , the sentence-scale feature has the largest time complexity and the document-scale features come next. The shallow linguistic features and prompt-related features are very small. In the sentence-scale features, BERT has a large count of parameters which has more layers and a big size of hidden layer nodes. Secondly, LSTM cannot be calculated in parallel and requires loop calculations of E sentences. Therefore, the extraction of sentence-level features requires a large time complexity. On the other hand, compared with other pre-trained models, our sentence-level feature extraction structure is already lighter.
It can be shown from the time complexity analysis that the time complexity of the shallow linguistic features and prompt-related features is very small and they are easily extended to the distributed representations of other DNN-AES models; therefore, they will achieve a great performance boost with little cost, which shows the effectiveness of these two parts features.

Threats to Validity
Our proposed method based on multi-scale features shows excellent results on the ASAP dataset. However, there are many potential limitations or threats to the validity of the research: • It does not have high content validity, which is the degree to which it can cover the content areas it intends to measure. Specifically, AES systems are typically trained on the specific dataset of essays, which may not fully represent the entire scope of writing styles. Consequently, AES may struggle to assess essays that fall outside the scope of the training data, leading to potential biases and incomplete evaluations. • It can be susceptible to language and cultural biases. The difference in language usage, dialects, or cultural references can impact the accuracy of AES, particularly for non-native English speakers or individuals from diverse linguistic backgrounds. In addition, scoring varies greatly between different languages and the model cannot be transferred well between different languages. • It often relies on surface-level information, such as word, sentence, or grammar, which is important. However, it may not capture the deeper aspects of writing quality, such as coherence, organization, or argumentation.

Conclusions
In this paper, we propose an AES method based on multi-scale semantic features, which extracts essay features from multiple perspectives to address the complexity of essay scoring. We take document-scale as input to extract the global features of essays by the LSTM-MoT structure; we then utilize a hybrid neural network structure to extract sentencelevel local semantic features after sentence vectorization of the essay via SBERT, so as to obtain deep semantic features of essays. For the shortage of DNN-AES to extract shallow linguistic features, we construct shallow features with the weight manually. We also add prompt relevance features to enhance the scoring effect of the model. These two-part features are easily extended to distributed representations of other AES models without significant time complexity. Finally, the final scoring is performed by semantic feature fusion. We conducted experiments on the Kaggle ASAP dataset. The results demonstrate that our proposed multi-scale AES method is highly effective in extracting diverse semantic features and outperforms the baseline model in terms of performance. We also conducted ablation experiments and a time complexity analysis to verify the effectiveness of our method. One of the limitations of the current approach is that the hand-crafted features utilized in this paper are designed in advance and computed offline. Furthermore, the selection and construction of the shallow features is still a challenge to be faced in the future. In addition, due to differences in languages, methods via which to score the essay of different languages is one of the important research directions in the future. Due to the uninterpretable characteristic of DNN models, we cannot well know the potential error information in various scale features. We will conduct research and analysis about potential errors in the future and adjust the structures of the model according to them.