Comparative Study on Feature-Based Scoring Using Vector Space Modelling System

*is paper shows the importance of automated scoring (AS) and that it is better than human graders in terms of degree of reproducibility. Considering the potential of the automated scoring system, there is further a need to refine and develop the existing system.*e paper goes through the state of the art. It presents the results concerning the problems of existing systems.*e paper also presents the semantic features that are indispensable in the scoring system as they have complete content. Moreover, in the present research, a huge deviation has been exhibited by the system which has been shown later in performance analysis of the study, and this clearly indicates the novelty and improved results of the system. It explains the algorithms included in the methodology of this proposed system. *e novelty of our work consists in the use of its own similarity function and its notation mechanism. It does not use the cosine similarity function between two vectors. *is paper describes and develops a more accurate system which employs a statistical method for scoring. *is system adopts and integrates rule-based semantic feature analysis.


Introduction
In educational sectors, nearly every institute conducts various examination processes to evaluate the abilities of students. In this examination, student responses are evaluated for given questions. ese questions can be both subjective answers or objective answers. In this research, evaluation of objective answer is not integrated and is a much trivial task.
Unlike multiple-choice questions (MCQ) in constructed-response (CR) questions, students write their own answers. ey can express their own ideas and suitably support them to give their response to the text. Subjective question tests the adoptive ability of a student. But its assessment encounters several issues like synonymy, polysemy, and trickiness. e further categories of subjective answer-type evaluations include long answer essay and short answer essay. e long answer essays are also known as free-text answers in which contents and writing style are evaluated. e scoring is generally done by extraction of grammatical and semantic relations from the student response and reference response [1]. e vector space model can be incorporated to correlate words as well as textual contexts from the student response with reference responses [2]. e remainder of this paper is organized as follows. In Section 2, we describe the state of the art of different scoring methods, both manual and automatic. Section 3 describes the existing system. It contains two important parts: the first one is predefined features (features construction, ranking, and selection). e second one is of related system. It contains the vector space model approach and other related concepts. Section 4 gives the proposed methodology. In section 5, a description of achieved objective is presented. We begin with the development of resources, the identification of predefined features, then the development of statistical model, and finally the scoring mechanism. Section 6 is reserved for the results and the discussion. It contains the performance analysis and the cosine similarity comparison. Finally, in Section 7, we conclude and give recommendations and future work.

State of the Art
In this section, we will present the state of the art of different scoring methods, both manual and automatic.

Manual Scoring.
Traditional mechanism carried out in the examination system was that the students were supposed to submit their answer sheets which were evaluated by the human rater. Since it is in use for long, its limitations cannot be overlooked. e answer sheets are provided to an examiner for scoring [3]. is process is both time consuming and greatly depends upon the examiner's availability [4]. Errors are likely to occur because different evaluators are employed for checking the answer sheets. Every human rater possesses their own perception for deeply looking into the answer as there are no standardized criteria for marking the answer. e results are then compiled [5].
With advancement in technology, advanced concepts of scoring an answer sheet were introduced in the examination system. e use of computerized tools overcomes the limitations of the manual process. In this system, the students are supposed to submit the answers written on the answer book [6]. e automated examination terminals are meant for transferring student's response to centralized database by electronic means, thereby restricting the physical movement of answer booklets. Intelligent software tools are advantageous in manifold as these are not only speedy but also overcome the human errors of omission and totalling mistakes [7]. e same inference mechanism for checking all the answers ensures the uniformity of marking scheme and speedy declaration of result [8].
Although the above process is partially automated, accuracy can be still enhanced more if the student answer is directly typed to the system and then automated scoring is done, giving score based on the content similarity. If the results produced by the automated scoring system correlate with the scores generated by human graders to a great extent, then this will make the system more consistent than the manual scoring [9]. One relevant solution to overcome the reliability and validity conflicts is to define external criteria against which human-and machine-generated scores can be validated. Another alternative is to define a true score against which these scores can be validated. e superiority of the AES system lies in the fact that it generates the same score for the same essay every time [10]. is employs reliability in test-retest and fairness in evaluation.

Automated Essay Scoring (AES) for Assisting Expert
Human Raters. Expert human raters have to deal with issues like complexity in answers and subjectivity in evaluation. ese limitations can be overcome using an automated scoring system. At the same time, effectiveness and convenience are obtained in scoring. Specifically, if automated tools exist, it would be advantageous to assist expert human raters to achieve the following objectives: (1) To evaluate their own scoring criteria.
(2) To assess deviation from consistencies and undesirable tendencies while scoring. (3) To study the steps of drawing summary conclusions from response features. (4) To identify the immediate and evolutionary changes in automated scoring. (5) To determine the causes of scoring differences between humans and automated grades. (6) To locate and correct the automated grades for answers that require manual intervention.
Although there exist numerous AES systems, the focus of most studies is on the agreement between automated scores and human-assigned scores on a single essay. Furthermore, the agreement does not tell much about what is measured by automated scores. ere is no sufficient evidence for validating AES. Hence, it does not contribute in AES validation construction. Table 1 shows the strength of AES over manual scoring.

Automated Answer Scoring Methods.
ere are rulesand statistics-based automated short answer scoring methods which are graphically shown in Figure 1 and explained in the subsequent sections.

Rule-Based Approach
. Every student answer has some inherent lexical rules or concepts in their answers. Such rules can be lexically matched, and certain features can be extracted by a few rules-based methods although they cannot be proved statistically. So, surface form of text is used in which the student answer is matched lexically with reference answers. is approach helps to get more accurate score.

Statistical Approach.
is approach identifies the probabilities of assigning score values for the given reference answers. e probabilities are calculated to extract features to score the answer. Compared with the fully rule-based mode, the probabilistic and mathematical model produces more accurate score. e existing automatic essay grading system relies on two aspects, namely, machine learning techniques and grammatical measures of quality techniques. However, none of them identifies meanings (propositions) in the text. erefore, it proves to be inappropriate for scoring the contents of an answer.

Automatic Scoring Challenges.
e automated scoring has been developed and adopted for English language. ere might be few instances where it is used for foreign language but not for the Indian language. It integrates development and demonstration of one of the important systems for Hindi language in Devanagari script. Few of the challenges related to Hindi language are use of compound or complex sentences and frequent use of polysemous words that are available [11]. erefore, this system is more suitable with other languages which will broaden the scope of this proposed system. e main advantages of automated scoring over manual scoring include efficiency and the application of the same evaluation criteria with greater consistency.

Existing System
In order to evaluate a number of varying features, there are various AES systems covering various aspects. Currently, there are four major developers of automated essay scoring which are widely used by universities, schools, and testing companies: Project Essay Grader (PEG), Intelligent Essay Assessor (IEA), E-rater, and IntelliMetric [12].
ere are many advantages of automated assessment over manual one. ese advantages include efficiency, application of the same evaluation criteria with more consistency, etc. Moreover, its ability to provide spontaneous feedback is its primary strength. Automated scoring achieves greater objectivity than manual scoring [13] as computers are not affected by external and emotional factors.
Majority of automated scoring systems generate nearly real-time performance feedback on various aspects of writing. For example: ETS e-rater model provides feedback on grammar, use of words, word mechanics, state, and organization of a written typed text. Similarly, Pearson's IEA covers the different aspects of writing for feedback. e aspects include ideas, organizations, conventions, fluency, and choice of words. is advantage of AES is a limitation of human rating which is not able to provide such analytical feedback for huge quantities of essays. Also, human raters usually need to train several grade ranges linked with a specific rubric and certain tasks. It requires adequate training for shifting to a new grade. Such training is not at all required for AES which is able to evaluate the essays at different grading levels (for example, the e-rater, IEA, and IntelliMetric). Comparison of the AES system is shown in Table 2.

Predefined Features.
In this research, scores are based on extraction of syntactic and semantic features. is research incorporates feature-based grading. Grading is also focused on the similarity among the given answers by the extraction of various features like semantic, syntactic, and lexical.
One of the key techniques for handling and organizing text data is text categorization. It is important because more and more documents are now available in digital form, and at the same time, online information is growing rapidly. It should be noted that the statistical classification methods and the machine learning techniques are also used in text  Mathematical Problems in Engineering categorization. Since in the proposed system, the domain and the reference answer are fixed, text categorization can be implemented on a continuous basis and efficiently. Text categorization involves feature extraction which is the most important part of any machine learning task. In this research, to build effective essay scoring algorithm, the aim is to develop model attributes like language fluency, grammatical and syntactic correctness, vocabulary and types of words used, essay length, domain information, and so on. e existing systems follow the following for feature extraction.

Feature Construction.
Features are measurable attributes in a text, and they are used as input to the machine learning (automated) software. Feature construction is a process in which possible features are defined.

Feature Ranking.
is procedure determines how important each feature is for categorization. e ERT algorithm provided by Scikit-learn [14] is used to generate the feature ranking which means placing feature in order of their importance. ese features are considered to determine their ranks in this algorithm.

Feature Selection.
In this process, ranked features are used as the input for feature selection algorithm. It is a gridsearch process in which respective classifiers are also taken into consideration. e feature with the lowest ranking is eliminated, and the cross-validation error is computed after performing a classification. Ultimately, minimum number of features is reached in this process.
Chen and He [15] defined four different types of predefined features that indicate the essay quality including lexical features, syntactical features, grammar and fluency features, and content and prompt-specific features. ese features have been appropriately refined and modified to achieve the objective of the study. e four classes of features used in this system are described below: (i) Syntactical features (ii) Lexical features (iii) Content and prompt-specific features (iv) Grammar and fluency features

Related System.
e overall objective is to assess the shortcomings of earlier techniques. First, traditional automated systems have been discussed. ereafter, other approaches have been discussed which are specifically related to the proposed research. e mechanism has been applied to short question answering.
Leacock and Chodorow [16] defined an automated scoring engine called C-rater which was developed to grade answers to content-based short answer questions, and C-rater utilizes morphological analysis, synonyms, and predicate argument structure for assigning full or partial credit to a short answer questions; it cannot be referred merely as a string machine program. C-rater agrees with human raters to a larger extent of 84% of the time. Song et al. [17] explained the user interactive question answering by applying short-text similarity assessment. e various applications of interactive question answering are IR and text mining like text summarization, text categorization, content-based image retrieval, and machine translation. It should be noted that the shorttext question-answers are used. Kaur and Jyoti [18] explained short one-line free-text answers through automated assessment in the field of computer science. In their research, they have defined a segment of criteria for evaluation, covering all the relevant areas of a short-text evaluation system. Gomaa and Fahmy [19] compared a different number of corpus-based and string-based similarities in order to explore text similarity approaches for automated short answer scoring in the Arabic language. e comparison between similarity measures reveals immediate feedback to the student. On analysis, resulted correlation and error rate findings proved that this system is useful for its application in a real scoring environment. Rababah and Al-Taani [20] forwarded a proposal of automated scoring technique for Arabic essay questions in short answers. For this purpose of applying scoring process, we used cosine similarity measure. It was based on the similarity between the student's answer and standard one. e experimental results showed that the competitive scores were achieved when compared to other such approaches.

Vector Space Model (VSM) Approach
Tsatsaronic and Panagiotopoulous [21] discussed a generalized vector space model for text retrieval based on semantic relatedness. e most difficult task is the modification of the standard interpretation of the VSM and others which deals with incorporating the semantic information in a theoretically sound and rigorous manner.
Ekba et al. [22] elaborated plagiarism detection in the text using vector space model. In order to detect external plagiarism, they proposed a technique based on textual similarity. Further it identifies the set of source documents from where the copying of suspicious  [23] studied vector space model information retrieval for analysis. It is one of the best traditional applied retrieval models for evaluating web page for its relevance. Various approaches of vector space model to compute similarity score of the search engine hits were important. Jahan and Ragel [24] discussed plagiarism detection on electronic text-based assignments using the vector space model. On analysis, even though trigram utilizes enough time, it is more suitable for detecting plagiarism using cosine similarity measure in all text documents. e vector space model was used in retrieving information using query processing. Cosine similarity measure showing higher results was preferred over Jaccard similarity measure. e future work is to concentrate lesser time for dealing with a large amount of assignments with long length document and detect plagiarism optimally. Alzahrani et al. [25] developed and compared number of NLP techniques that accomplish the task of automating scoring. ey presented the multivector model which is closer to human judgement and gives more accurate and reliable results. ey also plan to apply their methodology in different languages. Lilleberg et al. [26] performed demonstration for classification of text with semantic features on the support vector machines and word2vec. ey assumed that word2vec brings extra semantic features helping further in text classification. Based on this, effectiveness of word2vec was demonstrated by showing that TF-IDF and word2vec combination can outperform TF-IDF. eir approach was incomplete as it only scratches the surface; ideal results can still be expected. Recommendations for a future work depend on the ways to bring much improvement in consistency which can be achieved in many ways such as modification of stopword list or changing the weights.

Other Related Concepts
Keller [27] conducted a comparative study of the generalizability of scores produced by automated scoring systems and expert graders. In addition to the available information, their paper description is based on the performance of AES systems through various reports collected from expert raters and computerproduced scores. After analysis, performance was checked for physician's patient management skills through computer-delivered assessment. Final results exhibit a relatively positive outcome regarding performance of the regression-based scoring algorithm. Hajeer [28] conducted a study on various statistical similarity measures for their effectiveness. e use of different statistical measures in information retrieval (IR) is very effective for document retrieval using a unified set of documents. Two issues were addressed: firstly, to study the different statistical measures for its effectiveness on a unified set of documents and secondly, to find the most appropriate one to classify documents through comparing them in an orderly manner. After analysis, it was concluded that the cosine similarity measure is the best for the document retrieval technique. In future work, he hopes to extend this project to test other measures.
Weigle [29] presented numerous considerations which are critical for English language learners and automated scoring of essays. His study projected various considerations to use automated scoring systems in evaluating second language writing. ere were other aspects like challenges and opportunities which were listed in this presentation. His article analyses the extent to which system developers can assess the particular needs of learners in English language. It concludes that the greater the evaluators and authorities possess knowledge regarding automated scoring system, the more will be the chance of this technology to be used widely to meet the ever-growing demands of huge population. Paskaleva et al. [30] developed a new set of similarity functions for information retrieval. Records were considered as multisets of tokens which map records into real vectors. In their research, for bridging the gap between set-based models and vector space model, consistent extensions of set-based similarity functions were developed.
McNamara et al. [31] explained in their study the significance of approach based on hierarchical classification approaches which are meant for computing essay scores involving a set of text variables. On analysis, 55% exact accuracy between predicted essay scores and the human scores is revealed along with 92% adjacent accuracy. Although features which inform the overall assessment will differentiate depending on the specific problem, this approach is able to get performance models with high accuracy and information in comparison to simple one-shot regression. Sultan et al. [32] discussed student's short answer question which is given with the correct answer; the principle of grading student response is derived from its semantic similarity with the correct answer. Key measure employed in their supervised model utilizes the recent approach of identifying the short-text similarity features. In addition, the term weighting mechanisms are needed to identify important answer words in many cases. Accuracy for answer scoring can be achieved by evaluating a simple base model that can be easily extended with new features.
Wang et al. [33] conducted a study on identifying current issues in short answer grading (SAG). In order to observe the issues involved in SAG, they analyzed the results of a simple SAG approach. ey used KNN to score query answers, where vector representations of answers are generated from weighted, pretrained word Mathematical Problems in Engineering embedding. By analyzing the errors in the given approach, it was shown how the diversity and short length of answers caused problems to SAG. Properties of short answer scoring such as diversity of answers were statistically analyzed. Raczynski and Cohen [34] in their research article "Appraising the scoring performance of automated essay scoring systems-some additional considerations," they provided useful validation framework for assessment of the automated scoring system. ey determined the type of essays which can be used to calibrate and test automated essay scoring (AES) systems. ey also discussed what human scores should be used when there are scoring disagreements among multiple human raters. Wang and Brown [35] discussed validation on manual and automated scoring of essays against "true" scores. Raters were divided into two groups (14 or 15 raters per group), and they rated 250 essays in two sets which were all written in response to the same prompt, thereby providing an approximate true score to the essay. Training on the datasets was provided to an automated essay scoring (AES) system in order to score the essays using a cross-validation scheme. We concluded that the correlation between automated and human scores is of the same order as the correlation between manual graders.

Proposed Methodology
Undoubtedly, this system is based on the vector space model, but it is incorporated by further changes for gaining better results: (1) Our vector incorporates syntactical features and semantical features. It shows how the document is vectorized. Two arrays are there for each document as shown in Figure 2. First column referred to as predefined feature is inserted with term (T1,...,Tn). Second column is inserted with the weight (W1,. . .,Wn) with respect to the term feature. Whenever any new document is added, the columns are incremented in the matrix and the number of rows is incremented when new term is to be added. (2) Generally, cosine similarity is used to find similarity among vectors. Well-defined new similarity measures are proposed for the scoring of the answers which includes syntactic and semantic features. is technique definitely will produce better result than cosine similarity. Equation (1) represents similarity.
(3) Term weighting is the key in the vector space method. In addition, several researches on term weighting techniques have been conducted. ere is still a conflict regarding which method is more appropriate.

Term
Weighting. e advanced text retrieval systems view term weighing as an important component. e major content of the text or literature is well defined in terms of words, phrases, or other units of indexing. Each and every word of the text has its own importance and worth. is phenomenon is indicated as term weighing represented by the following equation: where N is the total set of documents and df t is the document frequency.

TF-IDF Weighting.
It is now attained by combination of term frequency and inverse document frequency, and also it produces a combined weight of every term in each of the document. is scheme is represented mathematically as TF-IDF, and this assignment of weight to terms t in the document is the basis proposed system. Equation (3) illustrates TF-IDF formula.
e Hindi literary document is such document with focus on information retrieval using proposed theory. is document proves to be much easier and interactive for Hindi literates and all students as they get the pictures along with the rhymes. It also learned and remembered content for beginners. Performance tuning is another important feature of this system which supports around 41 categories. Additionally, more documents can further be added which would be useful concept in future.
(4) In the present study, the performance analysis is calculated using Pearson's correlation coefficient between human scores, which is computed using the following equation: It is more reliable and accurate as the performance analysis is correlated with human graders.
is study combines both approaches, that is, rule-and statistical-based methods, for obtaining the scores for the given answer. For making a vector, it uses NLP tools such as morphological analyzer and POS tagger. e superiority and better version of this system is visualized through synergistic effect and the accuracy of the output.

Objective Achieved
is research develops an automatic answer scoring system suitable for Indian languages specifically Hindi in Gurmukhi script. e subobjectives to achieve this aim are as follows.

Development of Resources. Resources developed and used to accomplish this automated scoring were
Predefined question: predefined questions developed by teachers are fed to the system. e system is flexible in the sense that questions can be deleted, modified, or added at any time. Reference answers: standard reference answers defined by expert human raters are fed in the system. ere can be more than one reference answers for a given question. Corpus: a fixed corpus is developed and selected for a domain so that teachers can set or select fixed questions. is helps in easy extraction of information and in increasing the accuracy of the system. Lexicons: words and their synonyms relevant to the domain are collected and their contextual meaning are defined. is dictionary is then added to the database. Unlike English, lexical material is not easily available in Hindi. So, a special dictionary with synonyms has been prepared. Other standard NLP tools: NLP tools like morphological analyzer, stemmers, and part-of-speech tagger have been incorporated and successfully integrated in the system.

Identification of Predefined Features.
To define and maintain quality of an essay, certain predefined features are first identified and then extracted for evaluating them with respect to the reference answers. So, high importance is given to predefined features of both reference answers and student answers. e process of feature selection is used to determine and limit the dimensionality of features. Instances with higher probabilities are selected which have feature relevance. is helps in improving the performance of feature selection. A wide range of algorithms is used for text clustering in feature selection. A distance measure is selected in clustering which determines the similarities of two answers. Cosine similarity, Euclidean distance, Jaccard coefficient, and Pearson correlation coefficient concepts are some of the similarity or distance measures which have been used and widely applied in the study. e identification of predefined features involves the following three steps.

Preprocessing.
Text preprocessing is used to transform the whole text into a viable form for learning algorithms. It involves tasks like treatment and refining of data. Preprocessing includes: (1) Converting byte strings to tokens which can be called lexical analysis. (2) Eliminating stopwords like the, and, of, and a.
(3) Changing different word forms of a word to a single "stem" form like ing, ed, pre, and sub. (4) Selection of terms (feature) which can be individual words or noun phrases.

Extraction.
In extraction, NLP (natural language processing) tools are used to extract feature terms. e process is also applied to feature reduction phase of the text classification process. Linguistic features are extracted from text and used as a part of their feature vectors. One of the methods for extracting features is the part-of-speech (POS) tagging.
e document is tagged through the standard n-gram tagger. Besides the above NLP tools, there are other tools like morphological analyzer and spell checker, which can be used for feature extraction.

Feature Selection.
Feature selection is performed after extracting features. ereafter, standard predefined features are selected.

Developing Statistical Model.
Statistical model has been developed and used, and it has the following advantages: (1) It characterizes numerical data, describes measurements, and helps in the development of conceptual models of a system or process. (2) It helps to estimate the uncertainties in observational data and its calculation. (3) It characterizes numerical output from mathematical models. e information gathered from the model can be fed back to the system to enhance its performance. (4) Input parameters can be estimated if more complex mathematical models are encountered.
VSM is essentially a statistical model. So, considering the above advantages and strengths, it has been adopted in present study. e required changes have been applied to remove the limitations in the proposed system.

Scoring Mechanism.
It has been already discussed that the scoring mechanism is implemented by extracting grammatical and semantic relations between the student answer and the reference answers. e internal scoring Mathematical Problems in Engineering mechanism has been explained in the next chapter "Experimentation and Evaluation." Semantic similarity is a metric which is used to determine the degree of distance between a set of documents or terms. It is based on similarity of their meaning or semantic content just like syntactical similarities. e comparative analysis of the different meanings that allows us to obtain the numerical description. Semantic similarity must be distinguished from semantic relatedness. Any relationships between two terms constitute semantic relatedness. On the other hand, semantic similarity is based on "is a" relation. For example, "car" is similar to "bus" but is also related to "road" and "driving." So, the proposed statistical model greatly reduces the complexity of semantic relatedness as it is very hard to extract. However, these two terms are used interchangeably in much of the literature. It is true that basically these three terms, namely, semantic similarity, semantic relatedness, and semantic distance, mean "How much does term A have to do with term B?" e answer is usually a number between 0 and 1 or −1 and 1. If it is 1, it means very high similarity. Semantic similarity is a hot issue in NLP. Natural language processing (NLP) is a field of computer science and linguistics in which semantic similarity between concepts is a parameter to measure the semantic similarities or distances among the given answers. In other terms, semantic similarity is used to identify concepts that have common "features."

Result and Discussion
e usage of cosine similarity function between two vectors is the novelty in scoring; It is both its own scoring function as well as its similarity formula. For better evaluation, the proposed model is classified into three weighting intervals for effective evaluation. e maximum value belongs to one of the following intervals: (i) [33%, 50%] (ii) [50%, 75%] (iii) [75%, 100%].
Word_Count_feature: to give complete demonstration, suppose the maximum grade of the word count is 0.5, which is obtained through the filling of questionnaire given by 50 expert human scorers. For better grading, it is further divided into three parts to get a score (point) of 0.5 of word count feature.
After employing the process of calculating the range of feature, next step is to give scoring by finding similarities among the total word count of the reference answer vector with that of total word count of the student answer vector. Target grade scores are awarded after matching and are calculated by if ((Word_Count ≥ x) && (Word_Count ≤ y)) then word_count � 0.35; elseif ((Word_Count ≥ y) && (Word_Count ≤ z)) then word_count � 0.45; elseif (Word_Count ≥ z) then word_count � 0.5; else word_count � 0; End if where x, y, and z represent total word count (TWC) percentage of reference answers (Ri). It will be computed as follows.
x � TWC (Ri) * 0.33 y � TWC (Ri) * 0.50 z � TWC (Ri) * 0.75 Feature-based scoring is incorporated in this research. e student answer and the reference answer similarities are calculated to evaluate the correct scores. It also extracts the different lexical, syntactical, and semantic features for giving the scores. Each feature has its own weightage towards the target grade. e implication is that if there is any absence of semantic feature, then it will affect the target grade more than the syntactic features. Other features have less weightage than semantic features because semantic features depend on the content similarities as shown in Figure 3.

Performance Analysis.
e performance of the system is evaluated by comparing the output generated by the system with the result given by human raters. e correlation coefficient is calculated, and it proves that system score and human raters' score are highly correlated with each other, i.e., near to one (positive correlation). e value between human raters is quite close to the score agreement value achieved between a human rater and the system. Table 3 shows the final score of 100 students graded by the system. Table 4 and Figure 4, the observation shows that angle of cosine is not near to one. In this research, the reference answers and student answer similarities are not far from one which shows the efficiency of the proposed system. e cosine similarity function mentioned above is computed as shown in Table 3.

Conclusion, Recommendations, and
Future Work is paper contains a summary of the whole work, few recommendations, and future scope to extend the reported work. is work contributes to the active research area of automated answer scoring system. e focus of this work is to analyze the existing automated scoring systems along with its workflow system. e research objectives have been satisfied by developing reliable automated answer scoring system for Indian language.
7.1. Conclusion. Various approaches reported by researchers have been reviewed which show the feasibility of automated short answer system. Evaluating large number of students' answer in a given period of time with feedback is a trivial task. Manual grading proved to be a weakness with respect to resource requirement, fairness, cost, and timely feedback challenges. On the other hand, the automated system is much helpful for providing grade as well as feedback of student's answer within the specified time frame. In order to ensure consistency and to overcome the problems of manual scoring, the automated system gives correct scores. ese scores can be repeated several times with consistency at different times. Recognizing the importance and fairness of automated scoring system, there is need to refine and develop the system further for more accuracy. Other issues like computation and linguistic features are required to be handled for more effectiveness of the system. A number of AES are available for evaluation of different features. It still needs to be developed and refined so as to overcome the shortcomings.

Recommendations.
Based on the computer-based automated scoring system, the following recommendations are suggested for reliable scoring. e validation strategy for AES in this system takes into consideration the following: (1) e statistical relation between scores assigned by the AES system.    (1) Other aspects of AES like selection features and their weightage are also considered. (2) It considered the agreement between independent measurements of students' writing skills and the scores assigned by an AES system.

Future
Work. e work undertaken in this research can be expanded in diverse directions.
e present system achieves its target of objectivity and fairness of evaluation in scoring.
e achievements have already been mentioned. However, one may say that the system is rigid and lacks flexibility from the human perspective. But this charge, if at all it is to be levelled, can easily be addressed in the future development of the system. For example, the system can be modified so that the scoring can be graded or classified as strict, moderate, standard, etc., according to the requirements of the situation or teachers. e idea can surely be developed in the automated scoring system.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.