Automated Prediction of Good Dictionary EXamples (GDEX): A Comprehensive Experiment with Distant Supervision, Machine Learning, andWordEmbedding-BasedDeepLearningTechniques

Dictionaries not only are the source of getting meanings of the word but also serve the purpose of comprehending the context in which the words are used. For such purpose, we see a small sentence as an example for the very word in comprehensive bookdictionaries and more recently in online dictionaries. *e lexicographers perform a very meticulous activity for the elicitation of Good Dictionary EXamples (GDEX)—a sentence that is best fit in a dictionary for the word’s definition. *e rules for the elicitation of GDEX are very strenuous and require a lot of time for committing the manual process. In this regard, this paper focuses on two major tasks, i.e., the development of labelled corpora for top 3K English words through the usage of distant supervision approach and devising a state-of-the-art artificial intelligence-based automated procedure for discriminating Good Dictionary EXamples from the bad ones.*e proposed methodology involves a suite of five machine learning (ML) and five word embedding-based deep learning (DL) architectures. A thorough analysis of the results shows that GDEX elicitation can be done by both ML and DL models; however, DL-based models show a trivial improvement of 3.5% over the conventional ML models. We find that the random forests with parts-of-speech information and word2vec-based bidirectional LSTM are the most optimal ML and DL combinations for automated GDEX elicitation; on the test set, these models, respectively, secured a balanced accuracy of 73% and 77%.


Introduction
e comprehensive dictionary of any language provides the meaning of a word; at the same time, we find the correct usage of that word with an example of a sentence. us, when we can think of a word, a suite of multiple sentences can be set as examples to define it. All of these examples can be accurate w.r.t grammatical structure, the metaphor it delivers, and the context it is used into. In practice, with the corpus of these many (hundreds of thousands of ) sentences against a single word, the lexicographers, under the activity of considering Good Dictionary EXamples (GDEX), try to elicit one particular sentence which best defines the very word on the qualitative grounds of being typical, informative, and highly readable [1,2]. ere are certain rules that the lexicographers have to take care of during the elicitation process. On these rules, for example, Kilgarriff et al. [2] have maintained that a good sentence is one-in an adequate length of 10-25 words, two-comprised of words that are in the top 17,000, three-consisting of target collocation in the main clause, four-not engaging pronouns and anaphors, five-provides a context, and et cetera. Overall, the activity is quite dawdling and sometimes it is converged into a compromising scenario when a good sentence is not good enough to be an example in the context of contemporary fashion. All of it eventually turns into a powerful need to substitute an automated GDEX elicitation process with artificial intelligence, which specifically deals with natural language processing (NLP) and natural language understanding (NLU). e recent methods of automating such text classification tasks are based on supervised machine learning (ML) and Neural Network (NN) based deep learning (DL) techniques.
ese systems heavily rely on the prelabelled data, which mean, technically, a dataset that is labelled by humans. e accuracy of any such system is directly dependent on the size of data and quality of data labelling. However, recently, the researchers have produced abundant datasets for various classification tasks, but for the problem under study data is obscure, quite in deep relation to the fact that a lot of data is available over the Internet in the form of raw/unlabelled corpus; and if we aim to employ humans to do data labelling, a huge amount of time and labours efforts are required to complete it. In a parallel contrast, we have seen techniques such as distant supervision, which makes generalized assumptions for data labelling. For example instead of labelling a relation of Barack and Michelle marriage from the sentence "Michelle married Barack in 1992, and they have two daughters," we consider every sentence for marriage-relation where the terms Obama and Michelle appear [3]. Similarly, for sentiment analysis of product reviews, we can have binary star ratings supplied to it (such as the reviews with 3 or above stars out of 5 are positive, otherwise negative [4]).
us, for automation of such manual procedure of GDEX, in this paper, we have contributed to (i) the development of a dataset using the distinct supervision technique for GDEX classification. (ii) the application of supervised ML and DL algorithms to predict whether, for a given word, a sentence in running English text is good or not. (iii) the comparative analysis on the robustness and trade-off between ML and DL approaches. (iv) the competitive analysis between manual GDEX elicitation routines and automated GDEX classification.
However, it does not mean that the proposed methodology explicitly examines the syntax and other linguistic elements of good writing, nor does it deal with the inference of polarity (under the computational study for effect) in the given text, which, in general convention, refers to the task of sentiment analysis. Instead, as prefatory research, it aims to verify whether a discriminative classifier can be sought for categorizing English sentences as either of the binary classes good and bad through the supervised ML algorithms.
is paper is systematically divided into 5 subsequent sections, where the related work done for the same problem is given in Section 2. Section 3 provides details on the material and methods: data source, data labelling strategy, and approaches followed by maintaining information on ML and DL methods. e insights into the results, critique, and comparative and completive analyses on the results are presented in Section 4. e conclusion of the paper and future work are given at the end.

Literature Review
On the problem under study, there are many significant methodologies proposed by researchers; however, we maintain that, in comparison to other classification tasks in NLP, the amount of work for GDEX classification is small.
Pilán et al. [5] made the most relevant work for GDEX classification; they have developed a system to evaluate either the sentence suitability for dictionary examples or good examples for teaching purposes. ey argue that a good example should be typical, informative, and intelligible and should be easily readable for the learners. e two techniques based on natural language processing and machine learning were used for sentence selection. e content has been taken from Swedish novels, newspapers, and blogs for applying both techniques. From this work, 70% of the total sentences were suitable for understanding by students and teachers. Srdanović and Kosem [6] presented GDEX classification for the Japanese language; it was designed mainly for the lexicography of the Japanese language and learning purposes. In this research, a randomly extracted list of lemmas was used for evaluating GDEX configurations.
Kilgarriff et al. [2] presented some rules and boundaries for a good sentence; according to the study, the sentence should hold the following characteristics (or comply with the following rules): (i) (Rule#1) A sentence consisting of 10 to 15 words will be preferred. (ii) (Rule#2) A sentence will be penalized when it does not lie among 17,000 commonest words in a language. (iii) (Rule#3) A sentence containing pronouns and anaphors will be penalized. (iv) (Rule#4) Target collocation should be in the main clause. (v) (Rule#5) A sentence should start with a capital letter and end with a full stop, exclamation sign, or question mark.
Moreover, for a GDEX, Kilgarriff et al. [2] eulogize the first two characteristics/features (sentence length and word frequency) should be given the highest weight as compared to other features. According to Kosem et al. [7], the most important characteristics of a GDEX are authenticity, typicality, informativeness, and intelligibility. e developers of Good Dictionary EXamples system and their configurations are often lexicographers and lack programming skills in many cases.
Geyken et al. [8] show that GDEX work can be extended through ML techniques for mapping example sentences to 2 Complexity dictionary sense. ey performed the computations of all collocations sets and then maximum entropy [9] was used for learning the correct mapping between corpus sentences and their correct dictionary sense. Ljubešić and Peronja [10] presented another ML approach to extract GDEX. e dataset used in their experiment contains several examples of sentences with annotations of four classes/levels (i.e., very bad, bad, good, and very good). ey used the Random Forest regressor algorithm [11] and secured an average precision of 90%.
Stanković et al. [12] gave a similar work for the selection of GDEX for Serbian and it was used for the development of preliminary components of the model. eir approach analyses the lexical and syntactic aspects of a corpus consisting of five digitized volumes of examples from the Serbian Academy of Sciences and Arts (SASA) dictionary. ey compared the feature distribution of examples from their corpora with the feature distribution of sentence samples extracted from corpora comprising various other texts.
is way, selected candidate 140 examples were represented as feature vectors, and supervised machine learning classifiers were used for standard and nonstandard Serbian sentences.
Koppel [13] presented work for GDEX classification in the Estonian language. e group used the web corpus of etTenTen13; in their approach, they focus on the sentence length, word length, the number of subordinate clauses, and keyword position. In another similar study, Uprety and Shakya [14] conducted a test to analyse the effectiveness of context clue sentences among Nepalese students. eir study results showed that context clue sentences were more useful in learning vocabulary words than GDEX sentences. Based on their research results they concluded and recommended that context clue sentences should be included in the Good Dictionary EXamples to help the new learners.

Materials and Methods
is section is divided into three subsections; each one is dealing with the focused methodology such as data gathering and labelling (Subsection 3.1), preprocessing and feature selection (Subsection 3.2), and an overview about experiment setup employing the suit of predictive algorithms for machine learning (Subsection 3.3).

Data Source and Data
Labelling. We prepared our dataset in the fashion of distant supervision. Using BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) we scraped sentences from the website sentence.yourdictionary. com (YD.com). e scraping is made for the top 3K English words listed by Oxford Learner's Dictionaries (https://www. oxfordlearnersdictionaries.com/wordlist (accessed May 20, 2021)). On average we have got ≈250 sentences for a single word and more than 785K English sentences in total. Furthermore, the website not only provides the example sentences, but it also presents the count of thumbs-up and thumbs-down for every sentence against the very word. Hence we maintain the corpus in dictionary structure where, for every word as a key, a list of tuples is retained. To mean it mathematically, consider equation (1) below: where C is a dictionary with key-value pairs such as word w being the key, against whom a list of tuples is retained; further, the contents of the tuple shows the example sentence S w i along with its thumbs-up votes (U i ) and thumbs-down votes (D i ); the subscript i indicates the index of sentence respectively. e target label of a sentence, i.e., good or bad (or 1 and -1 in respective order), is determined by the count of thumbs-up and thumbs-down votes. In further analysis, we notice that YD.com holds different votes for the same sentence if the very sentence is referenced as an example to the different words. Hence, C is of no use if there exist redundant sentences with different votes. To restructure the dataset we extract a set of distinct sentences S * from C as per the following equation: Further, we prepared different datasets-corresponding to the pooling function Ψ(·)-having sentences and their labels in the form of tuples with the manner shown in the following equation: where s · ∈ S * and λ i is the label of respective i th sentence in C and determined on the criteria under function Φ given in the following equation: bad, otherwise.
In equation (4), a j is a real number yielded through a pooling function Ψ p (·) (explained later in the following text). In Φ(a 1 , a 2 ); subscript j for a j specifically indicates the incidences for U · and D · ; hence, j ∈ 1, 2 { }, to mean thumbsup votes a j � 1 and thumbs-down votes a j � 2. Lastly, the value of a j is calculated as per the following equation: where p is a final score calculating function, f ∈ max , average, sum . e index set I in the equation above has already been defined in equation (2). us, we utilized these votes as the crowdsourced labelling and adjudged a sentence to be good if the total thumbs-up votes are equal to or greater than thumbs-down votes (see equation (4)).
Complexity Table 1 gives the statistical information on the labelled dataset that is employed in this experiment. e dataset for every scoring function is balanced, i.e., each class contains 20K records (which alternatively means 40K sentences, in total, are used in the experiments.) One key observation we can get from the table is the average sentence length of good examples is approximately half of its counterclass. It further asserts that the distinct supervision (or nearly crowdsourced data) appeared to have aligned with rule#1 (i.e., already stated in Subsection 2.2).

Machine
Learning-Based Classification: Feature Enhancement, Transformation, and Algorithms. At the beginning of this section, the authors would like to maintain a summarized idea of experiments conducted for the GDEX classification based on the conventional ML algorithms; in the same context, the following itemized text provides a brief commentary on the components depicted in Figure 1.
(i) We experimented with two different approaches for the feature enhancement, such as the following: (1) Bag of Word (BoW).
(2) Usage of Part of Speech (PoS) tags alongside the words.
(ii) Besides the above two approaches, we set two feature transformation (or vectorization) techniques for the sentences in the dataset, such as the following: (1) Word frequency-based count vectorization.
(iii) e combination of these feature enhancement approaches and feature vectorization techniques are evaluated under the five conventional ML algorithms in the 10 randomly generated training and testing subsets under the Monte Carlo method. e ML algorithms used in this paper are enumerated below: (1) k-nearest neighbours (k-NN).

Feature Enhancement Approaches.
e BoW approach is considered a very basic approach in any task in NLP [15]. It consists of tokenization of a running text/ document and submission of tokens for further process. However, we can think that these sequences of words are of more importance and become meaningful and informative when they are analysed with the corresponding PoS tags.
us, hundreds of papers in the domain of NLP and NLU utilized such information of words' PoS alongside words in their capacities [16,17]. In the same regard, we can anticipate the words in addition to respective PoS tag information (BoW + PoS) will attain more robustness in the predictive ML model with two significant hypotheses: (i) BoW + PoS creates highly discriminative features for classifying a GDEX. (ii) Forbye the previous point, BoW + PoS embodies a writing pattern that exists for a comparatively longer sequence in n-grams-we surmise that it may engage better syntactic and semantic attributes.
On a technical note, we have used Natural Language ToolKit (NLTK) based word tokenizer (https://www.nltk. org/api/nltk.tokenize.html; there are many tokenizers provided in the module; function, which is precisely used in this paper, is, namely, word_tokenize) for the sentence tokenization; followed by it, the PoS tagging is also done with NLTK-based PoS tagging module (https://www.nltk.org/ api/nltk.tag.html#module-nltk.tag). We concatenated the word and its respective PoS tag with an underscore, as it is shown for a single sentence in Table 2; however, the information on the tag-set can be accessed in the online documentation of NLTK (https://www.nltk.org/book/ch05. html).

Feature Vectorization
Techniques. ML algorithms are not supposed to work directly on the running texts. Since there are thousands of terms in the vocabulary and a few of them are appearing in a sentence, we are required to transform every sentence through a specific mechanism that applies to all of the sentences in the dataset and is hence workable for the ML algorithms. Typically, the sentence transformation mechanism takes a sentence and projects it into a high-dimensional vector space [15]. e final structure of the dataset will be a matrix. It contains the number of rows equal to the number of sentences and the number of columns as per the size of vocabulary (or in other words, the dimensionality of a single vector is equal to the size of vocabulary). us, we can think of the values on the dimensions, corresponding to the words present in the sentence and carry nonzero numeric values; otherwise, they are zero (in the case of nonsparsity). e very matrix can be sparse by ignoring the indexing of words/dimensions that are not present in the sentence and retaining the records for the words that appear in the sentence.
Count vectorization, which is the first vectorization technique used in this paper, involves sentence vectorization by keeping the count of words that appear in the sentence and zero at the remaining dimensions. Figure 2 illustrates the count vectorization process, in which the first step includes generating a dictionary for word-indices, followed by utilizing very dictionary for the transformation of sentences in vector space.

Complexity
We may think of the cases where the most frequent words (i.e., a, an, the, of, to, et cetera, known as stop words) dominate in a sentence-diminishing the impact on least frequent words-hence, resulting in a larger value on their respective dimensions. In this regard, the TF * IDF approach sets a tradeoff between the high-frequent and less-frequent words [15,18]. It works by calculating a product of term frequency relative to a document (TF) and inverse document frequency of the very term in the corpus (IDF). To mean the TF and IDF mathematically, consider the following equations; moreover, Figure 3 uses these formulae to illustrate TF * IDF calculations.
On a technical note, we have used the n-gram range [1,3] in sklearn, which assumes the formation of unigrams, bigrams, and trigrams in the input string. Alongside it, we kept the same tokenization function for both of the vectorization processes, which has already been discussed in the previous subsection.

Machine Learning Algorithms. K-Nearest Neighbours
is among the instance-based lazy learning techniques in conventional ML algorithms [19,20]. Functionally, it computes the distance between the target document vector and all of the document vectors, followed by selecting k documents where the distance is minimum. In last, it decides the class for the target document through voting in the knearest neighbour vectors. e number of neighbours set for this work is five (i.e., k � 5). Furthermore, we like to maintain that there are many measures for computing distances between documents, and the one we have employed in this paper is cosine similarity. Since similarity is inversely proportional to the distance, with the case of similarity, the k-NN algorithm will perform voting on the k documents with the maximum similarity. e value of the cosine similarity ranges in [0, 1], where the similarity score 0 indicates no similarity whereas 1 indicates absolute similarity. e cosine similarity between two document vectors (A and B) is calculated through the following equation [15]: Naïve Bayes is a conventional ML algorithm for classification tasks [4,15]. It classifies the sentences by exploiting conditional probability using Bayes' theorem; however, the basic assumptions naïve Bayes holds are of the conditional independence between the features. e basic calculation   done by the naïve Bayes for classifying a sentence (X) is given in the following equation: Equation (7) is expanded w.r.t the individual features (X � x 0 , x 1 , . . . , x n ); see equation (9) below: However, when the documents are normalized and transformed through the TF * IDF vectorization, the values for features are no longer discrete. us, for the continuous features, we cannot employ the above conventional naïve Bayes algorithm. Instead, we have to use its variant that uses Gaussian distribution (and hence, known as Gaussian naïve Bayes) [21,22]; the substitution of (x 0 |C k ) in the Gaussian naïve Bayes is defined in the following equation: Finally, the target class y (by either of conventional naïve Bayes or Gaussian naïve Bayes) is elicited where (.) is maximum; to mean it mathematically, see equation (11), where K is the set of classes: Random Forest is an ensemble approach in ML classification algorithms, which is based on Decision Trees (DT) [23]. Instead of relying on a single decision tree, the basic aim is to draw multiple decision trees from the bootstrapped-random samples of training data. e testing data will be predicted on each of the DT, followed by eliciting the final label through voting [11]. us, we can think of RF overcoming the issue of overfitting through ensemble technique. Figure 4 shows how the RF classifier works and outputs a final class from all of the DTs. In the experiment, we have used 200 trees (or DT estimators) for building a forest.
Support Vector Machine is one of the widely employed classifiers in conventional ML algorithms [24]. It is well suited for the classification of complex, imbalanced ones but should be small or medium-sized datasets. e SVM aims to draw a hyperplane in an n-dimensional vector space, such that the hyperplane separates data points into two distinct partitions of data, representing the respective classes [25]. e SVM can be used for linear or nonlinear classification. However, the basic SVM, which fits a hyperplane, is conventionally known as linear-SVM [25,26]. Equation (12) gives the mathematical semantics for understanding linear-SVM.
In this work, we have used linear-SVM and radial basis function (rbf ) SVM (through kernel trick). e basic objective rbf-SVM sets are to fit a circular boundary margin for nonlinear datasets. e illustration in Figure 5(a) shows the linear-SVM, and in contrast, Figure 5(b) shows the situation where a hyperplane is not suitable for separating datasets into two distinct parts; and instead, this can be achievable

Feature Engineering for Deep Learning-Based
Classification. In this section, a discussion on the DL models and input data encoding schemes are given in detail.
Likewise, in an earlier summary methodology involving ML in Subsection 3.3, the authors would like to maintain a brief commentary on DL-based models; Figure 6 shows the overall scheme for these experiments.
(i) Since we empirically found, in the suite of ML-based algorithms, the most optimal result was secured with the dataset based on a final scoring function Ψ f�sum , all of the DL-based experiments are performed only on the aforesaid dataset. (ii) Since the NN essentially requires input data to be encoded in a numeric form, for doing the needful, we used 3 different data encoding approaches, which are as follows: (1) One-hot encoding.
(iii) A combination of these data encoding approaches was made with the following 5 DL algorithms/ networks: us, the total number of experiments done with DLbased methods is 15, i.e., 3 (data encoding approaches) × 5 (DNNs) � 15. e detail of these components is given in the subsequent subsections.

Data Encoding Approaches.
e NNs need data to be in numeric form for which we have got many transformation or encoding approaches. One-hot encoding is one of the techniques among them, which generates a single vector against every word in a sentence, such that the index corresponding to the very word is 1 and the rest of all incidences are 0. us, we can see a sparse matrix-like structure (or a list of four one-hot vectors) for the sentence " is is a cat" as is illustrated in Figure 7(a). Each row of the yellow block is a vector where there exists only a single entry of 1, indicating the presence of the very word in the vector. Hence, with this technique, we can think that input data is sparse and exists in a very high-dimensional space.
In contrast, the second approach for data encoding is based on NN inspired word embeddings and statistical means, which are dense and adjustable to any of the n-dimensional spaces, provided that n > 0; Figure 7(b) illustrates the example of word embedding where each row in blue colour is a dense representation of the word in 4-D space. e word embeddings render meanings to Firth's philosophy "You shall know a word by the company it keeps!" [27] through realizing the capability of retaining the context of words, such that every word will exist alongside the similar words (using GloVe, the examples of the nearest words for the word "king" are "kings," "queen," "monarch," et cetera; retrieved through online tool available at http://bionlp-www. utu.fi/wv_demo) in the n-dimensional space of word embeddings. In this work, we have employed two different word embeddings, namely, word2vec [28] and GloVe [29], developed by Google and Stanford, respectively. In addition to this, the word2vec employs continuous BoW in NN for learning the prediction of the current word (given the input of context of words) and skip-grams for learning the similar words (given a source/input word), whereas the GloVe utilizes matrix factorization techniques such as Latent Semantic Analysis (LSA) [30] on word-word context matrix for generating word vector representations. On a technical note, the representation used in this work is based on 300 dimensions (these vectors can be accessed at http://vectors. nlpl.eu/repository).

Deep Neural Networks.
e NNs are the computational system of connected units that loosely simulate the working of biological neurons in the brain of living beings. e story of ideas and advancements made in the file of NN is historic. ( e earlier NNs are devised by McCulloch and Walter [45], in 1943, for artificially simulating the working of a biological neuron  Complexity [39,46]. is earlier work is rendered with computational means known as "Calculators" and "Perceptron," respectively, in 1954 by Farley and Clark [47] and in 1958 by Rosenblatt [48]; however these works were limited to present the working of single neuron [39,44]. Upgrading NN with several layers (thus, called DNN) was made in 1965 by Ivakhnenko and Lapa [49]. In 1975, Werbos [50] presented that the backpropagation technique can be used for new weights learning for the training of multilayer networks [46]; the further work done by Rumelhart et al. [51] showed that the backpropagation techniques learn interesting features for text processing.) However, the authors would like to maintain a brief introduction to the basic working of these connected units or a NN (which is also illustrated in Figure 8), where inputs (or signals) received at the input layer are analysed and transmitted to further neurons to which they connected. We know the input should be a numeric value (for which we have maintained information in the previous section); thus, the input (X � x 0 , x 1 , . . . , x n ) received at the units of the hidden layer and the respective weights (W � w 0 , w 1 , . . . , w n ) that are correspondingly associated with the edges are taken for the dot product ( � X · W)creating a linear output. In the next step, bias (b) is added to this linear output (z � X · W + b), and the result is converted into nonlinearity through passing it to a nonlinear activation function, that is, in our case, tan h function, which is given in equation (13).
Similarly, the output of hidden neurons is transmitted to the final output neuron that takes a step function to compute the class of given input data. e step function, which is used in this paper is sigmoid that returns a number in the range of [0, 1], where we consider the prediction is relating to the positive class if the value is above 0.5; otherwise it belongs to the negative class. e sigmoid function is given in the following equation: e backpropagation technique is used to update weights considering the prediction errors that occurred during the training. In this context, DNN typically divides training set into multiple batches; thus, with one batch it calculates the error followed by updating the new values for the weights. Executing the same process on each batch will mark one run, which is technically called an epoch.
In this paper, we have used three types of NN that were specifically developed for text (or generally known for sequence) processing. e RNN [31] is one of the first DNNs that attempted to involve input history in the sequential data such that the process of RNN moves onwards with subsequent inputs alongside incorporating the result (of the hidden state) of the previous input units.
W.r.t Figure 9, the RNN works for every timestamp t, and the hidden state a 〈t〉 and the output y 〈t〉 are expressed as per equations (15) and (16).   where W aa , W ax , W ya , b a , and b y are the coefficients, and f is the activation function; comprehension of these coefficients and the internal structure of the blue box (illustrated in Figure 9) are given in Figure 10. e RNNs though were developed to retain memory but instead, they failed on doing it for the longer sequences. Alternatively, Hochreiter and Schmidhuber [33] presented another RNN-based architecture, namely, LSTM, which served better for the problem of input retaining. e LSTM introduced the concept of the gate for remembering the inputs; however, later an upgraded form of LSTM is presented by Gers et al. [34], which added forget gate in the architecture; further, with the induction of forgetting gate LSTM became capable of resetting its state [35]. e LSTM though is the wonderful RNN architecture but it takes more memory and processing time [36,37]. Cho et al. [38] introduced GRU, which is alike LSTM but contains fewer parameters. Traditional RNNs suffer the vanishing gradient problem, which is handled at the optimal level in LSTM and GRU [32,33,39]. e bidirectional LSTM and GRU are the variant of vanilla LSTM and GRU, which are capable of making the DNN process string in forward and backward directions [39]. In Table 3 the summary of gates used in LSTM and GRU is presented, in addition to which we can see their usage in the illustrations of LSTM and GRU in Table 4, where ⊙ shows elementwise multiplication between two vectors. e networks we have employed in this paper have the same input and output layer.
However, the hidden layer varies w.r.t the architecture. is DNNs are programmed with Keras using the sequential model. Information on the layers hyperparameters used in this work is given in Tables 5 and 6.

Results and Discussion
In this section, we presented a thorough discussion on the evaluation and comparisons of the ML and DL models. However, before proceeding any further, it should be in the knowledge that the evaluation is done on a validation set which is extracted from the labelled corpus with Pareto principle or 80/20 rule [15,40]. ese details are maintained in separate subsequent subsections.

Evaluation Criteria and Metrics.
e classification task in a supervised learning domain is often evaluated through the confusion matrix (CM), which statistically presents the number of correct and incorrect predictions w.r.t. the actual labels in the validation set. A sample CM is given in Table 7, where TNs are the true negatives, which logically means the number of actually negative documents and predicted negative as well; TP (true positives) will mean exactly the opposite to TN (i.e., consider a positive class in place of negative). e FP is the false positives, which logically means the number of documents that are actually negative but misclassified as positives; FNs (false negatives) are the exact opposite of FP, such that they were the misclassified documents that were actually negative but falsely predicted as negative.
We can drive several evaluation statistics for assessing the quality of the Predictive System (PS) using the CM. e statistics and their derivations used for the assessment of ML and DL models in this work are defined in Table 8. Moreover, for the ideal PS, we expect to have the highest value on the left diagonal of every individual CM, whereas, at the right diagonal of the matrix, we expect the least value.
For the evaluation of performance in this paper, we consider R and BA are of more importance. e R is critical because we consider losing or misclassifying a positive document into another category is perilous-as we have got little data for the GDEX classification in comparison to the colossal dataset-thus, we will consider an ML or DL model with an optimal where the R is higher. In a similar context, this will not mean the small value of S; hence, the BA is the

Analysis of ML Models and Results.
e quantified statistics of all evaluation metrics are given (respectively, for the final scoring functions, i.e., Ψ f�avg , Ψ f�max , Ψ f�sum ) in Tables 9-11 . e overall result of the ML-based models is positive. We can see an obvious insight into the better performance of all ML models (in all respective datasets corresponding to the final scoring functions) that are vectorized through the TF * IDF approach. On the collective ground, the dataset created with Ψ f�sum , parse, indicates the most optimal method for the dataset creation through distant supervision. In contrast, the results with the Ψ f�avg show the least significance for making the discriminant dataset for predictive modelling; hence, we can maintain that the distant supervision cannot be used with the averaging methods for data curation in supervised learning tasks.
Since the dataset with Ψ f�sum shows better results, we will consider it (considering Table 11) for the discussion in the remaining text. Coming towards the evaluation of feature enhancement technique, we see the BoW + PoS tags show better results in comparison to the only BoW approach. However, a drastic change in accuracy of k-NN (i.e., w.r.t Ψ f�sum , improvement of +12% with count vectorization and +2% with the TF * IDF vectorization) is seen when the PoS information is inducted alongside the simple words. However, in comparison to the count vectorization technique, we maintain that the improvement with the additional PoS information is slightly more visible in the TF * IDF vectorization technique. e most optimal ML algorithm and combination found with maximum accuracy of 77.3% are rbf-SVM + TBP. (TBP will be the acronym for the combination of TF * IDF vectorization + BoW + PoS tags features. Similarly, CBP will be combination of count vectorization + BoW + PoS tags features. TB will stand for the combination of TF * IDF + BoW features; and CB will be count vectorization + BoW features.) Ignoring the trivial difference of linear-SVM with its other variant, we can consider RFT + TBP secures the second position by attaining accuracy of 76.8%. For BA, k-NN + TBP is found the best combination with a 75.5% score, followed by RFT + TB with securing a 73.9% score. Besides accuracy and balanced accuracy, the highest recall (i.e., 75.4%) is seen in a dataset with Ψ f�max with RFT + CB and linear-SVM + TB. Forbye it, we see that the R is high with SVM everywhere. Figure 11 shows the improvement of the BoW + PoS approach on the conventional BoW approach. e subfigures in the top row indicate improvement w.r.t count vectorization, and in contrast, the bottom row carries information on the TF * IDF vectorization. e overall observation on the improvement gives a piece of mixed information except for the TF * IDF features on an average dataset, where the positive trend of improvement is steady. However, the least improvement, i.e., ≈0.8% on an average basis, is seen for the same dataset. In the same context, on average the maximum pointer of improvement (i.e., ≈3%) is found with the dataset with Ψ f�sum . Figure 12 shows the CM of all conventional ML algorithms, separated w.r.t feature enhancement and vectorization techniques. However, instead of multiplying the figure space three times for each of the datasets with respective final scoring functions, we have presented the aggregated-normalized CM. e colour bar on the right of Figure 13 is set to serve a specific purpose such that the maximum value is 0.5 (≈50%) which corresponds to the size of data in one class.
We maintain that the SVM + TBP with its both linear and rbf variants is the most optimal algorithm among all.
is is so because linear-SVM achieved TN + TP � 0.35 + 0.4 ≈ 0.75 ≡ 75% accuracy; however, the other variant, rbf-SVM, stood second. e authors would like to maintain the performance of the RFT + CBP; 0.34 + .41 ≈ .75 also similar to the previously mentioned linear-and rbf-SVM. Forbye it, we must maintain that the competition between the SVM + CBP and RFT + CBP is near equal, but the RFT + CBP is found champion such that it has got minimum value on right diagonal (i.e., FP + FN � 0.16 + 0.09 ≈ 0.21), and in a similar context, it has got the least FN which, per se, is an additive advantage.

Analysis of DL Models and Results.
e NN-based DL models are used on the dataset with the scoring function Ψ f�sum , as it has produced the most optimal result in comparison to the remaining two scoring functions.  x <t> x <t+1> x <1> x <2> a <0> Figure 9: Illustration of the RNN. Image courtesy [32]. Table 12 shows the metrics for the validation set only. Among the three input encoding techniques, the word2vec is found for better GDEX classification. However, the unidirectional or vanilla GRU and LSTM are found biased towards the negative class. Alternatively, in other words, the aforementioned DL networks failed to discriminate between a GDEX and bad examples and hence developed a propensity towards the negative class only. ( e authors would maintain that the biasedness of unidirectional NN can be overcome by introduction of dropouts but we are afraid of doing it for the reason of being unjust to the rest of NNs employed in this work.) Moreover, this behaviour is seen for both of the dense embedding techniques word2vec and GloVe. In contrast, the bidirectional variant of these two techniques achieved approximately equal and comparatively optimal results. We maintain that word2vec with Bi-LSTM is the optimal algorithm for GDEX as it has achieved 77% accuracy (and balanced accuracy as well). Alongside it, the highest recall, i.e., 86%, is also on record for this setting. e NNs with the onehot encodings though have shown the least but steady results. Figure 14 shows epochwise loss and accuracy achieved in training and validation sets. We have got the typical behaviour in counting the increment in epochs; the loss, in the validation set, minimizes to an extent, and afterwards, it gets propensity to increase; in contrast, the loss continues to diminish in the training set [39,42,43]. We can see this behaviour in all DL models-except for the Bi-LSTM and Bi-GRU with word2vec and GloVe, which show steady performance. Furthermore, since we know that the DL is more appropriate for the largescale datasets, and currently the data employed for this experiment is comparatively smaller, we can expect a few numbers of epochs are enough for the training (or not indulging in the overfitting model on training data). In this regard, the authors maintain that the 3 epochs are enough for any of the DL-NNs used in the experiments. is is so because we see in the validation dataset that the accuracy is declining after the 3rd epoch. Figure 13 shows the improvement in accuracy and balanced accuracy achieved by one DNN over the other networks; the quantified value of these metrics is subtracted as NN x -NN y provided that x ≠ y, where x is DNN (alongside the input encoding method used in it) defined on the x-axis and y is DNN in the y-axis. e cells with the shades of red colour in the figures indicate negative improvement; in contrast, the cells with grey shades indicate improvement. e intensity of shades is directly proportional to the value of the improvement. Likewise, in the observation reported in Table 8, we found that, except for the few network comparisons, the improvement in the accuracy and balanced How much information of cell should be revealed? LSTM Table 4: Architecture and variables' information the LSTM and GRU.

Variables LSTM GRU
Illustration      Recall (or true positive rate) shows the right potential of the PS for predicting positive documents in the subset of all positive documents in the system Specificity (S) S � (TN/(TN + FN)) Specificity (or true negative rate) is the exact opposite of R. It gives the potential of the PS for negative documents F 1 -measure is a harmonic mean of P and R. It is important to use where the dataset is imbalanced; further, it is a strict measure, which has a propensity towards the minima of P and R [41]

Accuracy (A)
A � ((TP + FP)/(TP + TF + TN + FN)) Accuracy gives the overall creditability of the PS Balanced accuracy (BA) BA � (R + S)/2 Likewise F, the balanced accuracy is also a mean statistic, which gives an arithmetic mean of R and S   accuracy yielded in DNN are equivalently identical. We find that bidirectional DNNs with word embeddings drew a major improvement on the rest of all DNNs. In a similar context, though the highest accuracy and balanced accuracy are seen over unidirectional NN we neglect this case on the ground of biased performance shown by the unidirectional DNNs (with word embeddings). Except for the previously mentioned case, the real highest gain in accuracy and balanced accuracy is seen over RNN + word2vec; i.e., Bi-GRU and Bi-LSTM have secured ≈23% improvement with word2vec, followed by attaining ≈22% improvement by the same DNNs with GloVe. Keeping the focus on Bi-GRU and Bi-LSTM, the most optimal word embedding scheme is word2vec such that it achieved ≈4% and ≈5% improvement over vanilla one-hot encodings used for the same DNNs and ≈1% improvement over GloVe.
Observing CMs presented in Figure 15, we confirm that the bidirectional LSTM with word2vec is the most optimal NN and inputs data embedding pair for resolving the problem under study. We also maintain that, in comparison to the GloVe and word2vec, the one-hot encoding is the most underperforming input encoding scheme.

Comparative Analysis on ML vs. DL Models.
As reported in several different studies on the comparison of ML and DL models [16,39,43,44], the authors of this paper reassert that the DL models outperform conventional ML models. In addition to it, we also maintain that DL-based models are revealed to attain balanced scores in accuracy and balanced accuracy. However, the DL-based unidirectional algorithms failed, which we consider specific to the problem under study; in contrast, the bidirectional DL algorithms are found the most optimal ones. us, w.r.t the results compiled in Table 13, if we look at the averages of all ML models (for the dataset Ψ f�sum ) and compared them with the averaged values of DL-based models (i.e., RNN, Bi-LSTM, and Bi-GRU; leaving unidirectional LSTM and GRU due to their biasedness) then, we see only an improvement of ≈+3.56% and ≈+2.47%, respectively, in recall and balanced accuracy for GDEX classification. However, the principal reason for such small improvement lies with the lower scores of RNN in comparison to the remaining two bidirectional NNs. In contrast, the ML-based models took very little time in preprocessing and training. In a similar context, we can see the one-hot encoding turned training time longer, whereas the DL models with 300-dimensional dense word embeddings were trained in a small amount of time. Table 14 shows the selected examples of sentences from the test/validation set and the prediction made by the most optimal ML and DL models for them. In addition to it, we also show the GDEX rules presented in the seminal work by Kilgarriff et al. [2]. ese were actually 5 rules, which are already mentioned in the literature review (see Subsection 2.2); however, rule 3 is omitted in discussion as it deals with the penalization of a sentence containing anaphors and pronouns (though there are sentences which deal the aforesaid matters, ML/DL do not explicitly deal with such penalization). Examples 1-8 show TP and TN, wherein, specifically rule#4 is false when the actual label is bad. Examples 9 and 10 show FP, where rule#4 is false. Examples 11 and 12 are real mistakes, as these are the FN, and the sentence not only complies with all rules but also appears to be very succinct in structure.       We can draw another meaningful insight into dataset curation through distant supervision. e unanimous true for rules 1 and 2 and correct assessment of rules 4 and 5 confirm the reliability of the usage of web-based data available at YD.com alongside the method for label assignment with the scoring function Ψ f�sum for GDEX classification and similar other problems.

Conclusions
is paper provides the implementation of both ML and DL models for the GDEX classification. Following the results compiled in the experiments, we conclude that the proposed methodology is accomplishable for the automation of manual GDEX elicitation routine. e dataset of 50K example is extracted with the distant supervision technique, for which the summation method is found better than vote aggregation (averaging and max) methods. For the conventional MLbased methods, the distinction of TF * IDF normalization over count vectorization is revisited during experiments. Also, we have analysed that PoS features are important and better for the easy classification and discrimination of GDEX. For the DL-based models, Bi-LSTM + word2vec is the champion among the rest of all DL-based combinations.
In the future, this work could be extended by incorporating supervised learning for the GDEX elicitation against the given target word. We would also like to evaluate the current system on the attention-based DL models. At last, we would like to apply and evaluate the current technique on oriental languages such as Arabic, Persian, and Urdu-where the GDEX is considered to have historic relevance in the poetic work.
Data Availability e models and data files can be accessed at https://github. com/MuhammadYaseenKhan/english-gdex.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.