Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports

With the rapid development of the internet technology, a large amount of internet text data can be obtained. +e text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. +erefore, the variance between the DF value of a particular term and the average of all DFs (DF), namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. +en, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF, TF-IADFnorm, and TF-IADFnorm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on stateof-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.


Introduction
Due to the rapid development of internet technology and information infrastructure construction, the volume of text data which can be obtained online has increased dramatically. As the China Internet Network Information Center (CNNIC) stated, the number of netizens in China had increased to 828.51 million by the end of 2018 [1]. e internet has become the main channel for Chinese people to obtain information.
e content of internet media is the most important data source, in which textual documents are the main one. And it is increasingly important to effectively analyze massive textual documents such as classification, indexing, and clustering. As a consequence, text classification (TC) is concerned by many researchers working in the field. Based on the previous studies, many applications based on TC technology have been developed, such as author identification [2,3], spam e-mail filtering [4], medical documents' classification [5], management of customer relationship, and classification of web pages [6,7]. Text classification (TC) is a task that assigns textual documents to predefined classes based on knowledge extracted from their content. e process of TC is as follows [8]: (i) Given a set of k different discrete class label values C � {C 1 , . . ., C k } and training data and a set of documents D � {D 1 , . . ., D n }, each document of which is labeled with a specific value from set C (ii) Calculate text representations for documents in set D (iii) Build a classification model based on the training data, which indicates the relationship between the features in the underlying document and one of the classes (iv) Predict class labels for class-unknown documents using the trained model Calculating text representation, training classification models, and predicting class labels for class-unknown documents are the main steps of text classification. e entire steps, factors, and the way they organize in TC are shown in Figure 1.
As shown in Figure 1, before documents can be analyzed by a classification model, documents need to be preprocessed in a specific way such as be represented by vectors with numerical values.
ese values relate to predefined classes that the classification model can understand. is process is called text representation, and it is an essential prerequisite for TC tasks [9].
ere are many methods proposed for text representation among which the vector space model (VSM) is the most commonly used one [10]. VSM is a feature vector that consists of numerical values which are also called term weights for representing a document. Components of this kind of model can be of different types such as words, sentences, and phrases [11]. ese components are also called terms which are extracted from a document to form a bag of words (BOW) [12]. e abilities of these terms in distinguishing different documents are represented by numerical values (weights) related to the terms [13]. For example, a document can be represented as a vector of weighted features (or terms) d k � (t 1 , t 2 , . . ., t n ) and a corresponding weight vector w k � (w 1 , w 2 , . . ., w n ), where n is the number of selected features (terms) and w 1 , w 2 , . . ., w n are the weights of t 1 , t 2 , . . ., t n . en, a collection of documents (corpus) can be represented as shown in Figure 2, where element w i,j represents the weight of t j from d i .
As we can see in the matrix, each term in each document can only be assigned to one weight at the same time in VSM. It is obviously crucial to assign appropriate weights to terms for the performance of text classification. erefore, many methods which are called term weighting scheme (TWS) are proposed to determine the weights for terms of documents. Different TWSs generate different vectors for the same document, thus attributing to the document with different representations. "Good" term weighting methods are of fundamental importance for guaranteeing good TC performance. So far, there are two main categories of TWSs in the literature: semantic-based TWSs and statistics-based TWSs [14]. e semantic-based TWSs focus on the semantic relationships between terms and documents which are hidden behind the extracted features (words) as well as focus on the meaning of words [15]. For example, Rao et al. proposed a new model based on a neural network which captures semantics of continuous text representations [16]. Based on the distributed hypothesis in which meanings of terms from documents with similar meanings will also be similar, a neural network was used to embed words into a continuous vector space (Word2Vec) for capturing the semantic information of words [17]. Doc2Vec, based on Word2Vec, was extended from the word level to the document level by fully using the information of the word sequence [18]. Because the above methods all have limitations when being used on their own, Kim et al. combined the BOW and Doc2Vec and proposed the bag-of-concepts method for overcoming the limitations [19], and Kim et al. also tried different methods, namely, TF-IDF, LDA, and Doc2Vec, for text representation [20]. More recently, Wu et al. proposed a novel phrase-based text representation called Phrase2Vec, which includes skip phrase, CBOP, and GloVeFP. ey applied the novel method to text analysis research, and results show that Phrase2Vec can improve the performance of TC and clustering tasks [21]. Jaeyoung Kim et al. first applied capsule networks, which achieved success in image classification, in the TC task and demonstrated comparable performance to well-known schemes at the time [22].
For non-English languages, semantic analysis is also widely used for TC. Ye-wang Chen et al. proposed a novel method using the biggest open and free internet knowledgebased Baidu Baike to capture the semantic relationships of words to categories to enhance the performance of TC in Chinese text [23]. Ashraf Elnagar et al. introduced two new datasets for Arabic TC tasks, namely, SANAD and NADiA, both of which are freely available for research studies. In their experiments of extensive comparisons among several deep learning (DL) models for Arabic TC on SANAD and NADiA, their method outperformed others because of no requirement of a preprocessing stage and being completely based on deep learning models [24].
However, TWSs based on semantic analysis are more complex than statistical counterparts in analyzing and calculating the process. Furthermore, for semantic-based methods, performance cannot be significantly improved. erefore, statistics-based TWSs are still major topics in the field of text classification [25]. Normally, most statisticsbased TWSs all depend on the following philosophies: (i) Terms with higher occurrences in a document relate to the document better, which is the basic idea of "term frequency" (TF) (ii) Terms with occurrence in fewer documents relate to the documents where they occur better, which is the basic idea of "inverse document frequency" (IDF) According to these principles, there are two main factors in a statistics-based TWS as it is shown in the following equation: TWS � term frequency factor * collection frequency factor.  Table 1, where values of "NONE" indicate there is no corresponding method for the specific parameter. As the table shows, some methods focus on modifying the term frequency factor (i.e., LogTF-RF [11] and SQRT_TF-IGM [25]), while some focus on developing novel methods as the collection frequency factor (i.e., TF-IDF [26], TF-CHI2 [27], TF-IEF [14], and TF-IGM [25]). Nevertheless, TF-IDF is still one of the most preferred methods.
In this paper, enriching the collection frequency factor of statistics-based TWS is concerned, i.e., TF-IDF, for handling situations of imbalanced data distribution. A new formula is designed by using the variance between the DF of a specific term and the average value of all DFs (DF) instead of original DF in TF-IDF. Based on the new formula, a novel method named TF-IADF and three other TWSs based on the same idea are proposed to enhance the TC performance in the imbalanced situation of internet media reports. e remainder of this paper is organized as follows. An overview of background study on statistics-based term weighting schemes is given in Section 2. e main idea of our novel methods is described in Section 3. Section 4 briefly introduces the experimental settings and datasets, including data preprocessing, the classifiers, and the measurements.
Experimental results and the analysis are presented in Section 5. e final conclusion is given in Section 6.

Background Study
When looking at the studies related to TWS in the literature, TF-IDF, originally designed for information retrieval (IR), may be at the top of the list. However, as Chen et al. stated, due to its original design, TF-IDF is not effective enough in the text classification domain [25]. us, they proposed a new statistics-based model named inverse gravity moment (IGM) to describe the intercategory distribution. Based on IGM, TF-IGM and sqrt_TF-IGM (RTF) are proposed. In their demonstration on popular classifiers, namely, SVM and kNN, the   proposed methods had better performance in measurements such as micro-F1 and macro-F1 than existing  TWSs (TF, TF-IDF, TFIDF-ICSDF, TF-CHI, TF-PB, and  TF-RF). However, Turgut Dogan et al. reviled that, for each case where the term document frequency changes, the term with the same weight is given by TF-IGM. is means terms with different distinguishing abilities obtain the same weights from the standard IGM method which is unreasonable [28]. In their studies, two novel TWSs, namely, SQRT_TF-IGM imp and TF-IGM imp , are proposed deriving from IGM to overcome its limitations. In other aspects, Zhong Tang et al. described two deficiencies from which TF-IDF suffers, namely, collection frequency factor being undefined (division by zero) or being equal to zero in some special cases. ey proposed a novel method, namely, term frequency-inverse exponential frequency (TF-IEF), to overcome these drawbacks [14]. e proposed methods replaced the IDF with a global weighting factor IEF, and a log-like method is used to characterize the collection frequency factor. It greatly reduced the influence caused by terms with high TF values, which helped in generating a more representative vector of terms. e experiments stated that the novel methods had an improved performance than compared schemes. e knowledge about Chinese language and Chinese culture provided by Baidu Baike is learned and organized by Chinese language-speaking people and professional employees of Baidu company. erefore, Baidu Baike is used for optimizing TC on Chinese text a couple of times in the Chinese language aspect [23,29]. However, both Baidu Baike-based methods are based on semantic analysis, and huge calculations are required for processing.
However, most of these methods are based on the assumption that the dataset is relatively balanced in distribution. In fact, the imbalanced distribution of the dataset occurs frequently in the TC domain [30]. Furthermore, the classification performance is heavily affected by the imbalanced distribution of the dataset in TC [31,32]. Many studies have been proposed to address this problem, such as [33,34]. In these proposed studies, two common ways are used to solve the problem of data imbalance, namely, the data-driven methods and the algorithm-driven methods. e data-driven method is to adjust the proportion of data categories by undersampling, oversampling, or a combination of undersampling and oversampling. e algorithm-driven method is to adjust the classification algorithm to achieve the effect of promoting learning without changing the dataset. e simulation results of these proposed methods show that the more unbalanced the proportion of categories is, the lower the overall performance of TC becomes. One of the main reasons is that some lesscommon terms in large-scale categories are weighted even higher than some more-common terms in smallscale categories due to their frequencies of occurrence. Document classification of Chinese media reports on the internet which is also a TC problem with imbalanced dataset is researched in this paper. And a more representative model in cases of imbalance data is tried to create by modifying the term weighting method.

Novel Term Weighting Methods Based on Improved TF-IDF
TF-IDF is the most widely used TWS proposed by Karen Spärck Jones [26]. In this section, a new TWS based on TF-IDF, namely TF-IADF, and its variants proposed in this paper are described in specific.

3.1.
Overview to TF-IDF. TF-IDF [35] is a combination of term frequency (TF) and inverse document frequency (IDF). Since the original value of term frequency in a document is used directly, the TF representation is one of the simplest TWSs. TF is based on the assumption that a term with a higher term frequency value is regarded to be more important than that with a lower term frequency value. It only depends on the number of occurrences of a specific term in a local document. erefore, the capacity of TF for distinguishing all relevant documents from other irrelevant documents is very low due to its ignorance of collection frequency. To address this problem, the inverse document frequency (IDF) was proposed with a concern of collection frequency which enhanced the discriminative capacity of a term for text classification [36]. IDF extends from document frequency (DF) which means the number of documents where a term occurs. It is proposed based on the assumption that a term which occurs in fewer documents is regarded to be more important than that which occurs in more documents [11]. e IDF value of a specific term can be obtained as shown in the following: In equation (2), DF(t, D) represents the DF value of term t in corpus D. e symbol in equation (2) represents the total number of documents in corpus D. To avoid infinity of some extreme cases, the formula is sometimes optimized as shown in the following: After that, Jones extended the IDF method by adding the TF value into calculation [26]. e proposed combination with TF and IDF is the most well-known term weighting method, namely, TF-IDF. Similar with IDF, TF-IDF is also a global statistical measure. e classical structure of TF-IDF is shown as In equation (4), TF − IDF(t, d, D) represents the weight of term t of document d in corpus D, while TF(t, d) represents the TF value of term t in document d.
As we introduced in Section 2, TF-IDF is not effective enough in the text classification domain due to its original design. And many research studies have been deployed in optimizing term weighting methods based on TF-IDF from different perspectives. Some of them developed new methods replacing the term frequency factor or document frequency factor of TF-IDF, while some of them modified the existing method of TF-IDF.
is paper focuses on modifying IDF to improve the TC performance, especially for Chinese internet media content.

Proposed Methods.
When looking into the formula of calculating the value of IDF as shown in equations (2) and (3), we notice that when the corpus is not very balanced which means the size of different categories in a corpus varies from each other, terms from categories with larger size will be assigned smaller values than terms from other categories.
is is obviously not in line with the real situation. Moreover, for some low document-frequency terms, the value of IDF is much higher than others even when those low document-frequency terms are meaningless, which is not in line with the true situation either. To address this kind of problems, we focus on the deviation of the DF value between a specific term and the overview of all terms in the whole corpus since when the deviation between the DF value of a specific term and the average of all DF values is large, its discriminative ability is weak.
is factor should be considered in the term weighting process.
Definition 1 (average document frequency (ADF)). It is the variance between the DF value of a specific term and the average of all DF values in a corpus.
We modified the collection frequency factor by adding ADF into calculation to address the problems mentioned above. In this study, the average of all DF values in the corpus is represented as DF, while the ADF value of term t in document D is represented as A DF (t, D). Equations (5) and (6) show how they are calculated, where n is the number of terms: As ADF is extended from DF, the simplest way of optimizing IDF is to replace DF by ADF in the formula. en, we get a novel formula of collection frequency which is shown as follows: In fact, the IDF method is successful enough in most cases; what we need to do is just to modify it for some extreme cases. en, we get another novel formula as shown in equation (8), where ADF is used to reduce the weight of the terms with extremely high or extremely low DF value.
e two ADF-based methods can improve the TC performance in some cases we mentioned. However, there are still limitations due to the variance itself that when the size is too large or too small, the variance will be relatively too small or too large. Extreme values for terms will obviously impact the TC performance. erefore, we further optimized the formula by normalizing the ADF to reduce the effect caused by the extreme value of terms. First, A DF (t, D) is modified as shown in equation (9) and then using the normalization formula as shown in equation (10): . (10) Based on A DF ″ , another two novel formulas are designed as shown in equations (11) and (12), where α (default value is 1) is used as an optional weight proportion to adjust the importance of A DF ″ in different cases.
Based on the above four proposed formulas of collection frequency based on IDF, we get four novel term weighting methods which are shown in equations (13)- (16): As a result, the optimized text representation model of processing internet media reports is shown in Figure 3. Four new calculation formulas are used to replace IDF of TF-IDF. And four novel term weighting methods are obtained to enhance the performance of processing unbalanced text collection.

Case Study
To evaluate our proposed TWSs, experiments are carried out by using proposed methods in state-of-the-art classification algorithms on both Chinese and English corpuses. In this section, datasets used in experiments are briefly described.
en, algorithms utilized for the classification process and the measurements used for performance evaluation in this Mathematical Problems in Engineering 5 study are introduced. Finally, the experiment settings are also presented.

e Data Source.
is study carried out experiments on three different datasets, i.e., standard dataset of English text, namely, Reuters-21578 corpus, classic dataset of Chinese text, namely, Fudan corpus, and a collection of Chinese internet media reports named Internet corpus, which were crawled off web and transformed into forms of Chinese textual document.

Reuters-21578 Corpus.
e Reuters-21578 corpus contains top-10 categories of Reuters-ModApte separately split which is most preferred in the TC domain [37]. In this study, multilabeled samples are removed since single-labelclassification is focused. So, only 8 categories of 5607 training samples and 2270 test samples in Reuters-21578 were used in our experiments. e detail of data distribution of this corpus is shown in Figure 4 and Table 2.

Fudan Corpus.
e Fudan University TC corpus is from the Chinese NLP group in Department of Computer Information and Technology, Fudan University of China.
ere are 20 categories of which the data distributions are shown in Figure 5 and Table 3. Similar to Reuters-21578, Fudan corpus is also an unbalanced dataset, but in Chinese language.

Internet Corpus.
To test the performance of our proposed methods on Chinese internet media reports, some reports from the web are crawled, and this corpus is formed.
ere are six categories, namely, sport, education, tourism, traffic, tech, finance, and food. e data format is shown in Figure 6. ere are three parts in each instance of the test data which are the category index, the article content, and the total number of words. In this study, both balanced and unbalanced data distributions of this corpus are tried.

Classification Algorithms Used for Experiments and Measurements.
ree popular classification algorithms, namely, naïve Bayes (NB), support vector machine (SVM), and random forests (RF), are utilized using our proposed methods and existing methods for a brief comparison.    Training samples Testing samples

Mathematical Problems in Engineering
Introductions about these algorithms and measurements for evaluation are given as follows.

Naïve
Bayes. Naïve Bayes algorithm [38] is a wellknown TC classifier based on Bayes' assumption that the features are regarded to be independent from each other. In the TC process, document d k can be represented as a vector of terms (t 1 , t 2 , . . ., t n ). e probability that d k belongs to a specific category c i can be calculated using equation (17). More details about the NB classifier can be accessed in [39].
In this study, a NB classifier is used for evaluating the text weighting performance.

Support Vector
Machine. SVM [40,41] is one of the most preferred algorithms for TC and many other patter recognition problems. Since it is a learning algorithm, it can handle problems with high dimensions well. e main principle of the SVM is to create linear or nonlinear hyperplanes to separate positive and negative samples. SVM uses some samples in the training set (called support vectors) to create hyperplanes at locations maximizing margins between negative and positive samples. In this study, a classic SVM classifier is used for evaluating the text weighting performance.

Random Forests.
e random forest (RF) algorithm [42] is a parallelizable integration method which is one of the most preferred classifiers in the field of TC [43]. RF is composed of multiple decision trees. It is used to build a forest in a random way, which consists of many decision trees (DT). ere is no correlation between each decision tree in the RF. After the RF is obtained, for each sample input, each decision tree in the forest is judged to see which category this sample belongs to, which category ultimately gets the most results, and which type of input prediction is. e architecture of RF is shown in Figure 7, where DT refers to the decision tree. In this study, a RF classifier is used for evaluating the text weighting performance.

Measurements for TC Performance.
To evaluate the classification performance on the aforementioned datasets, accuracy, precision, recall, and  Table 4.
In multiclass classification problems, the overall performance can be measured by averaging the evaluation methods. Microaverage and macroaverage are used widely for this purpose. In this study, the microaveraged F1 (micro-F1) and macroaveraged F1 (macro-F1) measurements are also calculated to evaluate the experimental methods. e definition of macro-F1 and micro-F1 is as shown in equations (22) and (23): In cases of unbalanced distribution, it is better to use micro-F1 than macro-F1 since the data size of categories is not considered in micro-F1 score calculation.

Experiment Settings.
In this study, we carried out three experiments on the aforementioned datasets. All experiments were implemented on a 64 bit Windows 10 computer with 8 GB internal storage. e experimental code was written in Python language using Scikit-learn (sk-learn). sk-learn is a commonly used third-party module in machine learning which encapsulates many commonly used machine learning algorithms such as regression, dimension reduction, clustering [44], and classification. For each dataset, in preprocessing, term weighting, and term extraction, term representation and classification were utilized. In preprocessing, all documents were segmented into words by the open-source tool Jieba, and stop words were removed in this process. After that, a vector space model (VSM) was used for term representation using words as terms. Term weighting methods including TF-IDF and proposed methods, i.e., TF-IADF, TF-IADF norm , TF-IADF + , and TF-IADF + norm , were used here to form a final representation for each document. Finally, NB classifier, SVM classifier, and RF classifier were utilized for TC purpose. Combinations of different term Mathematical Problems in Engineering weighting methods and different classification algorithms on different datasets were compared for a brief analysis.  Figure 8. e details of experimental results are shown in Table 5 in that the proposed TF-IADF + norm demonstrates better performance than TF-IDF in all cases. Furthermore, all proposed methods outperformed the TF-IDF, in some cases, respectively.

Results' Analysis and Discussion
For the SVM classifier, the overall classification effect is better than the other two classifiers, and the micro-F1 value is over 94%. TF-IADF has achieved the best effect of 94.45% (increased by 0.31% than TF-IDF). For the RF classifier, TF-IADF achieves the best effect. In addition, TF-IADF norm and TF-IADF + norm also come with some improvement. e micro-F1 value of TF-IDF is 84.51%, while that of TF-IADF + norm is 85.52%, and that of TF-IADF norm is 85.26%. e micro-F1 value of TF-IADF reaches to 86.37%, which is the best among all methods, an improvement of 1.86% from TF-IDF. For the NB classifier, both TF-IADF + and TF-IADF + norm show improvements for the micro-F1 value of TF-IADF + and TF-IADF + norm reaching 92.26% and 92.33%, while that of TF-IDF is only 92.13%. TF-IADF + norm achieves the best effect with an increase of 1.2% over TF-IDF. In the classification results, we notice that the precision rate and recall rate of finance are relatively low, at only 80% and 85%, respectively. e reason may be that the terms of this category are not obvious enough, and the scope involved is relatively wide, which may cover some contents from tourism and traffic categories, resulting in the poor classification effect of the whole category.
e results show that, in the case of balanced dataset which is not focused by our design, the proposed methods can improve the classifiers to a certain extent even though the improvement range is not so obvious.

Unbalanced Dataset.
In this section, we investigate how TC performance is impacted by unbalanced datasets. Dataset here is specially designed. e size of food is increased to 5000, and the sizes of other categories are kept the same as in the balanced dataset (1000) except sport. For sport, the size is gradually reduced from 400 to 50, decreasing 50 at a time.
Definition 2 (balance ratio). It is the proportion between the sizes of the category with the smallest size and the category with the largest size.
In this section, we call the proportion between the size of sport and that of food as balance ratio which drops from 8% to 1% with the decrease of sport's size. Experiments by using the SVM classifier with TF-IDF and our proposed methods are carried out.
e experimental results show that performances on categories other than sport and food are basically the same with the balance ratio changing. However, the recall and F1 score of sport and the precision of food are impacted heavily by decreasing the balance ratio. e details are shown in Tables 6-8. As shown, the balance ratio is decreasing, with the recall and F1 score of sport also decreasing in all methods. However, the proposed TF-IADF and TF-IADF norm outperformed TF-IDF in all cases. Especially, when the balance ratio decreased to an extreme value (1%), TF-IADF norm came with a recall of 0.7168 which is almost 170% of that of TF-IDF. Due to the balance ratio changes, precision of the category with a relatively large size (food) was also impacted which can be seen from Table 8.
Definition 3 (decline ratio). It is the ratio of the current value to the initial value in a declining trend.
To make the relationship between the balance ratio and the performance clear, the decline ratio is calculated. It is the ratio of the value for each balance ratio to its initial value (value at 8%). is can be seen in Figure 9, where the ordinate refers to the decline ratio of the corresponding Mathematical Problems in Engineering performance factor. As it is shown, with the balance ratio decreasing, the growth rate of decline ratio becomes faster and faster. It is obvious that, for datasets with extreme categories such as food (relatively too large) and sport (relatively too small), performances of TF-IADF and TF-IADF norm are more stable than TF-IDF. e overall performance on this dataset is impacted by the balance ratio also. e details of micro-F1 and macro-F1 are shown in Tables 9 and 10, respectively. e proposed TF-IADF and TF-IADF norm outperformed TF-IDF in all cases. To make it clear how balance ratio impacts the performance, we calculated the decline ratio which is shown in Figure 10. As it is shown, with the decrease of balance ratio, both micro-F1 and macro-F1 decrease gradually. For example, when looking at micro-F1, TF-IADF norm was with a decrease of just 3.65%, while TF-IDF was with a decrease of 8.06% which is more than twice of that of the proposed TF-IADF norm , meaning the performance of TF-IADF norm is much more stable than that of TF-IDF in this dataset.
Even though TF-IADF + and its variance do not achieve improvement in this experiment, TF-IADF and its variance outperformed TF-IDF significantly. Furthermore, it can be seen from Figure 10 that our proposed TF-IADF and TF-IADF norm are not only numerically better but also more stable than TF-IDF. erefore, considering ADF in the term weighting method can actually improve the performance of text classification considerably in unbalanced cases.

Analysis.
For this corpus, the TF-IADF + and TF-IADF + norm methods are more suitable for the NB classifier, and TF-IADF + norm is the best. For the RF and SVM classifiers, TF-IADF + norm and TF-IADF are more suitable. In addition, TF-IADF is better in case of a balanced dataset, while TF-IADF norm is better in case of an unbalanced dataset. is means that the processed TF-IADF norm method is more sensitive and stable in the case of unbalanced datasets, while TF-IADF improves the classification effect more when the datasets are relatively balanced. For several mathematical models proposed by different algorithms, it can be concluded that the formula suitable for different algorithms may be different, and for the improvement of the corresponding algorithm effect, it can also be concluded that the ADF index has improved the effect of text classification, which confirms the conjecture; especially, when the dataset is not evenly distributed, the effect of text classification is more stable. Table 11 and Figure 11 show the micro-F1 and macro-F1 scores obtained on Fudan corpus using SVM, RF, and NB algorithms with different TWSs, while Tables 12-14 show the detailed results. For the NB classifier, TF-IADF + norm has the best performance, which is 89.38%, an increase of 0.39% compared to TF-IDF. Meanwhile, TF-IADF + also gets an increase of 0.32%.

Results' Analysis.
is means for Chinese text datasets, these two models are more suitable for a NB classifier and can achieve an improvement on the overall classification effect. When comparing the specific measurements which are shown in Figure 12 and more details can be seen in Table 14, TF-IADF + norm has achieved a lot of the highest performancescore items. Furthermore, the difference between performance scores of TF-IDF and those of TF-IADF + norm is relatively small in categories, where TF-IADF + norm is not as good as TF-IDF. For example, in C32 and C35, where TF-IDF achieves the best precision score, the difference between the precision score of TF-IDF and that of TF-IADF + norm is less than 1%. However, in categories such as C17, where TF-IADF + norm achieves the highest precision score, the difference between that and precision score of TF-IDF is more than 6% which is six times of the difference occurring in categories where TF-IDF achieves the higher score. In fact, in C17, when comparing TF-IADF + norm to TF-IDF, the precision is improved by 6.62%, and the recall score of C17 is also increased by 7.41%, which is obvious. In an overall view of the F1 scores, it can be noticed that, in all the 20 categories except C35 and C5, the scores of TF-IADF + norm are not lower than those of TF-IDF. Especially, in cases of TF-IADF + norm with higher elevation such as C16, the F1 score is increased by nearly 10%.
For the RF classifier, the micro-F1 score obtained by TF-IADF is 81.95%, which is 1.63% higher than that obtained by TF-IDF. Meanwhile, TF-IADF + and TF-IADF norm also have achieved improvements in different extents. It is known that RF algorithm has certain randomness, but our proposed TF-IADF method is more stable and has achieved better results which can be seen in Figure 13. When looking into detailed results as shown in Table 13, in some categories, such as C29 and C5, the precision score has been improved by 25.98% and 30.16%, respectively, and the recall score has remained basically unchanged. For the performance of F1 score, in some categories, such as C16 and C36, the score obtained by TF-IADF is about 10% higher than that obtained by TF-IDF. Furthermore, there are only three categories where F1 of TF-IADF is not as good as TF-IDF.
For the SVM classifier, it is still the case that TF-IADF and TF-IADF norm have achieved improved performances. When comparing with TF-IDF in the micro-F1 score, TF-IADF has increased by 0.73%, while TF-IADF norm has increased by 0.63%. It can be seen in Figure 14 that, for the SVM classifier, these two improved methods are more stable and have achieved some improvements. As shown in Table 12, there are many categories where the precision scores are very high, even up to 1. It can easily be seen in Figure 14  that the size (data amount) of those categories is very small. For example, there are only 32 training samples in C15. e reason is that due to the small eigenvalues, no other categories being considered as this category are responsible for the high precision score. However, it can also be concluded from the detailed results that, for most of those categories with a high precision but a small size, the recall score is significantly reduced; that is to say, many documents are assigned to the wrong categories, which may be due to the impact caused by a small number of features. For example, the recall score of C15 is only 0.09. TF-IADF and TF-IADF norm perform better in most of the categories, especially in C29, where the size is small. Although the precision scores obtained by the two methods are decreased by about 4%, the recall scores are improved by about 30% which is significant. In addition, there are only two categories where F1 scores of TF-IADF are lower than those of TF-IDF, while for most categories, TF-IADF is better.

Further Discussion.
Due to the unbalanced distribution of this corpus, all categories can be roughly divided into three groups by their size for specific analysis. e first group refers to categories with size in the range of 0 to 100, the second refers to categories with size in the range of 101 to 1000, and the third group refers to categories with size over 1000. First of all, for the first group, due to the insufficient size of training data, the same property is reflected in all term weighting methods, i.e., high precision and low recall scores, such as C16. Taking SVM combined with TF-IDF as an example, the precision score reaches 100%, while the recall score is only 3.57%, resulting in poor classification effect. In short, it assigns a few samples correctly, while a large number of test samples are assigned to the wrong categories. And the performance is similar in the RF and NB classifiers. is is the impact caused by the unbalanced distribution and the small size of the training data. For the second group, the number of training sets has been improved to a higher level. For example, in C11 when using the SVM classifier with TF-IDF, the precision score and recall score are 96.10% and 92.06%, respectively, which is a significant improvement compared to the first group. For the third category, such as C19, the precision score and recall score are 95.64% and 98.53%, respectively. It can be concluded that the precision score in categories with large training set size is relatively low, while the recall rate is relatively high. In categories with a small training set size, it will have better precision but very low recall score. In these kinds of conditions, our proposed methods will demonstrate a similar but more stable performance compared to TF-IDF. Taking C29 as an example, there are only 57 training samples. Comparing TF-IDF with TF-IADF using the SVM classifier, the precision scores are 100% to 96.15%, while the recall scores are 8.48% to 42.37%. It can be seen that although the accuracy rate of our algorithm is slightly reduced, the recall rate is greatly increased by nearly 500%, and the F1 score is also greatly improved, from 15.63% to 58.82%, showing a very obvious improvement.
In the Fudan corpus, a similar phenomenon with the internet corpus can be seen. For example, the TF-IADF + norm method is the best one in the NB classifier, while TF-IADF + is also better than TF-IDF. And with RF and SVM classifiers, both TF-IADF and TF-IADF norm achieve a relatively stable performance. e difference is that TF-IADF achieves the best effect in the Fudan unbalanced dataset. It is also noticed that although the micro-F1 score of the SVM classifier is the best, the macro-F1 score, which is about 50%, is not as good as that of the NB classifier which is over 65%. at is to say, although the overall accuracy of the SVM classifier is high, the effect of the NB classifier is better when each category is regarded as equally important.

Results on Reuters-21578.
e overall performance on Reuters-21758 of all methods is shown in Table 15, while Figure 15 shows a brief comparison between all proposed       Tables 16-18. For the SVM classifier, TF-IADF norm comes with the best performance which is 0.86% higher than TF-IDF in the micro-F1 score. According to the detailed results shown in Table 15 and Figure 16, the improvement of TF-IADF norm is mainly contributed by the recall score, which is greatly improved in the category of crude, grain, interest, and money-fx. Especially in the category of money-fx, the recall score has increased by 10.35%. e other categories also improved in different extents, namely, crude increased by 3.31%, interest increased by 4%, and grain increased by 10%. From the perspective of the F1 score, in the total eight categories, the max F1 score of five categories is obtained by TF-IADF norm , and the largest increase is obtained in the category of grain, where the F1 score increased from 57.14% of TF-IDF to 66.67% of TF-IADF norm , close to 10%.
For the NB classifier, the results are different from those on the Chinese dataset, where TF-IADF + and TF-IADF + norm achieved better performance, and performances of TF-IADF + and TF-IADF + norm were worse on this corpus. However, the proposed TF-IADF and TF-IADF norm showed an improvement, the micro-F1 score of which is 0.59% and 0.5% higher than that of TF-IDF, respectively. As shown in Table 16 and Figure 17, TF-IADF has achieved the highest F1 score among all methods in all categories of this corpus, and the maximum improvement, 3.83%, occurs in the category of interest. Specifically, in terms of precision, we see that the four categories of cloud, interest, ship, and trade have improved significantly, with an improvement of 2.36%, 4.35%, 6.82%, and 3.75%, respectively. In terms of the recall score, the increase in the money-fx category is 4.60%.
For the RF algorithm, TF-IADF + norm and TF-IADF norm have improved the effect, micro-F1 score of which is 0.69% and 0.32% higher than that of TF-IDF, respectively. TF- C5   Figure 15: Performance of the NB classifier on Reuters-21578 (precision, recall, and F1 score).    IADF, which has better performance in the Chinese dataset, has worse performance on this corpus. It can be seen in Table 17 and Figure 18 that the better performance of TF-IADF + norm is mainly due to the higher recall score. Comparing with TF-IDF, the recall score obtained in the trade category is 6.52% higher, and in the crude category, it is 2.48% higher. In terms of precision, it has achieved good performance in the categories of interest and trade, with an increase of 10.54% and 8.52%, respectively. On the side of F1 score, the trade category improved significantly, up to 7.59%. However, we also see a decrease in the performance on the categories of grain and money-fx. is may be due to the small size of the test dataset, which changes greatly, and so, it is not easy to draw more accurate conclusions. According to all experimental results on Reuters-21578, the proposed TF-IADF norm outperformed TF-IDF in nearly all conditions except the micro-F1 score of the RF classifier. In fact, the RF classifier's performance was worse among all classifiers, suggesting that the RF classifier may not be suitable for this corpus.

Discussion.
e experimental results obtained on the English dataset (Reuters) are somehow different from those on the Chinese dataset, and the most effective combination of a term weighting method and classification algorithm is different. For example, TF-IADF +norm is suitable for the Chinese internet corpus using the NB algorithm, whereas TF-IADF performs best in the English unbalanced dataset. It can be inferred from the experimental results that all of the proposed algorithms can generally always be combined with a suitable mathematical model that shows a better performance than the original TF-IDF. For example, TF-IADF + norm performs better in the RF algorithm, while TF-IADF norm performs better in the SVM. e best combinations concluded from the experiments are shown in Table 19. For the two Chinese datasets in the experiment, we draw the following conclusions: (1) for the NB algorithm, TF-IADF + and TF-IADF + norm are more suitable, and TF-IADF + norm can achieve better performance whether in balanced or unbalanced datasets; (2) for the RF algorithm and SVM algorithm, TF-IADF and TF-IADF norm can achieve a relatively stable improvement effect. In the experiments with the internet corpus, it can be concluded that TF-IADF has a better improvement effect when the dataset is relatively balanced, and TF-IADF norm has a better classification effect when the dataset is unbalanced. However, in the Fudan corpus (unbalanced), TF-IADF has achieved better performance than TF-IADF norm , while TF-IADF norm has also improved. It is possible that TF-IADF is more suitable for RF and SVM algorithms when there are many categories which are unbalanced in the corpus.

Conclusions
In this paper, an improved TF-IDF with novel term weighting schemes is proposed to greatly reduce the impact from the unbalanced distribution of datasets. It is easy to be observed that all unbalanced corpuses that are categorized with a larger amount of data will always have an impact on the classification effect due to the larger amount of feature words. Meanwhile, the precision score decreases, while documents from other categories will be easily mistaken into this category.
is is due to the increase in the training set; some feature words with strong performance capabilities in other categories have also appeared in this category which causes errors. Meanwhile, categories with smaller amount of data will be reflected in the significant reduction in the recall score due to the insufficient collection of feature words. e training process cannot classify these categories well without sufficient representation in the training process. As a result, many documents are assigned to wrong categories. In these cases, the proposed methods can increase the weight of those feature words which have a document frequency close to the average value, while reducing the weight of low-frequency and high-frequency words in order to obtain better results. e simulation results show that the proposed methods with ADF are more effective than the original TF-IDF, although different mathematical models may be needed for improvement when utilizing different classification algorithms. Especially in experiments specifically designed in which the size of data in sport decreases while keeping other conditions the same, the results proved that the proposed methods are with a better performance and more stable than the well-known TF-IDF on the unbalanced corpus. It can also be concluded that our proposed methods come with a better performance in the balanced dataset when compared with TF-IDF. Document frequency of specific words may vary across categories, even in cases where the training sets appear roughly the same. Methods with ADF can weight those words more reasonably and form a more representative model.
However, for the purpose of TC on internet media reports, this paper just focuses on the term weighting scheme under unbalanced distribution but ignoring linguistic characteristics, which might be helpful in the term extraction process. erefore, our next study will focus on enhancing the TC performance by combining the proposed methods with language characteristics.
Data Availability e data that support the findings of this study are available from the corresponding author upon reasonable request.