Research on the TF – IDF algorithm combined with semantics for automatic extraction of keywords from network news texts

: As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users ’ fast access to the desired content. This article ﬁ rst introduced two common algorithms: term frequency – inverse document frequency (TF – IDF) and TextRank. Then, the calculation of news title weight was added to the TF – IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF – IDF, TextRank, and the semantics-com-bined TF – IDF algorithms gradually decreased, and the recall rates gradually increased. When ﬁ ve keywords were extracted, the gap of the semantics-combined TF – IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and F -measure were 72.77, 78.64, and 75.59%, respectively. Finally, the F -measure of the semantics-combined TF – IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF – IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice.


Introduction
The term "online news text" refers to news texts that are disseminated through the Internet, which have a faster dissemination speed and wider coverage compared to traditional news texts.With the Internet's ongoing evolution, there has been a significant increase in the number of news texts in the network, which facilitates people to get news information more quickly and directly but also makes it more difficult for people to find the news they want.The use of keywords can help readers quickly comprehend the main content of a text and thus improve search efficiency, which has a significant role in various fields, such as text categorization and information retrieval [1].With the massive growth of information, the traditional way of manual annotation has become increasingly difficult to meet current needs; thus, algorithms for automatic keyword extraction have been widely studied [2].Compared with ordinary texts, network news texts are different in terms of text structure and writing techniques, so the present algorithms for automatic keyword extraction are not fully applicable.In this article, the traditional term frequency-inverse document frequency (TF-IDF) method was combined with the Word2vec word vector model to improve semantic extraction, and the performance of this combined approach was proved through experiments on a dataset.The research in this article provides a new reliable method for automatically extracting keywords from online news texts, which can serve as the foundation for classifying and retrieving online news texts, thereby further enhancing the efficiency of processing such texts.

Related works
Thiyagarajan et al. [3] studied three popular keyword extraction techniques: the rapid automatic keyword extraction, TF-IDF, and semantic fingerprinting algorithms, and found through experiments that the TF-IDF algorithm had the strongest correlation with the human assessment.Li et al. [4] designed a new unsupervised method for Weibo texts by combining two hashtag enhancement algorithms and found through experiments that the method was accurate.Yang et al. [5] introduced a word network based on the relationship between sentences.A new word-sentence approach proposed by them was found to be superior to the classical TF-IDF and TextRank algorithms in aspects of precision and recall rate through experiments.Hassani et al. [6] conducted a study on video text mining and proposed a new key phrase extraction method that considered the local and global features of every candidate phrase and conducted experiments on five datasets in English and Persian and found that the method performed better in aspects such as precision.Okada et al. [7] extended a multi-keyword pattern matching machine, called the Aho and Corasick machine, and proposed an effective substring search method to achieve keyword extraction.The simulation results showed that the method had good performance.Azcarraga et al. [8] put forward an approach called liGHtSOM, based on analyzing how weights distribute in the weight vector of the training graph and simple operations of the random projection matrix applied for input data compression.The experiment showed that the keywords obtained by the approach were highly accurate.Tixier et al. [9] introduced an unsupervised technique using the degradation of graphs, carried out experiments on documents of different sizes, and obtained good performance.Campos et al. [10] proposed YAKE, a lightweight unsupervised keyword extraction approach that uses statistical characteristics of the text from a single document to select the most significant keywords within it, and demonstrated the advantages of the method through experiments on 20 datasets.Yan et al. [11] integrated eye movement signals with electroencephalogram (EEG) signals and utilized neural networks to automatically extract keywords from microblogs.They verified the collaborative effect of EEG and eye movement signals through experiments.Zhang and Zhang [12] introduced a method that utilizes human reading time for keyword extraction.They extracted fixation durations from publicly available language resources and designed two neural network models for keyword extraction.The effectiveness of the proposed method was demonstrated through both quantitative and qualitative experiments.Zhang et al. [13] developed a neural framework for extracting keyphrases, which obtains indicative representations through conversation context encoders and inputs them into the keyphrase table to extract important words.The experiment found that this method had better performance than previous models.
where N i j , denotes the occurrence frequency of word i in text d j and k denotes the quantity of different words in text d j .
IDF refers to inverse document frequency where | | D is the total quantity of texts in the corpus and | | ∈ j t d : i j refers to the quantity of texts containing word i in the corpus.
The TF-IDF value is obtained by If a word has a high TF value and a low IDF value, the word is considered to have great criticality [16], and this method is simple to operate and widely used [17].
The TextRank algorithm is a refined version of the PageRank algorithm [18].The principle of PageRank is that if a web page is linked to many other web pages, it indicates that the web page is relatively important, which means its PageRank value is high.PageRank is calculated using the following equation: where ( ) S V i is the PR value of a web page V i , V j is the web page linked to V i , i.e., the inbound link, ( ) V In i is the set of inbound links, and ( ) V Out j is the quantity of elements in the set of links pointing to external web pages in web page j.
TextRank is an algorithm for ranking based on graphs.It treats sentences or words in a text as nodes of a graph, treats the relationships between them as edges, and determines their importance by calculating the weights between the nodes.
The calculation formula of the PageRank-based TextRank algorithm is where ( ) V WS i is the weight of sentence i, W ji refers to the resemblance of sentences, and d is the damping factor, 0.85 usually.
A text is segmented into sentences.Candidate keyword graph G = (V, E) is built after preprocessing.V is the set of nodes, i.e., the obtained candidate keywords, and E is the edge between two points, which indicates the co-occurrence relationship.Subsequently, the node weight is calculated according to the above formula to obtain the most important T words.
Most of the ordinary texts are single texts, while online news texts are generally composed of titles and bodies.According to the characteristics of news texts, the titles are usually a high summary of the main content of the news.To further enhance the performance of automatic keyword extraction from online news texts, this article improves the TF-IDF algorithm by combining semantics.
First, consideration of the title is added when calculating the importance of words: where n i j , means the quantity of word i in the title of news text d j and k is the quantity of different words in the title of d j .
The TF-HF-IDF is combined with the Word2vec model [19] to improve the extraction of semantics.Suppose there is text , the word vector obtained after every word is trained by the Word2vec model is: . Then, the word vector is weighted.The obtained vector is expressed as where denotes the TF-HF-IDF value of word i in the text, which is used as the initial weight of the word.
The specific process of the method is as follows.After processing the text by word segmentation and stop word elimination, the TF-HF-IDF value is computed, and then the individual words are represented by Word2vec word vectors.After that, the semantic-based similarity of the processed words is calculated.The set of semantic topic concepts is obtained by the hierarchical clustering algorithm [20], i.e., the set of words with similar semantics.Finally, the comprehensive weight value is calculated: e e sim , cos , i j i j i j (10) where refers to the sum of the semantic similarity between word t i and the other words, ( ) t t sim , i j is the Word2vec-based semantic similarity between words t i and t j , e i is the word vector of t i , and e j is the word vector of t j .
Finally, according to the comprehensive weight of words, the word with the highest weight in every semantic topic concept set is used as a keyword to obtain the keyword set of the document.

Experimental analysis
The experiments were conducted on a Windows 7 system with 4 GB memory.The word separation system was Institute of Computing Technology, Chinese Lexical Analysis System [21].The algorithm was implemented through Java language programming.The experimental dataset came from the ACE2005 corpus [22], containing news reports from Xinhua News Agency and China National Radio.Table 1 presents the statistics of the corpus.There were 500 texts in the dataset.The semantics-combined TF-IDF algorithm was compared with TF-IDF and TextRank algorithms.The evaluation indexes as follows.
(1) Precision: P = A/B, where A is the quantity of keywords extracted correctly by the algorithm and B is the quantity of all keywords extracted by the algorithm.(2) Recall rate: R = A/C, where C is the actual total number of keywords.
(3) F-measure: F-measure = 2PR/(P + R), indicating the overall performance of an algorithm.First, for Word2vec, the chosen dimension of word vectors will affect the results.Under other consistent conditions, the performance of the proposed method with different dimensions (64-dimensional, 96-dimensional, 128-dimensional, and 200-dimensional) was compared.Five keywords were extracted, and the outcomes are presented in Table 2.
It was seen that with the gradual increase in word vector dimension, the training time of the algorithm gradually increased.When the dimension was 200, the training time of the algorithm was 12.24 min, which was increased by 14.71% compared to that when the dimension was 128, and the accuracy was 71.77%, which was increased by 0.61% compared to that when the dimension was 128.This indicated that the training time was significantly increased, but the improvement of the accuracy was limited.Therefore, the word vector dimension was set as 128 in the following experiments.
The impact of the quantity of keywords on the algorithm performance was compared.The number of keywords sampled was 1-10, and the precision variation is presented in Figure 1.
It was seen from Figure 1 that when only one keyword was extracted, the precision of all three algorithms was close to 100%, indicating that all three algorithms performed good when only one keyword was extracted.When the number of extracted keywords reached five, the gap between the semantics-combined TF-IDF algorithm and the TF-IDF and TextRank algorithms started to increase; at this moment, the precision of TF-IDF and TextRank algorithms were 70.12 and 70.23%, respectively, while the precision of the semanticscombined algorithm reached 72.77%, which was improved by 2.65 and 2.54%, respectively.When the number of keywords reached ten, all three algorithms achieved their minimum accuracy levels, 19.56, 19.87, and 25.12%, respectively.TF-IDF algorithm combined with semantics  5 The variation of the recall rate of different algorithms is shown in Figure 2.
It was seen from Figure 2 that contrary to the precision, the recall rates of different algorithms gradually improved as the number of keywords automatically extracted increased, but similar to the precision, the gap between algorithms started to become obvious when the number of keywords reached five; at this moment, the recall rate of TF-IDF and TextRank algorithms were 74.28 and 75.34%, respectively, while the recall rate of the semantics-combined algorithm was 78.64%, which was improved by 4.36 and 3.3%, respectively.When the number reached ten, the recall rate of all three algorithms was around 90%.
Finally, the F-measure of different algorithms was compared, as shown in Figure 3.It was seen from Figure 3 that when the number of keywords was small, the difference in the F-measure was not obvious and almost the same.When five keywords were extracted, the F-measure of the semanticscombined algorithm was 3.45 and 2.89% higher than the other two algorithms, respectively.When the number of keywords reached ten, the F-measure of TF-IDF and TextRank algorithms were 32.09 and 32.55%, respectively, while the F-measure of the semantics-combined algorithm was 39.39%.Finally, it was concluded from Figures 1-3 that the semantics-combined algorithm had the best performance when the number of extracted keywords was five.
In order to further understand the effect of the semantics-combined TF-IDF algorithm on the automatic extraction of keywords from web news texts, 500 articles were crawled from news web pages based on the Scrapy framework through a crawler tool for experiments.Ten keywords were manually labeled.Under different numbers of extracted keywords, the comparison of the F-measure is presented in Figure 4.
From Figure 4, it can be found that the F-measure of the semantics-based approach was higher than the other approaches in automatically extracting keywords from 500 crawled online news texts; when five keywords were extracted, the F-measure of the proposed algorithm reached the highest, 81.13%, which was 13.06% higher than the TF-IDF approach and 7.8% higher than the TextRank approach, further proving the performance of the proposed method.
Wen et al. [23] proposed an optimized weighted TextRank algorithm to extract keywords.When five keywords were extracted, the outcomes are displayed in Table 3.In Table 3, for the weighted TextRank approach, the precision, recall rate, and F-measure value were the highest in keyword extraction when the weight value was taken as 0.5.When five keywords were extracted, the comparison of the growth amplitude of different indicators of the weighted TextRank method and the semantics-based TF-IDF method compared to the TextRank method is presented in Table 4.
In Table 4, the object of the study in the literature [23] was 500 news crawled from Sohu news, and the object of this article was 500 randomly crawled online news.The amount of data in the two datasets was similar.The weighted TextRank method was improved by 0.0296, 0.0532, and 0.03805 in precision, recall rate, and F-measure compared to the TextRank method, respectively; the semantics-based TF-IDF method was improved by 0.077, 0.07797, and 0.078 in precision, recall rate, and F-measure compared to the TextRank method, respectively.Comparisons revealed that the growth rate of the semantics-based TF-IDF approach was higher than that of the TextRank approach, indicating that the semantics-based TF-IDF method was more beneficial to improving the keyword extraction effect than the weighted TextRank method.
Taking one of the online news texts entitled "iPhone 14 fastest price drop: record-breaking speed" as an example, the keyword extraction effect of the semantics-based TF-IDF method was analyzed.The text content is as follows.
After the iPhone 14 full series went on sale, the offline price of two models in the standard version was lower than the initial offer price.Even with the value of heavy upgrades such as the Spirit Island and 48 million pixels, the premium for the two models in the Pro version also fell rapidly after the launch, spot goods at the original price were available for some models and colors offline, and the service break can be expected soon.
Judging by the price trend in the last 2 days, iPhone 14 has seen a big drop in the e-commerce platform, and the offline spot price has also seen a new low.iPhone 14 has dropped by about 600 yuan, iPhone 14 plus has dropped by about 800 yuan, iPhone 14 Pro has also dropped slightly, and the price of the high-capacity version has dropped a little more.
After analysis, experts believe that compared to the iPhone 12 and iPhone 13 series in the previous 2 years, the lack of price reduction in the last month after the release means that the iPhone 14 is the model with the fastest price reduction in recent years.
The manually labeled keywords and the keywords extracted by the TF-IDF, TextRank, and semanticsbased TF-IDF methods are shown in Table 5.It is seen from Table 5 that when automatically extracting keywords from this online news text, three keywords were correctly extracted by the TF-IDF and TextRank approaches, and the other two were different from the manually labeled results, but the keywords extracted by the semantics-combined algorithm were consistent with the manually labeled ones, which further proved the reliability of the TF-IDF algorithm combined with semantics.

Conclusion and future works
This article designed a TF-IDF algorithm combining semantics by combining title weights and Word2vec word vector model to improve the algorithm performance for automatic extraction of keywords.It was found through experiments that this method had advantages in precision and recall rate.This method showed greater enhancement in precision, recall rate, and F-measure than existing methods when extracting keywords.The case study showed that the extracted keywords had a better match with the manually labeled keywords.The semantics-combined TF-IDF algorithm can be further applied in the real world.However, this study also has some limitations, such as the small number of languages studied and the small number of texts.In future work, the applicability of the proposed method will be investigated in more languages and the scale of experiments will be further expanded to determine the reliability of the method.

Figure 1 :
Figure 1: Comparison of precision between different algorithms.

Figure 2 :
Figure 2: Comparison of recall rates of different algorithms.

Figure 3 :
Figure 3: Comparison of F-measure between different algorithms.

Figure 4 :
Figure 4: Comparison of F-measure between different algorithms for the automatic extraction of keywords from online news texts.

Table 2 :
The effect of word vector dimensions on the algorithm performance

Table 3 :
Comparison between the TextRank and weighted TextRank methods

Table 4 :
Comparison between the weighted TextRank method and the semantics-based TF-IDF method

Table 5 :
Example of automatic keyword extraction results