Next Article in Journal
Evaluation of Contextual and Game-Based Training for Phishing Detection
Next Article in Special Issue
Mapping Art to a Knowledge Graph: Using Data for Exploring the Relations among Visual Objects in Renaissance Art
Previous Article in Journal
Detecting IoT Attacks Using an Ensemble Machine Learning Model
Previous Article in Special Issue
Coarse-to-Fine Entity Alignment for Chinese Heterogeneous Encyclopedia Knowledge Base
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cross-Domain Transfer Learning Prediction of COVID-19 Popular Topics Based on Knowledge Graph

1
Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China
2
School of Information, University of International Business and Economics, No.10 Huixin East Street, Chaoyang District, Beijing 100029, China
3
Xiangsihu College of Guangxi University for Nationalities, Nanning 530031, China
4
School of Microelectronics, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Future Internet 2022, 14(4), 103; https://doi.org/10.3390/fi14040103
Submission received: 20 February 2022 / Revised: 20 March 2022 / Accepted: 20 March 2022 / Published: 24 March 2022
(This article belongs to the Special Issue Knowledge Graph Mining and Its Applications)

Abstract

:
The significance of research on public opinion monitoring of social network emergencies is becoming increasingly important. As a platform for users to communicate and share information online, social networks are often the source of public opinion about emergencies. Considering the relevance and transmissibility of the same event in different social networks, this paper takes the COVID-19 outbreak as the background and selects the platforms Weibo and TikTok as the research objects. In this paper, first, we use the transfer learning model to apply the knowledge obtained in the source domain of Weibo to the target domain of TikTok. From the perspective of text information, we propose an improved TC-LDA model to measure the similarity between the two domains, including temporal similarity and conceptual similarity, which effectively improves the learning effect of instance transfer and makes up for the problem of insufficient sample data in the target domain. Then, based on the results of transfer learning, we use the improved single-pass incremental clustering algorithm to discover and filter popular topics in streaming data of social networks. Finally, we build a topic knowledge graph using the Neo4j graph database and conduct experiments to predict the evolution of popular topics in new emergencies. Our research results can provide a reference for public opinion monitoring and early warning of emergencies in government departments.

1. Introduction

Social networks have changed the form of information dissemination about public emergencies from “limited reporting” to “nationwide communication”. User-generated content (UGC) has emerged in large numbers, making every citizen both the receiver and disseminator of information. In 2020, the outbreak of COVID-19 triggered the biggest public panic in human history. As the first platform to form and disseminate information related to emergencies, social networks have become the incubator and catalyst for mass panic, which has played a driving role in promoting information dissemination. However, this pattern makes a lot of information interweave, which leads to the problem of information overload. As human attention to information is limited, attention becomes a scarce resource [1]. In the process of information transmission, the distribution of popularity is uneven; most of the information has low popularity, and only a small portion of the information can maintain high popularity. In this case, what information will attract people’s attention and how it changes over time become the focus of research [2]. To solve this problem, it is necessary to make full use of network text data and construct a scientific and effective prediction model to analyze the potential evolution of popular topics.
While most previous studies only focused on information in a single network, social networks are interconnected, and the same event spread in different social networks has relevance and transitivity. For example, when an emergency breaks out on Weibo, it will drive the clicks of its related content on video websites, so cross-domain information association is needed. Based on this, this paper has two research questions:
(1)
How can one measure the text information similarity of different social network platforms for knowledge transfer learning?
(2)
From the perspective of text information, how can one detect potential popular topics on social networks and predict popular topics?
This paper adopts the transfer learning algorithm to predict the popular topics of emergencies in combination with information from other social network platforms to improve the accuracy of prediction. Based on the transfer learning results, this paper combines the knowledge graph with the emergency data to construct the public opinion topic graph, which will present the network structure of social network topics with semantic characteristics and then reach the goal of tracking the evolution of popular topics. The basic research framework of this paper is shown in Figure 1. The research results help government departments to guide practice, grasp the public opinion trends of emergencies, and realize monitoring analysis and early warning of public opinion so as to take scientific and effective measures to guide and control public opinion.

2. Related Work

At present, for research on topic detection, scholars mostly work on methods such as topic modeling, multidimensional attribute extraction and algorithm-based improvement perspectives to identify the Internet users’ topics [3,4,5]. The majority of them adopt multiple perspectives and techniques to detect topics in online public opinion events, but little work on the field of public opinion has been done with the help of knowledge graph methods [6,7,8,9]. The topic graph constructed by combining knowledge graph with data of online public opinion emergencies will present a complete network structure reflecting comment forwarding relationships and with semantic characteristics. Through the extraction of multi-dimensional feature attributes in the graph, the objective of tracking the evolution of topics in a comprehensive and systematic manner is achieved. We discuss related work in the fields of transfer learning, popular topic detection, and knowledge graphs.
Transfer learning. This concept first originated in psychology, arguing that people can use previously learned knowledge and skills to guide the learning of new knowledge [10]. Common machine learning techniques traditionally address isolated tasks. In contrast, transfer learning aims to transfer knowledge learned in one source domain and use it to improve learning in a related target domain [11]. Transfer learning has been previously used in various cases including classification, image clustering, collaborative filtering, and sensor-based location prediction [12,13,14]. The feature learning problem of little or no sample data in the target domain can be solved by transferring the source domain data [15].
Popular Topic Detection. Researchers have investigated a wide variety of methods and resources to detect popular topics of online information [16], e.g., the LDA model replaces the traditional VSM model to extract hidden topic information [17], a method of clustering words based on the similarity of related time series [18], overview-based topic models and real-time detection techniques [19], incremental clustering to detect new topics and use content and temporal features to discover popular topics in time [20], and layered topic detection methods [21].
Knowledge Graph. The essence of a knowledge graph is a semantic network graph that reveals entity knowledge [22]. Scholars use the vocabulary provided by the semantic network to achieve short text understanding, word segmentation, type annotation, and concept labeling [3]. Knowledge graph technology has made remarkable achievements in medicine, film and television, transportation, and other fields, but there is still a lot of work to be done in the field of social networking [23,24].

3. Data Pre-Processing

This paper takes the novel coronavirus pneumonia in 2020 as its research scenario. We collected datasets from the Harvard Dataverse platform [25]. The two main social network platforms of Weibo and TikTok were taken as the research object. Based on transfer learning, the cross-domain prediction problem of the prevalence of new text messages on the social network platform was studied.
Weibo COVID dataset. Weibo, commonly referred to as “Chinese Twitter”, is a micro-blogging site. The data was crawled on the Weibo platform from 7 December 2019 to 4 April 2020. The data was crawled in two phases covering a total of 4,047,389 Weibo posts. The first crawler ran on 26 February 2020 and collected 3.3 million Weibo posts from 18 January 2020 to 26 February 2020. The second crawler ran on 4 April crawling from 7 December 2020 to 4 April 2020 to complement the original dataset.
TikTok COVID dataset. We mainly used the GooSeeker big data crawler software to crawl the data of the TikTok website. Data were collected from 25 December 2019 to 30 May 2020, crawling a total of 15,756 pieces of data from 2685 government media users. Since the short videos were released by government media accounts, they have high authority, a wide spread, and great influence.
We adopted Chinese word segmentation (jieba) to segment Chinese text, and added a user-defined dictionary (dict. Txt) based on the new crown open concept knowledge graph publicly released on the OpenKG platform (COVID-19-Concept) to optimize the word segmentation results. Additionally, this study makes comprehensive use of the advantages of “Baidu deactivation Thesaurus”, “Harbin Institute of technology deactivation Thesaurus” and “Sichuan University Machine Intelligence Laboratory deactivation Thesaurus” to filter stop words [26,27]. We combined the contents of the three deactivation thesauruses to build a new deactivation thesaurus and deleted meaningless information so as to reduce the interference to the word segmentation results.
After the above processing, the Chinese text data was still in text format and could not be directly recognized and calculated by the computer. Therefore, text representation processing is required. This paper represents text based on the vector space model (VSM). Then characteristic words with strong discrimination and representativeness were extracted from the text so as to reduce the dimension of vector space, simplify the calculation process, and improve the efficiency of text processing without damaging the core information. At the same time, the word frequency was normalized to avoid the interference of the length of the text. In this paper, term frequency inverse document frequency (TF-IDF) was used to extract text features. If a word had a higher word frequency in one document and a lower word frequency in other documents, it was considered that the word could distinguish documents well and it was given a greater weight; On the contrary, if the word appears in multiple documents, it indicates that its distinguishing ability is not strong, and the value of IDF is small. The whole process of data preprocessing is shown in Figure 2. Finally, we used the obtained data for model construction and analysis.

4. Transfer Learning Model

In this paper, we adopted the transfer learning model which was established based on the DSN Network [28]. This method considers that the source domain D s and the target domain D t are composed of two parts: the public part and the private part. The former can learn the public features and the latter is used to keep the independent characteristics of each domain. Traditional machine learning algorithms require the training and testing sample data to have the same distribution. However, the transfer learning algorithm broke the limitations, as long as there is a correlation between the D s and D t domains, knowledge can be obtained from the D s domain which can be applied to the different but similar D t domain to solve the problem of insufficient D t training samples. Therefore, this paper establishes a transfer learning model based on the DSN network method.
According to the principle of similarity matching, instance data with high similarity between the source domain D s dataset (Weibo platform) and the target domain D t dataset (TikTok platform) were transferred to the target domain to help with target domain model learning. In the transfer learning model, the source domain was set as D s and the learning task in the source domain was denoted as T s and the target domain was set as D t and the learning task in the target domain was denoted as T t . The data distribution of these two domains is P ( X s ) and P ( X t ) , and P ( X s ) P ( X t ) . In order to improve the performance of transfer learning, the D t domain data was divided into training sample set X = { X 1 , X 2 , , X u } and test sample set Y = { Y 1 , Y 2 , , Y v } , (i.e., D t = X Y ), where u and v represent the time sequence length, respectively. Then, D s and X were combined into the training set. The structure of the DSN network transfer learning algorithm is shown in Figure 3.
In the Figure 3, h c t and h c s are the hidden vectors of common features extracted from the D s and D t domain. h p t and h p s are the hidden vectors of private features extracted from the D s and D t domain. F c o m m o n is the similar feature extracted from the D s and D t domain by calculating the similarity between h c s and h c t , and F d i f f e r e n c e is the dissimilar feature extracted from D s and D t domain, which ensures that the private part still plays a role in learning task T s and T t . F l a t e n t is the latent feature which is further mined based on dissimilar features. φ s i m i l a r i t y is an indicator to measure the similarity between platforms, which is calculated by two dimensions: temporal similarity and conceptual similarity. Based on the φ s i m i l a r i t y value, the paper transfers the source domain samples to the target domain for training research.

4.1. The Improved TC-LDA Model

The LDA (Latent Dirichlet Allocation) model is a classic subject model. This paper uses the distribution of words in the text to find the potential subject by clustering words with similar distribution (Figure 4). After LDA model subject extraction, the text set of the source domain is represented as D s = { v 1 , v 2 , v k , p j } , which is composed of feature word { v 1 , v 2 , v n } and subject word p j . Similarly, the text set extracted from the training set of the target domain is represented as X = { f 1 , f 2 , f m , d i } , consisting of feature word { f 1 , f 2 , f u } and subject word d i .
As the source domain and target domain of this study are both social network media, the text content belongs to the short text type, which has the characteristics of time dynamic and feature sparsity of short text. In view of the characteristics of short texts, this paper improves the LDA model to calculate text similarity and proposes TC-LDA similarity model. First, considering the influence of the time factor, time feature similarity is introduced. Since this paper studies hot topics on social media, the popularity will drop significantly after 3 days. Therefore, the time interval is set as 3 days (72 h) in this paper. The longer the time interval, the lower the similarity value of the time feature between texts with more than 3 days. We take text w 1 in the source domain and text w 2 in the target domain for example. For texts with the same time, a greater weight is assigned, and the calculation is shown in Formula (1):
Timesim ( w 1 , w 2 ) = { 100 w 2 . T i m e = w 1 . T i m e 72 | w 2 . T i m e w 1 . T i m e | w 2 . T i m e w 1 . T i m e
Second, for the feature sparsity of short texts, concept similarity is introduced to expand the feature space of short texts.
The calculation formula of the number of concept words shared by two texts c i is as follows:
N ( c i | w 1 w 2 ) = Conof ( w 1 ) Conof ( w 2 )
The formula for calculating the total number of concept words contained in the two texts c j is as follows:
N ( c j | w 1 w 2 ) = Conof ( w 1 ) Conof ( w 2 )
The calculation formula of concept similarity is:
Consim ( w 1 , w 2 ) = N ( c i | w 1 w 2 ) N ( c j | w 1 w 2 )
The definition of the similarity calculation formula in this paper is given. For text w 1 in the source domain and text w 2 in the target domain, under the condition that they are divided into the same subject class through the analysis of LDA subject model, through the feature weight coefficient λ , combined with the two factors of time feature similarity Timesim ( w 1 , w 2 ) and concept similarity Consim ( w 1 , w 2 ) , β w 1 and β w 2 are the subject weights of text division obtained by LDA model. After testing, the similarity threshold of 0.3 is found to be the most appropriate. The calculation of determining the similarity between w 1 and w 2 is shown in Formula (5):
Similarity ( w 1 , w 2 ) = ( λ Timesim ( w 1 , w 2 ) + ( 1 λ ) Consim ( w 1 , w 2 ) ) β w 1 β w 2

4.2. Experiment

Based on previous research results and combined with the characteristics of COVID-19 emergencies [29], this paper divides the development of epidemic public opinion into six stages: incubation period, growth period, outbreak period, fluctuation period, decline period, and long tail period, as is depicted in Figure 5. Additionally, in order to more accurately reflect the topic content in each period, according to the statistical analysis results of the life cycle stages, this paper divided data into time slices in the granularity of one week (7 days). The first 70% of the data in the two domains was selected as the training set, including 14 time slices (T0, T1, …, T14) from 25 December 2019 to 7 April 2020). The last 30% of the data was selected as the testing set, including 7 time slices (T15, T16, …, T21) from 8 April 2020 to 31 May 2020. The data were then allocated to the corresponding time slice, and the text in each time slice was processed in turn and the subject extracted.
The data used in this experiment are the data from the Weibo and TikTok platforms after data pretreatment. This paper takes the LDA model processing of source domain and target domain data on T1 time slice as an example to illustrate. Data on other time slices were processed according to the similar LDA model mentioned above.
Through Gensim’s LDA Model method, the subject number range was set as [2,20], and the subject consistency scores of LDA models in the source domain and target domain under different subject numbers were obtained, as shown in Figure 6. The best number of subjects for the source domain T1 time slice data was obtained as 16, and the model’s perplexity is −6.9. For subjects in the target domain, the optimal number was 12, with the model’s perplexity equal to −5.27.
Then we set the number of subjects in the source domain to 16, the number of subjects in the target domain to 12, and reserved 10 keywords for each subject. Check the subject results extracted from the source domain and target domain in Figure 7.
Each subject is composed of multiple words, and the same word may be contained in multiple subjects. For example, the 10 key words of the EP subject include “prevention and control, COVID-19, epidemic, combat, Novel Coronavirus, pneumonia, gather, disinfection, and Wuhan” (note: original data keywords are Chinese; this is translated data). Based on the above analysis, this paper is mainly divided into 10 subjects: epidemic condition (EC), data reporting (DR), epidemic prevention (EP), epidemic impact (EI), charity donations (CD), scientific research and popularization (SRP), people’s stories (PS), rumor clarification (RC), emotional subjects (ES), and other subjects (OS).

4.3. Results

According to the results of subject extraction from LDA models of the source domain and target domain, we proposed an improved TC-LDA model to calculate the similarity of texts with the same subject. Some text data are extracted for explanation, as shown in Table 1.
According to Formula (1) given by TC-LDA, the temporal similarity results of the above three groups of text examples are calculated (see Table 2). For conceptual similarity, we used the open API interface of the CN-Probase concept knowledge graph platform to query the concept words of related entity words in the text. This approach can expand feature words and reduce feature sparsity in short texts. Then, we used Formula (4) to calculate the conceptual similarity of text feature words, and the results are shown in Table 3. Finally, the similarity between the two text message fields was obtained according to Formula (5). For information with high similarity, instance transfer is performed, and information with low similarity is retained to keep the private features of source and target domains.
In order to measure the effect of text similarity calculation, we adopted A c c u r a c y and F M e a s u r e values to measure the effect of text similarity. Parameters related to A c c u r a c y and F M e a s u r e are shown in Table 4.
A c c u r a c y refers to the ratio of correctly classified text data to all texts, which is used to evaluate the performance of correct migration of similar texts in source domain and target domain. F M e a s u r e is used to evaluate the overall performance of similar text transferring in the source domain and target domain. It is calculated by P r o c i s i o n and R e c a l l . The formula is defined as follows:
A c c u r a c y = T P + T N T P + T N + F P + F N
P r o c i s i o n = T P T P + F P
R e c a l l = T N T N + F N
F M e a s u r e = 2 × ( P r o c i s i o n × R e c a l l P r o c i s i o n + R e c a l l )
In order to obtain better experimental results, this paper sets the characteristic weight coefficient to change from 0.1 to 1 and carries out experiments according to the value of different parameters. According to the principle of TC-LDA model in Section 4.1, if the similarity threshold is set to 0.3, that is, for texts with a Similarity ( w 1 , w 2 ) 0.3, it is considered that the texts of the source domain and the target domain are similar, and transfer learning can be carried out. Otherwise, the texts are considered to be dissimilar samples, and the results are shown in Figure 8. When the feature weight coefficient λ = 0.5, the A c c u r a c y and F M e a s u r e value of TC-LDA model are the highest, so we set λ = 0.5 in this study.
In order to measure the performance of the improved TC-LDA model, this paper compares the LDA model with the TC-LDA model through four evaluation criteria, and the results are shown in Table 5. We can see that the TC-LDA model proposed in this study has significant advantages over the LDA model in similar text detection. The new model improves the A c c u r a c y , P r o c i s i o n , R e c a l l and F M e a s u r e value, which contributes to the subsequent transfer learning and data mining.

5. Prediction of Popular Topics Based on Knowledge Graph

According to the results of transfer learning in Section 4, for text data with low similarity in time slices, transfer learning cannot compensate for training samples in the target domain. Therefore, the source domain and target domain retain their private features respectively, but text topics in different time slices may also have an evolutionary relationship with relevance. Therefore, the potential topic information can be extracted from the source domain which can be transferred to the target domain in the adjacent time slice for the prediction of potential popular topics in the target domain.

5.1. Topic Discovery and Filtering

Subject extraction of text data of each time slice belongs to subject discovery of static text information. However, social network texts are particularly affected by time, so this study adopts single-pass algorithm to find topics [30]. Due to the large amount of data, this study takes the data of the T2 time slice in the source domain and the T3 time slice in the target domain as an example to illustrate the topic discovery process of single-pass algorithm. The calculation steps for data in the other time slices are the same.
Aiming at the shortcomings of the single-pass algorithm [31], we made the following improvements to this algorithm: On the one hand, we sorted text data according to the release time before input, which was suitable for the evolution of the topic. In this way, we reduced the influence of different input sequences on clustering results. On the other hand, we use the way of the inverted index to accelerate the clustering and improve the search efficiency when the data volume is large.
After several experiments, it was found that it is appropriate to define the threshold as 0.2 and realize the model through Python software to obtain the respective topic categories of the source domain and target domain. As machine learning cannot fully understand human natural language text, limitations exist in the topic filtering. Therefore, we manually set “problem-answer” to train the binary classifier, which is used to filter topics after training to some texts. If relevant, they will be left, while irrelevant topics will be deleted. In this way, we filter out the topics unrelated to COVID-19 events. The final number of topic categories is shown in Table 6.
In this section, there were a total of nine subjects involved in the source domain and target domain data, which are encoded by letters [a–i], respectively. As for the extraction results, the topics were manually labeled according to topic category keywords and topic sentences and were encoded by 1 to 10, respectively. The specific results are shown in Table 7.

5.2. Topic Evolution Relationship Mining

After the topic is discovered, this paper mines the evolutionary relationship between topics. KL divergence refers to the information gained or relative entropy used to quantify the asymmetric difference between two probability distributions, P and Q, which can be used to measure the evolution of topic content [32]. As the evolutionary relationship is asymmetric, KL divergence is adopted in this paper to measure the evolutionary relationship. The calculation of KL divergence is shown in Formula (10).
KL ( P T m | | P T n ) = i = 1 P T m ( x i ) × l o g P T m ( x i ) P T n ( x i )
where, P T m represents the distribution of feature words of topic T m , P T n represents the distribution of feature words of topic T n , P T m ( x i ) represents the probability of feature words x i in P T m , and KL ( P T m | | P T n ) represents the loss of information to approximate with T n and T m . The smaller the value, the closer T n is to T m , the more likely T n is to evolve from T m , and vice versa. At the same time, considering that some of the directly extracted keywords have the same conceptual semantics, but the words are different and therefore cannot be recognized, this study extends the conceptual semantics of the topic feature words based on the CN-Probase conceptual knowledge atlas platform.
The evolution of topics in the source domain and target domain can be divided into four types: one-to-one relationship (OTO), one-to-many relationship (OTM), many-to-one relationship (MTO), and many-to-many relationship (MTM). This paper set the calculation results of KL divergence as the basis to determine whether the evolution between topics takes place. After many tests, we set different KL divergence thresholds to observe the results of the experiment. When KL = 0.25, the classification accuracy of topics with evolutionary relationships was the highest. Therefore, the threshold value of KL divergence is set to 0.25. If the threshold value is lower than 0.25, it indicates that there may be an evolutionary relationship between topics. Taking the scientific research and popularization (SRP) subject as an example, the results of the evolutionary relationship between topics are shown in Table 8.
In order to facilitate the construction of the topic knowledge graph, this paper defines four categories of topics, including starting topic, ending topic, single topic, and hot topic. The above method is used to mine the evolutionary relationship of T2 time slice in the source domain and T3 time slice in the target domain, and the evolutionary results are shown in Figure 9.

5.3. Topic Evolution Prediction Based on Knowledge Graph

A knowledge graph is a network with a semantic nature, and its basic constituent unit is the triple of “entity, relation, entity”, through which knowledge is expressed in a graph structure [33,34]. Entities are essentially the semantic objects in the real world. Network public opinion entities can be individuals or organizations for opinions through social networks such as media, government, Internet users, social institutions, etc., or incidents causing wide public concern. The entity of this article is extracted from the social network dataset and contains four entity classes: Subject, topic, event, person. Through the research on the defined entities, four relations of <subject, include, topic>, <event, belong to, topic>, <people, belong to, topic> and <people, involve, event> are obtained, as shown in Figure 10, where the ellipse represents the entity category, and the line represents the relation.
The topic knowledge graph was constructed from text data of source and target domains from 25 December 2019 to 7 April 2020, including subjects extracted by LDA model and topics, events, and people extracted by single-pass model. Considering the display effect of the graph, the knowledge graph ignoring the human entities as shown in Figure 11a,b intercepted part of the public opinion topic subgraph with the source domain T2 time slice and the target domain T3 time slice data as the core data, with the human entity added.
This paper takes the topic knowledge graph established by the source domain T2 time slice and the target domain T3 time slice as the research samples and selects five representative online public opinion events in the source domain on 22 January 2020 as test samples. They are, respectively, “Henan has banned the sale of live poultry in the market” (N1), “The Ministry of Transportation launched the level II emergency response” (N2), “Wuhan set up the epidemic prevention and control headquarters” (N3), “CDC will announce the first novel Coronavirus case in the US” (N4), and “Strict prevention of Novel Coronavirus in Spring Festival Transportation” (N5).
By traversing the nodes in the topic knowledge graph, cosine similarity was used to calculate the similarity between the key words of test events N1–N5 and the key words of nodes in the knowledge graph, and the node with the highest similarity was found. However, the accuracy of node prediction results in the knowledge graph of the target domain is low, so nodes with an evolutionary relationship between the source domain and the target domain are transferred to improve the accuracy of topic prediction in the target domain so as to realize topic prediction in the target domain.
According to the similarity calculation of topic nodes in knowledge graph, we get the source domain N1–N5 test events subject prediction results, as shown in Table 9. At the same time, we traverse the nodes to find the target domain generalization topics, as shown in Table 10. This paper extracts the weight of the biggest prediction of top 10 subjects and topics, and the accuracy of the prediction results was judged manually. The correct prediction was recorded as Y, and the wrong prediction was recorded as N.
Additionally, this paper adopts the MRR (Mean Reciprocal Rank) indicator to measure the topic prediction effect. The prediction accuracy of N1–N5 test events calculated by MRR method is shown in Table 11, and the mean value is taken as the final result, according to which the subject prediction accuracy is 82.6% and the topic prediction accuracy is 70.46%, indicating that the model has good prediction effect and strong applicability.

6. Conclusions and Future Work

We implemented cross-domain transfer learning between two social network platforms and propose an improved TC-LDA model to measure the similarity between the two domains. Compared with the LDA model, the TC-LDA model has improved A c c u r a c y , P r o c i s i o n , R e c a l l and F M e a s u r e values. Based on the results of transfer learning, we built a topic knowledge graph by using the Neo4j graphics database and conducted experiments to predict the evolution of popular topics in new emergencies. Experimental results show that knowledge graph technology is effective in popular topic prediction.
In this paper, entities, relationships, and attributes were extracted from COVID-19 emergency information to construct knowledge maps and predict topics. On the one hand, the effect of online public opinion analysis could be improved, so that government departments could make predictions in advance and take corresponding guidance measures in a timely manner to prevent public opinion from blindly expanding. On the other hand, the construction of the topic knowledge map based on the characteristics of the COVID-19 event also expands the application of the knowledge map in the field of public opinion analysis of emergencies on social networks, which can provide theoretical reference for studies in similar fields.
In future research, the graph convolutional network will be considered to model richer sentence semantics. In future research, multiple source domain data will be considered to be used for transfer learning to improve the prediction accuracy of information popular topics.

Author Contributions

Data curation, X.C.; methodology, Q.Q. and S.C.; project administration, X.C.; writing—original draft, X.C.; writing—review & editing, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 71801047); Beijing Philosophy and Social Sciences Planning Project (No. 17GLC052); Fundamental Research Funds for the Central Universities in UIBE (CXTD12-04); The Foundation of Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education; UIBE Distinguished Young Scholars Project of UIBE.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable, the study does not report any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wu, F.; Huberman, B.A. Popularity, novelty and attention. In Proceedings of the 9th ACM Conference on Electronic Commerce, Chicago, IL, USA, 8–12 July 2008; pp. 240–245. [Google Scholar]
  2. Wang, C.K.; Xin, X.; Shang, J.W. When to Make a Topic Popular Again? A Temporal Model for Topic Rehotting Prediction in Online Social Networks. IEEE Trans. Signal Inf. Process. Over Netw. 2018, 4, 202–216. [Google Scholar] [CrossRef]
  3. Hua, W.; Wang, Z.; Wang, H.; Zheng, K.; Zhou, X. Short text understanding through lexical-semantic analysis. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering, Seoul, Korea, 13–17 April 2015; pp. 495–506. [Google Scholar]
  4. Allan, J. (Ed.) Introduction to Topic Detection and Tracking. In Topic Detection and Tracking: Event-Based Information Organization; Springer: Boston, MA, USA, 2002; pp. 1–16. [Google Scholar]
  5. Geng, X.; Zhang, Y.; Jiao, Y.; Mei, Y. A novel hybrid clustering algorithm for microblog topic detection. In Proceedings of the 2nd International Conference on Materials Science, ReSource and Environmental Engineering (MSREE 2017), Wuhan, China, 27–29 October 2017; p. 040074. [Google Scholar]
  6. Dong, G.Z.; Yang, W.; Zhu, F.D.; Wang, W. Discovering burst patterns of burst topic in twitter. Comput. Electr. Eng. 2017, 58, 551–559. [Google Scholar] [CrossRef]
  7. Ghoorchian, K.; Girdzijauskas, S.; Rahimian, F. DeGPar: Large Scale Topic Detection using Node-Cut Partitioning on Dense Weighted Graphs. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017; pp. 775–785. [Google Scholar]
  8. Zhang, C.; Wang, H.; Cao, L.L.; Wang, W.; Xu, F.J. A hybrid term-term relations analysis approach for topic detection. Knowl.-Based Syst. 2016, 93, 109–120. [Google Scholar] [CrossRef]
  9. Torres-Tramón, P.; Hromic, H.; Heravi, B.R. Topic Detection in Twitter Using Topology Data Analysis; Springer: Cham, Switzerland, 2015; pp. 186–197. [Google Scholar]
  10. Gick, M.L.; Holyoak, K.J. Schema induction and analogical transfer. Cogn. Psychol. 1983, 15, 1–38. [Google Scholar] [CrossRef] [Green Version]
  11. Roy, S.D.; Mei, T.; Zeng, W.; Li, S. SocialTransfer: Cross-domain transfer learning from social streams for media applications. In Proceedings of the 20th ACM International Conference on Multimedia, Nara, Japan, 29 October–2 November 2012; pp. 649–658. [Google Scholar]
  12. Dai, W.; Jin, O.; Xue, G.-R.; Yang, Q.; Yu, Y. EigenTransfer: A unified framework for transfer learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 193–200. [Google Scholar]
  13. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  14. Zhao, P.; Hoi, S.C.H. OTL: A Framework of Online Transfer Learning. In Proceedings of the ICML, Haifa, Israel, 21–24 June 2010. [Google Scholar]
  15. Bussas, M.; Sawade, C.; Kuhn, N.; Scheffer, T.; Landwehr, N. Varying-coefficient models for geospatial transfer learning. Mach. Learn. 2017, 106, 1419–1440. [Google Scholar] [CrossRef] [Green Version]
  16. Fiscus, J.G.; Doddington, G.R. Topic detection and tracking evaluation overview. Top. Detect. Track. 2002, 12, 17–31. [Google Scholar]
  17. Huang, B.; Yang, Y.; Mahmood, A.; Wang, H. Microblog Topic Detection Based on LDA Model and Single-Pass Clustering; Springer: Berlin/Heidelberg, Germany, 2012; pp. 166–171. [Google Scholar]
  18. Stilo, G.; Velardi, P. Efficient temporal mining of micro-blog texts and its application to event discovery. Data Min. Knowl. Discov. 2016, 30, 372–402. [Google Scholar] [CrossRef]
  19. Xie, W.; Zhu, F.D.; Jiang, J.; Lim, E.P.; Wang, K. TopicSketch: Real-time Bursty Topic Detection from Twitter. In Proceedings of the IEEE 13th International Conference on Data Mining (ICDM), Dallas, TX, USA, 7–10 December 2013; pp. 837–846. [Google Scholar]
  20. Chen, Y.; Amiri, H.; Li, Z.; Chua, T.-S. Emerging topic detection for organizations from microblogs. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 28 July–1 August 2013; pp. 43–52. [Google Scholar]
  21. Chen, P.; Zhang, N.L.; Liu, T.; Poon, L.K.M.; Chen, Z.; Khawar, F. Latent tree models for hierarchical topic detection. Artif. Intell. 2017, 250, 105–124. [Google Scholar] [CrossRef] [Green Version]
  22. Zhang, L. Knowledge Graph Theory and Structural Parsing; Twente University Press: Enschede, The Netherlands, 2002; p. 216. [Google Scholar]
  23. Hao, X.; Ji, Z.; Li, X.; Yin, L.; Liu, L.; Sun, M.; Liu, Q.; Yang, R. Construction and Application of a Knowledge Graph. Remote Sens. 2021, 13, 2511. [Google Scholar] [CrossRef]
  24. Pu, T.; Huang, M.; Yang, J. Migration knowledge graph framework and its application. J. Phys. Conf. Ser. 2021, 1955, 012071. [Google Scholar] [CrossRef]
  25. Leng, Y.; Zhai, Y.; Sun, S.; Wu, Y.; Selzer, J.; Strover, S.; Zhang, H.; Chen, A.; Ding, Y. Misinformation during the COVID-19 Outbreak in China: Cultural, Social and Political Entanglements. IEEE Trans. Big Data 2021, 7, 69–80. [Google Scholar] [CrossRef]
  26. Gitee. Stopwords [DB/OL]. 2019. Available online: https://gitee.com/UsingStuding/stopwords (accessed on 30 June 2021).
  27. Guan, Q.; Deng, S.; Wang, H. Chinese Stopwords for Text Clustering: A Comparative Study. Data Anal. Knowl. Discov. 2006, 1, 72–80. [Google Scholar]
  28. Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; Erhan, D. Domain separation networks. Adv. Neural Inf. Process. Syst. 2016, 29, 343–351. [Google Scholar]
  29. Mo, P.; Xing, Y.; Xiao, Y.; Deng, L.; Zhao, Q.; Wang, H.; Xiong, Y.; Cheng, Z.; Gao, S.; Liang, K.; et al. Clinical Characteristics of Refractory Coronavirus Disease 2019 in Wuhan, China. Clin. Infect. Dis. 2020, 73, e4208–e4213. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimed. Tools Appl. 2018, 78, 15169–15211. [Google Scholar] [CrossRef] [Green Version]
  31. Yi, X.; Zhao, X.; Ke, N.; Zhao, F. An improved Single-Pass clustering algorithm internet-oriented network topic detection. In Proceedings of the 2013 Fourth International Conference on Intelligent Control and Information Processing (ICICIP), Beijing, China, 9–11 June 2013; pp. 560–564. [Google Scholar]
  32. Blei, D.M.; Lafferty, J.D. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 113–120. [Google Scholar]
  33. Wise, C.; Ioannidis, V.N.; Calvo Rebollar, M.; Song, X.; Price, G.D.; Kulkarni, N.; Brand, R.; Bhatia, P.; Karypis, G. COVID-19 Knowledge Graph: Accelerating Information Retrieval and Discovery for Scientific Literature. arXiv 2020, arXiv:2007.12731. [Google Scholar]
  34. Flocco, D.; Palmer-Toy, B.; Wang, R.; Zhu, H.; Sonthalia, R.; Lin, J.; Bertozzi, A.; Brantingham, P.J. An Analysis of COVID-19 Knowledge Graph Construction and Applications. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 2631–2640. [Google Scholar]
Figure 1. The basic research framework of the study. Based on the preprocessed text data: (1) For data of the same time slice, the LDA model was used to extract similar subject text data for transfer learning; (2) for data of different time slices, the single-pass algorithm was used to extract potential topics and evolutionary relationships; (3) the above results are used to construct the topic knowledge graph and predict popular topics.
Figure 1. The basic research framework of the study. Based on the preprocessed text data: (1) For data of the same time slice, the LDA model was used to extract similar subject text data for transfer learning; (2) for data of different time slices, the single-pass algorithm was used to extract potential topics and evolutionary relationships; (3) the above results are used to construct the topic knowledge graph and predict popular topics.
Futureinternet 14 00103 g001
Figure 2. The process of text data preprocessing and optimization methods.
Figure 2. The process of text data preprocessing and optimization methods.
Futureinternet 14 00103 g002
Figure 3. Structure diagram of transfer learning algorithm based on DSN Network.
Figure 3. Structure diagram of transfer learning algorithm based on DSN Network.
Futureinternet 14 00103 g003
Figure 4. LDA model potential subject topology diagram.
Figure 4. LDA model potential subject topology diagram.
Futureinternet 14 00103 g004
Figure 5. Stage division of life cycle of Weibo and TikTok platform events.
Figure 5. Stage division of life cycle of Weibo and TikTok platform events.
Futureinternet 14 00103 g005
Figure 6. This is the subject consistency of T1 time slice data in the two domains: (a) is the subject consistency score of source domain T1 time slice data; (b) is the subject consistency score of target domain T1 time slice data.
Figure 6. This is the subject consistency of T1 time slice data in the two domains: (a) is the subject consistency score of source domain T1 time slice data; (b) is the subject consistency score of target domain T1 time slice data.
Futureinternet 14 00103 g006
Figure 7. This is the subject classification of the source domain and target domain T1 time slice data: The T1 time slice of the source domain has 16 subjects; The T1 time slice of the target domain has 12 subjects. There are 10 subjects in the union of the two domains.
Figure 7. This is the subject classification of the source domain and target domain T1 time slice data: The T1 time slice of the source domain has 16 subjects; The T1 time slice of the target domain has 12 subjects. There are 10 subjects in the union of the two domains.
Futureinternet 14 00103 g007
Figure 8. A c c u r a c y and F M e a s u r e change in feature weight coefficient λ = [0.1, 1].
Figure 8. A c c u r a c y and F M e a s u r e change in feature weight coefficient λ = [0.1, 1].
Futureinternet 14 00103 g008
Figure 9. Graph of topic evolution over time (8 January 2020–21 January 2021).
Figure 9. Graph of topic evolution over time (8 January 2020–21 January 2021).
Futureinternet 14 00103 g009
Figure 10. Entity, relation model diagram.
Figure 10. Entity, relation model diagram.
Futureinternet 14 00103 g010
Figure 11. This is the topic knowledge graph constructed by using Neo4j graph database: (a) is the COVID-19 Topic Knowledge Graph; (b) is the topic knowledge graphs of source domain T2 and target domain T3.
Figure 11. This is the topic knowledge graph constructed by using Neo4j graph database: (a) is the COVID-19 Topic Knowledge Graph; (b) is the topic knowledge graphs of source domain T2 and target domain T3.
Futureinternet 14 00103 g011
Table 1. Examples of the texts of the same topic in D s and D t (T1 time slice).
Table 1. Examples of the texts of the same topic in D s and D t (T1 time slice).
NumDomainRelease TimeSubjectWeightContent
1Ds2 January 2020 22:29Subject 160.7808SRP
Dt5 January 2020 23:18Subject 120.6944SRP
2Ds3 January 2020 19:13Subject 130.8815EP
Dt6 January 2020 9:03Subject 20.9167EP
3Ds6 January 2020 14:27Subject 40.4109ES
Dt1 January 2020 8:00Subject 50.5085ES
Table 2. Temporal similarity of T1 time slice data in the D s and D t .
Table 2. Temporal similarity of T1 time slice data in the D s and D t .
NumDomainRelease TimeTime Interval (Hours)Temporal Similarity
1Ds2 January 2020 22:2972.820.99
Dt5 January 2020 23:18
2Ds3 January 2020 19:1361.831.16
Dt6 January 2020 9:03
3Ds6 January 2020 14:27126.450.57
Dt8 January 2020 8:00
Table 3. Conceptual similarity of T1 time slice data in the D s and D t .
Table 3. Conceptual similarity of T1 time slice data in the D s and D t .
NumDomainShared WordsThe Total Number of WordsConceptual Similarity
1Ds8160.5
Dt
2Ds6160.375
Dt
3Ds2280.071
Dt
Table 4. Accuracy and F-measure Parameters.
Table 4. Accuracy and F-measure Parameters.
NumClassificationSimilarDissimilar
1TrueTPTN
2FalseFPFN
Table 5. Performance comparison between LDA model and TC-LDA model.
Table 5. Performance comparison between LDA model and TC-LDA model.
ModelAccuracyPrecisionRecallF-Measure
LDA57.14%76.92%45.45%57.14%
TC-LDA77.14%81.82%69.23%75.00%
Table 6. The number of topics extracted by single-pass in T2 and T3.
Table 6. The number of topics extracted by single-pass in T2 and T3.
DomainTimeNumber of Topics Extracted by Single-PassNumber of Filtered Topics
Ds8 January 2020915
9 January 2020616
10 January 2020235
11 January 2020175
12 January 2020374
13 January 2020995
14 January 2020397
Dt15 January 2020–21 January 20202517
Table 7. Topic extraction results extracted by single-pass in T2 and T3.
Table 7. Topic extraction results extracted by single-pass in T2 and T3.
DomainNumEP(a)DR(b)SRP(c)CD(d)EI(e)EC(f)OS(g)
Ds-T2 Topics1Mask purchase PublicityEight patients with viral pneumonia of unknown cause have been discharged from hospital in WuhanSARS has been ruled out as the cause of pneumonia of unknown cause in WuhanFace mask shortageEntertainment industry field influenceUs flu outbreakAustralian bushfire
2Wildlife quarantinePeople are demanding transparencyThe pathogen of the pneumonia of unknown cause in Wuhan is coronavirusMask supply in Southeast AsiaRiots in Hong KongThe situation in Wuhan is grimJapanese swine fever infection
3Comprehensive prevention and control of infectious diseasesOne person has died of pneumonia of unknown cause in WuhanExperts interpret pneumonia of unknown cause in WuhanJapan has a shortage of masksYellow alert for flu in Tianjin other
4The government increased access to informationUpdate on the novel Coronavirus pneumonia in WuhanThe novel Coronavirus gene sequence is highly similar to SARS Wuhan virus laboratory investigated
5Prevent coronavirus infectionNotification of confirmed casesThe coronavirus is not same as SARS Online teaching by students
NumEP(a)DR(b)SRP(c)CD(d)EC(f)ES(h)PS(i)
Dt-T3 Topics6Dou action to defeat the epidemicUpdate on COVID-19 casesZhong nanshan is certain of human-to-human transmission of COVID-19The government has paid for the treatment of all patients in WuhanThere is a possibility of a wider outbreak of novel CoronavirusHats off to the medical staff! Wuhan refuelingAnhua Wu went to the front lines of COVID-19
7Wearing masks is the key to preventing novel CoronavirusSichuan confirmed the first case of wuhan novel pneumoniaChina will attend the meeting to share information on the epidemicWuhan road
8ECMO technology successfully treated a patient with pneumonia Teach you how to wear a mask correctly
9Folk ancestral special prescription for fighting novel pneumonia Traditional Chinese medicine for COVID-19 prevention
10Henan has banned the sale of live poultry in markets
Table 8. The evolutionary relationship of SRP topics.
Table 8. The evolutionary relationship of SRP topics.
NumTopic NumDirection Topic NumKullback-Leible (KL)Evolutionary Relationship
1c1-->c60.16MTM
2c1-->c70.23MTM
3c1-->c80.21OTM
4c1-->c90.12MTM
5c2-->c70.22MTM
6c3-->c60.25MTM
7c3-->c70.17MTM
8c4-->c60.19MTO
9c5-->c60.18MTM
10c5-->c90.22MTM
Table 9. N1–N5 Test event target domain subject division prediction results.
Table 9. N1–N5 Test event target domain subject division prediction results.
NumN1SimY/NN2SimY/NN3SimY/NN4SimY/NN5SimY/N
1RC0.396NEP0.305YEC0.498YEC0.331YSRP0.408Y
2EI0.377YEI0.225YSRP0.461NEP0.241NEP0.379Y
3EP0.370YEC0.215YDR0.457NSRP0.240NPS0.361N
4EC0.363YES0.210NRC0.411YCD0.203NCD0.358N
5DR0.358NSRP0.196NEP0.397YDR0.201YES0.352N
6CD0.346NCD0.191NES0.391NRC0.195NEI0.340N
7SRP0.344NDR0.181NCD0.368NES0.193NEC0.338N
8PS0.342NPS0.179NEI0.340NEI0.192NRC0.330N
9ES0.327NRC0.172NPS0.310NPS0.165NDR0.314N
Table 10. N1–N5 test the target domain generalization topic prediction results of events.
Table 10. N1–N5 test the target domain generalization topic prediction results of events.
NumN1SimY/NN2SimY/NN3SimY/NN4SimY/NN5SimY/N
1Virus associated0.312YGovernment measures0.489YDomestic outbreak0.616YGovernment measures0.374YPrevention publicity0.506Y
2Public opinion0.332YLife influence0.356YVirus associated0.535YForeign outbreak0.341YVirus associated0.430Y
3Public events0.359NEducation influence0.261YInfection Source 0.531NVirus associated0.330YPersonal protective0.422Y
4Prevention publicity0.338YVirus associated0.246YLocal rumors0.521YDomestic outbreak0.320YInfection Source 0.398N
5Infection Source 0.339YDomestic outbreak0.244Yconfirmed case0.516YInfection Source 0.306YGovernment measures0.390Y
6Domestic outbreak0.304YMedical measures0.226NPrevention publicity0.484Yconfirmed case0.283YLife influence0.387Y
7Medical measures0.346NPublic opinion0.225YMedical measures0.427YPrevention publicity0.256YMaterial donations0.387N
8Environmental impact0.452NPublic events0.223Ndead csae0.447NLife influence0.250NTreatment technology0.377N
9Material donations0.313NEnvironmental impact0.215NEnvironmental impact0.421NEducation influence0.243NPublic opinion0.318Y
10Education influence0.303NMaterial donations0.185NMaterial donations0.321NForeign donors0.229NEnvironmental impact0.303N
Table 11. Results Prediction Accuracy Test.
Table 11. Results Prediction Accuracy Test.
Test EventsSubject Prediction AccuracyTopic Prediction Accuracy
N177.80%74.00%
N288.90%73.30%
N374.10%68.30%
N477.80%70.00%
N594.40%67.00%
Average82.60%70.46%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Chen, X.; Qu, Q.; Wei, C.; Chen, S. Cross-Domain Transfer Learning Prediction of COVID-19 Popular Topics Based on Knowledge Graph. Future Internet 2022, 14, 103. https://doi.org/10.3390/fi14040103

AMA Style

Chen X, Qu Q, Wei C, Chen S. Cross-Domain Transfer Learning Prediction of COVID-19 Popular Topics Based on Knowledge Graph. Future Internet. 2022; 14(4):103. https://doi.org/10.3390/fi14040103

Chicago/Turabian Style

Chen, Xiaolin, Qixing Qu, Chengxi Wei, and Shudong Chen. 2022. "Cross-Domain Transfer Learning Prediction of COVID-19 Popular Topics Based on Knowledge Graph" Future Internet 14, no. 4: 103. https://doi.org/10.3390/fi14040103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop