Network Pseudohealth Information Recognition Model: An Integrated Architecture of Latent Dirichlet Allocation and Data Block Update

. The wanton dissemination of network pseudohealth information has brought great harm to people’s health, life, and property. It is important to detect and identify network pseudohealth information. Based on this, this paper deﬁnes the concepts of pseudohealth information, data block, and data block integration, designs an architecture that combines the latent Dirichlet allocation (LDA) algorithm and data block update integration, and proposes the combination algorithm model. In addition, crawler technology is used to crawl the pseudohealth information transmitted on the Sina Weibo platform during the “epidemic situation” from February to March 2020 for the simulation test on the experimental case dataset. The research results show that (1) the LDA model can deeply mine the semantic information of network pseudohealth information, obtain the features of document-topic distribution, and classify and train topic features as input variables; (2) the dataset partitioning method can eﬀectively block data according to the text attributes and class labels of network pseudohealth information and can accurately classify and integrate the block data through the data block reintegration method; and (3) considering that the combination model has certain limitations on the detection of network pseudohealth information, the support vector machine (SVM) model can extract the granularity content of data blocks in pseudohealth information in real time, thus greatly improving the recognition performance of the combination model.


Introduction
At present, pneumonia caused by the new coronavirus has been effectively controlled nationwide, but the panic and fear caused by it are making people nervous. People are attempting to find various effective methods for improving their immunity to resist virus invasion and prevent virus infection by the new coronavirus. Under this background, some people take advantage of the panic mentality of the public to produce and disseminate a large amount of pseudohealth information on the Internet in the name of health. For instance, "drinking strong liquors can kill novel coronavirus," "drinking radix isatidis and smoking vinegar can prevent novel coronavirus," "drinking sterilizing fluid can kill novel coronavirus," and "wearing multilayer masks can prevent novel coronavirus." e publishers and disseminators of this pseudohealth information, with personal interests, act as "unhealthy" in the name of "healthy" and induce unwise behaviors of people who do not know the truth, which has brought great harm to the physical and mental health of the general public; additionally, it will cause property loss and life danger. All kinds of "Health Articles," "Cancer Alert," and "Private Sector" are filled in WeChat's circle of friends. Not only on social platforms, but also the whole network environment highlights a serious problem: health information is full of all kinds of health care pseudoscience, and the information is enough to make some people who lack health knowledge and literacy believe this kind of pseudohealth information. In addition, pseudohealth information spreads unscrupulously in rural areas, resulting in a series of serious consequences. For example, in recent years, there have been activities to promote fake health care products in rural areas in China. e swindlers take advantage of the mentality of rural residents, such as seeking cheap prices and worrying about health, to carry out swindling actions which results in heavy losses to farmers. In 2015, Financial Channel of China Central Television reported that acetochlor pesticide residues were detected in strawberries, and long-term consumption would cause cancer risks. For this kind of pseudohealth information, it is difficult for nonprofessionals to distinguish whether the information is true or false. Although professionals interpreted that dosage determines toxicity with eight validation samples, it still caused a large scale of unsalable strawberries and brought great economic impact to farmers. erefore, effective identification of pseudohealth information in networks is of great significance for maintaining the physical and mental health of the general public.
At present, there is no universally accepted definition of "pseudohealth information" in society. In general, pseudohealth information is interpreted as false health information without a factual basis, but in the real world, much pseudohealth information is fabricated based on certain facts, which only extends, distorts, exaggerates, and even fabricates the facts. erefore, the pseudohealth information to be studied in this paper is that fabricated without a factual basis or with a certain factual basis but distorted or exaggerated by the publisher, the so-called health information that deviates from the truth. Network pseudohealth information refers to false health information that is fabricated or distorts the truth transmitted specifically through social media on the network. It is the "noise" in health transmission; it often induces people to form incorrect health cognition and even engage in improper health behaviors, which brings inestimable harm to the public's physical and mental health. us, it is of great practical significance to study the identification methods of network pseudohealth information to prevent the spread of pseudohealth information and maintain social stability.

Related Works
Internet pseudohealth information mostly belongs to the nature of rumors, which have the characteristics of rapid spread, wide influence range, and great social harm. It often induces a wide range of network public opinions or public health events and attracts widespread attention. At present, the research on pseudohealth information identification mainly focuses on the following three aspects: (1) e "select instance" (or sliding window) classification method. For example, Molinaro and Greco proposed a two-stage instance selection algorithm, which is divided into two stages: concept detection and retraining. If the semantics of class health are detected, the algorithm will automatically update the classifier and find classification labels in class health information data for classification [1]. Han et al. proposed the sliding window algorithm, which can deal with the attribute classification problem of network pseudohealth information [2]. Hoens et al. proposed an support vector machine (SVM) model for detecting network pseudohealth information, and the classification of network pseudohealth information was realized by updating the weight allocation of instances [3].
(2) e batch classification method. For example, Sutskever et al. proposed the information batch processing model, which realizes batch processing of class health information by constantly updating the classifier, thus realizing the classification of pseudohealth information [4]. Rodriguez and Laio proposed an integrated model based on time limits, which can preliminarily compare and distinguish pseudohealth information and health information in the network [5]. (3) e classification method of online learning. For example, the pseudohealth information network online learning combination model proposed by Brzezinski and Stefanowski is composed of network online classifiers. Since the number of classifiers is usually fixed, as a result, the weighted sum update is also fixed [6]. Shi et al. proposed an online incremental algorithm to deal with the classification of network pseudohealth information. Due to the narrow value of online increments, which leads to poor fault tolerance [7], Eskandari and Javidi adopted the network online learning method to classify pseudohealth information through centralized processing, but its classification accuracy was relatively low, and the classification effect was also poor [8].
In previous related studies, scholars have proposed a variety of classification algorithms for the identification of network pseudohealth information, including the combination model of different algorithms. ese algorithms and models have good recognition effects on pseudohealth information with obvious identification of information sources and text semantic tags. However, it is difficult to identify pseudohealth information with unclear information sources and unclear semantic tags in the network and is also difficult to detect and classify. In the previous research on pseudohealth information identification, whether it is "select instance" (or sliding window) classification method, batch classification method, or the online learning classification method, each has its own advantages and disadvantages. Although pseudohealth information can be classified from different aspects, the existing methods are mainly single classifiers or batch processing, which result in either the classification cannot be effective or the recognition accuracy not being high. rough the research on pseudohealth information, this paper aims to help people distinguish pseudohealth information and improve their health information literacy, thus fundamentally improving the quality of network health information and purifying the network health information environment. Based on this, this paper proposes an integrated combination of the latent Dirichlet allocation (LDA) algorithm and data partitioning and accurate update. By identifying network pseudohealth information by topic, class tag blocks are accurately updated and integrated with data blocks to effectively identify and classify pseudohealth information.

Concept Definition.
e combination algorithm proposed in this paper to identify the problem of network pseudohealth information, the core of which is to divide the dataset corresponding to network pseudohealth information into "granularity" blocks according to its class label properties. To detect the minimum information unit attribute contained in the dataset, the dataset is continuously updated in blocks according to the category of information attribute contained in the minimum information unit and is reintegrated and classified according to the category of data blocks, to effectively identify pseudohealth information. e concepts involved in this combined algorithm are as follows.
e so-called pseudohealth information refers to misleading others to follow blindly or accept false publicity in a misleading and deceptive way in the name of health to realize the personal interests of producers and broadcasters and has been falsified.
In summary, pseudohealth information usually appears in the external form of health information. It takes advantage of people's demand for health information and uses false, deceptive, misleading, and other ways and means to spread and advocate unscientific, false content to achieve personal purposes, and the information has been falsified. e semantics of pseudohealth information deviate from the information title and semantic label and have a conceptual drift with the original meaning. According to this, pseudohealth information can be defined in terms of information from the perspective of information dissemination, and its information definition is shown in Definition 2.
Definition 2. Pseudohealth information (information definition). Class health information dataset S � (x t , y t ) |t � 1, 2, . . . , T}, where x is the attribute value and y is the vector of the class label, decomposes its joint probability P(x, y) into P(x, y) � P(x)P(y|x). If the prior probability P(x) and conditional probability P(y|x) of the sample in the class health information dataset change, semantic concept drift occurs in the class health information dataset S: during semantic concept drift, if P(x) does not change and P(y|x) changes, it belongs to the concept drift of the conditional change class; that is, the class health information is determined as true health information; if P(x) and P(y|x) change, it belongs to the concept drift of feature change; that is, similar health information is false health information; that is, it is determined as false health information.
Generally, health information refers to similar health information in which the attributes or labels contained in the information dataset have not changed, but their external representations or conditions have changed over a period of time; however, pseudohealth information refers to those that appear as "health" and have a relatively stable feature distribution. However, the class health information is changed or deviates from the class label corresponding to the "health" eigenvector.
Definition 3. Data block. If the information dataset S � (x t , y t )|t � 1, 2, . . . , T is divided into sequences arranged in sequence z 1 , z 2 , . . . , z i , · · · , z n , . . ., each sequence contains a data record or several logical markers; if each sequence z i � (x, y) consists of eigenvectors x ∈ X and class labels y ∈ Y, sequence elements z i are called data blocks.
Definition 4. Data block integration. If the information dataset S � (x t , y t )|t � 1, 2, . . . , T is divided into data blocks z 1 , z 2 , . . . , z i , · · · , z n , . . . with uniform size, each type of information data block contains d data blocks; for each newly added block z j , the weight of the classifier C i ∈ ε is weighted by the weighting function Q(·). e weighting function Q(·) depends on the classification accuracy of the classifier. If the size of the data block set is set to k and does not exceed the limit, z j is classified and added to a data block set of a certain type; if a data block set is a full set and the weight of the newly added data block is greater than that of the remaining data blocks, the newly added data block replaces the weakest block in the original set, and this process is called data block integration.

Algorithm Idea.
e combination algorithm proposed in this paper blocks the data of the health-like dataset; that is, based on the class labels in the health-like dataset, topic recognition, information dataset partitioning, data block classification integration, and semantic offset detection are involved in the LDA model, Algorithm 1, SVM model, and Algorithm 2. e logical framework of the combined model is shown in Figure 1.

LDA Model. LDA was proposed by David Blei, Andrew
Ng, and Michael I. Jordan in 2003. It is mainly used for document-topic generation and contains three levels of structure: document, topic, and word. erefore, it is also called the probability model of the three-layer shellfish leaf stage [9]. As soon as the LDA model was proposed, it attracted the attention of scholars, especially in the field of semantic mining, which can greatly reduce the representation dimension of the text, thus making the model widely used [10,11]. Additionally, as a typical representative unsupervised model, the LDA model has the advantage that the number of topics can be determined as long as important input parameters in the model are determined; therefore, the algorithm process is greatly simplified [12]. Based on this, when determining the optimal value of the number of document topics, this paper selects perplexity as an index to evaluate the pros and cons of the model, and its calculation equation is as follows: (1) In the equation, M is the number of documents, D is the set of words in the document, w d is the word, N d is the number of words, and p(w d ) is the probability of words in the document.
According to the statistical results, users who have published more Weibo information basically do not have the behavior of spreading fake health information, and their user credibility can be measured by the number of fans, the Complexity 3 number of followers, and the ratio; for those users who observe more but have fewer fans, the credibility is relatively low, and their fans are often the Internet Water Army. ese users are most likely to be publishers or sources of a large amount of pseudohealth information. ey publish or disseminate pseudohealth information through various network social platforms, such as Weibo and WeChat. erefore, user credibility Reliability(u) can be defined by Reliability(u) � ln(follower − following + num) + verify. (2) In the equation, follower, following, and num are the number of fans, the number of followers, and the number of Weibo posts, respectively, after z − score standardization.Reliability(u) is an important basis for measuring the user credibility. e larger the value of Reliability(u) is, the higher the user credibility is.
For users, the number of fans, the number of Weibo forwards, the number of comments, and the number of praises are the basis for evaluating their influence. Generally, the more fans a user has, the greater the probability that the microblog posted by the user will be seen and spread by others and the more the corresponding forwarding, comments, and praise are. Regardless of what kind of Weibo operation behaviors fans perform, they all focus on the content published by users. erefore, the influence of users' Weibo Influence(t) can be defined according to the following equation: In the equation, follower, repost, comment, and like are the number of fans, forwarding, comments, and praise after z − score standardization. Influence(t) is an important indicator for evaluating the influence of users' microblogs. e greater the value of Influence(t) is, the greater the influence of user microblogs is.

Data Block Update Integration Algorithm
(1) Dataset Partitioning Algorithm.
e identification of network pseudohealth information determines the essence of information semantics according to the deviation degree between the target class label and semantic ontology. If the semantic concept in information dataset S t is replaced by S t+1 and the type of deviation is a subversive deviation, the "health" information content contained in the information semantics is replaced by pseudohealth information, its information semantic ontology has undergone fundamental changes, and the semantic ontology of network pseudohealth information belongs to this category. According to this principle, the information dataset S is now divided into data block streams z 1 , z 2 , . . . , z i , . . . , z n , . . ., and each data block contains one record or several logical records. Classifier C i is constructed, and the newly added data block Z j is empowered. e classification performance of classifier C i is determined by the weighted function Q(·). In the process of information dataset partitioning, if a certain type of data block set is not a full set, data block Z j is added to this type of set; if a set of data blocks is a full set and the weight of block Z j is greater than that of any block, the weakest block is replaced. e block integration algorithm of the dataset is shown in Algorithm 1.
(2) Data Block Set Classification Integration SVM Model. SVM is a typical representative binary classification model that is superior in classification generalization ability; therefore, it has been widely used in the field of information and data classification [13,14]. In this paper, when identifying network pseudohealth information, the SVM model is adopted to integrate and classify the data block set to transform the instance sample dataset into the problem of solving convex quadratic programming. en, the best classification hyperplane of the sample space is obtained. e classification hyperplane equation is as follows: In the equation, ω � (ω 1 , ω 2 , . . . , ω 7 ) is the normal vector, which determines the direction of the hyperplane; b is the displacement item, which determines the distance between the hyperplane and the origin; and x � (f 1 , f 2 , . . . , f 7 ) is the eigenvector of the sample point. e distance of the hyperplane is a controllable factor that makes the distance between two types of sample points and the classification hyperplane reach the optimal size based on the requirement of classification accuracy [15]. In addition,  the SVM model has good fault tolerance in the training process, and the optimal solution form of its optimal classification hyperplane equation is as follows: In the equation, x i is the eigenvector of the i-th sample point, ξ i is the relaxation variable of the i-th sample point, T i is the category label of the i-th sample point, N is the number of training samples, and C is the penalty coefficient. e classification performance of the SVM model is determined by its kernel function. Choosing different kernel functions will lead to great differences in classification accuracy. At present, the kernel functions commonly used in SVM models include linearity, polynomial, and radial basis function (RBF) [16]. Since the classification accuracy of the RBF kernel function is much higher than that of other kernel functions and is suitable for situations where the number of features is less than or equal to the number of samples [17,18], this paper chooses the RBF kernel function, as shown in the following equation: When the SVM model is classified and trained, the penalty coefficients C and c in the RBF kernel function need to be determined in advance, the fault tolerance of the model is controlled by the former, and there is a negative correlation between the two; that is, the larger the penalty coefficient C is, the smaller the fault tolerance is. When C is too high, the overfitting phenomenon occurs [19,20]; however, when C is small, to a certain extent, the classification accuracy of the model will be reduced accordingly. In other words, the parameter c in the RBF kernel function affects the distribution of sample points mapped to high-dimensional space and exerts an influence on the penalty coefficient C, thus making the SVM model have high classification integration accuracy.

(3) Semantic Offset Detection Algorithm.
e semantic changes in network pseudohealth information are very complex. Existing studies use online weighted and incremental classification methods to detect the changes in target semantics in network pseudohealth information, but data block integration is much more complicated than incremental classification, and the existing semantic offset detection algorithms have defects. To compensate for this defect, this paper adopts the semantic offset detection algorithm. e principle of this algorithm is that each data block contains one record or multiple logical markers, the data block set classified by the SVM model needs to be batch processed, and the candidate classifier corresponding to the integration component of the data block set is triggered for a classification check. If the current data block set is correctly classified, the original data block set of classification integration can be kept unchanged; if the fault tolerance of the current data block classification integration is poor or the classification accuracy is low, the integration component is reweighted, and the class tags in the data block are redetected to improve the classification accuracy of the classifier to effectively detect the target semantic attributes. erefore, the semantic offset detection algorithm is shown in Algorithm 2.

Instance Data Acquisition.
In this paper, crawler software is used to crawl experimental data, and pseudohealth information published by the publicity section of the Sina Weibo community management center is used as a reference. is pseudohealth information is reported due to false information and has been clearly confirmed by the government as pseudohealth information. Due to the spread of various pseudohealth information during the new coronavirus epidemic, the pseudohealth information in Sina Weibo is considerable. is paper crawled pseudohealth information from the API of Sina Weibo from February 1 to March 31, 2020, and randomly collected 1,183 pseudohealth information points. Among them, 759 of the original Weibo have more than 100 comments. e content of each Weibo was marked, counted, and sorted by its number of forwards, comments, and praises, and the experimental case dataset was constructed together with user information and the number of followers and fans.
To prevent the classifier from dividing all experimental data into health information, we added a manual verification step and selected some microblogs with comments greater than 100 and text, not pure symbols, and length greater than 10. e classification basis was obtained by means of manual verification technology and compared with health information. A total of 368 pieces of health information data were obtained through layer-by-layer screening, with more than 96.43 million comment texts. Based on the characteristics of comment anomaly parameters and SVM model parameters determined by the algorithm, this paper manually labeled the collected instance datasets. e selected instance dataset includes 359 pieces of pseudohealth information and 268 pieces of health information. When verifying the pseudohealth information recognition model, we made full use of the remaining 100 pieces of pseudohealth information and 100 pieces of health information to conduct precision comparison training experiments. e dataset composition of the experimental examples is shown in Table 1.

LDA Topic Recognition and Preprocessing.
According to the data variables given in Table 2, the LDA model was used to preprocess the instance dataset to mine the documenttopic distribution characteristics of the pseudohealth information dataset; the variables listed in Table 2 are the characteristic indicators required for LDA model preprocessing, and the meaning of each variable corresponds to (1) Information data block do for all Z j ∈ S (2) According to Z j and Q(·), candidate classifier C′ is established and empowered; (3) According to Z j and Q(·), all classifiers C i in set ε are empowered; (4) if |ε| < k, then ε ⟵ ε ∪ C′ ; or the offset is detected, then; (5) According to W and Q(·), candidate classifier C′ is constructed and empowered; (6) According to Wand Q(·), the classifier C i in integration ε is empowered;   topics, the vertical axis is the perplexity, the polyline is 3 to 28, and the interval is 1. As seen in Figure 2, with the increasing number of subjects, the perplexity also continues to increase, but the rising track has certain volatility. When the number of subjects is 5, the perplexity reaches its lowest value. As the number of topics increases, the perplexity also   increases in a wave and reaches the maximum when the number of topics is 28. Based on the minimum principle of "perplexity + number of topics," 5 is selected as the topic parameter value of the LDA model. After determining the optimal topic parameter value of the LDA model, the LDA model can be used to perform deep semantic training on the segmented instance dataset and then determine the distribution rules of "document-topic" and "topic-word" to determine the class labels or classification features of topics and words and prepare for the block and reintegration of instance datasets. e training results are shown in Table 3. As seen in Table 3, the results of LDA model training have obtained 5 topics. Now, the first 5 words are selected to represent each topic, and the probability of occurrence of each word is given.
Next, we randomly selected 6 documents as examples and show their "document-topic" distribution map to explore the probability of their themes and subject words. e specific results are shown in Figure 3. As seen in Figure 3, the probability of six document topics is different, but there is always a higher probability of one or two topics, while the probability of other topics is lower, which shows that the LDA model can divide the topic of microblog text well and provide a good foundation for the next step of this paper to block and integrate the microblog pseudohealth information instance dataset.

Block Experimental Datasets.
e experimental dataset processed by the LDA model was cross-verified K times, and the instance dataset (S) was input. It was randomly divided into K subsets S′ (S ′ � S 1 , S 2 , . . . , S K ) with different sizes and mutual exclusion. In addition, S ′ was trained and tested K times; that is, in i iterations, subset S i was retained as a test set, and the remaining subsets were used for training. e block efficiency is K iterations of training times divided by the total number of experiments. eK-fold cross-validation uses the classifier in Algorithm 1 to extract the weight of interactive information. e purpose of the cross-validation experiment is to verify the block efficiency and performance of Algorithm 1.
According to Algorithm 1, for a given instance dataset, if the attributes and class labels of the information text are obvious, the accuracy of the instance dataset is very high; if the attributes and class labels of the information text are vague or not clearly defined, the window algorithm (Algorithm 2) needs to be used to detect semantic deviation. In the process of grouping instance datasets, with the change in candidate classifier C ′ , the classification discrimination boundary also changes. For all classifier C i weights, the instance information dataset S is divided into data blocks of uneven size: z 1 , z 2 , . . . , z i , . . . , z n , . . ..; the candidate classifier C ′ is established according to Z j and Q(·), and it is empowered accordingly so that the decision boundary of the instance dataset will not fall into the center point of onedimensional, two-dimensional, and three-dimensional spherical Gaussian step by step, the cross-validation data blocks present Gaussian distribution, and the block discrimination boundary is composed of two hyperbolic surfaces. e block decision area is not simply connected but the area where the two elliptical contour lines formed by probability density are located, as shown in Figures 4(a) and 4(b).
In Figure 4(a), candidate classifier C ′ implements the partitioning of instance datasets according to the attributes of class labels. It uses all candidate classifiers C ′ in set ε to assign and update data blocks and creates k components to retain the original class labels of data blocks. e weight is updated based on the size of the instance buffer d to ensure that all data blocks have corresponding nonzero weights. In Figure 4(b), instance buffer d can not only retain the class tags of data blocks but also decide whether to replace the data blocks with the weakest class tags in the data block set according to classifier C i . In addition, the data blocks with the weakest class labels can be removed or collected into the sets of other classes to effectively block instance datasets.

Data Block Classification and Integration.
Vectorization is required for data block classification and integration. is paper uses the SVM model for classification and integration training, calls the libSVM tool, and adjusts the values of parameters C and c to make the covariance matrix of the instance data block set distribution equal to obtain two n-dimensional spherical distribution information sets, namely, "health" and "pseudohealth" information data block classification integration datasets σ 1 and σ 2 , where σ 1 and σ 2 are located on both sides of a n − 1-dimensional normalized hyperplane. e hyperplane is the classification decision boundary of the two. e central line of the two n-dimensional spherical distributions formed by σ 1 and σ 2 is perpendicular to the hyperplane, as shown in Figure 5(a) and 5(b). In the process of classification integration, assuming ∃i: Q(C ′ ) > Q(C i ), all classifiers C i in the set ε are empowered according to Z j and Q(·). If the errors of all types d ∈ ε to S are equal, then all data blocks with Z j ∈ ε are classified and weighted. By measuring the Oushi distance from each data block to the ε-mean vector, the "minimum distance" of the boundary (hyperplane) is judged based on ε display and classification, classify and collect the weighted data blocks into the nearest dataset σ i (i � 1, 2), and the weakest data block in the set σ i (i � 1, 2) is replaced with C ′ to realize the preliminary classification and integration of data blocks, as shown in Figure 5(a). e window algorithm (Algorithm 2) is different from other integrated classifiers. Its combination with the SVM model can continuously update and empower data blocks; therefore, data blocks are classified and integrated into the form of class labels, and the semantic deviation of data blocks can be effectively detected.
e candidate classifier C ′ in Algorithm 2 and all the classifiers C i in set ε determine the distance between the classification integration dataset and the superplane, which continuously updates the weight of the instance data block set d. e distance between the two types 8 Complexity of datasets and the classification hyperplane is separated to the maximum through the best classification hyperplane in the SVM model (see equation (4)). At the same time, the slack variable ξ is introduced to improve the fault tolerance performance in the training process of SVM, and the sample points affected by the parameter c in the RBF kernel function are mapped to the low-dimensional space to continuously correct the classification and integration efficiency so that the instance dataset can be accurately divided into "health" and "pseudohealth" information sets σ 1 and σ 2 . e precise classification and integration process is shown in Figure 5(b).

Performance Evaluation of Classified Detection.
To illustrate the advantages of the algorithm proposed in this paper, the logistic algorithm [21], decision tree (DT) [22],    and artificial neural network (ANN) [23] are now adopted for comparison. In addition, the classification accuracy among the four algorithms is tested. e classifiers of these four algorithms can update and classify the instance data block sets by using sliding windows in a free combination way. erefore, the experimental data block set can be classified and integrated. Because the cross-validation strategy can overcome the overfitting of the classifier and enhance the generalization ability of the four algorithms, the classification accuracy of the four algorithms is compared by using the cross-validation strategy. e 10 instance subsets in this paper are randomly used for training to verify the classification accuracy of different models. e experimental results are shown in Figure 6.
As seen in Figure 6, the detection effects of the DT, logistic algorithm, and ANN are better than that of the method proposed in this paper within 0-100 seconds. However, the method in this paper is better than the other three algorithms in more than 100 seconds because the average absolute error (MAE) of the classification of the   method proposed in the paper is higher than that of the other three methods in 100 seconds; however, in more than 100 seconds, it is lower than that of the other three methods, and the three overlapping windows all have similar situations. To further illustrate this problem, the window unit is set to minutes, the sliding window size is 100 minutes, and the overlapping size is equal to 20%, 50%, and 70% of the window size. Four algorithms are used to detect the dataset of the example, and the detection effect is shown in Figure 7.
In Figure 7, the DT, logistic, and ANN algorithms can greatly reduce MAE by adjusting parameter settings and adopting supervised/semisupervised methods to increase the classification effect after 30 minutes. e algorithm in this paper can efficiently detect and classify instance datasets from the beginning, and its MAE value always fluctuates between 0.5 and 0.8. erefore, whether in seconds or minutes, the algorithm in the paper is obviously superior to the DT, logistic algorithm, and ANN model. e classification accuracy of the four algorithms is compared with the example dataset in the paper. e experimental results are shown in Table 4. As seen in Table 4, the classification accuracy of the four classifiers is quite different: the classification accuracy of the algorithm in the paper is the highest, the classification accuracy of the training sample is as high as 96.88%, the classification accuracy of the test sample is 98.73%, DT has the lowest classification accuracy, the classification accuracy of its training sample and test sample is 71.35% and 76.29%, respectively, the accuracy of the logistic algorithm and ANN is between the two, and the precision of ANN is slightly higher than that of the logistic algorithm.

Conclusions
e identification of network pseudohealth information is not only the frontier and focus in the field of news dissemination but also the focus and difficulty in the field of data mining. Although some scholars have studied this problem and proposed many recognition methods, the existing methods are mainly single classifiers or batch processing, which result in the fact that either the classification cannot be effective or the recognition accuracy is not being high. Based on the class tag attributes of network pseudohealth information datasets, the paper proposes a combination algorithm integrating data partitioning and classification update based on previous research results, integrates LDA topic recognition model, dataset partitioning algorithm, SVM data block classification integration model, semantic offset detection algorithm, and other methods, and adopts Web crawler technology to conduct simulation experiments based on the pseudohealth information of the Sina Weibo platform during the epidemic from February 1 to March 31, 2020. e simulation results show that the combination algorithm proposed in this paper has good superiority in both the subject recognition of pseudohealth information and the block and integration classification of instance datasets. Compared with DT, logistic algorithm, and ANN, the experimental results show that the classification integration accuracy of this method is higher than that of these three methods, which fully illustrates the reliability and practicability of the method in the paper. e identification of pseudohealth information in the future is of great significance for maintaining normal public health order and building a "Healthy China." Traditional mainstream media has high authority and influence. As a public tool of the society, media should perform its functions to serve the audience and the society and strengthen the check of fake health information to clarify its authenticity. At the same time, the media should also clarify the pseudohealth information that disturbs people in time to prevent the spread of pseudohealth information, which is also a way for the media to maintain their own image and authority. erefore, we should not only pay attention to the problems existing in the dissemination of various kinds of information, but also make full use of technical means and tools to curb the further dissemination and influence of pseudohealth information.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.