Semantic Outbreak Power Based Evolution of Web Event in Large-Scale Ubiquitous Contexts

Nowadays, emergencies’ events have a great impact on people's daily life. Web is acting as an important platform for the diffusion and evolution of social events. With the popularization and development of web technology in the world, web is becoming an important platform to cover, transmit, and release the news. Information holders can use Internet as a medium to broadcast various news in real time. But it is difficult for people to excavate the events development trend from web information. In this paper, we introduce a method to measure the evolution process of the Web event based on the semantic outbreak power. Then, we propose an approach to distinguish the event type based on the fuzzy recognition. Finally, we give some instances validation of semantic outbreak power of web event and the correctness verification of the event type distinguish of Web event.


Introduction
With the popularization and development of web technology in the world, web is becoming an important platform to cover, transmit, and release the news.Information holders can use Internet as a medium to broadcast various news in real time.Web users can quickly and comprehensively grasp the dynamic events [1].
Web event refers to the web event which is sustainedly and extensively reported and discussed in short online media (such as BBS, news sites, and blogs) [2,3].It is likely to cause some harm to the social reality.The reported and discussed forms are various, such as a web page news commentary, the post and reply in BBS, and the records and message in blog.These reports and discussions have a major impact on the social stability and a large number of Internet users.It often accompanies the occurrence, development, and change of the hot topics and social events.Web event has some typical features: (a) it may have wide and speedy transmission; (b) it may have devastating effects on the society; (c) it is not easy to find and control the breaking point.So how to detect and compute the propagation speed, predict the direction of evolution, and effectively control web event is a major challenge.Therefore it is necessary to analyze the future of the web event and measure its evolution.
With the development of internet of things [4][5][6], big data [7][8][9][10], and cloud computing [11][12][13][14], evolution is a basic feature of the web events and important research content in the field of topic detection and tracking (TDT) [3].The main research contents of the TDT [15,16] involve topic detection and information collection, segmentation of event information and the first report event time detection, and topic tracking.Generally speaking, the TDT technologies try to detect unknown events and cluster news related to these events.
The TDT traces the development of the events.But it is not to measure the dynamic evolution process of the events and does not consider event semantic characteristics in the evolution process.So the TDT cannot offer us a global and clear understanding of the web event.Thus, in order to recover the insufficiency of the TDT, we have proposed a method to measure the evolution process of the web event based on semantic outbreak power.Then, we put forward an 2 International Journal of Distributed Sensor Networks approach to distinguish the event type based on the fuzzy recognition.
The paper is organized as follows.Section 2 presents briefly the topic detection and topic tracking.Section 3 introduces a method to measure the evolution process of the web event based on semantic outbreak power.Section 4 puts forward an approach to distinguish the event type based on the fuzzy recognition.Section 5 gives some instances of validation of semantic outbreak power of web event and the correctness verification of the event type distinguish of web event.

Related Work
The TDT [17][18][19] was first initiated by the Defense Advanced Research Projects Agency (DARPA) and National Institute of Standards and Technology (NIST).It aims to develop a series of information organization technology based events and help people cope with information overload problem.Then, we mainly introduce the research of the two subtasks: topic detection and topic tracking.

Topic Detection.
Topic detection puts data stream from the newswire and news sources reported into different topics.If it is necessary, topic detection will establish the new topic of technology.Topic detection can be thought of as an event cluster.Most of these clusters are incremental [20].
Most of current research topic detection using the traditional method of natural language processing: the center vector method [17], NN [21], -means [22], single pass clustering algorithm [23], and so forth.The processing object of the topic detection is the reporting flow, which is changing with time, rather than a static text set.So most studies use a combination of a variety of clustering strategies to extend and improve these basic clustering algorithms.
More representative research includes the following: Yang et al. [24] used the strategy which combined the condensing type clustering algorithm and the average clustering algorithm; the researchers of CMU [25] mainly use the single pass clustering method with time Windows to detect the topic; literature [26] proposed a hierarchical community discovering algorithm based on event network, which exploits the semantic properties of event nodes and edge-weight information in the network, to discover fine granularity communities that are semantically meaningful.
In addition, establishing the model for the topic and report is another important study in the process of topic detection.Related models include space vector model, probabilistic model, lexical chain model, and graph model.The vector space model is the most commonly used model.There are many kinds of the formula to calculate the similarity based on this model, such as okap, Clarity, Hellinger, Tanimoto, Weightsumt, and cosine similarity formula.

Topic
Tracking.Topic tracking is the technology used to identify the rest news associated with the topic from news flow according to the given small amounts of training reports associated with a topic.The essence of this task is a guided learning.Topic tracking uses few positive instances data (one or more samples) and a lot of negative instances (history data) to get a classifier.This classifier is used to distinguish whether the new report is associated with the topic.So the topic tracking can be regarded as a special kind of binary classification problem.Many of the technologies in text categorization are the foundation of topic tracking.
In the topic tracking study, the commonly used methods include  nearest neighbours (NN) [27,28], the decision TREE (D-TREE) [29], Rocchio algorithm [30], the SVM algorithm [31], and language model [32].Most of the research improved the basic algorithm and then applied it to the topic tracking.The following simple introduce several representative researches.
Allan et al. used Rocchio algorithm to implement the topic tracking.The core idea of the algorithm is the structural strategy of the topic model.If the feature is conducive to the topic description, its weight will be strengthened.If the feature is tending to wrongly guide topic description, its weight will be weakened [33].Leek and Sista used the probability model in their topic tracking and recognition system, mainly based on the simple Bayesian algorithm.The system combines multiple classifier system, organized to present the results of each classifier to the user [32].
In addition, another research about web events or topics includes this paper [34] which presents a novel hot event discovery framework to detect hot events online.It contained three stages: document preprocessing, threshold-resilient document classification, and adaptive splitting document clustering.Deng and Xu [35] propose a method to measure the influence and represent the event evolution graph.Lee et al. [36] proposed an incremental tracking framework for cluster evolution over highly dynamic networks.This paper considered the event evolution tracking task in social streams as an application, where a social stream and an event are modeled as a dynamic postnetwork and a postcluster, respectively.
Generally speaking, the traditional TDT technologies just detect unknown events and put related news reports into different topics based on text clustering and text classification.The TDT also traces the development of the events but does not analyze the event deeply.So it cannot provide us with a global and clear understanding of the web event.

Semantic Features of the Web Event.
Our goal is to measure the evolution process of the web event based on semantic outbreak power.So, first of all, we need to get the semantic features set of the web events in the evolution process (Table 2).
Definition 1 (the semantic features set of web event Fe).The semantic features set of web event Fe consists of three parts including event seed set (  ,   ), web page set (  ,   ), and events keywords set (  ,   ).
Events seed set (  ,   ) contains the core attributes word of events which are usually used as the keywords to search for events.Web page set (  ,   ) is a set of resources on the web.The set is a cluster of related web document collections which The basic steps to get the semantic features set of web event Fe are as follows: (1) using events seed set (  ,   ) as search events attributes, search engine will return web page set (  ,   ); (2) extracting events keywords set (  ,   ) from web page set (  ,   ), the weight of event attributes   of web page   can be computed by the TF-IDF method [37].
Then, we will discuss several basic semantic features of the web event set.
It means that there are new increased  related web pages of event e during the time [  ,   ], and no intersection set of web pages during all previous time [  ,   ], namely, Δ(  ,   )∩ (  ,   ) = ⌀.
Table 1 shows the event's time-series data of the new increased web page.The right of the table was the new increased number of Chinese web page of the specific source from the search engine on the day.Definition 3 (the new increased keywords set: Δ(  ,   )).From time   to   , the new increased keywords set of event e is extracted from the new increased web page set Δ(  ,   ), which can be labelled as Δ(  ,   ) = {  ,  +1 , . . .,   }.It means that there are new increased  events attributes of event e during the time [  ,   ] and no intersection set of events keywords set during all previous time [  ,   ], namely, Δ(  ,   ) ∩ Δ(  ,   ) = ⌀. ) .
The new increased web page set Δ(  ,   ), the new increased events keywords set Δ(  ,   ), and the distribution of event attributes in the new increased web page (  ,   ) can be obtained from the source data shown in Table 3, and calculate the binary chart shown in Figure 1.The picture reflects the relationship of web page, event keywords, and the distribution of event attributes in the web page.

The Semantic Outbreak Power
Definition 5 (the outbreak power of web event: op(  ,   )).From time   to   , the outbreak power of web event can be labelled as op(  ,   ).It represents the fact that the severity power of the event e may cause harm and urgency power to deal with emergency during the time [  ,   ].
We understand the web event life process more comprehensively after discussing the semantic features set of web events.We will propose an iterative algorithm combining all of these characteristics of events, measuring the web event life process.

Input:
The set of pages (  ,   ), the set of keyword (  ,   ), the distribution of keywords among pages (  ,   ).Output: The outbreak power op(  ,   ). for each 𝜑

𝜓(t i , t j )
Figure 1: The relationship of the events seed set (  ,   ), the new increased web page set Δ(  ,   ), the new increased events keywords set Δ(  ,   ), and the distribution of event attributes in the new increased web page (  ,   ) (  ,   ).
In the previous section, the semantic features set of web events have been defined and attained.Now, we will use these features to calculate the outbreak power.Before proposing the Calculation method, we introduce two important propositions: the representability of event keywords and the credibility of a web page.

Proposition 6 (the representability of event keywords: er(𝑘)). It describes the ability of the event keywords to express the event.
Proposition 7 (the credibility of a web page: ()).It describes the believable event power of the keywords to express the event.
According to the observation of the actual data set and cognitive knowledge, we give the following few deductions to be the basis of calculating semantic outbreak power algorithm (Algorithm 1).Deduction 1.The more web pages on event discussion, the higher semantic outbreak power during the time [  ,   ], namely, op(  ,   )∞|Δ(  ,   )|.
Deduction 2. The more keywords on event discussion, the higher semantic outbreak power during the time [  ,   ], namely, op(  ,   )∞|Δ(  ,   )|.Deduction 3. When the disagreement is larger about the event discussion, the distribution of event keywords in the new increased web page (  ,   ) is one-to-one mapping.Then, the semantic outbreak power is higher, namely, According to Deduction 3, on the condition that each event keyword is provided by one web page which only offers one keyword, the similarity power between each page will be 0.This means that all the webs are different.At this point, the disagreement is bigger and the event is likely to further deteriorate.In contrast, we can make Deduction 4. Deduction 4. If the web page number and keywords on the event discussion are decay during the time [  ,   ], the distribution of event attributes in the new increased web page (binary diagram in Figure 1) tends to be a complete graph.At this time, the power of the outbreak of the event is low.
According to Deduction 4, if the entire keywords exit in every web page, namely, the event keywords set and web page set constitute a complete graph, the similarity power between each page will be 1.This means that all web pages are reproduced from a web page.At this point, the web event is relatively uniform and is less likely to generate further threats.
From the previous discussion, we can know that the greatest influences on the semantic outbreak of web events are the new increased web page set Δ(  ,   ), the new increased keywords set Δ(  ,   ), and the distribution of event attributes in the new increased web page: (  ,   ).
Considering the above discussion, we can use the representability of event keywords er() to infer the credibility of a web page ().The reverse is also true. it is shown in Figure 2 which consists of three parts: the new increased web page set Δ(  ,   ), the new increased keywords set Δ(  ,   ), and the distribution of event attributes in the new increased web page: (  ,   ).
Figure 2: The process of iterative calculating the credibility of web page and the represented ability of event keywords.
According to Figure 2, we get the following two deductions.
Deduction 5.For a web page , if most of the event keywords have the strong representability er(), the web page will have high credibility ().Deduction 6.If a keyword  is provided by a higher () web page, the keyword  will have a strong er().er() and () affect each other.So we can use altercative algorithm to compute the semantic outbreak power considering the new increased web page, the new increased keywords, distribution of keywords in the web page.
According to Deduction 4, for a web page , we can compute the average of the representability of event keywords to compute the credibility of a web page.Consider where Kw() is the event keywords set for a web page .
According to [38], we use the probability function to calculate the representability of event keywords: where Wp() is the web page set event keywords .
The semantic outbreak power op(  ,   ) during the time where  is |Δ(  ,   )| and it represents the new increased web pages during the time [  ,   ].
Therefore, we put forward the detailed steps on semantic outbreak power algorithm as follows: (1) give each page the initial credibility; at the same time, all the initial credibility is a vector of a web page ⃗ ; namely, ⃗  = {( 1 ), ( 2 ), . . ., (  )}; (2) according to formula (4), compute the representability of each event keyword er(); (3) according to formula (3), compute the credibility of a web page: ()  ; (4) all types of the credibility calculated using (3) constitute a vector:  →   , namely,  →   = {( 1 )  , ( 2 )  , . . ., (  )  }; (5) computing will reach a stable state until cosine similarity of ⃗  and  →   is greater than ; at this time, the representability of event keywords er() and the credibility of a web page () will no longer be changed; if this condition is not satisfied, make ⃗  =  →   ; turn to step (2); if this condition is satisfied, end the iteration; (6) according to formula (4), calculate the semantic outbreak power.

Experiment
4.1.The Verification of the Semantics Outbreak Power.Experiments involve events derived from Baidu news sites.We chose 50 web events about 450000 web pages and 100 events about 900000 web pages as the experimental data set, which includes various areas such as political, accidents, and disasters, and terrorist attacks.Tables 3 and 4 show the statistical results of the experimental data set in detail in this paper.The determination of these web start events stamp according to the method in [26], and the end events stamp general settings for the experimental process in the time we crawl events related web pages; the average length of each event sampling 6 International Journal of Distributed Sensor Networks time is 30∼40 days.In addition, to determine the evolution process of event, we also obtain semantic futures of events, including event seed set, event web page set, and the event keywords set.We get the event seeds from Baidu.Baidu provides hot issues; at the same time, it also provides hot search words to help users search events.For example, about Japan's catastrophic earthquake and tsunami disaster in 2011, many web users want to search the related web page information.Baidu will provide a set of keywords about events such as "Japan, earthquake, tsunami." We will be event seed as a search term and crawl events related web from the Internet, after getting the event seed from the search engines.
The detailed steps we collect from the web resources are (1) obtaining event seed set (  ,   ) of web event from the news websites Baidu, such as "Japan, earthquake, tsunami, " (2) using the event seed set as the search keywords, crawl events related web pages from the Internet set, (5) doing step (4) in the different sources of information (including news, blogs, and BBS).
We took "days" for the minimum time granularity, collected, from different sources on the web event, every day, the semantic characteristics of the source data, used the iterative algorithm to calculate the semantic outbreak powers every day based on the source data, and then got the events timeseries data of the semantics outbreak over a period of time, as shown in Figures 3, 4, and 5.

Conclusions
Nowadays, emergencies events have a great impact on people's daily lives.And, with the popularization and development of web technology worldwide, web is acting as a platform for the diffusion and evolution of social events.However, faced with the huge disorder and continuous web resources, it is impossible for people to efficiently recognize, collect, and organize the events.Therefore, automatically collecting and organizing the information about events and then tracking the evolution process of events becomes a hot research field.Generally, the traditional topic detection and tracking (TDT) techniques have been attempting to detect or cluster news stories into these events, without discussing or interpreting the evolution process of events.
In this paper, we proposed the semantic outbreak power and events process measurement algorithm.For any one event, we can calculate the event time-series data of the semantic outbreak power and measure the evolution process of events by analysing the information from the web about the event.It can help people to understand clearly the evolution process of a web event.Then, we propose an approach to distinguish the event type based on the fuzzy

Figure 3 :Figure 4 :Figure 5 :
Figure 3: The event "Japan nuclear leak" evolution based on semantic outbreak power.

Table 1 :
The new increased web page number of the events of "Libya unrest" every day.the events in the time period [  ,   ], and it can be labelled as (  ,   ) = {  ,  +1 , . . .,   }.We can use the search engine (e.g., Google) as interface and the seed set as a keyword to search events.We can crawl from the Internet for time-series data of hot issues.Event attributes set (  ,   ) is a bunch of event attributes set which is extracted from the web page set (  ,   ), and it can be labelled as (  ,   ) = { described  ,  +1 , . . .,   }.

Table 2 :
The semantic features original set of the web event evolution process.
determining the event start time stamp   according to the web page set (  ,   ) and the event end time stamp   according to the download page time, (4) obtaining timing source data of event every day from the web page set (  ,   ) including new increasing web page number |Δ(  ,   )|, new increasing keywords number |Δ(  ,   )|, and the distribution of event keywords in the new increased web page: (  ,   ),