Information Classification and Extraction on Official Web Pages of Organizations

: As a real-time and authoritative source, the official Web pages of organizations contain a large amount of information. The diversity of Web content and format makes it essential for pre-processing to get the unified attributed data, which has the value of organizational analysis and mining. The existing research on dealing with multiple Web scenarios and accuracy performance is insufficient. This paper aims to propose a method to transform organizational official Web pages into the data with attributes. After locating the active blocks in the Web pages, the structural and content features are proposed to classify information with the specific model. The extraction methods based on trigger lexicon and LSTM (Long Short-Term Memory) are proposed, which efficiently process the classified information and extract data that matches the attributes. Finally, an accurate and efficient method to classify and extract information from organizational official Web pages is formed. Experimental results show that our approach improves the performing indicators and exceeds the level of state of the art on real data set from organizational official Web pages.


Introduction
Authors are encouraged to use the template for Microsoft Word, to prepare the final version of their manuscripts and facilitate typesetting. Authors may elect to submit two versions of their manuscript, one for the printed version of the journal, and the other for the on-line version of the journal. Illustrations in color are allowed only in the on-line version of the journal. Organizations, such as companies and schools, are collectives of people working together to achieve specific goals. As a force to lead social development, most organizations have their official Web pages, which contain numerous personal, departmental, business information and shows the significance and correlations. In the application scenario of supervision and investigation, it is essential to obtain real-time and useful information from organizations. Compared to others, the official Web pages of the organization have the following characteristics: 1) The content ministers to the public perception. 2) The time of publication is real-time.
3) The information contained is authoritative. Mining the value of information through the essential pre-processing contribute to the works such as analyzing organization relationships and predicting the development trend. To effectively analyze the information in official Web pages of organizations, preprocessing complex multi-source Web pages should be an essential pre-step. There exist several problems in the current research as follows: 1) Public data sets related to these Web pages are lacking [Pasternack and Roth (2009) ;Hernández, Rivero, Ruiz et al. (2014)], which requires crawling by researchers. 2) Every organizational Web page has its layout and content [Gautam and Kumar (2013); ], which contains much irrelevant information.
3) The organizational information is of varying lengths [Thamviset and Wongthanavasu (2014); Wang, Ma, Zhang et al. (2008)], which makes it challenging to map attributes. In general, there is still no effective method to deal with the official Web pages of organizations.
In this paper, the information classification and extraction method for the official Web pages of organizations has been proposed, which processes Web pages into unified attributed information. Active information blocks are obtained based on the analysis of valid characters. By combining the structural and content features of Web pages with the specific model, this method accurately completes the information classification. For organizational information of varying lengths, two processing methods based on trigger rule and LSTM are proposed to extract the classified information. The architecture of the classification and extraction method is shown in Fig. 1. The main contributions of the paper are summarized as follows: 1) It is an effective research to design classification and extraction methods specifically on the official Web pages of the organizations. 2) After locating the active information blocks of the official Web pages, the specific structural and content features for classification are proposed and proved effective by the experiments.
3) Innovatively combined with the trigger lexicon and LSTM, methods for extracting types of information for organizational Web pages are designed, which have better performance than the similar methods. The remaining of this paper is organized as follows. Recent studies are described in Section 2. Information classification and extraction method for official Web pages are presented in Sections 3 and 4. Experimental work is given by using real official Web page data set in Section 5. Finally, discussion and conclusion are presented in Section 6.

Related work
In this section, recent studies are divided into the classification and extraction of Web page information.

Classification on Web page information
Classification of Web pages plays an essential role in the process of content mining. Hashemi [Hashemi (2020)] surveyed the proposed methodologies in the literature, but also traces their evolution and portrays different perspectives toward this problem. Recent studies mainly focus on the following two aspects: classic text classification and classification based on Web page features. In the aspect of classic text categorization, Gautam et al. [Gautam and Kumar (2013)] improved the mutual information formula by adding the class correlation balance factor. Then they applied it to the feature weighting algorithm, which significantly enhances the effect of text categorization. Saleh et al. ] proposed ONBC, a novel strategy for vertical Web page classification, which employs several Web mining techniques, and depends mainly on proposed multi-layered domain ontology. Xu et al. [Xu, Yu and Qi (2018)] presented a novel sensitive information classification algorithm and topic tracking algorithm for Tibetan Web pages contents. However, Web pages are semi-structured HTML documents, which have the structural features of layout and rendering in addition to text information. Therefore, this classification method based on the classic text has some limitations in the application of practical problems. Based on traditional text classification methods, many studies work around Web page features. Pasternack et al. [Pasternack and Roth (2009)] looked for the largest sub sequence of Web page and got the content of Web information by segmentation with the proposed method EMSS. Hernández et al. [Hernández, Rivero, Ruiz et al. (2014)] proposed an unsupervised URL-based Web page classification method. By constructing some URLs, it classifies categories of Web pages and matches the classified Web pages with patterns. Saleh et al. ] proposed a new centroid-based model to solve the class imbalance problem, which learns from the training data and weighs each category to indicate the data distribution of the corresponding categories. Wang et al. [Wang and Qu (2017)] proposed a new method of Web text classification based on the improved CNN and SVM, using the CNN model with the five-layer network structure to extract text feature and then classify and predict by using SVM. Onan [Onan (2016)] presented a comparative analysis of four different feature selections and four different ensemble learning methods based on four different base learners. The experimental results of these methods indicate that feature selection and ensemble learning can enhance the predictive performance of classifiers in Web page classification.

Extraction on Web page information
According to the extraction methods, recent studies can be divided into three categories: pattern-based, domain ontology-based and machine learning-based methods. Thamviset et al. [Thamviset and Wongthanavasu (2014)] designed a semi-supervised extraction system and proposed ERSP, a method of information extraction based on repetitive patterns. Moreover, this system applies the topic tree clustering algorithm to cluster the target data record and create extraction patterns. Li et al. [Li, Jiang, Xu et al. (2017)] built a Web information retrieval matching and structure extraction model based on search engine, which realized the algorithm of locating and automatically extracting multi-Web news information with regular expression. However, when the structure of Web pages changes, the extraction rules need to be modified. Compared with the pattern-based approach, the domain ontology-based method ClusTex proposed by Ashraf et al. [Ashraf, Özyer and Alhajj (2008)] has its specificity and limitations. Domain ontology is the collective knowledge recognized by people in a specific domain and can also be learned by many Web pages. Moreover, ClusTex takes more effort to construct. Zhang et al. [Zhang and Ding (2015)] introduced Web page segmentation into the stage of Web page pretreatment, by analyzing the ontology-based Web information extraction technology. Vigneshwari et al. [Vigneshwari and Aramudhan (2015)] develop a model based on the multiple constructed ontologies from the mutual information, which experimental results shows a healthy improvement for quick access of useful data from a huge information resource like the Internet. Web page information extraction based on machine learning is to utilize learning models such as conditional random field method [Li (2012)], SVM (support vector machine) method [Wu, Hu and Liang (2014)] and multimodal learning [Gong, Wang and Peng (2017)] to transform Web page information extraction into model optimization. The advantage of this method is that it can better adapt to the change of the structure of the Web page, meanwhile the cost is high. Therefore, to classify the official Web page information effectively, the combination of location and feature is adopted. In the process of extracting the classified data, the appropriate method is applied according to the classification features, which is a reasonable way to classify and extract the official Web page information.

Classification on official Web pages
In this section, an information block location method based on valid characters is proposed to complete the information classification work. After that, the structural and content classification features are summarized in Tab. 1 to classify the active blocks in official Web pages.

Location of active information blocks
The block location is an effective way to identify the active ingredients in Web pages. As for the official organizational Web pages, valid characters are mostly articles, prepositions and conjunctions, which is the key to express a sentence smoothly. Therefore, it is feasible to utilize the valid character to locate active information blocks in pages. The existence of valid characters indicates that the texts are semantically complete and smooth sentences. The more valid characters the DOM node contains, the more feasible it is to be the active information block. The valid characters can be regarded as an attribute of the DOM tree node to combine number. As shown in algorithm CountChars, an attribute named validChars is added. When a node is a leaf of the DOM tree, the text content of the node is judged to contain valid characters. After processing, a Web page file with attribute validChars can be obtained, in which DOM node shows the number of valid characters. The number of valid characters represents the possibility where the node locates in the information block.
Definition 1. i represents a node in the DOM tree. Ci represents the number of valid characters of i, and j represents the subnode of i. The maximum character ratio of the subnode (MPV) is defined as follows.
MPV indicates the importance of child nodes in parent nodes. As shown in the block location algorithm Major, the DOM tree with attribute validChars is added to check all the subnodes, find the maximum value node, and select the maximum value node maxNode. If the value of MPV is less than the threshold k, Major stops and outputs the current node. Otherwise, Major is recursively performed on maxNode until the Web page information block is determined.

Structural features of official Web pages
After analyzing the organization's official Web pages with the location of active information blocks, the information that needs to be classified is obtained. These active information blocks have features, which mainly reflected in the structure and content of Web pages. Considering the valid characters and sub tags, three structural features are given as follows.
1) The maximum proportion of valid characters in subtags of Web page information (MPV is shown in Eq. (1)).
2) The maximum difference in the proportion of valid characters in subtags of Web page information (MDP).
As shown in Eq. (2), i represents the node with DOM number. Ci represents the number of valid characters of DOM node i, and j represents the child of DOM node i.
3) The number of sub tags in Web page information (NSW).

Content features of official Web pages
In the research of this paper, the classification is mainly oriented to the three types of information as follows. 1) Personal information comes from the employees of the organization, such as name, age, and school. 2) Department information comes from the organizational departments, such as name, e-mail, and functions. 3) News information comes from organizational news, such as name, description, and knowledge.
In Eq. (5), S represents the set of identical names in the main body information block and, s represents each specific result. The most significant difference between the list type single-person and the list type multi-person information pages is that there will be multiple same personal names.
3) The number of business information (NBI) As shown in Eq. (8), K denotes the same set of department names in the main body information block, and s indicates each specific result. The most significant difference between the list-style single-department and the list-style multi-department information pages is that there will be more same departmental names.

6) The character proportion of business information (CPB)
In Eq. (9), C denotes the regular matching result set of business information, s represents each specific result and r denotes the characters in s. SUM represents the total number of valid characters in the Web page information block. The cumulative distribution of the experimental data set shown in Fig. 3 proves the validity of content features.

Extraction on official Web pages
After locating the active blocks and presenting the features, the information of the organization's official Web page is classified. Information with a short length can be extracted by rules, while others require experience in the extraction process. Two methods based on trigger lexicon and LSTM are proposed to extract attributes of the classified organizational information as follows.

Extraction method based on trigger lexicon
It has difficulty to identify personal and departmental information in official Web pages. TRIE, a method of information extraction based on trigger lexicon is proposed. The trigger lexicon of target extraction information is established, and the process of rule matching is applied to extract the trigger lexicon.
In the information that needed to be extracted, there always exist some trigger words such as verbs and nouns. Constructed by expert knowledge, the trigger lexicon to extract information is shown in Tab. 2. It is essential to locate the sentences where the information to be obtained. The corresponding rules are matched by searching the rule base, which is a collection of regular expressions for an attribute that has excellent extraction effects for types of known classifications.

Extraction method based on LSTM
The work scope of organizations is reflected in their business information from official Web pages, which has several features as follows: 1) no apparent triggers; 2) no structural similarity; 3) lives in a long text. Due to the long length of business information, natural language processing is required. RNN (Recurrent Neural Network) can vectorize sentences and find optimal solutions at high speed. However, the gradient disappears after the training time increases in training RNN model using the business information because RNN can only remember part of the data. LSTM [Hochreiter and Schmidhuber (1997)] is a cyclic neural network model with control units such as the input, output, and forget gate. It makes the weight parameters of the cyclic structure continuously change during the learning process and adds long-term dependency based on RNN. LSTM can deal with the long-term dependence in the previous text and judges the next after understanding the past. Due to the features mentioned in the business information, LSTM has advantages over other methods in processing business information.
As shown in Fig. 4, an information extraction method based on LSTM is proposed. Firstly, the text containing a long length of information is selected and divided into clauses. The information, such as business name and description, is identified by expert labeling. The performance of the LSTM model is gradually improved by continuous training. The business information of organizations in different fields labeled by experts is utilized as the training set. The processing model contains word2vector, vector joint, LSTM computing, and softmax selection, which is utilized to perform probability calculation on the business information to obtain the label with the highest matching value. After calculating by the model, business information and their corresponding labels are obtained.

Experimental study
After introducing the implementation of the above methods, this section will explain the experimental part, including data set, environment and experimental results.

Data set and environment
Official Web pages of 900 organizations in various industries were collected to test the effectiveness of the above methods. The collection source contains 300 NGOs (nongovernmental organizations), companies and schools each. Besides, three categories were obtained through the summary of the Web pages. 20% of each category data set was randomly selected as the experimental data set. The categories and subcategories were labeled by expert knowledge. The summary of the experimental data set is shown in Tab. 3. All experiments were conducted on CentOS 8.0 with an Intel i7 CPU@3.4 GHz, 16 GB of memory, and an SSD hard disk with a capacity of 480 GB.

Information classification on official Web pages
In this experiment, three structural and six content features of official Web pages were summarized. The training data set was obtained by randomly selecting 80% of the experimental data set, meanwhile the remaining part formed the test set.

Selection of classification models
The BP neural network model, C4.5 decision tree and SVM algorithm were applied to construct the classification model by using the same training data and the features mentioned above. The comparison results of the classification performance of the three models are shown in Fig. 5. With relatively high accuracy, recall and F-measure value for the official Web page classification, the SVM model is chosen to make the classification model according to the comparison analysis.

Comparative analysis of indicator performance
After selecting SVM as the classification model, the proposed features were applied to process the information. The indicator performance comparison was made for scenarios that consider all proposed features, without any feature, the baseline method EMSS and ONBC. As shown in Tab. 4, all nine features have positive effects on Web classification, which proves the rationality. Besides, it is noteworthy that the accuracy decreases more when feature MPV, MDP, NSW and CPB are missing. The classification effect of Web pages is significantly reduced when the feature CPB is lacking. Compared with the baseline methods, the proposed method performs better in terms of parameter performance and time cost.

Information extraction on official Web pages
The official Web pages have been divided into personal, departmental and business information after classification. In this experiment, two proposed methods were applied in extraction according to the text length. We applied TRIE to personal and departmental information and LSTMIE to business information. The method ERSP and ClusTex were applied as the baselines.

Analysis of personal information extraction
As shown in Fig. 6A), TRIE performs slightly better than the baseline methods in most categories. However, in terms of the extraction of personal names, the accuracy of TRIE performs slightly worse than the baseline methods. The reason is that some official Web pages of organizations do not contain the name in trigger lexicon, which causes a slight loss. Compared with the baseline method ERSP and ClusTex, TRIE performs better in dealing with personal information.

Analysis of departmental information extraction
As shown in Fig. 6B), the accuracy of the four categories of information extraction performs higher than the baseline methods, which proves the effectiveness of algorithm TRIE. It is noteworthy that the two methods of extracting function and name information perform poorly. Therefore, the reason for the low accuracy of departmental names is that there exist no trigger words in the official Web pages of organizations. Besides, the accuracy performs poor because the information style of the departmental function is flexible. It brings challenges to cover all cases with extraction rules.

Analysis of business information extraction
The LSTM model was trained with labeled text containing business information. After obtaining the model, the other non-labeled text was processed, and the business information extraction was completed. The accuracy of the extraction results shown in Fig. 6C) was compared according to three types of information. In the extraction of business information, the accuracy of LSTMIE performs higher than that of the baseline methods. Compared with the baseline method, LSTMIE has better performance in dealing with the official organizational Web pages information with flexible forms, various changes and low similarity between Web pages.

Analysis of three types of organizations
By applying TRIE and LSTMIE, the information was extracted from the Web pages of schools, NGOs, and companies in the experimental data set. As shown in Fig. 7, the accuracy performance of three types of organizations in different information categories is consistent. The reason why schools perform better is that their organizational structure is relatively fixed. In contrast, NGOs are slightly obscure in introducing their personal and business information. To evaluate the efficiency of our methods, ERSP and ClusTex were applied in the experiment. Fig. 8 shows the results of the comparative efficiency experiment. It can be observed that TRIE and LSTMIE have the advantage of efficiency cost in the processing of data sets of various categories of the same size.

Discussion and conclusion
It can be observed that the proposed method shows the characteristics in the experiments as follows: 1) The proposed structural and content features pay more attention to the intuitive representation of the active information blocks, prove the effectiveness, and improve the performance of Web classification.
2) The comparison with the baseline algorithm proves the validity of the selected classification model and features.
3) Collaboration between TRIE and LSTMIE can be applied to extract attributes of the classified information from types of organizations. Compared with existing works, the proposed method has achieved better classification and extraction performance in official Web pages of organizations. In this paper, we have proposed the information classification and extraction method for the official Web pages of organizations. After locating the active information blocks of Web pages, the content and structural features were summarized. The specific method was applied to construct the model to classify the Web pages. Two extraction methods were proposed for types of Web information, which are respectively based on trigger lexicon and LSTM. Experiments showed that our method performs better than existing methods in terms of effectiveness and efficiency. In the future, it is necessary to expand the size of the trigger lexicon and reduce costs with parallelization. The focus will also be placed on further enhancing efficiency and discovering more stable features.