Elsevier

Future Generation Computer Systems

Volume 76, November 2017, Pages 510-518
Future Generation Computer Systems

An optimized approach for massive web page classification using entity similarity based on semantic network

https://doi.org/10.1016/j.future.2017.03.003Get rights and content

Highlights

  • A weight estimation algorithm based on the depth and breadth of Wikipedia network is used to calculate the class weight of all Wikipedia words.

  • A kinship-relation association based on content similarity was proposed to optimize the unbalance problem.

  • Bayesian classifier is used to estimate the page class probability.

Abstract

With the development of mobile technology, the users browsing habits are gradually shifted from only information retrieval to active recommendation. The classification mapping algorithm between users interests and web contents has been become more and more difficult with the volume and variety of web pages. Some big news portal sites and social media companies hire more editors to label these new concepts and words, and use the computing servers with larger memory to deal with the massive document classification, based on traditional supervised or semi-supervised machine learning methods. This paper provides an optimized classification algorithm for massive web page classification using semantic networks, such as Wikipedia, WordNet. In this paper, we used Wikipedia data set and initialized a few category entity words as class words. A weight estimation algorithm based on the depth and breadth of Wikipedia network is used to calculate the class weight of all Wikipedia Entity Words. A kinship-relation association based on content similarity of entity was therefore suggested optimizing the unbalance problem when a category node inherited the probability from multiple fathers. The keywords in the web page are extracted from the title and the main text using N-gram with Wikipedia Entity Words, and Bayesian classifier is used to estimate the page class probability. Experimental results showed that the proposed method obtained good scalability, robustness and reliability for massive web pages.

Introduction

With the development of mobile technology, the users browsing habits are gradually shifted from only information retrieval to active recommendation. The classification mapping algorithm between user interests and web contents has been become ever more complicated in the volume and variety of web pages. Web page classification plays a vital status on the Internet information management, convenient retrieval, web page crawling and user profile identification  [1], [2]. In order to increase the friendly browsing and rapid retrieval experience, the directories of some sites, such as Yahoo!1 and DMOZ ODP,2 define some series site structure based on web page content information to strengthen the structure and hierarchical browsing approaches. ODP employed over 78940 people to engage the web page classification for page maintenance in Netscape Communications Corporation 2008 report.3 With the development of mobile, the company increased ten million annually and the average web page numbers of these Web sites almost reach to millions lever. The web page classification depending on manpower editing becomes increasingly quality verification and other tedious processes of web pages do not meet demand for the rapid expansion of web sites. The web page classification labeling has become consume enormous, human and financial costing for Internet applications. Some automated or semi-automated mechanisms for Web page classification with high accuracy, reliability and scalability have an obligation to replace the man powered editing.

Traditionally, web page classification systems  [2] are almost based on supervised machine learning algorithm, which estimates a labeled classification training model to forecast the testing data. Web page classification methods can be divided into several main categories: subject classification, functional classification, sentiment classification and other forms of classification.

Subject classification is the theme for page content classification  [3], which is more conducive to site information management and publishing. Current integrated information sites almost use this classification system like Yahoo!, American Online, and Sina, have the channels named “health”, “culture”, “travel” and etc. The web pages are, firstly, encoded with multi-dimensional vector representation for facilitating machine learning and classification processes. The dimensionality reducing methods or feature selection algorithms are used to save both time and space for computation. Then, machine learning methods, like SVM  [4], ANN  [5], Rocchio  [6] and Bayesian Classifier  [7], are applied for classifying web pages.

Actually, the data set used by the conventional methods has significant structural characteristics. The accessible data set provider will optimize the data structure to correspond the customer’s requirement. On the other hand, the users have sufficient resources and capabilities to integrate the data structure and optimization for a small data collection. However, it is impossible to do these things when the data are too cumbersome for the users. At the same time, the data set may be brought from different providers whose data structures are skimble-scamble. Further more, the traditional data analysis and processing methods are too complex and inefficient to accommodate the diverse circumstances noises when the page data reach TB/PB level. Big data have brought a lot of development opportunities and challenges in the content classification field. More and more big news portal sites and social media companies use the computing servers with larger memory to deal with the massive document classification, based on traditional supervised or semi-supervised machine learning methods.

This paper proposed web page content classification algorithm based on Wikipedia knowledge with network topology. Wikipedia encyclopedia knowledge is the largest knowledge base all over the world, and has over 200 different language knowledge bases with thousands or even millions of entries. These knowledge bases are entries every day from the separate group to expand, editing and finishing. We defined the Preliminary Classes (PCs) space on web page content consulting various category systems of portal sites at first. Then, some category words in Wikipedia Knowledge Network (WKN) were defined as Elemental Keywords (EKs) for PCs. During the Wikipedia category tree breeding, we proposed three inheritance association algorithms: Preliminary Association (PA), Rule Association (RA) and Kinship-relation Association (KRA). Class Probabilities (CPs) of all Wikipedia category words were estimated using Levenshtein distance between the category words and EKs for the association method. RA method analyzed the breadth and depth between Wikipedia category word and father or ancestor words to estimate the CPs of the current category words. KRA method introduced the sister concepts to optimize the hereditary weights of Wikipedia category words for multiple father’s conditions.

In specific page classification processing, we collected several typical Chinese web pages as test samples. And a Main-text founder process (MFP) tool was used to extract the kernel content information. The contents were segmented into words with minimum phrases using a Chinese tokenizer. And these words were combined to obtain additional vocabulary bag using N-gram algorithm referring to Wikipedia categories phrases. We used a naive Bayes classifier model to estimate the page PCs according to the word PCs. A sigmoid function was put in place to optimize the hereditary weights considering to the unbalance of word frequency and word length.

The experimental results showed that KRA method based on WKN obviously enhanced the classification accuracies for all benchmarks. The proposed method had high accuracy, reliability and scalability for a variety of page qualities comparing with traditional BOW and TF–IDF methods. According to the update model of WKN, we alleged that the KRA algorithm based on WKN was suitable for web page classification, especially for large data mining.

The paper will be presented as following: Section  2 introduced the recent web page classification methods. The motivation of using Wikipedia knowledge database and the reconstruction algorithm were explained in Section  3. Section  4 presented the kernel of web page keyword extraction and classification approach. Finally, the experimental evaluation was discussed in Section  5. Conclusion and feature works were given in Section  6.

Section snippets

Related work

Web pages are classified based on their contents or subjects in the topic-based classification research. For the English document, it is very easy to segment a document into a vector of words. And each word indicates a certain concept in the document of the basic hypothesis. Finally, a web page is represented by a vector of words with weights. This if often referred to as Bag-of-Words approach  [4], [8]. In order to improve the weight generation, Mladenić  [9] introduced N-gram for feature word

Wikipedia structure

Wikipedia is a multilingual, web-based, free-content encyclopedia project operated by Wikipedia Foundation and based on an openly evitable model. Wikipedia is a live collaboration differing from paper-based reference sources in important ways. Wikipedia encyclopedia knowledge is the largest knowledge base all over the world, and has over 200 different language knowledge bases with thousands or even millions of entries.

Its English version contains more than 4.3 million entries currently. It

Keyword extraction

The traditional pure text message contains only a title and body content. Generally web page contains site structure information, such as tags, header, legal statements, billboards and image sources. We crawled some pages from typical Chinese Web sites. Then, a main-text founder processing tool was set up to extract the kernel content of web page. The contents were separated into elementary strings with minimum phrases based on “Jieba” toolbox.

The thesaurus derived from Wikipedia contains a

System structure

The process of the whole system was very simple. First, we reprocessed the downloaded Wikipedia knowledge base and saved into a database. Then, EKs dictionary was initialized based on “sogo” topic dictionary and Wikipedia knowledge base. The CPs of all WEWs were generated using PA, RA and KRA. The main contents of crawled web pages were provided by an “MFP” toolbox. Page keywords were selected using a tokenizer refer to WEWs by the N-gram applications  [41]. After introducing an inverse sigmoid

Conclusions

With the explosion of Internet information, the traditional web page classification algorithm based on the training data set model has been unable to handle the complicated web page classification. This paper introduced a classification method using WKN. As the most popular scientific knowledge database in the world, Wikipedia knowledge base contains more than 200 types of the linguistic knowledge base. We used Chinese knowledge base to solve the Chinese web page classification, and this

Acknowledgments

This work was supported by the NSFC (No. 61502247, 11501302, 61502243), China Postdoctoral Science Foundation (No. 2016M600434), Natural Science Foundation of Jiangsu Province (BK20140895), Scientific and Technological Support Project (Society) of Jiangsu Province (No. BE2016776), and Postdoctoral Science Foundation of Jiangsu Province (1601128B).

Huakang Li was born in Suzhou, China. He received the Master and Ph.D. degree from the School of computer science and engineering, University of Aizu, Aizuwakamatu, Japan, in 2007 and 2011, respectively. He is currently working in the School of Computer Science and Technology, School of Software, Nanjing University of Posts and Telecommunications. His current research interests include big data mining, web mining, social network, user profile and semantic Web [31], [32], [33]. He has authored

References (41)

  • T. Joachims

    A probabilistic analysis of the rocchio algorithm with tfidf for text categorization., Tech. rep.

    (1996)
  • L. Jiang et al.

    Naive bayes text classifiers: a locally weighted learning approach

    J. Exp. Theor. Artif. Intell.

    (2013)
  • M. Rusinol et al.

    Logo spotting by a bag-of-words approach for document categorization

  • D. Mladenić

    Feature subset selection in text-learning

  • D.D. Jung Y et al.

    An effect term weighting scheme for information retrieval., Tech. rep.

    (2000)
  • C. Buckley et al.

    The effect of adding relevance information in a relevance feedback environment

  • T. Joachims

    Text categorization with support vector machines: Learning with many relevant features

  • M. Hall et al.

    The weka data mining software: an update

    ACM SIGKDD Explor. Newsl.

    (2009)
  • C.B. Markwardt, Non-linear least squares fitting in idl with mpfit, arXiv preprint...
  • J. Zhang et al.

    Network traffic classification using correlation information

    IEEE Trans. Parallel Distrib. Syst.

    (2013)
  • Cited by (26)

    View all citing articles on Scopus

    Huakang Li was born in Suzhou, China. He received the Master and Ph.D. degree from the School of computer science and engineering, University of Aizu, Aizuwakamatu, Japan, in 2007 and 2011, respectively. He is currently working in the School of Computer Science and Technology, School of Software, Nanjing University of Posts and Telecommunications. His current research interests include big data mining, web mining, social network, user profile and semantic Web [31], [32], [33]. He has authored or co-authored more than 20 publication include IEEE and ACM. He has served as a program chair and TPC member for several international conferences and journal like TPDS, HRI, FCST, NBiS, RACS, AINA, CSE. He is a member of IEEE CS, and ACM, and CCF China.

    Zheng Xu was born in Shanghai, China. He received the Diploma and Ph.D. degrees from the School of Computing Engineering and Science, Shanghai University, Shanghai, in 2007 and 2012, respectively. He is currently working in the third research institute of ministry of public security and the postdoctoral in Tsinghua University, China. His current research interests include topic detection and tracking, semantic Web and Web mining. He has authored or co-authored more than 70 publications including IEEE Trans. On Fuzzy Systems, IEEE Trans. On Automation Science and Engineering, IEEE Trans. On Cloud Computing, IEEE Trans. On Emerging Topics in Computing, IEEE Transactions on Systems, Man, and Cybernetics: Systems, etc.

    Tao Li is currently a full professor in the School of Computer Science, Florida International University, and the head of School of Computer Science and Technology, School of Software in Nanjing University of Posts and Telecommunications, China. He received his Ph.D. in computer science from the Department of Computer Science, University of Rochester in 2004 (My old homepage at Rochester). He was a recipient of NSF CAREER Award (2006–2010) and multiple IBM Faculty Research Awards (2005, 2007 and 2008). In 2009, he received FIU’s Excellence in Research and Creativities Award. In 2010, he received IBM Scalable Data Analytics Innovation Award. He received the inaugural Graduate Student Mentorship Award from the College of Engineering and Computing at FIU in 2011. He is on the editorial board of ACM Transactions on Knowledge Discovery from Data (ACM TKDD), IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), and Knowledge and Information System Journal (KAIS).

    Guozi Sun is currently a professor of the School of Computer, Nanjing University of Posts and Telecommunications, China. His recent research interests mainly include digital forensics, multimedia forensics, social network forensics, and digital investigation. Dr. Sun has published more than 100 refereed papers in the academic journals and international conference proceedings in the related research areas. He has served as a program chair and TPC member for several international conferences, and editor-in-chief, associate editor, editorial board member and guest editor for a number of scientific journals. He is a member of IEEE CS, and ACM, USA, and CCF, CIE, ISFS, China.

    Kim-Kwang Raymond Choo received the Ph.D. degree in Information Security from Queensland University of Technology, Australia, in 2006. He currently holds the Cloud Technology Endowed Professorship at The University of Texas at San Antonio. He serves as the Special Issue Guest Editor of ACM Transactions on Embedded Computing Systems (2017), ACM Transactions on Internet Technology (2016), Digital Investigation (2016), Future Generation Computer Systems (2016), IEEE Cloud Computing (2015), IEEE Network (2016), Journal of Computer and System Sciences (2017), Multimedia Tools and Applications (2017), Pervasive and Mobile Computing (2016), etc. He is a recipient of various awards, including the ESORICS 2015 Best Paper Award, Winning Team of the Germany’s University of Erlangen-Nuremberg (FAU) Digital Forensics Research Challenge 2015, and the 2014 Highly Commended Award by the Australia New Zealand Policing Advisory Agency, the Fulbright Scholarship in 2009, the 2008 Australia Day Achievement Medallion, and the British Computer Society’s Wilkes Award in 2008. He is a Fellow of the Australian Computer Society, and a Senior Member of IEEE.

    View full text