Semantic Web and Web Page Clustering Algorithms: A Landscape View

The major evolution of the semantic web has become exchanging data between applications in all domains of activities. Based on this vision, different applications in recent days, e.g. in the fields of community web portals, social networking, e-learning, multimedia retrieval, etc. have been designed. Due to growing number of web services, clustering of web resources becomes a valuable tool for semantic web mining. Clustering of internet objects like Internet web pages’ intimate new methods for grouping correlated content for better understanding and satisfies massive user query results in web pages’ search. Hence, web pages clustering algorithms should be able to handle massive irregular content and discover knowledge regardless of the web page complexity. These algorithms vary depending on the characteristics and data types. So, choosing the most appropriate algorithm is not an easy process as it should be accurate in terms of time and space complexity. Therefore, this paper rigorously surveys the most important algorithms of different types used for web page clustering. In addition, a comparative analysis of all such algorithms are provided in terms of several parameters. Finally, a brief discussion is provided on why web page clustering is important in emerging era of Semantic Web of Thing (SWoT) applications.


Introduction
Today petabytes of data are available in web which are unstructured, structured, and semi structured.Tremendous growth of websites and web contents in the form of text, multimedia messages on the WWW has led to demand of a strategy which can provide knowledge from the vast data scattered over different servers.Accessing and retrieving information from structured data is easy but due to massive growth of web data in past few years, most of the data these content on the WWW is mostly unstructured and human understandable.Hence, the requirements to improve the users searching results from the scattered and massive number of internet web pages is key challenge in existing search engines, which typically aims to sequence the relevant result in a sequential form.Here, semantic web mining comes into action which combines web mining and semantic web towards making the data structured and machine readable thus supporting easier data discovery, data integration, navigation, and automation of tasks.Mining of web content is like techniques of data mining where the applications aim to extract the hidden patterns or features from web pages.The clustering techniques of semantic web provide query relevant knowledge with feature extraction techniques on the massive and linked data present in WWW [4] [5].The objective of the paper is to introduce the core idea of the commonly used clustering algorithms and analyse the advantages and disadvantages of each one.
Web mining [1] [2] is defined as mining of the World Wide Web (WWW) to find useful information about web content, users queries, user behaviour, and structure of the web.Here users can act as consumer or contributor of its data and services.According to the present use of WWW, there is a paradigm shift from the web users from the demand of information to demand of knowledge and this requirement transfers WWW to semantic web.The semantic web is knowledge oriented and provide query relevant knowledge using clustering technique on the massive and linked data present in web on different fields.Figure 1 shows an image of open link data on the Internet (or cloud) of several domains such as social networking sites, media, publications, user generated data etc.The image shows the datasets published in the Linked Data format which currently contains 1,255 datasets with 16,174 links (as of May 2020).The figure (Figure 1) is adopted from the web page which is the home of the LOD cloud diagram [3].

Types of semantic web mining
Patterns discovering methods may provide tools and techniques to extract several significant contents from web by implementing data mining techniques in sophisticated manner.Web mining can be classified into three classes depending on how the web data to be mined.These are -Web Structure Mining (WSM), Web Usage Mining (WUM) and Web Content Mining (WCM) [6].WSM techniques are concerned with the correlation that exists among related pages, WUM techniques focus on discovering the patterns of raw usage of data, while WCM refers to finding, extracting and gathering of valuable information that better suited to search query and gathering the related web pages to coherent groups.Web structure mining and content mining are often performed together which allows to exploit the hypertext content and the structure simultaneously.

Ontology and semantic web
The semantic web has come into field as a solution to the information overload problem due to massive and ever-growing data in web.It is a machine-readable web, designed as a global document repository, with easy routes to access, publish, and link documents.Ontology is a recognized approach for knowledge representation and sharing across several applications.Ontology is the backbone of the semantic web as it aims to deal with the structured data to develop standards and technologies designed for both users and machines understandable.It requires construction of a swift and operative ontology method for developing an erudite knowledge-based and semantic web-based system.Figure 2 simply represents how the ontology plays as a vital input in semantic web mining and as a result a model and pattern set are received as output.Along with one or more web pages, relational database, graphs, text documents etc. are also key input in such methods [7] [8].
However, setting up ontology manually is a difficult task as it is not only error prone but also time consuming and so it requires participation of domain experts.So, a better solution is construction of an automatic or semi-automatic ontology methodology.Over the past years, several research attempts have been made to build such appropriate ontologies for semantic web.But there still exists many open issues in this field.

Figure 2. Ontology in semantic web data mining
The rest of the paper is organized as follows.Section 2 discuss the background of the clustering in semantic web and its importance.Then several clustering methods are thoroughly described in Section 3. The comparative results of all such methods are presented in Section 4 followed by the interpretations of the same in Section 5.The importance of clustering in SWoT applications is briefly discussed in section 6.Finally, we conclude in section 7 with some future work possibilities.

Background
Clustering is the method of categorizing similar object into groups, or in other words, partitioning of a dataset into subsets or groups called clusters.The process is considered as a valuable tool for semantic web agents as it is applicable in large range of problems.Clustering navigates the user to find the results in several collections of clusters relevant to the corresponding query.So, it becomes easy for the users to locate the valuable search results according to their need, thus better satisfying user requirements and providing optimal utilization of web surfing time.The main aim of semantic web pages clustering is to group together related pages depending on its contents, then this information is used to improve the results of web search engines and other applications such as information retrieval systems.New algorithms frequently required to process complex data types that are collected from several web pages for collecting meaningful contents.Among all other clustering methods, web pages clustering is a critical task due to the complex structure of web pages which is totally different from other format and combine extra embedded information.Web page clustering is the most popular strategy in web mining that puts together web pages in groups depend on similarity or other proximity measures, where pages in the same cluster are more similar to each other than pages in other clusters [9][10].However, the clustering is very significant and difficult task when large number of unlabeled web pages or objects frequently accessed by several users.In this regard, the new clustering algorithms should be developed, or existing algorithms should be modified efficiently to be able to propose new analysis criteria which can match the users' requirement.However, many research works have been done so far and as a result many clustering algorithms have been proposed for different classes of web mining techniques.But this work discusses the detail of such algorithms mainly focus on web content mining and also compare their characteristics with each other.After going through the whole survey, it becomes easy to select one according to the requirements after analyzing the merits and demerits of all.

Categorization of web page clustering methods
In web page clustering, the data before clustering is collected in the form of web pages or search results.Then data preprocessing is done to make it suitable for clustering.During next phase, features are extracted based on web page content (mainly text-based) or interconnected web links or both content and links.Lastly, on the extracted features, clustering method is applied, and results are obtained.Here, we initially focus on the methods based on web content clustering which comes under the web content mining.Later, few clustering methods from web mining from structure and usage are also discussed [11].
Web content clustering can be defined as the form of unsupervised classification where the classes of web pages are not known previously, and which are to explore and discover the significant content from massive data by grouping the data contents into coherent points into clusters.The points that fall or exist within one cluster are similar to each other than others in different clusters.Now-a-days, most of the information on the internet are stored in form of text, this made the text mining very important topic in web mining [8].The text-based clustering algorithms characterize every page by its contents (words or sometimes phrases are used).The main goal behind that is the pages that involve many common words, or which are likely to be very similar.However, the partitioning algorithms produce spherical shape clusters as they assign the data point to its closest cluster centroid [14].Figure 3 represents the results of final clustering from the initial/original points after applying k-means algorithm where the spherical shape clusters produced from variant text document dataset.
In centroid based techniques, optimizing the intra-cluster variation is computationally challenging task which measures the partitioning quality of cluster   .For the within-cluster variation, for each object in every cluster (1,..,), the distance between the object and its cluster centre is squared, and the distances are summed up; it can be represented as in equation 1.
where  is the sum of the squared error for all objects in data set D in d dimensional space,  is the point in space representing a given object; and ′  is the centroid of cluster   and (, ′  ) is the distance (Euclidean distance) between data point  and centroid ′  .
In -medoids method, which is an object-based technique, the objects mean value is not used as a reference point in a cluster.Instead, the actual objects are used to represent the clusters, using one representative object per cluster.Then, each remaining object is assigned to the cluster according to the closest representative object i.e. whose representative object is found as the most similar.Then the partitioning process is performed following the principle of minimizing the sum of the dissimilarities between each object p and its corresponding representative object.This method groups  number of objects into  clusters thus minimizing the absolute error which can be defined as in equation 2.
Where  is the sum of the absolute error for all objects p in D, and   is the representative object of   .
Medoid is less effected by outliers or other extreme values compared to mean.Thus, in the presence of outliers and noise, -medoids method is more robust than -means.However, the complexity of each iteration in -medoids increases for large values of  and , and such computation becomes more costly than the -means method.
The main partitioning goal is to get coherent number of clusters which have minimal Sum of Squared Error (SSE).

Hierarchical clustering
Hierarchical clustering methods are developed to overcome the drawbacks presented in portioning or flat based clustering algorithms.Hierarchical clustering analysis can be classified into two approaches: Divisive and Agglomerative approach.In Agglomerative, the process starts from the cluster having single data point then integrate sub-clusters into big cluster and so on.It calculates proximity matrix i.e. similarity of one cluster with all other clusters.Then the clusters which are highly similar to each other are merged and then the proximity matrix is recomputed.The process is repeated until only one cluster remains there.The reverse of Agglomerative method is Divisive.Divisive method starts from the inclusive cluster and segregate the cluster points recursively into sub-clusters until certain number of similar clusters is gathered at each iteration [18].For segregation of the cluster points at each step, any partition-based algorithm can be used.Figure 4 presents the procedure of both types of hierarchical algorithm where each alphabet is considered as a single cluster.Table 2 illustrates the characteristics of most wellknown hierarchal algorithms developed in different times where  = number of data points/objects as mentioned above,   and   are average and maximum number of neighbors for a point respectively,  is the cluster radius.Divisive algorithms are more efficient in terms of running time and accurate, but they are computationally more complex as compared to Agglomerative techniques.Figure 5 shows diverse shaped clusters which are produced from hierarchal algorithms [15].
Hierarchal algorithms can join with other clustering algorithms such as -means to find coherent groups, for example clustering of text documents.Both the hierarchical and partition-based algorithms are distance-based clustering method which can be used for any data types if a proper distance function is created for that type.In such methods, the clustering quality depends on the algorithm, the distance function, and the application where inter cluster distances should be maximized and intra cluster distances should be minimized.Thus, designing of appropriate distance function is very crucial task and important area of research in web data mining.But hierarchical algorithms are typically not commonly used when data has multidimensional space due to the following reasons: • It requires high computation time.
• Number of clusters should be known in prior.
• Continuous clustering process produces large incoherent cluster    Table 3 shows the characteristics of the most widely used density-based clustering algorithms, where  refers to cluster radius (maximum distance to consider), and   refers to the minimum number of data points needed in a neighbourhood to define a cluster.Apart from the classical density-based algorithms presented in the table, there are some algorithms under the two broad categorizations of density-based clustering methods such as -ExCC, MR-Stream etc. under density grid-based clustering and Denstream, FlockStream etc. under density micro clustering method.These clustering methods have better quality of clusters than grid-based methods, but they need more computation time.So, later different hybrid methods proposed modifying DBSCAN algorithm, but they are not suitable for today's distributed web environments and web contents [22].
In these methods, clustering results are sensitive to parameters and for a large volume of data, huge memory is needed.When the data space density is uneven, the method results in low quality clusters.

Graph based clustering
Graph based algorithms initially construct a graph or hypergraph then apply clustering algorithm to partition the graph result.For clustering the web contents, the web pages can be viewed as a set of nodes and the web links are the edges among nodes representing the strength of relationship.A set of vertices is considered as a good cluster if it has low conductance i.e., if it has more external edges than internal.However, drawbacks of such algorithms are as follows -the graph must fit in the memory, and the technique that is used for calculating similarity among nodes have to use cut-off property.This type of clustering is a hybrid method using content and link both for feature extraction (Section 2) and comes under the web mining based on structure (WSM).
Figure 7 shows a simple example of graph representation of linked web documents/pages with text (TX), title (TI), link (L) [27].

Figure 7. Graph representation of web content
Graph construction methods extract a similarity graph which conserves the key properties of the dataset and to do so, they involve a sparsification of the similarity matrix under different heuristics (from simple thresholding to sophisticated regularisations).However, the choice of sparsity (method parameters) has a strong impact on the performance of such methods.The workflow in graphbased clustering method is shown in Figure 8. used methods are ∈ −ball graph, kNN, CkNN etc. for graph construction and Markov Stability (MS) for community detection [28].MS has successfully applied in social networks, airport networks etc.Other dynamical processes have also been applied widely in network analysis.The high dimensional nature of web data leads to complex geometries associated with datasets and thus it poses challenges to standard clustering methods.The use of graph clustering method in this regard helps in capturing complex geometry of dataset and managing complex network analysis in web.Table 4 illustrates the most common characteristics of graph-based algorithms where  number of objects each with  number of attributes to be clustered, = number of links or edges, = starting vertex, =target conductance, ℎ=size of the sparsity of given dataset  and graph ,  is any set with small conductance, and the resulting PageRank vector is not close to the stationary distribution for many starting vertices contained in , as it has significantly more probability within .The problems in this kind of clustering methods for web mining are selecting/determining the input parameters, memory consumption and runtime for massive input data such as in today's web.In addition, it is also difficult task to select the appropriate method for clustering with the massive growth of web resources.
Graph based clustering methods result in high quality and accurate clusters but as the complexity of the graphs increases, the time complexity of the algorithms increases drastically.

Probabilistic Clustering
Probabilistic clustering, also called distribution-based clustering, is a special type of hard clustering method.The probabilistic clustering algorithms are most closely related to statistics that follow Bayesian classification arguments.
In such methods, the data is considered as a sample which is independently drawn from a mixture model of several probability distributions.Here, each vector  is assigned to the cluster Ci for which Probability(Ci | x) is maximum, i.e.where a vector belongs to a specific cluster.The assignment of the vectors to individual clusters is carried out optimally, according to the optimality criterion.These methods are extensively used in many applications such as recognition of handwriting, clustering text document, retrieval systems and topic modeling.Figure 10 shows a scatter plot of dataset with clusters identified using Gaussian Mixture clustering by python.These algorithms use statistical models rather than predefined similarity measures to calculate the similarity among data points.But, the time complexity of these algorithms is quite high, and they converge slowly in some situation.So, clustering huge amount of web data today where time is the prime factor, these methods are not always suitable.Table 5 shows the most widely used probabilistic clustering algorithms and their characteristics where  number of objects each with  number of attributes to be clustered,  is the number of Gaussian components/clusters here and data are produced from the mixture of  distributions  1 ,…,   , and distribution parameters are mean, variance etc.

Special clustering methods
Apart from the clustering methods there are some special clustering which do not exactly fit in any of the above categories.Here we briefly discuss few of them.

Branch and bound
This method exploits the dynamic programming principle.It provides globally optimal clustering without need of considering all possible clustering based on certain prespecified conditions (for certain fixed number of clusters).A variant of this method, called A* search is used in some applications in today's web.

Stochastic
Like Branch and bound method, based on certain prespecified condition, stochastic method guarantees convergence in probability to the globally optimum clustering.

Genetic
Based on certain prespecified conditions, Genetic clustering generates new population of clustering at each iteration using an initial population of possible ones.Apart from the above method there are some clustering methods proposed in different times [35][36] which does not fall in any specific category of clustering, rather, may be considered as a mixed clustering technique.These

Results and Discussion
Most of well-known clustering algorithms have many limitations and drawbacks.The methods are not able to work properly with ever-growing web data till we modify such parameters which adjective these algorithms with web data.Additionally, one of the most critical challenges is intimated here when text data that comes from web environment do not have labeling property.However, this challenge has made the evaluation process of any clustering results more difficult task in which depends on analysis model and validation measures.This paper aims to revise the studies and perceptions looking for the algorithms that are mostly used with text content in the web.
In table 6, a comparative study of all such techniques discussed here is presented based on the real time application, CPU time used, memory consumption, algorithm limitations, and data dimensionality.The terms used for comparisons are follow: Flexible, Low, High, Convenient, Inconvenient and Not assigned.

Interpretation and Evaluation
Cluster analysis of web content is significant process to understand and interpret; in this regard, choosing a suitable clustering algorithm is more difficult.This process is typically restricted based on the data types used and application purpose.The traditional clustering algorithms are unable to perform satisfactory in the current scenarios of web data mining due to several reasons such as follows.
• Scaling issues due to large number of samples to be processed • The number of features is too large and sometimes exceed the number of samples due to high dimensionality of data • Finding the outliers are very significant which is difficult for large volume of data • The knowledge of previous cluster analysis can be reused to avoid starting the analysis from scratch, but they are often available • Heterogeneous and distributed data sources where local cluster analysis results are to be integrated into global models Most of clustering algorithm that are used with web pages consider only the text part and most of them use statistically based Vector Space Model (VSM) [36].According to such model, a web document is represented conceptually by a vector of keywords mined from it.But none of the methods of this model is well suited for all types of queries generated in web.In the field of information retrieval, Web clustering Engines e.g.Clusty, Lingo3G, Grokker, KartOO, CREDO etc. are emerging trend which organize search results by topics [37].They group the search results into different (hierarchical) clusters and display those cluster labels.As a result, the user can conveniently and quickly locate the desired document.This clustering includes constantly changing billions of pages.The dynamicity nature of the data along with the interactive use of the clustered results stance new needs and challenges to clustering technology such as selection of similarity measure, meaningful cluster labels, handling cluster overlapping, removing clustering ambiguity, increasing computational efficiency etc. Depending on the specific algorithm used, the clustering phase can significantly contribute to the overall processing time.
To improve the cluster efficiency, the extracted features from cluster should be powerful.In this regard, to increase the efficiency of clustering phase, the developers should adopt methods to generate more expressive and effective descriptions of clusters and should find optimal cluster representatives.But the usage of WWW is increasing everyday now and as a result in the big data era and digital world, managing the massive content in internet is becoming more difficult task.So, before choosing any clustering technique, the innovative usage and requirement of web content clustering should be realized first which can be identified as follows -classifying network traffic, identifying fake news, spam filtering, sales/marketing, analysing document.[38].However, each algorithm has its own advantages and demerits, as discussed in this article, and cannot work for all real situations.So, combination of existing clustering algorithms should be used for getting better clusters.

Semantic web clustering in IoT applications
Today enormous number of data in different formats are generated in web from a huge number of heterogeneous devices, several networks, applications, communication protocols.With these growing number of devices, the reality of Internet of Things (IoT) and their diversity is stimulating the current technologies for a smarter integration of their applications, data and services.While the web is considered as a convenient platform to integrate the things, the semantic web on the other hand can further extend its capacity to recognize the things' data and simplify their interoperability.In this regard, the Semantic Web of Things (SWoT) is proposed for integrating the semantic web on IoT.Web of Things (WoT) allows the different things and systems to communicate together through API over HTTP or CoAP protocol.Whereas, the SWoT is the fusion of IoT trends for moving toward the web technologies with protocols like CoAP, REST architecture and WoT concept.There are semantic web technologies as well as some of the well accepted ontologies which are used to develop applications and services for the IoT.But the existing approaches are lacking behind in well-defined standards and conventional tools to solve the semantic interoperability problem in IoT applications [39][40].
One strategy to deal with these challenges is to reduce the number of discovered services using different methodologies such as clustering of semantic web.However, most of the existing approaches are suitable for static context and don't take into consideration the dynamicity of services and gateways.So, the unsupervised clustering mechanisms need to be discussed and explored with much more attention for performing analysis on IoT sensor data along with dynamicity of WoT services.

Conclusion
This paper aims to provides a rigorous survey of several algorithms that is used with web content clustering along with a brief discussion on the importance of semantic web clustering for WoT services.Behavior of text content in web pages is very different from the traditional text documents.So, selecting the most suitable algorithm for handling the variety in text tokens is very difficult as it depends on the domain complexity and types of data used in web.Several clustering algorithms proposed so far but the main problem with such algorithms is that they cannot be standardized.One algorithm may give appropriate results with one type of dataset but may provide poor results with dataset of other types.Although there have been many attempts for standardizing the algorithms, but no major achievement has been accomplished till now.Our future works aim to find the behaviour of using hybrid algorithms by successfully combined few of the popular existing algorithms such as graph-based algorithms with hierarchal algorithms etc.The aim is to able to successfully handle the heterogeneity in web content from the large datasets and heterogeneous devices and which can also be successfully implemented in SWoT applications.

Figure 3 .
Figure 3. Spherical shapes produced from Partitioning clustering algorithms 3.1 Flat or Partition-based clustering The partition-based clustering method classifies the information into multiple groups based on the similarity and characteristics of the data.The number of clusters that has to be generated for the clustering methods can be known from the data analysis.Given a data set () of multiple () web objects, to form  number of clusters, partitioning method constructs user-specified  partitions (where  < ) of the data in which each partition represents a cluster and a particular region.The most widely used partitioning methods are -means and -medoids.These are classical partitioning algorithms and several variations of these methods are used today in web to handle growing volume of data.The method -means is a centroid based technique where partitioning clustering algorithms aim to point assignment procedure.In other words, the main point, called Centroid C, is initially selected in each cluster by computing the mean value of the objects in the cluster.Then each object (among n multiple objects) of D is assigned to the cluster having closest Centroid value i.e. to which the object is the most similar according to the mean value.The main objective from partition-based clustering algorithms is to find  clusters represented by   = { 1 ,  2 ,…,   } by partitioning dataset D to coherent groups .Clusters choose Centroids randomly at the initialization step.The process of partitioning is frequently repeated depending on number of trials, where at each trial the cluster centroids are updated, till the no change in Centroid or clustering algorithms conduct optimally similar clusters.However, the partitioning algorithms produce spherical shape clusters as they assign the data point to its closest cluster centroid[14].Figure3represents the results of final clustering from the initial/original points after applying k-means algorithm where the spherical shape clusters produced from variant text document dataset.

Figure 5 .
Figure 5. Clusters produced from Hierarchical algorithms Thus, due to many irrelevant dimensions for the high dimensional data in today's web, the quality of distance function may be reduced.It may exhibit errors that in turn reduces the statistical significance of web data mining results.Moreover, the size of web clusters varies, and web datasets contain noises.Most of the hierarchical clustering methods are sensitive to outliers.Outliers are not assigned to any cluster and, they can be considered as anomalous points depending on the context.So, for the huge and growing web datasets and due to the arbitrary shapes of web clusters, density-based clustering is preferred over distancebased clustering methods such as hierarchical and partitionbased algorithms.

Figure 8 .
Figure 8. Basic idea of Graph-based clustering methodMany literatures have conjunction with the graph partitioning and community proposed many methods for graph construction in detections from high dimensional dataset such as web data.Among different graph construction algorithms, in this regard, the most commonly

Figure 9 .Figure 9
Figure 9. Clusters generated from Graph clustering method on large web data

EAI
Endorsed Transactions on Energy Web 03 2021 -05 2021 | Volume 8 | Issue 33 | e7 methods intend to use more than one standard clustering techniques.

Table 2 .
Properties of Hierarchical Algorithms

Table 5 .
Characteristics of common Probabilistic Clustering Algorithms