A Soft Set-based Co-occurrence for Clustering Web User Transactions

Grouping web transactions into some clusters are essential to gain a better understanding the behavior of the users, which this grouping process is widely used by e-commerce companies. Hence, web transaction clustering of webpages is important yet a challenging web mining problem. This is due to uncertainty to form clusters. Rough set theory has been utilized for clustering web user transactions, while managing uncertainty in clustering process. However, it suffers from high computational complexity and low cluster purity. In this study, we propose a soft set-based co-occurrence for clustering web user transactions. Unlike rough set approach that uses similarity approach, the novelty of this approach uses a co-occurrence approach of soft set theory. We compare the proposed approach and rough set approaches on computational complexity and cluster purity. The result demonstrates better performance than the previous approaches and is more effective so that lower computational complexity is achieved with the improvement more than 100% and cluster purity is higher as compared to rough set-based approaches.


Introduction
Data mining is the procedure of extracting important information from large databanks.It is a process of performing extraction automatically from a large data bank into a knowledge or useful information.And also, it can be regarded as an algorithmic process that takes a sample as input and yield patterns such as classification, association rules, or clustering as an output [1].Clustering is the process of partitioning a set of data objects into subsets.Objects within the cluster have similar characteristics between each other and are different from other clusters [2].Therefore, clustering is very useful and can find unknown groups or groups in the data.Clustering is widely used in various applications such as web search, pattern recognition, biology, business intelligence, and security.There are many approaches used for data clustering, some are able to handle uncertainty and other improves computational complexity.Many practical and complicated clustering problems in the engineering, environment, economics, medical science, and social science are often involving uncertain and vague data.
There are several well-known approaches to handling uncertainty during the clustering process.There are such as a fuzzy set theory [3], a rough set theory [4], vague sets [5], and interval mathematics.But all of these theories have their inherent difficulties, as pointed out in [6].Consequently, Molodstov [6] proposed a soft set theory that is the completely new approach for dealing uncertainty.It uses parameterization concept as its main vehicle, therefore it offers wider applications in real problems.Currently, the works on soft set theory is making progress rapidly by many scholars in theory and practice, including work of [7][8][9][10][11][12][13][14][15].
Clustering is another big issue of the soft set-based data analysis, particularly when it contains large and imprecise data, e.g.clustering of web user transactions.Web clustering is one of the most widely used techniques in the web mining context, which is to group similar web usage data-such as web page, user transactions or user sessions, mouse-clicks and scrolls, and any kind of data produced by the interaction between users and the web [16] -into a number of groups by means of measuring their mutual vector distance [17].Grouping web transactions into clusters is essential to gain a better understanding of behavior of the users, which is widely used by e-commerce companies [16].Recently, Rough Set Theory (RST) is 1345 used for web user transactions clustering.De and Krishna [18] proposed a rough set approach based on similarity of upper approximations of transactions by giving a threshold for clustering web user transactions.However, high complexity is still an outstanding issue for finding the same approximation that is used to merges two or more clusters which have the same similarity.
To cope this outstanding issue, Yanto, et al. [19] proposed a framework based on rough set theory for clustering web user transactions.Further, they proposed a technique namely RoCeT for clustering web transaction by using the similarity class concept and showed how transactions are allocated in the same cluster [20].However, all of the aforementioned approaches still suffer from high computational complexity and low clustering purity.Therefore, in this study, we propose a soft set-based co-occurrence for clustering web user transactions.
To sum up, the main contribution of our work is described as follows: a.We propose a soft set-based approach for clustering web user transactions, which capable of achieving lower computational complexity and higher clustering purity.b.Unlike rough set approach that uses similarity approach, the novelty of this approach uses a co-occurrence approach of the soft set.c.This study presents comparison on theoretical analysis and performance results of the proposed approach as compared to two rough set-based approaches.
The rest of this study is structured as follows.Section 2 reviews the basic notion of rough set and soft set theory.Section 3 describes the proposed soft set-based approach.Section 4 elaborates the experimental and comparison results.Finally, the conclusion of this study is given in section 5.

Theoretical Background
This section we briefly recall some basic notions of rough set theory and soft set theory.

Rough Set Theory
In rough set theory, an information system is defined as a 4-tuple (quadruple) , where , called information (knowledge) function.The rough set theory was initially developed by Pawlak [4] for modeling vagueness and granularity in an information system.The indiscernibility relations is the starting point of rough set approximation that generated by information about the objects of interest.If two objects have the same feature, they are regarded indiscernible (indistinguishable) in an information systems.

Soft Set Theory
Let U be a non-empty universe object, whereas E is a set of parameters in relation to objects of n is the number of parameters and q is the number of objects.and 5 stands for the parameters in a word of "expensive", "beautiful", "wooden", "cheap", and "in the green surroundings", respectively.Consider the mapping given by "house condition (.)", where (.) is to be filled in by one of the parameters In the next following section, an alternative approach for clustering web user transactions using soft set theory is proposed.

Proposed Soft Set-based Technique
This section presents the proposed soft set-based co-occurrence for clustering for web user transactions.It is based on the fact that user transactions can be represented as a soft set.

The Proposed Technique
Firstly, we present the relation between a soft set and a Boolean-valued information system, as follows.
be a soft set over the universe U , we define a mapping   , where then a soft set can be considered as a Boolean-valued information system .
From Proposition 3.1, a binary-valued information system can be easily represented as a soft set.Thus, one-to-one correspondence between   E F, The representation of soft set-based for web user transactions is illustrated in the Example 3.1 as follows.
Example 3.1.The data of web user transactions is adopted from [18] given in Table 1 containing four users  

Data transactions of four users
The Table 2 above clearly showed whether the user clicks on the hyperlink 1 hl or the user clicks on the hyperlink 2 hl , and etc.The transformation of Table 2 into a soft set can be represented as follows:   2, the parameter occurrence is defined as follows:    2 and from Example 3.3, the following binary relations are formed with a given threshold 0.4  2, we the following similarity classes of each transaction with a given threshold 0.4 , and

Correctness Proof
The following definition states that two web user clusters in U to be similar if their union are equal.
Proof.We suppose that   I This is a contradiction from the hypothesis.

Algorithm and Its Complexity
The algorithm of the proposed technique is presented in Figure 1.

Algorithm: Soft set technique
Input: Web user transactions data set Output: Web user transactions clusters Begin Step 1. Calculate the size of the similarity between two object transactions.
Step 2. Get the similarity class with the given any threshold value.
Step .Thus, the overall computational complexity is the polynomial

Results and Discussion
In this section, the experimental results of the proposed technique for clustering web user transactions are presented.The experiments are conducted to compare between the proposed technique and rough set-based techniques [18][19][20] on a 1.86 GHz Intel Core i3-3217U machine with 4 GB memory using Windows 10 operating system.The techniques for clustering web transactions are implemented in MATLAB version 8.1.0.604 (R2013a).For experiment test, the two UCI Repository of Machine Learning Databases benchmark datasets are used, which are obtained from [21].The client-side cached data is not recorded; thus, this data contains only the server-side log.The 2000 instance transactions are used in the experiment, afterwards those data are split into five categories i.e.100, 200, 500, 1000, and 2000 instance transactions, respectively.

Cluster Purity
In general terms, web user transactions clustering algorithms are based on criteria to assess the quality of a given class.Especially, they take some parameters as an input e.g.number of classes.Since the clustering algorithms are an apriori technique, then they need to be assessed in term of the purity of classes [44].Formally, the purity of classes (clusters) is defined as follows:

Purity
, where  h it t is the number of data occurring in both the i-th cluster under the given threshold and  n t is the number of data in the dataset.Meanwhile, the overall purity of clusters is defined as follows: According to the above equations, the highest value of overall class purity reflects the best clustering result, where perfect clustering results closing to have a value of 100 %.In the following sub-section, we present the experimental results of our proposed soft set-based technique from the benchmarked datasets.

Msnbc.com Dataset
The data describe the web page visited by the visitors and are recorded at the level of URL category chronologically.The data were taken from Internet Information Server (IIS) logs for msnbc.comand news-related portions of msn.com.The requests are not recorded at the best level of detail -that is, at the URL level, but the data are recorded at the at the page category (as specified by the site administrator).Each page request served through the caching mechanism were not logged in the server log and therefore is not in the data.Table 6 shows comparison performance of the proposed approach in terms of processing time (in seconds) and clusters purity.Based on the results shown in Table 6, the rough set techniques proposed by [18][19][20] took longer execution time since it discovered the similarity of upper approximations and many iterations.Meanwhile, the proposed approach shows significant improvement of more than 116 % from web user transactions ranging from 100 to 2000 data sample.For cluster purity, the proposed approach improves up to 2.3% from two rough set-based techniques.

Microsoft.com data set
The data describe the page visited by the 38000 users who visited www.microsoft.com.For each user, the data list all the areas of the website (Vroots) that user visited.For each user, the data list all the areas of the website (Vroots) that user visited within the one-week time frame.The Vroots can be identified by their title and URL.Table 7 shows the comparison of the processing time (in second) and clusters purity.As we can see that, the proposed approach shows significant improvement of more than 105 %

Conclusion
In this paper, we have studied the web user transactions clustering problem which emphasizes on reducing computational complexity and increasing cluster purity.This is the first study that has proposed algorithms for clustering web user transactions by using soft set theory.We proposed an algorithm with varying lower computational complexity and higher cluster purity.Although there are several existed baseline techniques that address the issues concerning web user transactions clustering, none of these techniques provide lower computational complexity and higher cluster purity.We have carried out a comparative analysis of the proposed technique with respect to final computation complexity and cluster purity.The results demonstrates that the proposed approach outperforms as compared to two rough setbased approaches in terms of computation complexity and cluster purity.

Definition 2 . 3 .F
(See[6].)A pair   E F, is called a soft set over U, where F is a mapping given by This shows that a soft set is a parameterized family of subsets of the universe U .For is regarded as the set of  -elements of the soft set   E F, or as the set of  - approximate elements of the soft set, instead of a crisp set.Example 2.1.(See[9]) Let be a soft set   E F, describes the "attractiveness of houses" that one is going to purchase.Suppose that five considered houses in the universe U and E is a set of parameters,

FromDefinition 3 . 1 .
Proposition 3.1, the idea of similarity between two parameters (representing two transactions) t and u in U are presented.Firstly, the notion of co-occurrence of parameters in soft set theory is defined as follows.Let   E F, be a soft set over the universe U representing data of web user transactions and a web user transaction U u  .A parameter co-occurrence set of an object u can be defined as:

.Example 3 . 2 .
The Definition 3.1 can be illustrated in the following example (Example 3.2).From soft set   E F, in Table

FromDefinition 3 . 3 .Example 3 . 4 .
Definition 2.1, a soft set theory can be referred as a binary relation.Furthermore, from Definition 3.2, we present the notion a binary relation with respect to the similarity between two user transactions.Let   E F, be a soft set over the universe U representing data of web user transactions andU u t  ,are two user transactions.A binary relation R between t and u denoted bytRu is defined as follows: pre-defined threshold value.This relation R in Definition 3.3 is both reflexive and symmetric, but may not a transitive.The Definition 3.3.can be demonstrated by using the following example.From soft set   E F, in Table

From 1349 Definition 3 . 4 .FromExample 3 . 5 .
Definition 3.3, the similarity class is formulated in the following definition.TELKOMNIKA ISSN: 1693-6930  A Soft Set-based Co-occurrence for Clustering Web User Transactions (Edi Sutoyo) Let   E F, be a soft set over the universe U representing data of web user transactions and a web user transaction U u  .The similarity class of t, denoted by   u SC , is defined as a set of transactions which are similar to t i.e.Definition 3.4, for the given any threshold values, we also can have any similarity classes.An expert can choose a threshold based on their preferable value to get expected similarity classes.The Definition 3.4.can be demonstrated in the following example.From soft set   E F, in Table

Definition 3 . 5 .
Let   E F, be a soft set over the universe U represents data of web user transactions.Two web user clusters i C and j C in U , for j i  are said to be the same if

From 2 . 3 . 2 .
similarity classes in Definitions 3.4 and 3.5, we can form a cluster of web user transactions as shown in Proposition 3.Proposition Let   E F, be a soft set over the universe U represents data of web user transactions and   i u SC be a similarity classes of transaction u, for

.
The consequence, we get the following:

Figure 1 . 2 n
Figure 1.The pseudo-code of the proposed approach

A
Soft Set-based Co-occurrence for Clustering Web User Transactions (Edi Sutoyo)


ISSN: 1693-6930 TELKOMNIKA Vol. 15, No. 3, September 2017 : 1344 -1353 1352 from web user transactions ranging from 100 to 2000 sample data.For cluster purity, the proposed approach improves up to 7.2% from the two rough set-based techniques.

Definition 2.1. Two elements
As from the example above,

Table 1 .
To illustrate the Proposition 3.1, let we consider example 2.1 above.From the soft set in the Example 2.1, it can be represented as a Boolean-valued information system.A Boolean-tabular representation of soft set   A Soft Set-based Co-occurrence for Clustering Web User Transactions (Edi Sutoyo) 1347

Table 6 .
Performance comparison of Msnbc.comdata set

Table 7 .
Performance comparison of Microsoft.comdata set