An analytical model of website relationships based on browsing history embedding considerations of page transitions

: In recent years, obtaining a large amount of information and receiving various services through websites has become possible. Consequently, web browsing activities are increasing and numerous companies are conducting their businesses online. In this scenario, users’ web browsing behaviour is one of the important topics in web marketing analysis. Several studies have been conducted with distributed representation model to analyse web browsing history. This method can capture the relationships between the websites continuously browsed as similar representations. However, the browsing behavior without a clear browsing purpose in web browsing history can deteriorate the representations. In this paper, we propose the sparse skip gram model using the regularised online learning approach for analysing website relationships robustly. In addition, we apply our method to actual browsing history data and discuss the findings acquired from the analytical results. We show that our proposed model represents the characteristics of websites with subspaces in the embedding space.


Introduction
In recent years, various services are being provided to consumers through various websites over the internet. For example, one can purchase a variety of items on electronic commerce sites (EC sites) and ask for the purchased items to be delivered to their homes. When going for dinner, gourmet websites allow us to search for preferable restaurants and make a reservation. Internet has therefore become an indispensable tool in our daily life. The accumulated web browsing history generated by the browsing behaviour of consumers reflects their preferences and several studies have been conducted on this topic from various viewpoints. For example, Liu et al. (2015) employed a convolutional neural network model to predict the tendency of consumers to click on advertisements based on the web browsing behaviour as input. Marketing analysis based on web browsing history data is a well-investigated approach in this field. Marketing analysis is performed to optimise marketing measures such as advertisement on the internet or mail distribution, and the consumer preferences are frequently explained in terms of the websites browsed by them. For instance, it is assumed that a consumer who has browsed several fashion sites is interested in fashion and will visit other fashion sites in the future as well. However, directly interpreting and analysing the characteristics of the large number of browsed websites on the internet is difficult.
Hence, a method for conducting an effective analysis that estimates the characteristics of each website through web browsing history data is required. In previous studies, such as by Tsuruhara et al. (2008), the authors performed an independent component analysis of browsing history data and extracted the browsing concepts. Schroder et al. (2019) segmented the consumer data based on the topic model and analysed the type of websites the consumers of each segment were interested in. Notably, these approaches assumed that all the websites browsed by a single consumer are largely similar.
However, considering the web browsing activities performed by each user, although there are strong relationships between the websites sequentially browsed before and after a particular website, not all pairs of websites browsed by the same consumer are similar. For example, if a consumer planning a trip browses hotel-booking sites and gourmet sites continuously, these sites are browsed for planning the trip. Therefore, it should be considered that there might be multiple browsing purposes in a consumer's web browsing history data. In this situation, if we apply the methods that define the similarities between websites by simple co-occurrence at the consumer level, distinguishing between multiple browsing purposes can be challenging. Therefore, it is desirable that an analytical model that considers the similarities between local browsing actions instead of within the complete browsing history is applied.
On the contrary, in the field of natural language processing, extensive research has been performed on the distributed representation of text data that shows high performance in the semantic analysis of words. Distributed representation is a technique that represents the meaning of a word through its semantic vector; word2vec proposed by Mikolov et al. (2013a) is a typical model for learning distributed representation. In this model, each word in the corpus is assumed similar to the context words that tend to appear in the same sentence and the representation learning is accomplished by an online learning algorithm. With the same logic, objects in web browsing history data also show similarity to surrounding objects. Later, Tagami et al. (2018) and Hosaka et al. (2019) demonstrated the effectiveness of applying distributed representation models such as word2vec to web browsing history data as well.
However, there is a significant difference between web browsing history data and text data. The words in the text data are based on clear contextual meanings, whereas browsing behaviour does not always reflect the consumers' browsing purpose. For example, during multi-purpose browsing, there is a possibility that websites sequentially browsed before and after a particular one are based on different purposes. Moreover, if one gets a phone call during web browsing, the browsing behaviour can change suddenly because of the content being discussed on the phone. Thus, there are several website relationships that cannot be extracted in a straightforward manner from web browsing history.
In this paper, we propose a distributed representation learning method of strong relationships in the web browsing history data considering the novel sparse regularisation for the estimation. The browsing behaviour of users is not always generated by a rational process and websites sequentially viewed by a user may have no relationship among themselves. In a few cases, a user may browse websites randomly in his/her spare time. However, most users usually browse websites on the internet sequentially with a certain context or meaning. The relationship between the websites that we focused on in this study is in the meaning of the relative frequency with which they are browsed by each user in the same context. If the websites that tend to be browsed after or before a target website can be identified, we can form a hypothesis about the contexts and intentions behind the browsing behaviour. Such hypotheses can be useful in considering appropriate promotion strategies and marketing measures. Occasionally, reasonably interpreting the relationships between websites generated by the browsing behaviours of the users may be difficult. However, such findings may be useful for marketing planning as it is difficult for an individual marketer to imagine them based on just his/her own experiences or logical thinking.
Website relationships can be appropriately analysed using our approach. In particular, we introduce follow the regularised leader-proximal (FTRL-proximal), an online learning algorithm suggested by McMahan et al. (2013) that enables regularised learning of the word2vec model and the estimation of sparse distributed representation of the websites. The algorithm allows the extraction of strong relationships between the websites by using sparse distributed representations constructed using the proposed method. Finally, we apply the proposed method to actual web browsing history data and analyse the relationships between websites based on the results of distributed representation. We demonstrate the effectiveness of the proposed model by performing an analysis of the real data.

Distributed representation
In the field of natural language processing, one of the most important issues is the numerical representation of words. Previously, two major methods were used to represent words in a numerical manner: one-hot encoding method that treats each word as a dummy variable, and the term frequency and inverse document frequency (TF-IDF) method that assigns weights to the word characteristics. However, in recent years, distributed representation methods that provide numeric representations in an embedding space after considering the meaning of the words have gained prominence. This technique represents each word as a numerical large dimensional vector in the constructed embedding space. Words with similar meanings have similar distributed representations and the distance between the word vectors quantifies the similarity between the meanings of the words. However, to use the distributed representation efficiently, an appropriate algorithm is necessary to estimate the word vector accurately. In this context, word2vec, suggested by Mikolov et al. (2013a) is the most typical model to learn the distributed representation and construct an embedding space.
The word2vec model estimates the distributed representation using neural networks based on the hypothesis that a word in the corpus can be predicted from the context words. There are two learning approaches used for word2vec. In the skip gram model, a one-hot vector of the words in the corpus is received as input data and two-layer neural network learning is employed to predict the context words. Then, the weight vector from the input unit to the hidden layer is extracted as the word vector. In practice, from the viewpoint of computational complexity of the learning algorithm, negative sampling, which approximates a multiclass classification problem with a few binary classification problems, as suggested by Mikolov et al. (2013b), is widely used. The learning algorithm of the skip gram model with negative sampling is shown below.
We denote the number of words by I and the size of the vocabulary by V . Further, let the width of the window size W before and after be regarded as the context. Moreover, let the negative sample size K be the number of negative examples to be sampled for each word. Let u v and c v be the D-dimensional word vector and context vector for the word v(1 ≤ v ≤ V ) respectively. Then, we build a noise distribution θ as follows: Here, f req v is the frequency of the word v in the corpus. In the skip gram model, K negative words w k (1 ≤ k ≤ K) are sampled for the combination of a word w i in the corpus. The context word w i+j (−W ≤ j ≤ W, j ̸ = 0) and loss L(w i+j |w i ) is calculated as follows: Here, σ(·) is the sigmoid function. Moreover, the loss for the corpus L SGNS is calculated as follows: The model parameters are estimated by minimising equation (3) according to the stochastic gradient descent. Word2vec performs excellently in several tasks of natural language processing. For example, Xue et al. (2014) applied word2vec to text data collected from social media and conducted a sentiment analysis. They calculated a score to represent the positivity or negativity of a word based on the similarity between the words. In addition, Siencnik (2015) applied a clustering method to the word vector learned with word2vec and used the clusters for named entity recognition. The results confirmed that the model shows higher accuracy than the linear support vector machine for classification baseline.

Online learning with regularisation
Using the stochastic gradient descent to update the parameters in the negative gradient direction of loss per data is the most basic optimisation method used in online learning. However, the stochastic gradient descent cannot be directly applied to a loss function that includes nondifferentiable points such as the L1 norm. Although Duchi et al. (2011) adapted the subgradient method for online learning and estimation of parameters even in such a situation, it is difficult to learn sparse parameters from a loss function with L1 norm using the subgradient method.
However, the effectiveness of a learning algorithm that combines the stochastic gradient descent and the proximal gradient method has been demonstrated in previous studies. Duchi and Singer (2009) suggested an online learning algorithm called forward backward splitting (FOBOS). This algorithm can optimise the loss function, including the regularisation term, efficiently by dividing it into differentiable and nondifferentiable terms and alternately applying the stochastic gradient descent and the proximal gradient method. Xiao (2009) suggested an online learning algorithm called regularised dual averaging that sparsifies a parameter based on the past gradients. Furthermore, McMahan et al. (2013) suggested FTRL-proximal, a hybrid of FOBOS and regularised dual averaging algorithms. This algorithm can efficiently estimate a parameter with the desired structure according to the regularisation term because it learns the parameter by referring to not only the past gradients but also the past parameter values. The learning algorithm of FTRL-proximal with L1 regularisation is presented below.
We denote the loss function for parameter x by ρ(x). ρ(x) can be divided into the differentiable function and the L1 regularisation term. We denote the differentiable function by f (x) and the regularisation strength by β. Then, ρ(x) is represented as follows: Here, let x t be the value of x after t th update. Then, x t+1 is calculated as follows: Here, η t is the learning rate of t th update and x t,i is the i th factor of x t .

Sparse distributed representation
In this study, our goal was to estimate the sparse distributed representation of websites. Several previous studies have also reported the construction of a model to estimate such sparse distributed representation from the point of view of interpretability. Faruqui et al. (2015) proposed a method to map the estimated word vectors linearly to the higher-dimensional and sparse vector space using the matrix factorisation approach. In their research, the loss function L SMF was defined as follows: Here, Q is the dictionary of base vectors and s v is the sparse word vector for the word v(1 ≤ v ≤ V ). β is the L1 regularisation strength of s v and γ is the L2 regularisation strength of Q. Faruqui et al. (2015) optimised equation (9) by using regularised dual averaging. They showed that it was possible to learn word vectors with high interpretability and representability by sparsifying them while increasing the dimensions of the embedding space. Besides Faruqui et al. (2015), several other researchers have also successfully transformed the word vector to a sparse word vector. Park et al. (2017) proposed a vector rotation algorithm that decreases the complexity of the representation. Moreover, Subramanian et al. (2018) proposed a method to nonlinearly transform the distributed representation of each word to high-dimensional sparse vectors by applying a k-sparse autoencoder suggested by Makhzani and Frey (2014). They evaluated the interpretability of each axis in the embedding space by using the word intrusion detection test and demonstrated the improvement in interpretability.
In constrast, Sun et al. (2016) proposed a method to learn the sparse word vector by introducing regularised dual averaging to the continuous bag-of-words (CBoW) model, one of the structural models of word2vec. They built an algorithm to sparsify and estimate the distributed representation of each word simultaneously, unlike in the other studies mentioned above.

Proposed model
In this paper, we propose a model to learn the website vector for extracting strong relationships from web browsing history data by introducing the sparse regularisation method to the distributed representation model. Previously, Sun et al. (2016) had proposed a similar algorithm that combined regularised dual averaging and the CBoW model. However, the parameters of infrequent words estimated by the CBoW model are unstable because it predicts the word from the context and there are representation conflicts between words occurring in similar contexts. Moreover, it has been shown that FTRL-proximal can estimate parameters more accurately than regularised dual averaging because it updates the parameters while referring to their past values. Furthermore, online learning is compatible with web browsing history data not only from the perspective of the convergence of the calculations, but also because numerous websites are created or shut down every day, and the semantic structure of the websites keeps changing constantly.
Thus, we propose a new model to estimate the sparse website vector with higher accuracy by introducing FTRL-proximal to the skip gram model. Our proposed model is formulated as follows.
Let I and V be the number of total views and websites, respectively. Additionally, we denote the window size and the negative sample size with W and K, respectively. Further, we denote D-dimensional website vector as u v and context vector as c v for website v(1 ≤ v ≤ V ). We build noise distribution θ according to equation (1). Furthermore, let λ, η, and α be the regularisation strength, and the learning rates of website vector and context vector, respectively. We adjust the learning rate according to Adagrad suggested by Duchi et al. (2011) and the adjusted parameter is denoted as ϵ.
The loss for each combination of the website w i and the context website w i+j (−W ≤ j ≤ W, j ̸ = 0) is calculated according to equation (2). Then, the loss for learning dataset L SSGNS is calculated as follows: Moreover, the gradient of equation (2) at each parameter is calculated as follows: The calculation procedure for the t th update of website vector u wi is shown below. Let u wi,t be the value of u wi before the t th update. In addition, let g wi,t and g 2 wi,t be sum of the gradients at u wi until the t th update and the square sum of the gradients at u wi until the t th update, respectively. In other words, g wi,t and g 2 wi,t are updated as follows: Then, the learning rate of the t th update η wi,t is calculated as follows: Finally, u wi,t,d that is the d th element (1 ≤ d ≤ D) of u wi,t is updated as follows: The context vector c v is estimated by applying the stochastic gradient descent based on Adagrad similar to the skip gram model.

Analytical experiments with actual data
In this study, we applied the proposed model to real web browsing history data and obtained sparse distributed representation of websites. In addition, we analysed the relationships between the websites based on the learned representation to verify the effectiveness of the proposed model.

Experimental setup
We employed the web browsing history data provided by VALUES, Inc., Japan, to demonstrate the suitability of our proposed model in real-world data analysis. The data comprise the host names of websites browsed on a PC or smartphone captured by a monitor who agreed to register for the survey. The data include I = 78,508,580 total views, 49,787 users, and V = 47,854 websites. The model parameters of our model are described below. We set the dimensions of the website and context vectors as D = 200, the window size as W = 4, and the negative sample size as K = 20. In addition, we set the regularisation strength as λ = 2.5, the learning rate of the website and context vectors as η = α = 0.5, and the adjusted parameter of Adagrad as ϵ = 1.0, based on preliminary analysis.
We extracted websites that displayed high similarities with a particular target website and verified the reasonability of the resultant estimated website relationships. Moreover, as the evaluation of the model was based on how much the model can suppress the extraction of weak relationships, we defined the average sparse rate of the website vector S vec and that of cosine similarity matrix generated by the website vector S sim as follows: Here, 1l(·) is an indicator function to evaluate if the relationship in parentheses is correct or not and cos(a, b) represents the cosine similarity between a and b.
We compared the proposed model with the models proposed by Mikolov et al.

Comparison with previous study
In this section, we discuss the characteristics of the representation learned by the proposed model by comparing them with the representations obtained in previous studies. To measure the effectiveness of the representation, we extracted websites displaying a high similarity with the target website and analysed the website relationships on the embedding space. Here, as the extent of similarity was measured by considering the number of shared axes, we defined the sign-dot similarity between the D-dimensional vectors a and b as follows: Because of the imaginability, we set 'Tabelog (tabelog.com), a typical Japanese gourmet site, as the target site of the analytical experiments in this study.    Table 1 shows the top five websites with a high cosine similarity with the website vector of 'Tabelog' for each model. According to Table 1, the model of Mikolov et al. (2013a) can extract gourmet sites similar to 'Tabelog'. However, the models of Faruqui et al. (2015) and Sun et al. (2016) extracted restaurant and hotel sites in addition to gourmet sites as these sites are expected to have strong relationships with gourmet sites. In the case of the proposed model, fashion sites and regional sites were also extracted in addition to the gourmet sites. It is considered that fashion sites, similar to gourmet sites, are browsed by young consumers. Next, Table 2 shows the top five websites with high a sign-dot similarity with the website vector of 'Tabelog'. According to Table 2, the model of Sun et al. (2016) and the proposed model extracted gourmet sites and websites with high relevance to gourmet sites. Therefore, it is evident that the proposed model also learns representation by considering the meaning as observed in the previous studies. In addition, a clear concept can be assigned to each axis of the embedding space where the website is mapped by the proposed model. Next, we discuss the sparsity of the parameters learned by the proposed model. Table 3 shows the average sparse rate of the website vector S vec and the cosine similarity matrix of the website vector S sim for each model. According to Table 3, the model of Mikolov et al. (2013a) constructed a dense vector and almost all the elements of the word vector are non-zero. On the contrary, the average sparse rate of the website vector learned by the model of Faruqui et al. (2015) is more than 90% owing to the adjusted parameter. However, in this method, the cosine similarity with the website vector is zero. The model of Faruqui et al. (2015) transforms the learned distributed representation to sparse vector space while maintaining the relationships. Thus, this model cannot sparsify the close relationships constructed by the model of Mikolov et al. (2013a). On the contrary, the proposed model can sparsify not only the website vector but also the cosine similarity matrix because this model learns and sparsifies the distributed representation simultaneously. This result implies that websites have a structure that can be distributed on multiple subspaces in the embedding space. However, the model of Sun et al. (2016) cannot learn the sparse relationships. There are two possible reasons for obtaining this result. First, the CBoW model averages the context vectors and the website vectors are assumed to have relationships with more abstract representations. Thus, the CBoW model has difficulty in learning the characteristics of websites with low frequency. Second, the CBoW model considers the website as a negative sample for the context websites whereas the skip gram model samples the context websites positively. Therefore, the CBoW model emphasises learning the characteristics of the context websites instead of that of the website. Hotel booking site C App game walkthrough site D No. 5 Hotel booking site D App game walkthrough site E Figure 1 The scatter of websites on the certain subspace in embedding space (see online version for colours)

Analysis by proposed model
In this section, we give an example of the analysis performed by the proposed model. First, we interpret the subspace in the embedding space learned by the proposed model. Table 4 shows the top five websites with a high value in certain dimensions of the website vector learned by the proposed model. According to Table 4, the 118th and 133rd axes of embedding space learned by the proposed model are assigned as interpretable topics. The websites with a high value on the 118th axis of the embedding space are relevant to travel or leisure. Moreover, a high value on the 133rd axis indicates relevance with app games walkthrough. Moreover, Figure 1 shows the scatter of websites on the subspace comprising the 28th, 49th, and 157th axes of the embedding space learned by the proposed model. Here, we select the automotive websites in the dataset arbitrarily. In Figure 1, the automotive websites are represented by a star sign and the websites in subspace 2 are represented by a darker color. The percentage of automotive websites in the subspace in all automotive websites is 60.5% while that of non-automotive websites is 7.2%. Thus, it can be said that this subspace represents the characteristic of automotive websites. In summary, according to Table 4 and Figure 1, the proposed model represents the characteristics of a website by using a subspace in the embedding space.

Figure 2
Browsing sequence of a consumer and the sign-dot similarity to previous websites (see online version for colours) Figure 2 shows the browsing sequence of a consumer in a session 3 and the average of the sign-dot similarities to previous W websites. According to Figure 2, this consumer checked the weather on a weather forecast site. Next, he/she viewed various news sites and a baseball team site, and used the mail and social networking services. Afterward, the consumer browsed multiple automotive websites and finally viewed a few bank sites. According to the figure, the partial browsing sequence from the 9th to the 16th browse has high general similarities among the visited websites. In other words, this partial browsing sequence can be interpreted as browsing on the same subspace in the embedding space with an interest in automotive websites. Meanwhile, a low similarity means changing the subspace in the embedding space and the partial browsing sequences with overall low similarities with immediately previous websites can be interpreted as browsing without a certain purpose, similar to internet surfing. Hence, we can detect a consumer's interests while web browsing by superimposing the similarities between the website vectors learned by the proposed model and of the actual web browsing. Furthermore, as the similarities can be calculated simultaneously with browsing, we expect that this analysis will be applied to web marketing measures in real-time.

Distributed representation model for browsing history data
In Subsection 4.2, we verified that distributed representations learned by word2vec and the expansion models can extract websites similar to a given website. This is due to the structural similarity between the text data and the browsing history data. Both the types are time series data and include a large number of objects. Moreover, Piantadosi (2014) and Adamic and Huberman (2002) have shown that frequencies of the words and websites follow Zipf's law.
In this study, we focused on the possibility that similarities may exist between proximately visited websites, based on the characteristic assumed by word2vec. This is a cogent hypothesis because we browse the internet with a single browsing purpose in the short-term but with a variety of browsing purposes in the long-term. Consequently, each method can extract reasonable relationships by calculating the cosine similarity between learned distributed representations.
The distributed representation of websites can be applied to machine learning tasks related to the internet. Actually, Tagami et al. (2018) learned distributed representation of browsing consumers from the web browsing history data and applied it to advertisement click prediction.

Regularisation of distributed representation
Among the previous studies on sparse distributed representation, the majority proposed a model that transformed the representation learned by other models to the sparse representation. However, the model of Sun et al. (2016) and our proposed model learn and sparsify the representations simultaneously.
According to the results presented in Subsection 4.2, the sparsity of the learned representations' similarity of the proposed model differs greatly from that of the previous models although all of them can learn the sparse representation. According to Table 1, all the models in previous studies and the proposed model can extract strong relationships between websites. However, the results for websites with weak relationships are different. The proposed model estimates the similarity between these websites as zero whereas the models in previous studies still display a small value.
The models that transform the representation learned by other models to the sparse representation retain the dense relationships constructed by the original representation. In addition, the model of Sun et al. (2016) cannot learn the sparse similarity because of the CBoW model feature that averages the context vectors and samples the websites as negative samples. Therefore, our proposal model demonstrates a novel method to sparsify the similarity between the websites. Furthermore, the difference between our proposed model and that of Sun et al. (2016) proves the advantage of the skip gram model over the CBoW model from the perspective of capturing the characteristics of the websites.

Significance of removing weak relationships
While considering the analysis of relational data, most studies focus on finding the high similarity between objects, such as items in the purchase history, because several tasks pertaining to relational data aim to extract strong relationship from big data. In other words, from a micro perspective, the purpose of learning is to create a high similarity between objects with strong relationship. For example, on an EC site, the users are recommended items that are similar to items they purchased in the past.
However, extracting strong relationships is insufficient from a macro perspective. If you want to analyse the multiple relationships between objects comprehensively, simply extracting the strong relationships can cause confusion as well as difficulties in interpreting the analysis results. Therefore, limiting the extraction of non-strong relationships is as important as extracting strong relationships while analysing multiple relationship objects simultaneously.
In actual analysis, we often manage relationships such as similarity between objects. In such a situation, the proposed model is effective because it estimates not only the strength of a relationship but also its existence. For example, in Subsection 4.3, browsing activity with zero similarity between browsed websites can be interpreted as browsing with variable interest whereas browsing activity with high similarity between browsed websites represents browsing with a strong interest. Moreover, the network structure of websites can be simplified by decreasing the links between them through an introducing sparsifying method and it is expected that the network analysis of the representation learned by the proposed model is effective.

Conclusions and future work
In this paper, we introduced FTRL-proximal, a tool for analysing web browsing history data. FTRL-proximal considers both the estimation accuracy and the sparsity of the parameters in the word2vec model, which shows excellent performance in word semantic text analysis, to extract stronger relationships between websites. Additionally, we proposed a representation learning algorithm to analyse the relationships between websites while ignoring the weak relationships present in the web browsing history data. We also applied the proposed model to real-time web browsing history data and compared the performance of the proposed model with that of existing ones from the perspective of semantic validity and sparsity of representation. Finally, we analysed the web browsing behaviour of a certain consumer using the proposed model. As a result, we verified that the representation learned by the proposed model combines with semantic expressiveness to not only extract the strongly related websites, similar to that achieved by previous models, but also the sparsity to mark the similarity between weakly related websites as zero. Furthermore, we demonstrated the capability of the proposed model to extract the consumers' interests from their web browsing behaviour.
As a future work, it is necessary to consider a method to analyse the complete relationship in a group of plural websites. Based on the distributed representation model, few studies have focused on the overall relationship of each site, each user, and each site and user, whereas numerous studies have been performed to analyse one-to-one relationships between two objects. Therefore, it is possible that the representation learned by the proposed model is effective in these analyses as well. In addition, we need to verify whether the proposed model is effective on the text data analysed in previous studies. Interpretability is an important research subject in the field of distributed representation of language and the proposed model is expected to learn the representation with good interpretability, as achieved in previous studies.
Moreover, we proposed a new model to estimate the sparse website vector with high accuracy and applied the model to actual data. The knowledge extracted from the proposed method is dependent on the actual case and the purpose of this study is not to find new common facts. Therefore, additional experiments are required for examining the rightfulness of the extracted hypotheses as future work.