QoS Prediction for Neighbor Selection via Deep Transfer Collaborative Filtering in Video Streaming P 2 P Networks

To expand the server capacity and reduce the bandwidth, P2P technologies are widely used in video streaming systems in recent years. Each client in the P2P streaming network should select a group of neighbors by evaluating the QoS of the other nodes. Unfortunately, the size of video streaming P2P network is usually very large, and evaluating the QoS of all the other nodes is resource-consuming. An attractive way is that we can predict the QoS of a node by taking advantage of the past usage experiences of a small number of the other clients who have evaluated this node. Therefore, collaborative filtering (CF) methods could be used for QoS evaluation to select neighbors. However, we might use different QoS properties for different video streaming policies. If a new video steaming policy needs to evaluate a newQoS property, but the historical experiences include very few evaluation data for this QoS property, CF methods would incur severe overfitting issues, and the clients then might get unsatisfied recommendation results. In this paper, we proposed a novel neural collaborative filtering method based on transfer learning, which can evaluate the QoS with few historical data by evaluating the other different QoS properties with rich historical data.We conduct our experiments on a large real-world dataset, the QoS values of which are obtained from 339 clients evaluating on the other 5825 clients. The comprehensive experimental studies show that our approach offers higher prediction accuracy than the traditional collaborative filtering approaches.


Introduction
In recent years, video content accounts for a large proportion of global Internet consumption.Video steaming is gradually becoming the most attractive service [1][2][3].However, Internet does not provide any quality of service guarantees to video content delivery.To expand the server capacity and reduce the video streaming bandwidth, P2P technologies are widely adopted by many content delivery systems [4][5][6][7].In a P2P network, a peer not only downloads the media data from the network but also uploads the download data to other clients in the same network.To get a better user experience of watching videos, each client (or node) in the P2P network should select some other nodes as its neighbors in terms of the quality of service (QoS) for this client [8][9][10].For example, a client might prefer to select nodes with high bandwidth.Due to the different locations and network conditions, different clients might have different QoS experience for the same node.To get the neighbors with the best QoS, one might want to evaluate the QoS of all the other nodes for each client.Unfortunately, the video streaming P2P network usually includes an extremely large number of users, and evaluating the QoS of all the nodes is time-consuming and resourceconsuming.
An attractive way is that we can predict the QoS value of a node by taking advantage of the past usage experiences of a small number of the other clients who have evaluated this node.This refers to a famous technology, collaborative filtering (CF), which has been extremely studied in recommender systems [11][12][13].With the help of CF method, each client only needs to know a small number of the real QoS values of the other nodes to select neighbors.The core idea is that if two clients have similar evaluation values of a specific QoS for some known nodes, they might also have similar QoS evaluation values for the other unknown nodes.
2 International Journal of Digital Multimedia Broadcasting However, the neighbor selection policy might need to be changed to improve the quality of video content delivery.If the new policy uses the new QoS property to select neighbors, but the historical user experiences include very few data of this new QoS property, CF methods would incur severe overfitting issues, and then each client might get worse neighbor recommendation list.Transfer learning aims to adapt a model trained in a source domain with rich labeled data for use in a target domain with less labeled data, where the source and target domain are usually related but under different distributions [14][15][16].Recently, deep neural networks have yielded remarkable success on many applications, especially on the computer vision, speech recognition, and natural language processing.Deep neural networks are powerful for learning general and transferable features.There are two major transfer learning scenarios, fine-tuning the pretrained network or treating the pretrained network as a fixed feature exactor.Instead of random initialization, we can initialize the network with a pretrained network, or we can freeze the weights of some layers of the network [17][18][19].
Unlike many supervised transfer learning tasks, we cannot simply fine-tune or freeze the weights of the network.The only information about the nodes in the video streaming P2P network is their identifiers (IDs) and the QoS evaluation historical experience.There is no raw feature for each node, and we need to lean abstract features for the nodes using embedding.Freezing the embedding features seems unreasonable.Furthermore, different QoS properties have different value ranges, and fine-tuning will make the final weights differ greatly from the initial weights pretrained in the source domain.Due to the sparsity of target domain labeled data, fine-tuning too much would incur severe overfitting problem.
In this paper we proposed a novel neural style collaborative filtering method, DTCF (Deep Transfer Collaborative Filtering).We can first train the model using the QoS evaluation data in the source domain and then adapt the model in the target domain with different QoS property.The core idea is that we only use the weights of first several layers to initialize the same layers of the model in the target domain, and randomly initialize the remaining layers.To control the degree of fine-tuning, we integrate the maximum mean discrepancy (MMD) measurement into the loss function [20][21][22].The main contributions of our work are as follows: (i) We propose a novel neural collaborative filtering model for QoS prediction using transfer learning technology.
(ii) We provide a novel interaction layer to represent the relationship between latent embedding factors of the nodes.
(iii) We adopt partial fine-tuning and MMD measurement to train the target domain model to implement domain adapting.
The remainder of this paper is organized as follows: We introduce the related work in Section 2. Section 3 presents the design details of our method.Section 4 describes our experiments and Section 5 concludes this paper.

Related Work
Distributed user-generated videos delivery poses a new challenge to large-scale streaming systems.To stream live videos generated by users, many existing video streaming systems rely on a centralized network architecture [23][24][25].Even these streaming systems use Content Delivery Network (CDN) for video delivery, such a kind of solution is not cost-effective [26][27][28].The unit price of content delivery over the Internet has dramatically decreased in recent years.However, there are higher requirements in terms of resolution, frame rate, or bitrate than before.Therefore, the amount of bandwidth consumed per user has grown at a faster rate.To reduce the bandwidth or the costs and improve the user experience, the P2P architectures can be adopted instead.
Collaborative filtering is a rational QoS prediction technology to select neighbors for each client in the P2P video streaming network [29][30][31].To select the best neighbors with high delivery quality for the clients, CF should predict the QoS values between the clients and then select the top  best neighbors in terms of the QoS values.Each client only knows partial information about the QoS values for all the nodes in the network.Memory-based CF methods are some kinds of generalized k-nearest-neighbors (KNN) algorithms [32,33], which have two types of models: user-based and itembased.Model-based CF methods are more popular, which act like generalized regression or classification algorithms, but they deal with abstract features not concrete or raw features.Among many model-based CF methods, matrix factorization has become the most popular technology to handle such kind of issues [34][35][36][37][38][39][40].Probabilistic Matrix Factorization (PMF) model considers that the QoS values obey a normal Gaussian distribution, and the latent factors should be learned from zero-mean Gaussian distribution [41].Nonnegative matrix factorization (NMF) can learn the nonnegative latent factors for the users or items, but it usually deals with the implicit feedback [42][43][44].
However, even if matrix factorization CF algorithms have obtained remarkable success, they have difficulty in dealing with cross-domain learning tasks if the output values of the source and target domain have different ranges.Deep neural networks can easily learn general and transferable features.More and more cross-domain applications adopt deep learning technologies and have yielded remarkable performance [45][46][47].However, the exploration of deep neural networks on recommender systems or QoS prediction has received relatively less attraction.Recently, some studies have proposed some deep learning-based collaborative filtering models.Two impressive technologies are Google's Wide & Deep [48] and Microsoft's Deep Crossing [49].The input of these models is side information, not the interaction between the users and items.Neural Collaborative Filtering (NCF) models are designed purely for user and item interactions [50].However, none of them are designed for cross-domain QoS prediction.
International Journal of Digital Multimedia Broadcasting 3

Model Architecture Overview.
We propose a novel neural architecture, outlined in Figure 1.The source domain and the target domain share the same network architecture.The input of the model is the identifier number of the nodes.For example, if size of nodes in the P2P network is , the ID of each node is an integer number from 1 to .The output of the mode is the QoS value that the node   evaluates on the node   .
Since we do not use any concrete feature for each node, we need to learn abstract features for them.Here, we use embedding layer to learn a continuous latent vector/factor u for each node.The details of designing embedding layers are described in Section 3.1.
If we get two latent vectors for   and   , u  and u  , one might expect that we should concatenate these two vectors and then use affine function T + b to transform the latent vectors into the input of the other hidden layer above.However, there is no interactive action between the latent factors, but only weighted summation of elements of the vectors.Some studies use the dot product of the vectors to represent the interaction, which is described as follows.
Unfortunately, it is too simple to completely represent the complex interaction between nodes.In this paper, we propose a novel interaction layer to tackle this problem, which has powerful representation capacity.We will give the design details in Section 3.2.
Above the interaction layer, we use ReLU as the hidden layer.We might need multiple ReLU layers.The ReLU activation function is as follows.
Finally, we use a fully connected layer to generate the output.When training the model in the source domain, we use the regression loss.We then use the all the layers of the pretrained model but the last FC layer to construct the model for target domain.The weights of these layers are kept as the initialized weights of the target domain model, but the final FC layer is initialized randomly.To avoid the overadaptation problem, we use both the domain loss and regression loss to train the target domain model.We will describe how to design the domain loss in Section 3.3.

Embedding Layer.
Since we can assign a unique integer number as the identifier for each node in the network, we can use a one-hot vector to represent the identifier.If we have at most  nodes in the network, the th nodes can be expressed as follows.
Our embedding layer is defined as follows: where W is a  ×  matrix.Expanding the formulas Wk  , we can see the following.
Therefore, u  is the th column of matrix W. Since the node identifier number is transformed to a one-hot vector, the result of matrix multiplication is exactly a specific latent vector for each node.This weight matrix is jointly trained with the other parameters of the whole network.

Interaction Layer.
There are two inputs of the interaction layer, u  and u  .Suppose any single vector is a column vector, and concatenating the two inputs will get a longer vector.This concatenation vector will be transformed to another vector, encoding interactive information between these two inputs.The transformation process is outlined in Figure 2.
Suppose the output of interaction layer is a vector h, the length of which is .The th element of the vector h is defined as follows.
If the length of u  is , W  is a 2×2 square matrix.ℎ  is a scalar, the value of which is determined by the matrix W  and the bias   .If the length of h is , we need  weight matrices and biases.ℎ  include all the possible interaction relationships between u  and u  .Denote

International Journal of Digital Multimedia Broadcasting
If the function class F is too large, it is not practical to work with this rich function class in the finite sample setting.A rational choice of the function class is a universal reproducing kernel Hilbert space H, named universal RKHS.Therefore, we have that () = ⟨, ()⟩ H , where  ∈ H.The kernel function (, ) is equal to ⟨(), ()⟩.

MMD [F, 𝑝, 𝑞]
Similarly, the empirical estimate can be defined now as follows.

MMD [F, S, T]
In this paper, we use the empirical estimate of MMD 2 as the domain loss.What we need to do is to select suitable universal kernel function.Here, we adopt Gaussian kernel function, which is defined as follows.L

𝑘 (𝑥, 𝑦) = exp (− 󵄩 󵄩 󵄩 󵄩 𝑥 − 𝑦 󵄩 󵄩 󵄩 󵄩
However, the loss function of the model in the target domain is defined as where To optimize our model, we need to compute the gradient of each weight.For any weigh related to both of the regression and domain loss, its gradient is computed as follows.

𝜕MMD [F, S, T] 𝜕𝑤
Note that h   is not used for computing gradients, because we only train the target domain model after pretraining in the source domain.The training process is described as follows: (i) We first train the model of the source domain using the loss function L(M  ).The gradient of each weigh is computed according to formula (13).
(ii) After training, we use the weights of this model to initialize the model in the target domain except the weights of the last FC layer.The last FC layer of the model of the target domain is initialized randomly.
International Journal of Digital Multimedia Broadcasting (iii) While training the model of the target domain, we use the loss function L(M  ).(iv) For each training iteration, we randomly select examples in the dataset, and compute the gradient according to formulas ( 14) and ( 15).(v) We use ADAM (Adaptive Moment Estimation) as the optimizer.

Experimental Results
4.1.Dataset and Evaluation Metrics.We conduct our experiments on a publicly large accessible dataset, WS-DREAM dataset#1, obtained from 339 hosts doing QoS evaluation on the other 5825 hosts.There are two types of QoS properties in this dataset: response time and throughput.Here, we use the response time as the source domain, and the throughput as the target domain.
For the source domain, we randomly extract 30% (density) of the data as the source training set.For the target domain, we construct 5 different training sets with different density of 0.5%, 1%, 1.5%, 2%, 2.5%, and 3%.Consequently, the remaining data is the test set.
We adopt a common evaluation metric: Mean Absolute Error (MAE), which is widely employed to measure the QoS prediction quality.The results are reported in Figures 3 and 4. We can make the following observations: (iii) DTCF model has more weights that need to be trained than the other models, but it gets the best performance, which indicates that the relationship between nodes is very complex, and shallow models cannot capture these structures.

𝑀𝐴𝐸
Although shallow models are not easily overfitting when the target domain training dataset is extremely sparse, they cannot transfer rich information from the source domain.The deep models might easily incur overfitting problem, but they can learn common latent features from the source domain.To balance this dilemma, we need to control the degree of fine-tuning the deep model.This experiment shows that MMD domain loss is an efficient way of controlling the adapting degree.

Impact of the Network Depth.
The network depth usually has important impact on the prediction performance.Here, the number of neurons of each ReLU is set to 128, and we add the number of ReLU layers from 1 to 6 to see how the MAE values change.The experimental result is outlined in Figure 5, from which we can see the following: (i) Adding more ReLU layers can get better prediction performance, but when the depth exceeds a limited value, the performance starts to become worse.
(ii) Although adding more ReLU layers can improve the performance, it seems that enlarging the size of the training data would be more helpful.
(iii) Sometimes, adding more layers would not improve the performance anymore, but it also does not get worse prediction performance.This indicates that deep neural network has some kind of regularization property.
Actually, if the training dataset is very large, adding more layers usually does not incur overfitting problems, but for the  cross-domain learning, the target domain has very little data, so the network depth needs control.

Impact of the Gaussian Kernel Bandwidth.
Another hyperparameter that we need to determine is the Gaussian  kernel bandwidth.By default, it is set to the median pairwise distance on the source training data.We scale the default value from 0.25 to 2.0, and the experimental result is outlined in Figure 6.
International Journal of Digital Multimedia Broadcasting (i) Obviously, the default value is a rational choice, and scaling too small or too large would get worse prediction performance.
(ii) If the bandwidth is too large, the kernel will be approximately equal to 1, and the nodes would look the same.We cannot propose personal recommendation for them.
(iii) If the bandwidth is too small, the kernel will be approximately equal to 0, and the nodes cannot find similar neighbors to follow their past experiences.

Conclusion
Selecting neighbors in terms of the QoS is an effective way of providing high quality contents in video streaming P2P networks.Due to the heterogeneous network conditions, the QoS between any pairs of nodes is different.However, evaluating the QoS of all the nodes for each user is resourceconsuming.An attractive way is to adopt collaborative filtering technologies, which use only a small amount of past usage experience.Unfortunately, the video content providers might often choose different QoS properties to select neighbors.Traditional CF methods cannot solve the cross-domain QoS prediction problem.This paper proposed a novel neural style CF method based on transfer learning.We first outlined our model architecture and then introduced the details of important parts of this model.To avoid the overadaptation problem, we combined domain loss and prediction loss together to train the model of the target domain.We adopted MMD distance as our domain loss, and we also provide its principle and how to compute the gradient.Finally, we conducted our experiments on a real-world public dataset.The experimental results show that our DTCF model can outperform the other models for cross-domain QoS prediction.

2 2𝛿 2 ) ( 11 ) 3 . 5 .
Algorithm.The total loss of target domain includes regression loss and MMD loss.We use the minibatch to train the model.Only a small group of examples are used to compute the loss per training iteration.Denote the set of the minibatch examples in the source domain M  and the set of the minibatch examples in the target domain M  .The loss function of the model in the source domain is defined as follows.

Figure 3 :
Figure 3: MAE with respect to density.

Figure 6 :
Figure 6: MAE with respect to Gaussian kernel bandwidth scale.
For the cross-domain QoS prediction in the video streaming P2P network, we are given a source domain D  = {⟨   ,    ,    ⟩}  ̸ = with   examples, which is characterized by the probability distribution  and a target domain D  = {⟨   ,    ,    ⟩}  ̸ = with   examples which is characterized by the probability .Usually the size of examples in the target domain is extremely small,   ≻   .Our work aims to build a deep neural network to learn transferable features that bridge these two domains' discrepancy.
[22]we can see that ℎ  = ∑    U  U  +   , where    is the element at the th row and the th column in the matrix W  3.4.Domain Loss.The output of the last ReLU layer of the model in the source domain is denoted as h  , and the output of the last ReLU layer of the model in the target domain is denoted as h  .If we want to avoid the overadaptation problem, one possible way is to minimize the differences between the distributions  = Pr(   ,    , h  (   ,    )) and  = Pr(   ,    , h  (   ,    )), where    =    and    =    .Let (X, ) be a metric space, and ⟨   ,    , h  (   ,    )⟩ ∈ X, ⟨   ,    , h  (   ,    )⟩ ∈ X.Let F be a class of functions : X → R, and the Maximum Mean Discrepancy (MMD) is as[22] s  ,t  ∈S  ∩T  ([ (s  )] − [ (t  )]) = ∑ (,, , )∈Q        , − r,