SVD-CNN: A Convolutional Neural Network Model with Orthogonal Constraints Based on SVD for Context-Aware Citation Recommendation

Context-aware citation recommendation aims to automatically predict suitable citations for a given citation context, which is essentially helpful for researchers when writing scientific papers. In existing neural network-based approaches, overcorrelation in the weight matrix influences semantic similarity, which is a difficult problem to solve. In this paper, we propose a novel context-aware citation recommendation approach that can essentially improve the orthogonality of the weight matrix and explore more accurate citation patterns. We quantitatively show that the various reference patterns in the paper have interactional features that can significantly affect link prediction. We conduct experiments on the CiteSeer datasets. The results show that our model is superior to baseline models in all metrics.


Introduction
Citation recommendation for researchers to quickly find the appropriate relevant literature is a rapidly developing research area [1]. Among this area, context-aware citation recommendation is a particular type for predicting citations for a citation context [2]. e citation context is usually a few sentences before and after the place holder, such as "[]". e key problem for context-aware citation recommendation is how to measure the similarity between the citation context and a specific scientific paper.
Similar to other NLP tasks (e.g., information retrieval (IR) and text mining), the simplest solution for contextaware citation recommendation calculates the relevant score between a citation context and candidate papers via Euclidean distance [3] and then selects the salient citations. However, simple text similarity is obviously too coarse to be a good measurement. In recent years, neural network models have been widely used to recommend documents due to their efficiency and effectiveness [4][5][6][7]. Neural network models can be regarded as better solutions than traditional machine learning methods for simplifying feature engineering tasks and having the ability to deal with largescale data. However, the weight vectors in existing neural network-based models are usually strongly correlated. In fact, a critical assumption of using similarity measurements, such as Euclidean distance or cosine distance, is that the entries in the feature vectors should be possibly independent [8]. When the weight vectors are overcorrelated, some entries of the descriptor will dominate the measurement and cause poor ranking results. e above problems seriously affect the performance of citation recommendation because citing activity appears to have strong orthogonality. Assume there are three types of citations in a paper, including "fieldreference" (red color), "method-reference" (purple color), and "math-reference" (blue color). "Field-reference" usually appears in the introduction and cites scientific articles that use the same techniques in other research fields. "Methodreference" usually appears in related work and cites scientific articles solving the same task. "Math-reference" usually appears in the main part of the paper describing the researcher's method in detail, and its citations will be more related to mathematical theorem. It is obvious that these three types of citations have strong orthogonality. In the neural network model, these three citation types are usually mapped into a matrix and can be seen as base vectors for inputs. As shown in Figure 1, vectors in the mapping matrix learned by traditional neural network models are not orthogonal. When a sample is mapped by w 1 �→ , w 2 �→ , and w 3 �→ , apparently w 1 �→ and w 3 �→ will dominate the output and consequently create low discriminative ability. A more satisfactory w 2 ′ �→ (yellow color) imposes orthogonality. To address the aforementioned problems, we propose a neural network model with orthogonal regularization for context-aware citation recommendation. Our model uses CNN to extract the semantic features for citation context and candidate papers. We then add the orthogonal constraint based on SVD in our model to weaken the correlation of weight vectors in the FC layer, which can learn good interpretable features for citation context and papers. To the best of our knowledge, this is the first work that addresses the context-aware citation recommendation with the CNN and orthogonal constraint framework. Experimental results show that our model significantly outperforms other baseline methods.

Citation Recommendation.
A variety of citation recommendation approaches have been proposed in the literature, including text similarity-based [9,10], topic modelbased [11,12], probabilistic model-based [13], translation model-based [7], and collaborative filtering-based [14]. Sun et al. [15] proposed a method for recommending appropriate papers for academic reviewers by using the similarity-based algorithm. eir method builds preference vectors for reviewers based on published history information and calculates the similarity between the preference vector and candidate document vector. e literature with high similarity is recommended to corresponding reviewers. Shaparenko and Joachims [16] considered the relevance of citation context and the paper content and applied a language model to the recommendation task. Strohman et al. [17] showed that using text similarity alone was not ideal for recommending citations, because scholars tend to construct new words to describe their own achievements, while two scholars who study the same topic may use different expressions for the same concept and method. To address this problem, Strohman et al. [17] regarded the document as a node in a directed graph to perform citation recommendations. ey believe that the similarity measurement with reference information can reflect the reference situation of a node more authentically. Livne et al. [18] proposed a citation recommendation method by coupling the enriched citation context of the literature and adopted various techniques, including machine learning when making recommendations. Some works addressed the language gap between cited papers and citation contexts and attempted to use translation models or distributed semantic representations. Lu et al. [19] assumed that the languages used in the citation contexts and in the cited papers were different and used a translation model to solve this problem. He et al. [3] combined a language model, topic model, and feature model to find the appropriate citation context. Huang et al. [20] assumed that the appearance of cited papers was a particular language and represented the cited papers in unique IDs regarded as new "words." e probability of citing a paper given a citation context is directly estimated by using a translation model. Tang et al. [21] proposed a joint embedding model to learn a low-dimensional embedding space for both contexts and citations.
In recent years, neural networks have shown better performance in many fields. Some researchers have attempted to recommend citations by using neural networks. Huang et al. [4] learned a distributed word representation for citation context and associated document embedding via a feedforward neural network and then estimated the probability of citing a paper by a given citation context. Tan et al. [5] proposed a neural network method based on LSTM to solve quote recommended tasks. ey focused on the characteristics of quotes and trained neural networks to bridge the language gap. A neural network model learned the semantic representations of arbitrary length texts from a large corpus.

Orthogonal Constraint in Deep
Learning. One of the greatest advantages of orthogonal matrices is that the norm of the matrix is changed when it is multiplied by a matrix.
is property is useful in gradient backpropagation, especially to deal with gradient explosion and gradient dissipation problems. Orthogonal regularization is widely used in many fields. Brock et al. [22] used orthogonal regularization to improve the generalization performance of image generation editor tasks by using generative adversarial networks (GANs) [23]. ey further expanded their work into BigGAN [24]. e results in their work showed that by applying orthogonal regularization, the generator allows fine-tuning the tradeoff between fidelity and diversity of samples by truncating hidden spaces, which can make the model achieve the best performance in the image synthesis of class conditions. Another advantage of orthogonal matrices is that they benefit from deep representation learning. If the weight vectors of the full connection layer in the convolutional neural network are highly correlated, the individuals in each full-join description will also be highly correlated, which will highly reduce retrieval performance. Sun et al. [25] proposed SVD-Net to show that guaranteeing the feature weight of the FC layer can increase the orthogonal constraint of the network and improve the accuracy. Zheng et al. [26] reported that regularization was an efficient method for improving the generalization ability of deep CNN because it makes it possible to train more complex models while maintaining lower overfitting. Zheng et al. [26] proposed a method for optimizing the feature boundary of a deep CNN through a two-stage training step to reduce the overfitting problem. However, the mixed features learned from CNN potentially reduce the robustness of network models for identification or classification. To address this problem, Wang et al. [27] decomposed deep face features into two orthogonal components to represent age-related and identity-related features to learn the age-invariant deep face features. In the above model, age-invariant deep features can be effectively obtained to improve AIFR performance. Chen et al. [28] proposed a group orthogonal convolutional neural network (GoCNN) model based on the idea of learning different groups of convolutional functions that are "orthogonal" to those in other groups, i.e., with no significant correlation among the produced features. Optimizing orthogonality among convolutional functions reduces the redundancy and increases the diversity within the architecture. Moreover, it can also obtain a single CNN model with sufficient inherent diversity, such that the model learns more diverse representations and has stronger generalization ability than vanilla CNNs.

Problem Formulation.
e context-aware citation recommendation is defined as the matching task between citation context and candidate papers. e main architecture of our model is shown in Figure 2. Our model is actually a convolutional neural network with two inputs and orthogonal constraints. Our model consists of the following main steps: (1) We adopt word2vec to obtain the raw input vectors and then use CNNs to extract multiple granularity semantic features (2) e multiple granularity semantic feature is then imposed orthogonally by an SVD-FC layer (3) We use fully connected layers to obtain the final vector representation. e logistic function or SVM is used to obtain the recommendation result

Input Layer.
Word2vec [29] is used to embed the input of our model. Each word is represented as a d 0 dimensional precomputed vector, where d 0 � 300. As a result, each sentence is represented as a feature matrix with dimension d 0 × s. rough this layer, we can obtain the raw representation of citation context c and candidate document d.
We also calculate the weight of common words according to the inputs. en, we can obtain the basic input features TF − IDF(c, d) for our model, which is the product of TF(w c , d) and IDF to reflect how important a word in citation context c is for a candidate document d in the corpus [30]. w c is a word in citation context c. ese two variables are calculated as follows: where count(w c , d) is the number of words w c that appear in document d. top(w * , d) is the occurrence number of the word w * that appears most frequently in this candidate document d. docs(w c , D) is the number of documents containing the word w c in all candidate citations D. N is the total number of candidate citations.

Convolution
Layer. e inputs of the convolution layer are the feature matrix of citation context c and document d. e process of this layer is demonstrated in Figure 3. We first pad the two inputs to have the same length s � max(c, d) by zero vectors. For every input, let v 1 , v 2 , . . . , v s be the words in a sentence. We define g i ∈ R wd 0 , 0 < i < s + w − 1, as the concatenation of v i−w , . . . , v i . en, this layer generates the feature P i ∈ R d 1 for the phrases v i−w , . . . , v i as follows: where W ∈ R d 1 ×wd 0 is a convolution kernel, and b ∈ R d 1 is the bias.

Average Pooling Layer.
e pooling layer is usually used for feature compression. In our model, we choose average pooling. e reason is that whole sentences or paragraphs can express more meaningful semantics. As shown in Figure 4, we design two pooling layers. e first one is "w-ap," which is the column average for the window of w continuous columns. After the convolution layer, an s column feature map is converted into a new s + w − 1 column feature map. By using "w-ap," the new feature map is recovered into the s column. is architecture facilitates the extraction of more useful abstract features. e second one is "all-ap," which normalizes all columns. As shown in Figure 5, "all-ap" generates a representation vector for each feature map. e generated feature combines the information of the whole citation context or cited document. Now, we can obtain the features of citation context and independent features of the cited document. e next step is to obtain the semantic relationships between the citation context and the candidate paper. We use cosine similarity to measure the semantic relations:    where C j and D j are the distributed representation of citation context and candidate document after the j-th "allap" layer, respectively. A total of ten "all-ap" layers are carried out in our model. erefore, j belongs to [1,10]. e benefit is that we can obtain the semantic relation between the citation context and the cited document with multiple granularities. As shown in Figure 6, the final output feature consists of all sim j and basic features. en, it is fed into the SVD-FC layer.
In most cases, we find that if we use all outputs of pool layers as the input of the SVD-FC layer, the performance will be improved. e reason is that features from different layers represent the different levels of semantics. Neglecting any layers will obviously cause information loss problems.
Next, we use the SVD-FC layer to learn the nonlinear combination features of citation relationships. is layer can force vectors in the feature map independent and orthogonal to each other. e added SVD-FC layer can also reduce the negative impact of excessive parameters.

SVD-FC Layer.
In this layer, we use SVD to factorize the weight matrix W (W � USV T ) and replace it with US. Our experimental results show that replacing operations can reduce the negative impact on the sample space.
e Euclidean distance between samples can be used to measure whether their feature expression changes in a sample space. Denoting e m and e n as the feature maps of two different samples, we can obtain two different outputs of the full connection operation by using the weight matrix W or US as follows: q � e × US.
As seen in the above equations, q is orthogonalized output, while p is unorthogonalized. en, we can obtain the following theorem. (4) and (5) will generate the same Euclidean distance for samples e m and e n .

Theorem 1. p and q in equations
Proof.
e Euclidean distance L between p m and p n is calculated as follows: Since V is an orthogonal matrix, equation (6) is equivalent to It can be seen that ‖p m �→ − p n �→ ‖ 2 � ‖q m �→ − q n → ‖ 2 . It should be noted that there are no negative impacts and no changes in discrimination ability for the entire sample space when replacing the weight. As shown in Figure 7, we use SVD of weight matrix W to map the feature map to an orthogonal linear space. e citation recommendation problem is regarded as a classification task in our model. In this layer, logistics and SVM can deal with binary classification tasks and predict the final citation relationship.

Training Details
3.3.1. Embeddings. In our model, words are initialized by 300-dimensional word2vec embeddings and will not change during training. A single randomly initialized embedding is created for all unknown words by uniform sampling from[−0.01, 0.01]. We employ AdaGrad [31] and L2 regularization. We introduce adversarial training [32] for embeddings to make the model more robust. e process is achieved by replacing the word vector v after word2vec embeddings using word vector with disturbing v * : where r adv is the worst case of perturbation on the word vector. Goodfellow et al. [33] approximated this value by linearizing the loss function log p(y|x, θ) around x, where θ is a constant set to the current parameters of our model, and it only participates in the calculation process of r adv without a backpropagation algorithm. With the linear approximation and L 2 norm constraint, the adversarial perturbation is Basic-feature  Computational Intelligence and Neuroscience is perturbation can be easily computed by using backpropagation in neural networks.

Layerwise
Training. In our training steps, we define conv-pooling block b t (t ≥ 2), which consists of a convolution layer and a pooling layer. Our network model is then assembled by the initialization block b 1 that initializes using word2vec and (n − 1) conv-pooling blocks.
First, we train the conv-pooling block b 2 after b 1 is trained. On this basis, the next conv-pooling block b 3 is created by keeping the previous block fixed. We repeat this procedure until all (n − 1) conv-pooling blocks are trained.
Second, the following semiorthogonal training procedure is used to train the whole network.
Semiorthogonal training (SOT): it is crucial to train SVD-CNN, which consists of the following three steps: Step 1. Decompose the weight matrix by SVD, i.e., W � USV T . W is the weight matrix of the linear layer. U is the left-unitary matrix. S is the singular value matrix. V is the right-unitary matrix. After that, we replace W with US. Next, we take all eigenvectors of US(US) T as weight vectors.
Step 2. e backbone model is fine-tuned by fixing the SVD-FC layer.
Step 3. e model keeps fine-tuning with the unfixed SVD-FC layer.
Step 1 can generate orthogonal weights, but the performance of prediction cannot be guaranteed. e reason is that over orthogonality will excessively punish synonymous sentences, which is apparently inappropriate. erefore, we introduce Steps 2 and 3 to solve the above problem. e weight matrix is defined as W � (w 1 , w 2 , . . . , w m ) T . e expected outputs are defined as A � (a 1 , a 2 , . . . , a m ) T . e error function is defined as where o k � f( m j�0 w kj y j ), k � 1, 2, . . . , l. en, E with respect to o k is derived, and the outcome is We utilize the gradient descent strategy to find the gradient of the error with respect to weights. e iterative update of weights is as follows: We define an error signal δ o k � zE/z net k . equation (12) is equivalent to According to equation (11), We use the sigmoid f(x) � 1/(1 + e x ) as the nonlinear function, so equation (13) is equivalent to In Step 1, the weight matrix W is decomposed by SVD and replaced with US. U � (q 1 , q 2 , . . . , q m ) T , and S � diag(λ 1 , λ 2 , . . . , λ m ). Since d k − o k is given, we define that Loss � d k − o k . As a result, equation (15) is equivalent to q i · q j � 0, i ≠ j are in the left-unitary matrix U, so the model operation is not affected by the nonorthogonal eigenvectors q i . is is the reason for excessively punishing synonymous sentences in Step 1. However, orthogonality has a positive effect on Δw kj in Step 2.
e purpose of SVD is to maintain the orthogonality of each weight vector in geometric space. When weight vectors are conditioned by orthogonal regularization, the relevancy between weight vectors decreases. We use the following methods in Step 3 to measure relevance: where W is a weight matrix that contains k weight vectors: j � 1, . . . , k) is the dot product of w i and w j . Let us define S(W) as the correlation measurement of all column vectors in W: When W is an orthogonal matrix, the value of S(W) is 1. When i ≠ j, S(W) obtains the minimum value (1/k). erefore, we can see that the value of S(W) falls into 6 Computational Intelligence and Neuroscience [(1/k), 1]. As a result, when S(W) is close to 1/k or 0, the weight matrix will have high relevance.

Complexity Analysis.
Assume that the training sample size is |C|, the average number of words in each citation context is |c|, C l is the number of kernels in the l-th layer, and wis the size of the sliding window. For one convolution layer, the training complexity is O(C l−1 · C l · w · (s − w + 1)). e training complexity of one w-ap layer is O(C 2 l · w · s). e training complexity of one all-ap layer is O(C 2 l · (s − w + 1)), which was improved by C. F. Van Loan [12], computing the eigenvalue for SVD matrix decomposition with K size takes O(K) on the way of JACOBI. Assume that the size of the weight matrix in the SVD-FC layer isK, and the channel of the input matrix is C in . e computational cost for the SVD-FC layer is O(2K 2 · C in + K).

Dataset.
We use the CiteSeer dataset [34] to evaluate the performance of our model. e dataset was published by Huang et al. [4]. In this dataset, citation relationships are extracted by a pair of citation contexts and the abstracts of cited papers. A citation context includes the sentence where the citation placeholder appears and the sentences before and after the citation placeholder. Within each paper in the corpus, the 50 words before and 50 words after each citation reference are treated as the corresponding citation context (a discussion on the number of words can be found in [7]). Before word embedding, we also remove stop words from the contexts. To preserve the time-sensitive past/present/future tenses of verbs and the singular/plural styles of named entities, no stemming is done, but all words are transferred to lower-case. e training set contains 3,989,547 pairs of reference contexts and citations, and the test set contains 1,021,685 citation relations.
Following common practice in information retrieval (IR), we employ the following four evaluation metrics to evaluate recommendation results: recall, mean reciprocal rank (MRR), mean average precision (MAP), and normalized discounted cumulative gain (nDCG).

Evaluation Metric.
For each query in the test set, we use the original set of references as the ground truth R g . Assume that the set of recommended citations is R r , and the correct recommendations are R g ∩ R r . Recall is defined as In our experiments, the number of recommended citations ranges from 1 to 10. Recall evaluation does not reveal the order of recommended references. To address this problem, we select the following two additional metrics.
For a query q, let rank q be the rank of the first correct recommendation within the list. MRR [35] is defined as where Q is the testing set. MRR reveals the average ranking of the first correct recommendation. For each citation placeholder, we search the papers that may be referenced at this citation placeholder. Each retrieval model returns a ranked list of papers. Since there may be one or more references for one citation context, we use mean average precision (MAP) as the evaluation metric: where R(d i ) is a binary function indicating whether document d i is relevant or not. For our problem, the papers cited at the citation placeholder are considered relevant documents.
We use normalized discounted cumulative gain (NDCG) to measure the ranked recommendation list.
e NDCG value of a ranking list at position i is calculated as where rel (d i ) is the 4-scale relevance of document d i in the ranked list. We use the average cocited probability [2] of 〈d i , d * 〉 to weigh the citation relevance score of d i to d * (an original citation of the query). We report the average NDCG score over all testing documents.

Baseline Comparison.
We choose the following methods for comparison. Cite-PLSA-LDA (CP-LDA) [36]: we use the original implementation provided by the author. e number of topics is set to 60.
(i) Restricted Boltzmann Machine (RBM-CS) [37]. We train two layers of RBM-CS according to the suggestion of the author. We set the hidden layer size to 600. (ii) Word2vec Model (W2V) [29]. We use the word2vec model to learn words and document representations. e cited document is treated as a "word" (a document uses a unique marker when it is cited by different papers). e dimensions of the word and document vectors are set to n � 300. (iii) Neural Probabilistic Model (NPM) [4]. We follow the original implementation. e dimensions of the word and document representation vector are set to n � 600. For negative sampling, we set the number of negative samples k � 10, where k is the number of noise words in the citation context. For noise contrast estimation, we set the number of noise samples k � 1000. (iv) Neural Citation Network (NCN) [7]. In NCN, the gradient clipping is 5, the dropout probability is 0.2, and the recurrent layers are 2. e region sizes for Computational Intelligence and Neuroscience the encoder are set to 4, 4, and 5, and the region sizes for the author network are set to 1 and 2. Figures 8 and 9 show the performance of each method on the CiteSeer dataset. It is obvious that the SVD-FC model leads the performance in most cases. More detailed analyses are given as follows.
First, we perform a comparison among CP-LDA, RBM, W2V, and SVD-CNN. Our SVD-CNN completely and significantly exceeds other models in all metrics. e success of our model is ascribed to the content and correlation of our network. Due to the lack of citation context information, we find that W2V is obviously worse than other methods in terms of all metrics. CP-LDA works much better than W2V, which indicates that link information is very important for finding relevant papers. RBM-CS shows a clear performance gain over W2V because RBM-CS automatically discovers topical aspects of each paper based on citation context. However, the vector representations of citation context in RBM-CS are extracted by traditional word vector representations, which fully neglect semantic relations between the citation document and citation context and thus may be limited by vocabulary.
Second, we compare the performance among NPM, NCN, and SVD-CNN. It is not surprising that NPM and NCN achieve worse performance than SVD-CNN since their distributed representation of words and documents relies solely on deep learning without restraint. NPM recommends citations based on trained distributed representations. NCN further enhances the performance by considering author information and using a more sophisticated neural network architecture. However, the CNN in NCN does not have orthogonal constraints, which makes it difficult to capture different types of citing activities. In addition, NCN only utilizes the title of the cited paper for a decoder, which is apparently not sufficient for learning good embedding.

e Influence on the Link Prediction of Reference Pattern
Interactional Features. According to the chapter positions of citation context in the article, we divide the training set into three parts: the introduction part contains 1,307,885 pairs of reference contexts and citations, the related word part contains 1,599,897 pairs of citations, and the main part contains 1,024,783 pairs. Furthermore, these datasets form three mixed datasets. In this part of the experiment, we use the CNN model without SVD as the baseline. ese datasets are tested in a ratio of 3 : 1. In Tables 1 and 2, we show the results on the abovementioned datasets.
From the results, we obtain the following observations: First, both CNN and SVD-CNN outperform unmixed datasets over mixed datasets across the different evaluation metrics, which shows that the diversity of reference patterns increases the difficulty of citation recommendation tasks.
Second, in Tables 1 and 2, we observe that our model is particularly good at resolving the difficulties in mixed datasets, which come from the diversity of reference patterns.
To better explore why mixed datasets are more complex than unmixed datasets, in Figure 10, we show the change in S(W) during the training process of SVD-CNN among various datasets.
As shown in Figure 10, the increase in S(W) on the mixed datasets indicates that SVD-CNN is good at decorrelation. We can also see in Tables 1 and 2 that the CNN model has pretty performance on unmixed datasets while achieving poor performance on mixed datasets. However, SVD-CNN achieves almost the same performance on the two types of datasets. is proves that the correlation from various reference patterns can significantly affect the link prediction.
e reason why the change in S(W) is not large on the unmixed datasets is that reference patterns of unmixed datasets have similar features, which belong to the same category. As a result, the orthogonality of the weight matrix is hard to improve on unmixed datasets. However, a citation recommendation algorithm has pretty performance on the unmixed datasets because there are low complexities.
Although mixed datasets are more complicated than unmixed datasets, SVD-CNN still performs well in mixed datasets. is indicates that SVD-CNN reduces the negative impact of the correlation of reference patterns, and our approach is more suitable for complex scenarios.

Comparison with Other Types of Decorrelation.
In addition to SVD, there are still some other methods for decorrelating the feature matrix. However, these methods cannot maintain the discriminating ability of the CNN model. To illustrate this, we compare SVD with several varieties as follows: (1) Using the originally learned W (2) Replacing W with US (3) Replacing W with U (4) Replacing W with UV T (5) Replacing Wwith Q D, where D is the diagonal matrix extracted from the upper triangle matrix in Q-R decomposition (6) Replacing W with W PCA , where W PCA is the diagonal matrix extracted from the weight matrix W after the processing of dimension reduction by PCA After convergence of training, different orthogonal matrices are used to replace the weight matrix W. We define T-cost as the time cost of replacing the weight, which is equivalent to the proportion of the added time to the original time. As shown in Table 3, other types of decorrelation degrade the performance, in addition to W ⟶ US and W ⟶ W PCA . However, the time cost of W ⟶ W PCA is more than that of W ⟶ US.
4.6. Ablation Study. In our method, there are two essential parameters, a term sot, which means the number of SOT iterations, and a biased parameter d 0 . In this section, we conduct an ablation study of these parameters.
We first evaluate the effectiveness of sot by empirically fixing d 0 � 300. Since sot defines the loop time of   weight matrix in the FC layer is highly correlated, and S(W) has the lowest value. e recommendation performance then increases while adding sot, which indicates that reducing the correlative degree of the weight matrix in the FC layer is critical for improving performance. When sot � 10, our model achieves the best performance.
In our model, d 0 is the dimension of citation context and cited document representations. Figure 12 shows how the performance of SVD-CNN varies with d 0 on the same sot. When d 0 is small, the information content of the citation context is very small and produces worse performance. e recommendation performance increases to a maximum point until d 0 reaches 300. It should be noted    that although the larger d 0 is better, the larger d 0 will significantly increase the training time. erefore, we choose d 0 � 300.

Conclusion and Future Works
We propose a convolutional neural network model with orthogonal regularization to solve the context-aware citation recommendation task. In our model, orthogonal regularization is achieved by using SVD to factorize the weight of the FC layer, which can essentially make each vector in the feature map more independent. e orthogonal regularization also enhances the feature extraction ability of CNN.
e experimental results show that SVD-CNN outperforms the other compared methods on CiteSeer. Our model only takes the abstract as the content of the cited paper. In the future, we will explore the performance of our model by using the full text of papers.

Data Availability
Previously reported CiteSeer data were used to support this study and are available at [https://psu.app.box.com/v/ refseer]. ese prior datasets are cited at relevant places within the text as references [4].  Computational Intelligence and Neuroscience 11