Next Article in Journal
Unified Algorithm of Factorization Method for Derivation of Exact Solutions from Schrödinger Equation with Potentials Constructed from a Set of Functions
Previous Article in Journal
Phenotype Analysis of Arabidopsis thaliana Based on Optimized Multi-Task Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detecting Pseudo-Manipulated Citations in Scientific Literature through Perturbations of the Citation Graph

Software Engineering Department, Braude College of Engineering, Karmiel 21982, Israel
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(18), 3820; https://doi.org/10.3390/math11183820
Submission received: 1 August 2023 / Revised: 26 August 2023 / Accepted: 4 September 2023 / Published: 6 September 2023
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
Ensuring the integrity of scientific literature is essential for advancing knowledge and research. However, the credibility and trustworthiness of scholarly publications are compromised by manipulated citations. Traditional methods, such as manual inspection and basic statistical analyses, have limitations in detecting intricate patterns and subtle manipulations of citations. In recent years, network-based approaches have emerged as promising techniques for identifying and understanding citation manipulation. This study introduces a novel method to identify potential citation manipulation in academic papers using perturbations of a deep embedding model. The key idea is to reconstruct meaningful connections represented by citations within a network by exploring, to some extent, longer alternative paths. These indirect pathways enable the recovery of reliable citations while estimating their trustworthiness. The investigation takes a comprehensive approach to link prediction, leveraging the consistent behavior of prominent connections when exposed to network perturbations. Through numerical experiments, the method demonstrates a high capability to identify reliable citations as the core of the analyzed data and to raise suspicions about unreliable references that may have been manipulated. This research presents a refined method for tackling the urgent problem of citation manipulation in academic papers. It harnesses statistical sampling and graph-embedding techniques to evaluate the credibility of scholarly publications with a substantial assessment of the whole citation graph.

1. Introduction

The integrity and dependability of scientific literature are essential for the advancement of knowledge and the advancement of research. From this standpoint, the manifestation of manipulated citations presents a significant obstacle to the credibility and trustworthiness of scholarly publications. Manipulated citations involve intentional actions by authors to artificially enhance the quantity or influence of their papers by including non-necessary or irrelevant citations. This practice undermines scholarly discussions’ accuracy, impartiality, and scientific validity. Citation manipulation aimed at increasing researchers’ citation counts can furthermore occur when editors or peer reviewers of a manuscript request the inclusion of unnecessary and unrelated references, a practice known as “coercive citation.” Despite researchers recognizing the unequal value of citations and their attempts to assign varying weights based on different types, most studies have primarily focused on differentiating and assigning weights to a specific citation type. In fact, as demonstrated by Prabha [1], more than two-thirds of the references in a paper are deemed unnecessary, providing further evidence of the existence of dubious citations.
Numerous surveys have assessed different elements related to manipulating reference lists, including references [2,3,4]. Most authors acknowledge that such citations are anomalous compared to typical or regular references.
While traditional methods like manual inspection and basic statistical analyses have been employed, they have limitations in capturing intricate patterns and subtle manipulations. Network-based approaches have emerged as promising techniques for identifying and comprehending citation manipulation in recent years. Due to the intricate nature of citation graph data, characterized by irregular structures and relational dependencies, conventional anomaly detection techniques struggle to address this issue effectively. In contrast to traditional detection methods, anomaly detection methods that leverage graph learning can simultaneously preserve both the node attributes and network structures throughout the learning process. By leveraging the structure and connections within the citation network, network-based approaches can unveil hidden relationships and abnormalities that indicate potential citation manipulation. These methods go beyond individual papers and examine the broader network dynamics, allowing for a more comprehensive understanding of the manipulation patterns. Research of this kind is presented in papers [5,6,7,8,9].
The article [10] can be highlighted in this connection. The paper introduces a new approach called GLAD (Graph Learning for Anomaly Detection); a deep graph learning model designed to identify anomalies in citation networks. GLAD integrates semantic text mining into the process of network representation learning by incorporating both node attributes (related to the content of the papers) and link attributes (capturing the citation relationships) using a graph neural network. This combined approach enhances detecting and classifying anomalous citations within the network.
Indeed, the availability and quality of textual information can vary significantly in different scenarios. In some cases, textual information may not be present or poorly embodied, such as using a simple Bag of Words approach that only may capture high-level content. In such situations, exploring alternative methods to infer potential citations becomes essential. One natural approach is leveraging the graph’s inner network connection information, which makes it feasible to tap into the inherent structure and relationships within the graph itself to make informed decisions about potential citations. By incorporating network connection information, such an approach can complement or even replace traditional text-based methods for citation recognition, offering a valuable alternative when textual information is inadequate or absent.
This article presents a novel method of this kind, intending to identify possible manipulation of citations in academic papers. The central concept suggests that citations indicate meaningful connections between different studies and can be reconstructed within a network by following slightly longer alternative paths. As a result, these indirect pathways can be effectively employed to recover the original, authentic citations while estimating their reliability.
The investigation is approached from a general perspective of link prediction, utilizing a natural suggestion based on the stable behavior of prominent connections when exposed to network perturbations. In simpler terms, it is proposed that the specified relationships are expected to endure more effectively amidst distortions that entail the removal of a subset of connections, followed by their reconstruction using a link prediction method. To summarize, it is suggested that the identified relationships are more likely to persist through distortions that involve selectively omitting connections and subsequently reconstructing them using a link prediction technique.
The current paper applies link prediction in a graph based on an embedding technique involving forecasting the presence or absence of edges (connections) between nodes in a graph.
Graph-embedding methods transform nodes and edges into vectors or embeddings that encode important information about the graph’s structure and semantics. The basic idea behind graph embedding is to map each node in the graph to a continuous vector representation in a relatively lower-dimensional space. This depiction captures the relational information between nodes and can be used to infer potential links or relationships missing in the original graph. By learning meaningful embeddings, link prediction algorithms can estimate the likelihood of a potential edge between two nodes based on the proximity of their embeddings.
The Node2Vec approach [11] is a commonly used algorithm for learning continuous representations of nodes by capturing the neighborhood structure of the graph. It explores the notion of “node neighborhoods” by defining a random walk strategy to sample node sequences from the graph. To learn node embeddings, these sequences are then utilized to train a skip-gram Word2vec model [12].
After learning the node embeddings by means of the Node2Vec approach, these embeddings can be utilized to perform the mentioned link prediction by measuring the similarity between the embeddings of node pairs. In the studied case, a fixed fraction of edges is deliberately omitted artificially, intending to subsequently recover them based on the embeddings obtained from the reduced networks. The stability and effectiveness of this approach are demonstrated in the source paper [11].
Several perturbation levels are applied in the experimental study to examine small, moderate, and significant perturbation levels for two different citation graphs. As can be concluded, these datasets demonstrate comparable behavior that may differ from the network’s distortion attitude. This finding indirectly supports the earlier observation that more than two-thirds of the references in a paper may be unnecessary. The fact that the citation graphs exhibit consistent behavior under different perturbation levels reinforces the idea that many references in academic papers may not be essential or genuinely cited.
By conducting these experiments and observing the robustness of the citation graphs, deeper insights are gained into the credibility of the citation relationships within the network. The results further highlight the potential for identifying anomalous or unnecessary citations, shedding light on potential issues related to citation manipulation or fraudulent behavior.
The paper offers contributions that encompass two key facets. Firstly, it introduces a new citation consistency model that advances the field. This approach enables the evaluation of the credibility of citations by implementing a sequential perturbation technique within a citation network. Notably, this method strives to retain the most significant references, ensuring their integrity throughout the assessment process. This approach adds to the existing body of knowledge and offers a new perspective on evaluating the trustworthiness of citations in scholarly work.
The remainder of the paper is organized as follows. Section 2 provides an overview of the mathematical preliminaries relevant to the study. In Section 3, the proposed model for identifying citation manipulation is presented. Section 4 presents the experimental study conducted to evaluate the effectiveness of the proposed model. Finally, Section 5 concludes the paper by summarizing the main findings and discussing their implications and further research directions.

2. Preliminaries

In this section, several mathematical models are introduced and discussed, forming the algorithmic foundation of the proposed research.

2.1. Word2Vec

Numerous traditional methods in text mining are linked to vector representations, such as the bag-of-words approach, which treats texts as vectors of term occurrences. However, these techniques have a known limitation: they disregard the sequencing of words and the relationships between them, leading to a loss of semantic information. To address these limitations, deep learning embedding systems offer novel strategies. They provide real-valued vector representations for words, where words with similar meanings are represented by vectors that are close in proximity. Word embedding, encompassing a range of language modeling techniques, plays a crucial role in natural language processing. It represents words from a given vocabulary in high-dimensional vector spaces, effectively preserving the underlying semantic and syntactic information. As a result, word embedding proves invaluable for enhancing performance across various natural language-processing tasks.
Word2Vec [12] has gained popularity as a highly effective algorithm for learning word embeddings. Its success lies in its ability to capture semantic relationships between words and represent them in a continuous vector space. The algorithm operates through a shallow, two-layer neural network with two main architectures: Continuous Bag-of-Words (CBOW) and Skip-gram.
In the CBOW architecture, the model predicts a target word based on its context words (words surrounding it in a sentence or text). This approach is particularly useful for tasks like language modeling, where the goal is to predict the next word given its context. On the other hand, the Skip-gram architecture reverses the process, aiming to predict context words from a given target word. This architecture is advantageous for applications like word similarity and analogy tasks.
The continuous word embeddings generated by Word2Vec facilitate more efficient mathematical operations, allowing for meaningful vector arithmetic and semantic similarity comparisons between words. This capability has spurred significant advancements in language-related tasks and has become a fundamental building block for numerous language models and applications in modern natural language-processing research. An application of this approach to the graph-embedding task is considered in the following subsection.

2.2. Graph-Embeddings Methods

Graph-embedding techniques aim, like word-embedding ones, to represent nodes and edges of a graph as continuous, relatively low-dimensional vectors; as mentioned earlier, word-embedding methods allow machine learning algorithms to operate on graph data efficiently. Here, a short review of some famous graph-embedding methods is provided.

2.2.1. Node2Vec

Node2Vec [11] is a widely used graph-embedding algorithm that generates embeddings by capturing nodes’ local and global neighborhood information. The algorithm achieves its objective by simulating random walks on the graph, letting it generate low-dimensional representations for nodes. It optimizes a neighborhood-preserving objective using a corresponding Skip-gram model. Roughly speaking, random walks generate sequences of nodes considered as sentences within the general Word2Vec method applied in the Skip-gram mode.
To strike a balance between exploration and exploitation during the random walks, Note2vec employs two hyperparameters: (p) “return” and (q) “inout”. These hyperparameters control the probabilities associated with the random walk, determining whether it stays close to previous nodes, explores outward, or explores inward. Thus, the hyperparameters “return” and “inout” in Node2vec play a crucial role in shaping the behavior of the random walks conducted during the algorithm’s learning process.
Adjusting the “return” hyperparameter can control the likelihood of revisiting previous nodes during the random walk. This influences the algorithm’s exploration of local neighborhoods and its ability to capture the structural properties of the graph. Similarly, the “inout” hyperparameter governs the random walk’s decision-making process, determining the probabilities of exploring outward or inward at each step. This empowers control of exploring global information versus exploiting local neighborhood information. Specifically, the probabilities 1/p and 1/q are used to calculate the likelihood of exploring outward or inward, respectively.
In study [11], a perturbation analysis is conducted on the BlogCatalog network to explore the impact of imperfect information regarding the network’s edge structure. The study examines two specific scenarios where the accuracy of the network’s edge information is compromised. In the first scenario, the network’s performance is evaluated by considering the fraction of missing edges concerning the entire network. The missing edges are randomly selected, ensuring that the number of connected components within the network remains constant. The study measures the decrease in the Macro-F1 score as the fraction of missing edges increases. The findings indicate that the decline in the Macro-F1 score follows a roughly linear trend, suggesting a certain level of robustness to missing edges. Furthermore, the slope of the linear decrease specifies a slight decrease in performance, designating that the network can tolerate some level of missing edges without significant degradation in performance.

2.2.2. GraphSAGE (Graph Sample and Aggregated)

GraphSAGE [13] is a highly scalable algorithm designed for large graphs. It operates by sampling a fixed-size neighborhood around each node and then aggregating information from the sampled neighbors to create node embeddings. By doing so, the approach can effectively capture and preserve structural information in the graph, including higher-order proximity between nodes. Additionally, the scalability of GraphSAGE allows it to handle massive graphs efficiently.

2.2.3. DeepWalk

Inspired by Word2Vec, DeepWalk [14] generates node embeddings by treating graph traversals as sentences and applying the Word2Vec techniques. It captures the structural information by considering the local node context within random walks. Both DeepWalk and Node2Vec are effective graph-embedding algorithms that leverage random walks. DeepWalk focuses on capturing local neighborhood information, while Node2Vec offers more control and adaptability in the exploration strategy, enabling it to effectively preserve local and global structural information.

2.2.4. LINE (Large-Scale Information Network Embedding)

LINE [15] aims to preserve both the first-order (i.e., local node proximity) and second-order (i.e., global graph structure) proximity information. It optimizes two objectives: one for preserving the first-order proximity through the similarity of nearby nodes and another for preserving the second-order proximity through the similarity of nodes connected to the same neighborhood.

2.2.5. Graph Attention Network (GAT)

GAT [16] is a graph neural network (GNN) that learns node embeddings by paying different attention to neighbors during aggregation. It uses self-attention mechanisms to determine the importance of neighboring nodes for each node’s embedding, allowing it to capture complex graph structures.
Indeed, selecting a graph-embedding method should consider various factors, such as the unique characteristics of the graph data, available computational resources, and the specific downstream tasks to be performed. After carefully assessing the available techniques, the Node2Vec approach is chosen as the most suitable one for our particular use case. The main reason for such a selection is its inherent suitability to the type of permutation involved in the present research and its clear intuition connection to the considered task. However, it is essential to acknowledge that the choice of Node2Vec does not undermine the significance of other graph-embedding methods.
The subsequent section discusses how the techniques mentioned earlier are applied to our proposed approach—a step-by-step process of investigating the stability of citations through perturbations and analyzing deviations from expected patterns.

3. Approach

This section presents the proposed approach. As previously mentioned, the assumption is that manipulated or fraudulent citations may exhibit anomalies within a citation network. These anomalies are expected to make the manipulated citations vulnerable to appropriate perturbations within the network, causing them to be unstable or detectable. The hypothesis is that manipulated citations, intentionally added to inflate the impact or credibility of certain publications, may not conform to the natural patterns and structures of the citation network. Therefore, when subjected to network perturbations, such as removing specific nodes or edges, the manipulated citations are more likely to exhibit inconsistencies or inconsistencies that distinguish them from genuine citations.
Investigating the stability of citations under perturbations and analyzing deviations from expected patterns can be a powerful approach to identifying anomalous citations within a citation network. By perturbing the network through various means, such as omitting edges or altering connections, it is possible to observe how the network responds and how citation relationships may change. Nodes with higher similarity scores are considered more likely to have a connection.
The perturbations of the considered citation network are, to some extent, those mentioned earlier involved in the perturbation analysis of the model consisting of artificial changes or modifications to the network structure. In the context of the citation network, they concentrate in our study on randomly removing citations to evaluate the citation network’s robustness, stability, integrity, and individual links.
Perturbations can reveal vulnerabilities or weaknesses in a network, making it more likely for anomalies or manipulated elements to exhibit abnormal behavior or stand out from the genuine components.
As mentioned previously, the reason for selecting Node2Vec as the embedding method is its resilience against considered perturbations. Node2Vec has demonstrated robustness, especially in scenarios involving such perturbations, making it a suitable choice for the initial research. The natural and intuitive nature of the Node2Vec approach further supported its application and successful demonstration of the basic idea in the provided study.
A pseudocode of the proposed approach is given as Algorithm 1.
Algorithm 1. The proposed procedure’s pseudocode.
• 
Input parameters:
  ✓ 
Graph_C—Graph of paper citations.
  ✓ 
p and q—”return” and “inout” parameters of Node2vec.
  ✓ 
Nwalk—Number of random walks in a model generation.
  ✓ 
Lwalk—Length of a random walk.
  ✓ 
d—Dimension of the Word2vec embedding in Node2vec.
  ✓ 
N_iter—Number of perturbations.
  ✓ 
Fr—Fraction of nodes randomly omitted in each iteration.
  ✓ 
S—Similarity measure.
  ✓ 
Tr—Similarity threshold.
Procedure:
  1. 
Load the dataset Graph_C
  2. 
Initialize an array Result0 of zeros with a length equaling the number of edges in Graph_C
  3. 
For iter = 1:N_iter do:
    3.1. 
Create a temporary dataset Graph_T by removing the Fr fraction of edges in Graph_C without replacement.
    3.2. 
Create an embedding of Graph_T:
  •       W(Graph_T) = Node2vec(Graph_C, Nwalk, p, q, d)
    3.3. 
Calculate for all pairs of nodes the similarity values between all nodes.
    3.4. 
Compose a set ED_R of the edges reconstructed using the procedure
  •    Link_prediction(Graph_T, W(Graph_T), S, Tr)
    3.5. 
For edge in ED_R do:
  •        Result(edge) = Result(edge) + 1
  4. 
Summarize by sorting the array Result0 in ascending order.
In the first step, the analyzed citation graph Graph_C is downloaded, and an array Result0 of zeros with a length equaling the number of edges in Graph_C is initialized. Further, N_iter sequential iterations are performed. A temporary Graph_T is created at each step by randomly removing Fr—a fraction of edges and embedded in Rd using the Node2vec approach.
A flowchart of the algorithm is given in Figure 1.
It is proposed to introduce two additional parameters: a similarity measure (S) and a threshold value (Tr). The similarity measure quantifies the similarity between pairs of nodes, while the threshold value determines the cutoff point for clarifying whether pairs are considered “connected” or not. Specifically, suppose the similarity score between two nodes exceeds Tr. In that case, they are deemed connected, whereas pairs with a similarity score below the threshold are considered disconnected, as explained in the following link prediction procedure.
Procedure Link_prediction (Graph_T, W(Graph_T), S, Tr)
A procedure is designed to predict the presence of an edge between two nodes.
  • Input parameters:
    Graph_T—Graph of paper citations.
    W(Graph_T)—Embedding of Graph_T.
    S—Similarity measure.
    Tr—Similarity threshold.
  • Procedure
    If the similarity score S(n1, n2) is more significant than 1 − Tr, the procedure returns 1, indicating that there could potentially be an edge between n1 and n2.
    Otherwise, if the similarity score is less than or equal to 1 − Tr, the procedure returns 0, indicating that there is likely no edge between n1 and n2.
It is necessary to emphasize that the studied citation graph is considered as not directed. This scenario focuses on the connectivity between papers rather than the specific direction of citation that allows us to analyze and understand the overall structure and patterns of the network. Disregarding edge direction in the citation graph enables a more holistic view of the network, capturing the relationships and interdependencies between papers regardless of whether one paper is citing another or being cited. An experimental study of the suggested algorithm is described in the following Section 4.

4. Experiments

Numerical experiments in this study are conducted using two citation datasets. The first dataset is the well-known “Cora” dataset, which comprises scientific research papers from the computer science domain that comprehensively represent the computer science research landscape. The second dataset used in our experiments is sampled from PubMed, a vast online resource managed by the National Center for Biotechnology Information (NCBI) and the U.S. National Library of Medicine. PubMed houses an extensive collection of biomedical literature, including research papers, reviews, and scholarly publications. By utilizing this dataset, the citation patterns and relationships within the biomedical research domain can be explored.
Using these two diverse datasets allows us to assess the performance and effectiveness of our proposed approach in different research domains and under varying contexts. The experiments aim to provide valuable insights into the strengths and limitations of the proposed approach in its application to citation networks in distinct academic disciplines.

4.1. Cora Dataset

The Cora dataset (https://relational.fit.cvut.cz/dataset/CORA, accessed on 15 May 2023) is a well-known and extensively used dataset in machine learning and natural language processing, specifically for studying citation networks. Each paper in the dataset is represented by a bag-of-words feature vector, which indicates the presence or absence of specific words within the document. In addition to the textual data, the Cora dataset provides information about citation links between the documents to establish connections among the papers, allowing us to study citation patterns and investigate techniques for citation network analysis.
The Cora dataset contains 2708 scientific publications categorized into seven classes. With 5429 links, the dataset’s citation network captures the connections between these publications. Each publication is additionally represented by a binary word vector consisting of 0 s and 1 s, indicating the presence or absence of words from a dictionary. This dictionary comprises 1433 different words. The following Figure 2 demonstrates a visualization of the data graph.
Numerical experiments are performed to investigate the dataset’s structure involving various Fr values (30, 40, and 50) and Tr values (0.05, 0.1, and 0.2) across 100 iterations following the procedure presented in Section 3. It is worth mentioning that the use of such small thresholds does not yield conclusive evidence for the existence of the required connecting edges. Just like in statistical hypothesis testing, when the similarity falls below these critical values, it signifies the rejection of the hypothesis that such edges are present. However, it does not offer substantial evidence to confirm the presence of a connecting edge. Therefore, connections that collapse below these thresholds are deemed questionable and suggest possible manipulation.
Experiments are performed with a specified set of parameters:
p = 1.
q = 1.
Nwalk = 200.
Lwalk = 30.
D = 64.
N_iter = 100.
Fr—30/40/50%.
S—the cosine similarity.
Tr—0.05/0.1/0.2.
Indeed, our study’s choice of fraction values was carefully considered to capture different perturbation levels in the citation network. The intention is to evaluate the network behavior under small, moderate, and significant perturbation levels, enabling an understanding of its stability across varying degrees of edge omission.
Selecting specific fraction values for edge omission aims to create perturbed networks that reflect realistic scenarios. Small fraction values represent minor disturbances, while moderate and significant fraction values would introduce more substantial perturbations to a network. This approach allowed us to explore how citation relationships behave under different perturbation levels and provided insights into the robustness of our proposed method and the network itself. The chosen fraction values should be sufficiently appropriate for tracing the network’s stability. They balanced preserving the network’s overall structure and introducing enough perturbations to reveal citation anomalies or patterns.
Recall that cosine similarity is a metric used to measure the similarity between two vectors in a vector space. It calculates the cosine of the angle between the vectors, providing a value that indicates their degree of similarity. The range of cosine similarity is from −1 to 1, where a value of 1 indicates identical vectors, 0 indicates no similarity, and −1 indicates entirely different vectors. To compute cosine similarity, the dot product of the two vectors is divided by the product of their magnitudes or norms. This normalization ensures that the similarity measure is independent of the vectors’ lengths and only depends on their directions. The concept of cosine similarity finds applications in various fields, such as natural language processing, information retrieval, and data mining. It allows quantifying the similarity between vectors or documents based on their relative orientations in a multi-dimensional space.
Three sets of histograms were prepared to illustrate the distributions of the scores obtained during the experiments. Each set includes three histograms, so the upper histogram represents the scores achieved with a threshold of Tr = 0.05, the middle histogram corresponds to Tr = 0.1, and the last histogram represents Tr = 0.2. In the visual depiction, the section in red corresponds to upper bound 10. The next interval’s upper bound, highlighted in yellow, is at half the maximum reconstructed edges. The subsequent intervals, marked in blue and green, represent the upper bounds of the intervals based on the maximum number minus 10 and the maximum number of reconstructed edges, respectively. Figure 3, Figure 4 and Figure 5 and Table 1, Table 2 and Table 3 display histograms illustrating the edges’ distributions within these categories. Accompanying tables provide additional detailed information regarding the edges’ allocation across the categories.
Upon analyzing the histograms and tables obtained for various Fr values, a noticeable similarity between them becomes apparent. This finding suggests the presence of a consistent underlying structure within the dataset that remains resilient to permutations. It is worth noting that approximately 20% of the total edges (citations) fail to withstand the distortion procedure adequately. These edges, which exhibit high sensitivity to data transformation, do not align with the stable inner structure of the core system. Consequently, the corresponding citations may be considered suspicious and potentially manipulated.
Alternatively, it is worth noting that a distinct set of edges exhibits consistent behavior when subjected to perturbations, resulting in their high probability of being accurately reconstructed. These connections, in fact, constitute a stable core within the data, comprising a substantial number of critical edges that encompass these connections.
The Table 4 showcase 15 distinct sets of specific edges that consistently emerge across various parameter combinations, demonstrating the behavior discussed earlier. It is important to note a significant overlap or intersection between these sets, indicating a strong association among the identified edges. Furthermore, the following table presents the top 15 highly reconstructed edges for all removed fractions (30%, 40%, and 50%) and all similarity thresholds, along with their corresponding average counts. The edges that are successfully reconstructed in each iteration are visually highlighted in red. The table columns can be categorized into sequential groups, with the first three groups containing results corresponding to all Tr values, while the last group represents their mean values.
The Table 5 presents obtainable papers’ titles. Note that due to the incompleteness of the Cora dataset, certain ID’s do not have corresponding names available. These cases are represented as “--” in the information provided below.

4.2. Sampled PubMed-Diabetes Dataset

The term “PubMed-Diabetes dataset” commonly refers to a compilation of scientific articles concerning diabetes that can be found in the PubMed database. In our research, the dgl.data.PubmedGraphDataset function from the Deep Graph Library (DGL) is utilized to retrieve and load the PubMed-Diabetes dataset. Designed specifically for this dataset, the function enables researchers to conveniently access and analyze the interconnected information present within these articles.
A subset of 5201 edges was randomly chosen from the dataset during our analysis. Interestingly, it was observed that among these selected edges, 4867 edges were connected. This discovery offers valuable insights into the interconnectedness within the chosen portion of the PubMed-Diabetes dataset, representing a random sample accounting for 10% of the original dataset. An analysis of such a sample dataset can be conducted similarly to the analysis performed on the Cora dataset.
In line with previous discussions, histograms in Figure 6, Figure 7 and Figure 8 and Table 6, Table 7 and Table 8 showcase the distributions of edges across these specific categories. Corresponding tables provide detailed supplementary information regarding the allocation of edges within each category.
The observed sensitivity of the dataset to the considered perturbations highlights the need for careful parameter selection. The results indicate that the dataset is exceptionally responsive when the similarity threshold (Tr) is set to 0.05. This setting consistently produces suitable outcomes for Fr values of 0.3 or 0.4, indicating a robust relationship between the selected threshold and the desired results. However, when Fr is increased to 0.5, the optimal choice for the similarity threshold becomes slightly more nuanced. In this case, both Tr values of 0.05 and 0.1 provide favorable results, suggesting a broader range of acceptable thresholds. The data utilized in our study were obtained by sampling from the complete dataset using a random walk approach. Combining this data sampling method with the subsequent perturbation procedure enhances the sensitivity of the approach. Still, it is essential to note that higher perturbation rates can disrupt the inner structure of the data.
Nevertheless, despite achieving desirable outcomes, the expected core associated with the reconstructed edges appears inferior compared to other cases. This finding implies that the reconstructed edges’ reliability and relevance within this subset may be questionable and should be treated cautiously.
It is also important to note that the results discussed so far are based on analyzing a subset of the dataset, representing only 10% of the entire collection. This limited sample size may have implications for the generalizability and reliability of the findings. Therefore, it is crucial to interpret the results within the context of this subset and exercise caution when drawing broader conclusions about the entire dataset.
The obtained results corroborate the previous findings concerning the Cora dataset. Specifically, a consistent pattern emerges where around one-third (or possibly slightly more, considering the lower fraction of the first category) of the edges demonstrate instability and lack relevance. This consistency between the results obtained for both datasets suggests a common underlying characteristic regarding the reliability of the edges. It indicates that many connections within citation datasets may be less trustworthy or subject to potential manipulation.

5. Summary and Conclusions

This paper proposes a novel method for identifying valid scientific citations. The central concept of this approach revolves around the stability of genuine citation paths, as indicated by their coverage through non-direct but not overly long routes that include the cited references. To analyze and uncover such connections, the citation network undergoes perturbations involving omitting random edge samples. The resulting network is then embedded into a suitable Euclidean space using the Node2Vec algorithm, enabling node representations that are closely located to suggest a link between the corresponding nodes.
Iteratively applying the perturbation and embedding process makes it possible to organize the edges based on their recovery success. As a result, the top recovered edges are deemed the most reliable, as they consistently appear in the recovered networks across multiple iterations. On the other hand, the edges that consistently fail to be recovered appear towards the bottom of the list and are considered more suspicious.
This repeated procedure of perturbing the network, recovering edges, and assessing their reliability allows for systematically ranking edges based on their consistency in the recovered networks. By prioritizing the top-ranked edges, which are consistently recovered, and raising concerns about the lower-ranked ones, this method provides valuable insights into the robustness and credibility of the citation relationships within the network. This ranking of edges based on their recovery success helps in ensuring the accuracy and trustworthiness of citation data in scientific research.
Analyzing the Cora and PubMed-Diabetes datasets using the proposed methodology has yielded valuable insights into citation interconnectivity and its sensitivity to perturbations. Acknowledging that these findings are based on a subset representing only 10% of the complete PubMed-Diabetes dataset is necessary. Despite the distinct internal structures of the datasets, the results obtained from the numerical experiments exhibit meaningful comparability. This comparability suggests the presence of a potential general inclination within the mutual citation structure, indicating shared characteristics that transcend specific dataset variations. Multiple deep-learning models will be utilized on the respective datasets to explore this phenomenon further. This approach can facilitate a more comprehensive examination of the underlying patterns and dynamics within citation networks on a broader scale.
As research in this area continues to evolve, further improvements and refinements can be made to our method. It is crucial to explore more extensive datasets and diverse research domains to validate the generalizability of our approach. By leveraging the advancements in graph-embedding techniques and network analysis, it is possible to continue to enhance the quality and trustworthiness of citation networks in academic and scientific communities.
The limitations of this approach are directly correlated with the constraints of the Node2vec method. These limitations encompass computational intensity, memory requirements, and the method’s sensitivity to the chosen sampling strategy. Especially the effectiveness of Node2vec hinges on selecting the sampling strategy, which involves considering parameters such as p and q for the node2vec biased random walk. Additionally, Node2vec may necessitate a more profound understanding of the graph’s context to generate meaningful embeddings in cases where nodes have few or no connections. These limitations should be considered when applying Node2vec in the network of the proposed method.
A deeper insight into the general citation interconnectivity is anticipated to be gained by conducting these forthcoming investigations, enabling more nuanced interpretations and analyses across diverse domains. Leveraging multiple deep-learning models will uncover valuable insights and potential commonalities in citation structures, thereby advancing the field of research in this area.

Author Contributions

R.A., D.T.K. and Z.V. collaborated on model creation, design, and the writing and organization of the paper. S.K. and E.V. were responsible for designing and conducting the experimental study. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The authors express their sincere gratitude to the anonymous reviewers for their valuable and constructive comments, which have greatly contributed to the substantial improvement of this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Prabha, C.G. Some aspects of citation behavior: A pilot study in business administration. J. Am. Soc. Inf. Sci. 1983, 34, 202–206. [Google Scholar] [CrossRef]
  2. Resnik, D.B.; Gutierrez-Ford, C.; Peddada, S. Perceptions of Ethical Problems with Scientific Journal Peer Review: An Exploratory Study. Sci. Eng. Ethics 2008, 14, 305–310. [Google Scholar] [CrossRef] [PubMed]
  3. Wilhite, A.; Fong, E. Coercive citation in academic publishing. Science 2012, 335, 542–543. [Google Scholar] [CrossRef] [PubMed]
  4. Wren, J.D.; Georgescu, C. Detecting anomalous referencing patterns. In PubMed papers suggestive of author-centric reference list manipulation. Scientometrics 2022, 127, 5753–5771. [Google Scholar] [CrossRef]
  5. Dong, M.; Zheng, B.; Quoc Viet Hung, N.; Su, H.; Li, G. Multiple rumor source detection with graph convolutional networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 569–578. [Google Scholar]
  6. Lu, Y.-L.; Li, C.-T. GCAN: Graph-aware co-attention networks for explainable fake news detection on social media. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 505–514. [Google Scholar]
  7. Bian, T.; Xiao, X.; Xu, T.; Zhao, P.; Huang, W.; Rong, Y.; Huang, J. Rumor detection on social media with bi-directional graph convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 549–556. [Google Scholar]
  8. Li, A.; Qin, Z.; Liu, R.; Yang, Y.; Li, D. Spam review detection with graph convolutional networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2703–2711. [Google Scholar]
  9. Yu, S.; Xia, F.; Sun, Y.; Tang, T.; Yan, X.; Lee, I. Detecting outlier patterns with query-based artificially generated searching conditions. IEEE Trans. Comput. Soc. Syst. 2020, 8, 134–147. [Google Scholar] [CrossRef]
  10. Liu, J.; Xia, F.; Feng, X.; Ren, J.; Liu, H. Deep Graph Learning for Anomalous Citation Detection. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 2543–2557. [Google Scholar] [CrossRef] [PubMed]
  11. Grover, A.; Leskovec, J. Node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
  12. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representa-tions in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
  13. Hamilton, W.L.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NeurIPS ‘17, Long Beach, CA, USA, 4–9 December 2017; pp. 1024–1034. [Google Scholar]
  14. Perozzi, B.; AI-Rfou, R.; Skiena, S. DeepWalk. Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘14, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
  15. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, WWW ‘15, Florence, Italy, 18–22 May 2015; pp. 1067–1077. [Google Scholar]
  16. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ‘19, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Figure 1. A flowchart of the algorithm.
Figure 1. A flowchart of the algorithm.
Mathematics 11 03820 g001
Figure 2. Partial visualization of the Cora dataset.
Figure 2. Partial visualization of the Cora dataset.
Mathematics 11 03820 g002
Figure 3. Histograms of distributions of edges recovered for the Cora dataset for Fr = 30%.
Figure 3. Histograms of distributions of edges recovered for the Cora dataset for Fr = 30%.
Mathematics 11 03820 g003
Figure 4. Histograms of distributions of edges recovered for the Cora dataset for Fr = 40%.
Figure 4. Histograms of distributions of edges recovered for the Cora dataset for Fr = 40%.
Mathematics 11 03820 g004
Figure 5. Histograms of distributions of edges recovered for the Cora dataset for Fr = 50%.
Figure 5. Histograms of distributions of edges recovered for the Cora dataset for Fr = 50%.
Mathematics 11 03820 g005
Figure 6. Histograms of distributions of edges recovered for the PubMed dataset for Fr = 30%.
Figure 6. Histograms of distributions of edges recovered for the PubMed dataset for Fr = 30%.
Mathematics 11 03820 g006
Figure 7. Histograms of distributions of edges recovered for the PubMed dataset for Fr = 40%.
Figure 7. Histograms of distributions of edges recovered for the PubMed dataset for Fr = 40%.
Mathematics 11 03820 g007
Figure 8. Histograms of distributions of edges recovered for the PubMed dataset for Fr = 50%.
Figure 8. Histograms of distributions of edges recovered for the PubMed dataset for Fr = 50%.
Mathematics 11 03820 g008
Table 1. Distributions of edges recovered for the Cora dataset for Fr = 30%.
Table 1. Distributions of edges recovered for the Cora dataset for Fr = 30%.
Frequencies MeanMedian
0.040.220.720.0226.5728.00
0.080.210.690.0225.7927.00
0.070.240.660.0324.9627.00
Table 2. Distributions of edges recovered for the Cora dataset for Fr = 40%.
Table 2. Distributions of edges recovered for the Cora dataset for Fr = 40%.
Frequencies MeanMedian
0.010.190.770.0233.9635.00
0.010.190.770.0233.9635.00
0.070.230.680.0130.8633.00
Table 3. Distributions of edges recovered for the Cora dataset for Fr = 50%.
Table 3. Distributions of edges recovered for the Cora dataset for Fr = 50%.
Frequencies MeanMedian
0.000.180.770.0540.0442.00
0.040.210.710.0437.9840.00
0.080.240.660.0335.2438.00
Table 4. The top 15 highly reconstructed edges.
Table 4. The top 15 highly reconstructed edges.
Edge304050Average Count
0.050.10.20.050.10.20.050.10.2304050
(116,553, 116,545)XXX46.349.651.6
(559,804, 73,162)XX44.350.653.6
(17,476, 6385)445562
(96,335, 3243)445355.6
(582,343, 4660)445056
(6639, 22,431)43.35360.3
(78,511, 78,557)43.35058.6
(1,104,379, 13,885)XXX42.65052.6
(39,126, 31,483)42.651.356
(10,177, 27,606)42.35358
(1,129,683, 608,326)XXXXXX42.345.349.6
(38,829, 1,116,397)XXXXXX40.344.638.6
(1,107,567, 12,165)42.35760
(287,787, 634,975)XXX4250.350.6
(643,221, 644,448)X4249.354.3
Table 5. The titles of the top 15 highly reconstructed edges.
Table 5. The titles of the top 15 highly reconstructed edges.
EdgeName of ID 1Name of ID 2
(116,553, 116,545)A survey of intron research in genetics.Duplication of coding segments in genetic programming.
(559,804, 73,162)On the testability of causal models with latent and instrumental variables.Causal diagrams for experimental research.
(17,476, 6385)Markov games as a framework for multi-agent reinforcement learning.Multi-agent reinforcement learning: independent vs.
(96,335, 3243)Geometry in learning.A system for induction of oblique decision trees.
(582,343, 4660)Transferring and retraining learned information filters.Context-sensitive learning methods for text categorization.
(6639, 22,431)Stochastic Inductive Logic Programming.An investigation of noise-tolerant relational concept learning algorithms.
(78,511, 78,557)Genetic Algorithms and Very Fast Reannealing: A Comparison.Application of statistical mechanics methodology to term-structure bond-pricing models.
(1,104,379, 13,885)--Learning controllers for industrial robots.
(39,126, 31,483)Toward optimal feature selection.Induction of selective Bayesian classifiers.
(10,177, 27,606)Learning in the presence of malicious errors.Statistical queries and faulty PAC oracles.
(1,129,683, 608,326)--A sampling-based heuristic for tree search.
(38,829, 1,116,397)From Design Experiences to Generic Mechanisms: Model-Based Learning in Analogical Design.--
(1,107,567, 12,165)--Slonim. The power of team exploration: Two robots can learn unlabeled directed graphs.
(287,787, 634,975)A User-Friendly Workbench for Order-Based Genetic Algorithm Research,Reducing disruption of superior building blocks in genetic algorithms.
(643,221, 644,448)Minorization conditions and convergence rates for Markov chain Monte Carlo.Adaptive Markov chain Monte Carlo through regeneration.
Table 6. Distributions of edges recovered for the PubMed dataset for Fr = 30%.
Table 6. Distributions of edges recovered for the PubMed dataset for Fr = 30%.
Frequencies MeanMedian
0.390.520.090.0015.2612.00
0.610.310.080.0012.368.00
0.640.260.090.006.572.00
Table 7. Distributions of edges recovered for the PubMed dataset for Fr = 40%.
Table 7. Distributions of edges recovered for the PubMed dataset for Fr = 40%.
Frequencies MeanMedian
0.170.690.140.0019.5616.00
0.490.390.110.0015.4511.00
0.650.260.090.0010.423.00
Table 8. Distributions of edges recovered for the PubMed dataset for Fr = 50%.
Table 8. Distributions of edges recovered for the PubMed dataset for Fr = 50%.
Frequencies MeanMedian
0.050.840.110.0023.1720.00
0.340.580.080.0017.7213.00
0.670.270.060.0011.224.00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Avros, R.; Keshet, S.; Kitai, D.T.; Vexler, E.; Volkovich, Z. Detecting Pseudo-Manipulated Citations in Scientific Literature through Perturbations of the Citation Graph. Mathematics 2023, 11, 3820. https://doi.org/10.3390/math11183820

AMA Style

Avros R, Keshet S, Kitai DT, Vexler E, Volkovich Z. Detecting Pseudo-Manipulated Citations in Scientific Literature through Perturbations of the Citation Graph. Mathematics. 2023; 11(18):3820. https://doi.org/10.3390/math11183820

Chicago/Turabian Style

Avros, Renata, Saar Keshet, Dvora Toledano Kitai, Evgeny Vexler, and Zeev Volkovich. 2023. "Detecting Pseudo-Manipulated Citations in Scientific Literature through Perturbations of the Citation Graph" Mathematics 11, no. 18: 3820. https://doi.org/10.3390/math11183820

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop