INFLECT-DGNN: Influencer Prediction With Dynamic Graph Neural Networks

Leveraging network information for predictive modeling has become widespread in many domains. Within the realm of referral and targeted marketing, influencer detection stands out as an area that could greatly benefit from the incorporation of dynamic network representation due to the continuous evolution of customer-brand relationships. In this paper, we present INFLECT-DGNN, a new method for profit-driven INFLuencer prEdiCTion with Dynamic Graph Neural Networks that innovatively combines Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs) with weighted loss functions, synthetic minority oversampling adapted to graph data, and a carefully crafted rolling-window strategy. We introduce a novel profit-driven framework that supports decision-making based on model predictions. To test the framework, we use a unique corporate dataset with diverse networks, capturing the customer interactions across three cities with different socioeconomic and demographic characteristics. Our results show how using RNNs to encode temporal attributes alongside GNNs significantly improves predictive performance, while the profit-driven framework determines the optimal classification threshold for profit maximization. We compare the results of different models to demonstrate the importance of capturing network representation, temporal dependencies, and using a profit-driven evaluation. Our research has significant implications for the fields of referral and targeted marketing, expanding the technical use of deep graph learning within corporate environments.


Introduction
Advancements in data analytics have enabled businesses worldwide to leverage data for effective decision-making and increased value creation (Baesens, 2014).With a growing number of connections between people, networks can be constructed and used for predictive modeling (Óskarsdóttir et al., 2017), forming social networks consisting of individuals who have social interactions and exchange information (Scott, 1988).Such networks provide valuable insights into interconnectivity patterns between individuals and can serve as an additional data source for real-life problem modeling (Baesens, Van Vlasselaer, & Verbeke, 2015).Network relationships determine how people influence each other and the extent to which the information is exchanged making network analytics useful for domains like fraud detection (Baesens et al., 2015;Van Vlasselaer, Eliassi-Rad, Akoglu, Snoeck, & Baesens, 2017), e-commerce recommendations (Sun et al., 2015), credit scoring (Óskarsdóttir, Bravo, Sarraute, Vanthienen, & Baesens, 2019) and churn prediction (Óskarsdóttir et al., 2017).
Word of Mouth (WOM) has long been recognized as a powerful tool for customer influence in marketing for spreading information effectively among customers (Puigbo, Sánchez-Hernández, Casabayó, & Agell, 2014).In the modern era of digital technologies, its prominence has become even more pronounced.Customers who possess the ability to convince others within their network can be considered influencers or opinion leaders (Rogers & Cartano, 1962).Early detection of influencers enables targeted marketing efforts with the potential to attract more customers for the most profitable initiatives by effectively spreading information across the network.
The increase in social contacts led to the emergence of referral marketing that exploits social connections for the purpose of promoting a product or service (Roelens, 2018).In this context, social network topology is considered a major factor in the information dissemination process, and influencers are the best candidates to make referral.To fully harness the power of social networks, it is essential to go beyond simply identifying opinion leaders.Anticipating the emergence of potential influencers before they even join the network can be a key competitive accelerator.Furthermore, predicting the possibility of re-influencing behavior from past influencers, who may refer new clients multiple times throughout their customer lifetime, is also critical.By taking these proactive steps, businesses can effectively leverage the power of social networks to expand their reach and drive growth.Moreover, the usability of influencer prediction models for businesses heavily depends on the profit they generate.Cost-sensitive decision making based on instance-dependent threshold tuning has proved to result in the highest savings for businesses (Vanderschueren, Verdonck, Baesens, & Verbeke, 2022).
As a result, the problem of influencer prediction is highly complex, involving intricate networks, temporal dynamics, diverse social connections, and the need for profit orientation.Dynamic Graph Neural Networks (DGNN) offer a promising solution by effectively learning from evolving network data, capturing temporal dependencies and evolving patterns (Skarding, Gabrys, & Musial, 2021).Moreover, DGNNs can leverage multiple relationship types within networks through edge coloring, by attributing specific characteristics to edges.To the best of our knowledge, no research has addressed the influencer prediction problem on dynamic attributed edge-colored customer networks with cost-sensitive decision making.
This paper presents INFLECT-DGNN, a novel framework designed for predicting influencers in evolving customer networks.It integrates Graph Neural Networks (GNN), specifically the best-to-date Graph Isomorphism Networks (Xu, Hu, Leskovec, & Jegelka, 2018), and Recurrent Neural Networks (RNN) using weighted loss functions and the Synthetic Minority Oversampling TEchnique (SMOTE) adapted for graphs (Zhao, Zhang, & Wang, 2021).This combination enables effective learning in highly imbalanced networks.In addition, INFLECT-DGNN incorporates a 2 Related work

Representation learning on graphs
Network data is non-linear, so it cannot be readily represented in Euclidean space, which is the typical format for many modern machine learning and deep learning algorithms.Hence, network topology should be extracted first to be used for downstream tasks.There exist standard approaches for encoding network topology, such as neighborhood and centrality metrics as well as collective inference algorithms (Baesens et al., 2015).With advances in computational power and the popularity of deep learning, Graph Neural Networks (GNNs) have become a popular method for representation learning on graphs (Zhou et al., 2020).Variations of GNNs, such as Graph Convolution Networks (GCNs), Graph Attention Networks (GANs) and Graph Isomorphism Networks (GINs), have demonstrated excellent performance on various deep learning tasks (Kipf & Welling, 2017;Veličković et al., 2018;Xu et al., 2018;Zhou et al., 2020).GNNs have found numerous practical applications, such as physical systems modeling (Sanchez-Gonzalez et al., 2018), disease classification (Rhee, Seo, & Kim, 2017), and traffic state prediction (Guo, Lin, Feng, Song, & Wan, 2019).Tiukhova et al. (2022) demonstrated that DGNNs obtain an outstanding performance for ex-post influencer detection with past influencers being propagated into the future.Section 2.1.1 focuses on general GNN approaches to network representation learning while Section 2.1.2delves into DGNN literature.

Graph Neural Networks
Deep learning's increasing popularity has brought GNNs to the forefront of representational learning for networks (Wu et al., 2021).With their versatility, GNNs can solve node-, edge-, and graph-level tasks using supervised, semi-supervised, and unsupervised learning.GNNs can be categorized into several types based on their learning approach, task level, and complexity.Recurrent GNNs, the pioneers in this field, use the same set of parameters recurrently to learn high-level node representations (Wu et al., 2021).Convolutional GNNs, a more advanced type, generalize the fixed grid structure of Convolutional Neural Networks (CNN) and use either spectral-based or spatial-based graph convolution to aggregate neighbor features and learn node representations (Wu et al., 2021).Graph Autoencoders learn node embeddings in an unsupervised manner by reconstructing the adjacency matrix, while Spatial-Temporal GNNs enable dynamic network analysis by incorporating the time dimension (Wu et al., 2021).
Although there are numerous GNNs available, they share a similar design pipeline (Zhou et al., 2020).It includes the steps of identifying network structure and type, designing the loss function, and building the model using computational modules (Zhou et al., 2020).Network structure can be either explicit, where the network is given upfront, or implicit, where the network is to be built from the task at hand.The main network types are directed/undirected, homogeneous/heterogeneous, and static/dynamic networks.Loss functions are dependent on the task type, including its level and the type of supervision (Zhou et al., 2020).Computational modules can be classified into propagation, sampling, and pooling modules.A propagation module is used to propagate feature and topological information between nodes and can include convolution operators, recurrent operators, or skip connections.A sampling module can be utilized to sample large networks, while a pooling module is used to extract information from nodes (Zhou et al., 2020).

Dynamic Graph Neural Networks
A network that changes over time, with appearing/disappearing nodes and edges, is called a dynamic network (Skarding et al., 2021).Research shows that incorporating the temporal component of networks into the prediction task enhances the models' performance (Óskarsdóttir, Van Calster, Baesens, Lemahieu, & Vanthienen, 2018).The process of network learning typically involves an encoder and decoder, with the encoder focused on learning node embeddings and the decoder used to solve prediction tasks such as node or edge classification (Hamilton, Ying, & Leskovec, 2017).DGNNs can be also represented with the encoder-decoder structure.In a recent survey, different encoder-decoder architectures were examined for supervised dynamic network learning, with techniques categorized as either Discrete Time Dynamic Graph (DTDG) or Continuous Time Dynamic Graph (CTDG) learning (Zhu, Lyu, Hu, Chen, & Liu, 2022).DTDG and CTDG learning can be applied to graphs of varying nature, including static/dynamic and attributed/non-attributed graphs, and both presume an implicit notion of time (Zhu et al., 2022).The primary difference between DTDG and CTDG learning is in how they handle time, with DTDG models using network snapshots that capture the graph topology at a particular moment, and CTDG models considering networks as an event stream that updates over time.
Due to the nature of input graph and complexity of the problem, in this research, we follow stacked DTDG modeling approach and combine encoder and decoder in the ways described in Section 3.2.1.

Referral marketing
As opposed to Business-to-Customer (B2C) relationships leveraged by traditional marketing techniques, referral marketing employs Customer-to-Customer (C2C) relationships to promote products or services.Referral programs create an additional motivation for customers to refer other potential customers in order to get rewarded.The reward can be a coupon, cash, bonus points, bonus products, etc. Targeted and referral marketing programs catalyze WOM effects by reaching to influential customers to encourage them to make product recommendations (Ryu & Feick, 2007).Ryu and Feick (2007) found that the offer of a reward increases the likelihood of a referral, regardless of the size of the reward offered.In particular, this effect is stronger for weak ties between a referrer and a potential customer, i.e., exchange relationships, as well as for weaker brands.Thus, referral needs higher encouragement in the absence of close relationships or high brand recognition.Nevertheless, even for stronger ties and brands, the reward should be present (non-zero) to reduce the level of inequity for the existing customers.Therefore, companies need to balance the marginal revenue impact with the size of the reward and the change in the likelihood of referrals (Ryu & Feick, 2007).Armelini, Barrot, and Becker (2015) discovered that referred customers are more loyal than non-referred ones.In addition, they also found that referred customers are more valuable if the referring customer has a high Customer Lifetime Value (CLV).This implies that referral marketing is most successful when the targeted customers are not only influential but also profitable.Besides, referred customers have a longer customer lifetime than non-referred ones (Roelens, 2018).

Influencer detection in networks
Influencer detection is a challenging task in the referral marketing domain aimed at discovering customers with a high potential to being a referrer.Besides, with the increasing availability of network data and the rise of the Internet, this challenge has become even more pressing (Puigbo et al., 2014).To address this issue, previous research has focused on different approaches, including network centrality measures, prestige ranking algorithms, and information diffusion methods (Puigbo et al., 2014).Another way to capture complex relationships is to consider networks with nodes of different types (Huynh, Nguyen, Zelinka, Dinh, & Pham, 2020).
Traditional network analytics approaches typically lack the ability to incorporate more advanced indicators of relationships between network nodes.To overcome these limitations, recent research has proposed using more advanced data types and sophisticated representation learning techniques.One such approach that has gained significant popularity in recent literature is the combination of semantic analysis with network representation learning, as exemplified by the works of Rios, Aguilera, Nuñez-Gonzalez, and Graña (2019) and Zheng, Zhang, Young, and Wang (2020).The former proposes filtering out irrelevant network interactions based on semantic analysis and identifying key network actors using a classical authority discovery algorithm (Rios et al., 2019).The latter combines network topology and text message semantics sources for influencer discovery, leveraging both a language attention network and graph convolutional networks (GCN) (Zheng et al., 2020).In addition to these approaches, new GNN techniques tailored specifically for influencer detection are emerging.For instance, the graph influence network (GINN) employs GNNs and local graph structures to identify influential neighbors of a node (Shi, Quan, Xiao, Lei, & Niu, 2022).However, GINN is designed for static graphs only and does not detect global influencers.Tiukhova et al. (2022) tackled influencer detection using DGNNs and investigated the best combinations of GNNs (Graph Convolutional Networks vs. Graph Attention Networks) and RNNs (Gated Recurrent Unit Network vs. Long-Short Term Memory Networks) applied to the network of one city.Influencers were defined as the customers in the network who at least once referred other potential customers to get a credit card with an influencing behavior being propagated to the future: "once an influencer -always an influencer".The study found that neighbor feature representations captured by GNNs play an important role in detecting influencers as well as capturing the dynamics of the network evolution.We build upon their insights, and reformulate the transductive influencer detection problem into the inductive influencer prediction problem for imbalanced networks that allows for re-influencing within the existing network nodes and is aimed to generalize to nodes not seen by the network.We also broaden the analysis by incorporating a more sophisticated GNN encoder, i.e., Graph Isomorphism Network, and employ focal loss which has not been researched before when dealing with the influencer prediction task.

Problem definition
Following the design pipeline developed by Zhou et al. (2020), we define the task of influencer detection in this paper as a supervised node-level learning on a dynamic heterogenous undirected network.That is to say, we consider a dynamic network as an ordered sequence of T undirected network snapshots G = ⟨G 1 , G 2 , ..., G T ⟩, G t = (V t , E t ) where V t = {v t 1 , ..., v t n } is the node set and E t ⊆ V t × V t is the edge set, t ∈ {1, ..., T }.Nodes are labelled: a label vector L t = {l t 1 , ..., l t n }, l t ∈ {0, 1} is constructed such that the value 1 labels an influencer customer using the intuition described in Section 4.1, and 0 otherwise.Nodes are characterised by a set of node features where n is the number of nodes, and F is the number of features in each node.Edges are colored according to the type of connection between nodes, i.e. each e t uv ∈ E t has ⃗ c vu t ∈ {0, 1}.The task is node classification, i.e., predicting a label for the nodes from a node set V T +k at time T + k, i.e., to learn a representation vector h v of a node v to subsequently use h v in a downstream task of predicting node's label l T +k v .Tiukhova et al. (2022) conducted a study on influencer detection and discovered that utilizing a Graph Attention Network (GAT) as an encoder consistently produces remarkable results.Their findings reveal that GAT surpasses Graph Convolutional Networks (GCN) (Kipf & Welling, 2017) when employed for this particular task.Based on these compelling results, we have decided not to include GCN as an encoder in our approach.

GNN-RNN configurations
Introduced by Veličković et al. (2018), GATs are comprised of layers where nodes attend to their neighbors, employing a self-attention mechanism to assign importance to each connection in the network.By stacking these attention layers, the implicit weights to different neighboring nodes are specified, even without knowing the entire graph structure.Brody, Alon, and Yahav (2022) introduced a modified version of GAT, known as GATv2, that calculates dynamic graph attention and is more expressive than the original GAT.In GATv2, the ranking of attention scores is conditioned on the query node, enabling every node to attend to any other node.In a single layer of GATv2, node features x = ⃗ x 1 , ..., ⃗ x n , ⃗ x i ∈ R F , as well as edge features ⃗ e vu (if any), are taken as input.The layer produces a set of node embeddings To do this, each node attends to its one-hop neighbors, including itself, by calculating attention coefficients α ij (as shown in Equation 1). where vector of the attention mechanism network, ∥ is concatenation, α k ij are normalized attention coefficients computed by the k -th attention mechanism.Once the attention coefficients are obtained, the final embeddings ⃗ x i ′ can be calculated.In this paper, we apply multihead attention where K independent attention mechanisms are calculated and final embeddings outputs are concatenated (Equation 1).
Recent advancements in GNN architectures have focused on maximizing the representational power of learned embeddings.Graph Isomorphism Networks (GINs), introduced by Xu et al. (2018), have been proven to be as powerful as the Weisfeiler-Lehman graph isomorphism test (WL test) (Leman & Weisfeiler, 1968).GINs are thus considered to be among the most powerful GNNs in terms of discriminative and representational power (Xu et al., 2018).Graph isomorphism is closely related to the problem of graph representation: isomorphic graphs should be mapped to the same representation, while non-isomorphic graphs should have different representations.The GNN with the highest representation power can distinguish between different graph structures by producing distinct embeddings for these structures in the embedding space (Xu et al., 2018).To achieve the same level of power as the WL test, a GNN's neighbor aggregation must be injective.GINs employ a Multi-Layer Perceptron (MLP) to model and learn these injective functions, as shown in Equation 2 (Xu et al., 2018).
where ϵ determines the importance of the target node compared to its neighbors, h is the feature vector of node i at the k − 1 layer, h (0) i represents node i's features, N (i) is a set of nodes adjacent to i, and x i ′ (k) represents final embeddings at the layer k. ϵ is set to 0 following the findings of (Xu et al., 2018) where learning ϵ yielded no gain in fitting training data compared to fixing it to 0.
While the GNN architectures mentioned earlier have been proven effective, they are static and cannot capture changes in the network that occur over time.To address this limitation, we incorporate Recurrent Neural Networks (RNNs) into the decoder component of our framework, specifically using the Long Short-Term Memory (LSTM) model (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Units (GRUs) model (Cho et al., 2014).By leveraging these RNNs, our model can account for temporal dynamics in the network and make accurate predictions even as the network changes over time.
The LSTM model is well-suited to learning long-term dependencies, which is particularly important for dynamic graph networks.At its core, an LSTM cell comprises several gates, including the input gate i t , forget gate f t , cell gate g t , and output gate o t (see Equation 3) (Hochreiter & Schmidhuber, 1997).These gates allow the cell state, which can be thought of as the network's long-term memory, to be modified at each timestamp t (see Equation 3).Additionally, the LSTM cell receives the hidden state from the previous timestamp t − 1 as an input at timestamp t (see Equation 3).By leveraging these operations and the cell state, the LSTM can effectively capture and learn from long-term patterns in the dynamic graph network.
where x t is the input (output embeddings x i ′ of a GNN model serve as an input to the LSTM model at timestamp t), h t−1 is the hidden state of the layer at time t − 1, σ is the sigmoid function, tanh is the hyperbolic tangent function, ⊙ is the Hadamard product.
The LSTM's operations enable it to either remember or forget information, and to update its "memory" by updating the cell state.The forget gate uses the sigmoid function to decide which information to retain and which to discard, while the input and cell gates use the input and previous hidden state to update the cell state by applying sigmoid and hyperbolic tangent functions.The cell state can then be updated with the outputs of the input, cell, and forget gates (refer to Equation 3).Finally, the output gate, together with the cell state, is used to update the hidden state.
The GRU model, which is a simpler version of the LSTM, consists of two primary gates: the reset gate r t and the update gate z t (see Equation 4) (Cho et al., 2014).Unlike the LSTM, the GRU does not have a cell state and stores long-term memory directly in the hidden states.As with the LSTM model, the GRU model takes the output embeddings x i ′ of a GNN model as input.
where x t is the input at t, h t−1 is the hidden state of the layer at t − 1, n t is the new gate used to update a hidden state using a reset gate, σ is the sigmoid function, tanh is the hyperbolic tangent function, and ⊙ is the Hadamard product.Equation 4 contains the calculations of the reset gate, update gate, and new gate, which are integral to updating the hidden state.The update gate operates similarly to the input gate in LSTM models, and the reset gate functions in a manner comparable to the forget gate.Each hidden state is updated with the previous hidden state values and the new gate value obtained  In order to get final predictions, we require an additional Fully Connected Network (FCN) on top of the GNN-RNN configuration.The FCN consists of two linear layers followed by ReLU and sigmoid activation functions, respectively.It takes the node embeddings represented by the hidden states of the last output layer of LSTM/GRU models as input and outputs a final probability of whether a node is an influencer or not.The aforementioned architecture is displayed in Figure 1a.

Baseline configurations -Features(+PR)-RNN
Following the strategy used in the research by Óskarsdóttir et al. (2017), we can enrich nonrelational classifiers with network features.One of such network features that can summarize node importance is PageRank (PR) (Brin & Page, 1998).Originally developed to measure the importance of web-pages, PR can be used to measure the importance of nodes based on the number of incoming edges and their importance.The PageRank centrality measure and variants of it have been shown to provide added value to predictive models in both fraud detection (Óskarsdóttir et al., 2022;Van Vlasselaer et al., 2015) and credit risk (Óskarsdóttir & Bravo, 2021).Following the assumption of the PR algorithm for web pages, the importance of a node is based on the importance of the nodes linked to it.
where B u is a set of nodes to u, d is a damping factor set to 0.85, C(v) is an out-degree of a node v.
The non-personalized PR values can be calculated for each node.As the PR value represents the relative importance of a node within one component, we calculate its value separately within each of the components.As the network is attributed, PR values can be seen as an additional feature of the node, and this enriched feature set can be seen as an encoder part and can be subsequently used for dynamic node classification with RNNs.Static node features together with PR are used as input to the RNN model.Similarly to GNNs in Section 2.1.1,two types of RNN can be utilized, namely LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Cho et al., 2014).The architecture of the PageRank + RNN models is displayed in Figure 1b.

Baseline configurations -Static GNN
This configuration consists of a GNN encoder similarly to the GNN-RNN configuration described in Section 3.2.1, while its decoder only contains an FCN network that is fed with the embeddings

Credit card
There exist an edge between customers who used the same credit card in the delivery app.

Geohash
There exist an edge between customers who ordered more than 4 times in the close geographical proximity (Geohash 7 -152.9mx 152.4m).

Contact
There exist an edge between customers if one of them have the other one in their phone contacts.
produced by the GNN model.We utilize GAT and GIN models described in Section 3.2.1 as the GNN encoder.The architecture of the static GNN model is displayed in Figure 1c.
4 Experimental setup

Networks
The data used in this study is sourced from a Super-App company that operates in Latin America, providing both a delivery app and credit card services to its customers.The data covers a period of nine months, during which aggregated information about delivery app usage was used to create monthly snapshots of the network, while credit card usage data was utilized to construct node features (see online Appendix C).The network includes only credit card customers with three types of connections based on the delivery app usage as outlined in Table 1.To account for the various types of connections, edge features are encoded with information about the connection type(s), allowing for the representation of multiple connection types per edge.The network is dynamic, with new nodes and edges appearing over time, while existing nodes and connections persist, leading to its continuous growth.
Our analysis is based on data from three different cities, each represented by a separate network.The network characteristics are described in Figures 2a, 2b, and 2c for City 1, City 2, and City 3, respectively.The cities were chosen to provide a diverse reference of population size and income, representing a range from middle income to high income countries, with per capita Purchase Power Parity-adjusted Gross Domestic Products (PPP GDP) ranging from $10,000 to $35,0002 .The network sizes vary, with City 1 having the smallest network and City 3 having the largest.Over time, all three networks have grown in terms of both the number of nodes and edges.However, we should note that the proportion of influencers among the nodes is relatively low, ranging from approximately 1.5% in the last month to around 5% in the first month.As a result, the data is heavily imbalanced.
The labels of nodes in the network are determined based on referral information from credit card applications.A referrer is always a node in the network, while the referred person may or may not be in the network depending on the timing of getting a credit card.Specifically, a node is labeled as an influencer at the time of the actual referral and remains labeled as such until the referred person obtains a credit card, i.e., until the point when the company has access to their usage data, with a maximum of six months.At that point, the influencer label is changed to a non-influencer label, unless a customer has made other referrals in the past 6 months.If a referral does not result in a credit card application within six months, the referrer is not considered an  2018).To maintain the connectivity expressed by the referral, an edge of type "Contact" is created between the referrer and the referred person.This involves adding the referred person as a node in the network and connecting them to the referrer.Figure 3 illustrates the aforementioned labeling process.

Data preprocessing
Apart from the network construction described in Section 4.1, some data preprocessing is required for the node features data.Numerical features are normalized using min-max normalization to stabilize learning and make the convergence faster.Categorical features are one-hot encoded.As mentioned in Section 4.1, new nodes can appear in the network over time.To account for that, we add artificial node features for the future nodes with features that are zeroed out.These features are added to account for new future nodes and are not used in the backpropagation process during training.
To address the issue of imbalance described in Section 4.1, we utilize the GraphSMOTE approach (Zhao et al., 2021).Specifically, we are employing a synthetic node generation method from GraphSMOTE that performs oversampling on top of the embedding space constructed by the feature extractor.In our case, we oversample the final embeddings generated by the RNN model for the GNN-RNN model configurations or by the GNN encoder in static GNN models.
We chose an oversampling scale of 0.5, as it has been demonstrated that scales below 0.8 lead to better GNN performance (Zhao et al., 2021).We utilize this approach to tackle class imbalance and to determine the need for oversampling in GNNs.Consequently, we have included a positive oversampling ratio as a hyperparameter specification, but we also allow for a non-oversampling strategy with a ratio of 0.

Model specifications
The models discussed in Section 3.2 were all implemented using the Pytorch Geometric library (Fey & Lenssen, 2019).For both the GNN-RNN models (Section 3.2.1)and Static GNN models (Section 3.2.3),we employed two regularization techniques that are commonly used for GNNs to enhance their performance and generalization capabilities.Specifically, we applied layer normalization and a 50% dropout rate.In addition, the GNN portion of the models uses Exponential Linear Unit (ELU) as its activation function.

Loss functions
In Section 4.1, we discussed how predicting influencers poses a challenge due to the imbalance between influencers and non-influencers in the network.The traditional binary cross-entropy (BCE) loss is not suitable for handling class imbalance, since it treats both classes equally.Our primary goal is to accurately predict the positive minority class of influencers, but binary cross-entropy prioritizes overall accuracy rather than class-specific performance.Therefore, a tailored binary cross-entropy loss function is necessary to handle class imbalance and weight positive (minority) examples accordingly.By using such a loss function, we can balance recall and precision trade-offs (see Equation 6).
where c = 1 for single-label binary classification, n is the number of samples, p c = nneg npos is the weight of the positive class, n neg is the number of samples in the negative (majority) class, n pos is the number of samples in the positive (minority) class, σ is a sigmoid function.
In order to surpass the simple inverse sample weighting technique, Cui, Jia, Lin, Song, and Belongie (2019) proposed a modified version of the focal loss function that addresses the challenge of class imbalance in a more sophisticated way.This new class-balanced focal loss function accounts for both the effective number of samples and the relative loss for well-classified samples (Cui et al., 2019;Lin, Goyal, Girshick, He, & Dollár, 2017).By adjusting the class-balanced term between no re-weighting and re-weighting by inverse class frequency, the class-balanced focal loss is capable of handling class imbalance more effectively than previous methods (see Equation 7) (Cui et al., 2019).
where C is the total number of classes, n c is the number of samples in the ground-truth class c, β ∈ [ 0, 1) is a hyperparameter from the effective number of samples for class term, γ is a focusing hyperparameter from the focal loss definition.
As part of our model optimization process, we carefully select the loss function based on the performance on validation data.We explore two aforementioned loss function configurations: weighted binary cross-entropy and class-balanced focal loss.By experimenting with these options, we aim to find the loss function that best fits our specific problem and maximizes our model's detection performance.

Hyperparameter tuning
We train the models for 300 epochs with a learning rate of 0.0001 using the Adam optimizer (Kingma & Ba, 2015) as we observed that this combination led to the convergence of all the aforementioned models.The best model configuration is searched via exhaustive grid search over the hyperparameter space (Table 2).The number of heads in GAT is fixed to 2 to maintain manageable computational complexity and the dropout rate for GNNs is set at 0.5.We also set the hyperparameters of the focal loss (see Section 4.4) γ and β to 2 and 0.999, respectively, as these values have been found to perform best (Cui et al., 2019;Lin et al., 2017).We pick the best model based on the average of AUPRC values on seen and unseen nodes (Section 4.6) as well as based on stability of the training process, i.e., we make sure that the model converges well on both training and validation sets and does not overfit.

Performance metrics
A widely used metric for evaluating the classification performance of a model is the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds.The Area Under the ROC Curve (AUC) is often calculated to compare the performance of different models, with a range of values from 0 to 1 and a baseline of 0.5.However, in the presence of class imbalance, AUC may not be a reliable indicator of model performance.This is because the FPR can remain low even when the number of false positives is high due to the large number of negatives, which can result in an overly optimistic view of a model's performance (Saito & Rehmsmeier, 2015).
To properly compare model performance in imbalanced settings, it is necessary to use metrics that take into account the skewed class distribution.The Precision-Recall (PR) curve provides a more informative view of model performance in tasks with a large skew in the class distribution (Davis & Goadrich, 2006).Similar to the ROC curve, the PR curve summarizes performance over different thresholds.However, it focuses on correctly predicting positives by using precision and recall as its axes.To facilitate model comparison, the Area Under the PR Curve (AUPRC) can be calculated with the baseline corresponding to the fraction of positives, resulting in different baselines for different classes.
While the AUPRC metric is our primary decision criterion for evaluating model performance, we also report the AUC metric as it is commonly used in the research community.We calculate and report the values of both these metrics separately for seen and unseen nodes of the network.The term "seen nodes" refers to the nodes that the model encountered during backpropagation in training, whereas "unseen nodes" refer to nodes that were not part of the training set and only appear in the network in the future.Hence, "unseen" nodes for the validation data are the nodes appeared in the last month of the validation window, while the "unseen" nodes for the test data include new nodes appeared in the last month of the test window as well as new nodes from the last month of the validation window, as all of them were not part of the training data.Model performance on seen nodes shows models applicability for transductive learning which focuses on making predictions specifically for the given instances in the training data without generalizing to unseen instances; whereas performance on unseen nodes highlights the ability to generalize from observed data to make predictions on unseen instances.By reporting these metrics for both seen and unseen nodes, we can better understand how well our model generalizes to new data.
In addition to detection performance metrics, we report training time performance and carbon footprint estimation measured in equivalent grams of CO 2 (gCO2eq) for training the model as we also want to evaluate a trade-off between energy-efficiency and prediction performance (Anthony, Kanding, & Selvan, 2020).

GNN-RNN models
To evaluate the performance of a model and ensure that it can accurately evaluate and generalize to new data, we employ a rolling window strategy with a window size of three and a shift of one to split the data into train, validation, and test subsets (refer to Figure 4).The train data comprises seven months, resulting in five three-month windows.Within each window, we apply the GNN-RNN framework described in Section 3.2.1 to make predictions for the last month.We use a network snapshot of each month as an input to the GNN-encoder, which produces embeddings that are subsequently fed into the RNN decoder.To initialize the input hidden states for the RNN decoder at the first time window, we use a tensor of ones.For all subsequent time windows, the hidden state produced by the first month of the previous time window is used as the input hidden state.The output produced by the RNN decoder at timestamp t is used as the input hidden state at timestamp t + 1, along with the embeddings produced by the GNN encoder with an input of the network snapshot at timestamp t + 1.At the final timestamp of the window, the hidden states produced by the RNN decoder are fed into the FCN decoder to generate final probabilities, and backpropagation occurs during training.
Both validation and test data consist of one three-month window.The trained models are applied on a validation set that is used to measure the generalization capability of the model and to check if it is overfitting or underfitting during training by looking at the validation loss.Additionally, a validation set is utilized to evaluate the model's performance with different hyperparameter settings in order to choose the best one (see Section 4.5).A test set is used for the final comparison of different model configurations (with the best hyperparameter setting chosen on a validation set) to get an estimate of the model's generalization performance.

Features(+PR)-RNN models
The process of splitting the data into train, validation, and test subsets and the training methodology for the Features(+PR)-RNN models are similar to those of the GNN-RNN models, with the only difference being the encoder part.In this case, instead of using GNN embeddings, the encoder solely comprises node features (including PageRank feature for PR configurations) that are fed directly into the RNN decoder.The remaining training process follows the steps outlined in Section 4.7.1 (Figure 4).

Static GNN models
Unlike DGNN models that incorporate RNN components to account for changes over time, static GNN models do not consider dynamics.The training data for the static model spans seven months, with predictions made for each month, and the model's training performance is averaged across these months.The validation and test sets consist of data from a single month each, with predictions made for their respective months (Figure 5).

Results
Table 3 presents the performance of the models on the smallest network considered.The results demonstrate that GIN-RNN models (including GIN-LSTM and GIN-GRU) perform similarly to dynamic non-GNN models (such as Feat.(+PR)-GRU,Feat.(+PR)-LSTM,Feat.-LSTM, and Feat.-GRU) in terms of AUC for both seen and unseen nodes, with GIN-GRU being the best-performing model for seen nodes.The AUPRC values for seen nodes are also comparable across these model types, with GAT-LSTM achieving the highest value.Notably, the Static GIN model delivers exceptional performance in terms of AUC on unseen nodes.Nevertheless, the highest AUPRC values for unseen nodes are obtained by the GIN-LSTM and GIN-GRU models.The GIN-RNN models obtain comparable time performance with baseline models, while the GAT-RNN models take the longest time to be trained due to an expensive attention calculation.The same goes for the CO 2 emissions produced by the models: the GAT-RNN models produce the highest amount of emissions during training.
Next, Table 4 displays performance metrics values for the models applied on the bigger network of City 2. Similarly to the results for City 1, the GNN-RNN models obtain similar results to the dynamic non-GNN models according to the AUC metric calculated on seen nodes.However, according to the AUC metric on unseen nodes, the best performance is shown by the static GIN model followed by the GIN-GRU model.The GIN-GRU/LSTM models show the best performance according to the AUPRC metric on both seen and unseen nodes, with the GAT-GRU/LSTM models also achieving high values on unseen nodes.The dominance of GIN-RNN models on unseen nodes according to the AUPRC metric is strong with an AUPRC uplift of around 0.12 from the best performing dynamic non-GNN model.Overall, the GIN-GRU model delivers the best performance based on a majority of the metrics.When it comes to the time and energy performance, the models for City 2 display the same pattern as for City 1: complex GAT-RNN models take the longest time to train and produce the maximal amount of carbon emissions.
Finally, Table 5 shows the results for the largest network of City 3, and it is clear that the GNN-RNN models outperform the baseline models.In particular, the GIN-RNN models achieve the highest AUC scores on both seen and unseen nodes, indicating their superior performance.Additionally, when it comes to the AUPRC metric calculated on seen and unseen nodes, all GNN-RNN models perform exceptionally well, with no clear winner among them.Notably, the Feature+PR-GRU and Features-LSTM models, which do not have a computationally expensive GNN component, exhibit the poorest time performance and the highest environmental impact.This is due to the need for a high positive SMOTE oversampling rate to achieve optimal performance, which incurs significant computational costs for larger networks (see online Appendix B for the detailed overview of optimal model configurations).
In summary, more sophisticated models, and specifically the GIN-GRU combination, perform best the larger the network, becoming dominant in the biggest network of all.This is consistent with other results in deep learning, that speak how data-hungry these models can be: in order to improve above simpler alternatives, it is necessary to have enough data diversity to reach effective results.In terms of computing time, models with pure attention mechanisms are the most complex to train, which is consistent with the quadratic complexity they present.In our models, this does not translate to better performance, so modellers should carefully test other alternatives to benchmark the right model.Interestingly, the use of a balanced error measure, AUPRC, shows that more powerful models do present better results.As already discussed AUC, can be misleading when in presence of a high imbalance, so measures such as AUPRC can be more helpful.
In Figures 6a, 6b and 6c, we display train and validation losses, AUC and AUPRC convergence plots for the consistently well-performing GIN-GRU models for all three cities.These plots demonstrate that the GIN-GRU model achieves convergence across all cities.While validation AUC consistently grows with epochs for all cities, Figure 6c reveals that the model's training for the largest city, City 3, is the most stable, with AUPRC gradually increasing with each epoch.In contrast, the validation AUPRC for City 1 is the least stable.Therefore, providing more data to the model improves its training stability, as well as providing better accuracy.The convergence plots for the other models can be found in online Appendix A.

Profit evaluation
In order to act upon the model predictions, the decision on the classification threshold has to be made to return the final predicted labels and use them for a marketing campaign.When dealing with a severe class imbalance problem, the default threshold of 0.5 used for the balanced tasks can result in a poor performance (Fernández et al., 2018).Hence, this threshold needs to be tuned in order to find the optimal one.We apply the predict-then-optimize approach introduced by Vanderschueren et al. ( 2022): a cost-insensitive influencer detection model is used to optimize decision making, i.e., tuning a classification threshold for obtaining the maximal expected profit of potential customers brought by influencers.As profit is customer specific, we utilize the approach of instance-dependent cost-sensitive thresholding (Vanderschueren et al., 2022).We assume that the influencer detection model is used for targeted marketing, i.e., the customers who are predicted by the model as influencers are contacted with a contact cost c.True influencers also get a reward that comes at a cost ic.As we know from the company providing the data, all the customers have been targeted during a time frame of the study.However, the assumption that this will be permanent is unrealistic, so we must model the impact of a targeted policy.Hence, we can introduce a probability p of not referring new customers while having the chance to do so (being an influencer).As we know everyone has been targeted, true influencers have a probability of not referring equal to 0 (p = 0).For non-influencers, this probability is a free parameter to be estimated (p ̸ = 0).Consequently, (1 − p) represents a probability that an actual influencer will still bring new customers to the customer base even without being rewarded for doing so (targeted).
The total profit of the campaign depends on the accuracy of a model's predictions.In particular, we can distinguish between four different categories of profit depending on actual and predicted influencer labels (Table 6).The first category is the Target Influencer (TI) category that corresponds to the True Positive predictions of the model.The profit coming from this category of customers is calculated as the future profit of potential customers P u referred by an influencer u reduced by the incentive cost ic value (Equation 8).Additionally, a contact cost c is subtracted once as the action to contact the customer may come at a cost4 .The T N I category represents false positive predictions: the customers that have been targeted but did not refer other customers.The profit from this category is negative as we spend money on contacting the customers while not getting new customers back (Equation 8).The N T I category consists of the false negatives: true influencers who were not contacted.We expect to get profit from this category of customers with the aforementioned probability 1 − p with the reward cost of ic subtracted (Equation 8).The last N T N I category represent true negative predictions that bring neither profit nor costs (Equation 8).where N u is the number of customers referred by u, c is a contact cost, ic is a reward cost, P v is a future 6-month profit of a customer v calculated as an interest earned on the outstanding balance on a credit card for non-defaulters or negative outstanding balance for defaulters.
We search for an optimal threshold that maximizes profit on the validation data and subsequently utilize this threshold on the test data to get a profit estimation (see Equation 9).Hence, we solve the optimization problem formalized in Equation 9.
where t is a classification threshold.We illustrate the aforementioned framework using the best performing GIN-GRU model for City 2. From Equation 8, we can see that there are three parameters ic, c and p that influence the resulting profit.We obtained a reward parameter from the company and normalized it to a generic value of 100 monetary units.All values are expressed based on these normalized results.We solve the optimization problem from Equation 9for each combination of c and p with c ∈ [0, 80] with a step of 10 (step is reduced to 5 in the interval [0, 30] for better illustration of smaller values of c) and p ∈ [0, 1] with a step of 0.05.We vary a threshold from 0 to the maximal probability score returned by the model which is 0.78 for the validation month data.The resulting matrix of optimal classification thresholds is displayed in Figure 7.
From Figure 7, we can see how the optimal threshold changes for different c and p combinations.In particular, when the incentive comes at no cost, the most profitable solution is to contact everyone in the customer base irrespective of the value of p.This conclusion comes at no surprise as targeting false positives (TNI category) comes at no cost while true positive predictions (TI category) bring non-negative profit.With an increasing cost of contact c and a decreasing probability of referral (i.e., increasing probability p of not referring), the optimal classification threshold increases: it is no longer economically feasible to contact everyone.This is a more realistic situation, where the influencer prediction model brings the added value to the business in picking the best customers to be contacted.In particular, when the contact cost c gets higher than 60, it is no longer feasible to target customers: the cost of targeting becomes higher than the added value of using the model's predictions irrespective of the value of p.At this point, the profit brought by the N T I category of customers gets high enough to outweigh the costs of the targeting campaign.Hence, the matrix displayed in Figure 7 can be used to determined the optimal value of a contact cost where the marketing campaign still makes sense from a profit perspective.
Consequently, we apply thresholds from Figure 7 on test data to obtain a profit estimation for each c and p combination.The resulting matrix of profits is displayed in Figure 8.It is evident that the maximum profit is achieved when there are no contact cost, regardless of the value of p or when the probability of referring without being targeted is one, i.e., 1 − p = 1.The former case corresponds to the classification threshold being set to 0, and contacting all individuals.This finding is expected, given that in this scenario, false positives carry no cost.However, this case is not realistic due to the necessity to contact customers via different channels that carries non-negative costs.On the contrary, the latter case represents a situation where no one is targeted as the profit obtained from the true influencers is higher than when the target campaign that comes at a contact cost is activated.When looking at non-zero values of contact costs c, we see that the profit gets lower with an increasing values of probability p.This is expected as with a rising probability p, the necessity of the marketing campaign diminishes as the referral likelihood is high enough on itself with no need to be stimulated by a target.When the contact cost gets too high (≥ 60), the profit is negative, and a targeted marketing is no longer economically rational.
A similar analysis can be performed for the case with varying ic instead of p.In this scenario, we can fix c at 3 (a values selected based on the conversations with the company; it must be selected based on the use case and cost structure) with ic ∈ [0, 250] with a step of 50 and p ∈ [0, 1] with a step of 0.05 (Figure 9).Similarly to the p vs. c case, when the probability of referring without being targeted (1 − p) is 1, the targeting campaign does not make sense: all the influencers will refer to other customers anyway with no need to be nudged to do so.For a rising probability p and decreasing reward cost ic, the optimal threshold value is decreasing with more customers being predicted as influencers.When a reward cost becomes too high (≥ 200), it is again optimal not to use a model, and rely on referrals without a targeting campaign.When these thresholds are applied on test data to obtain profit estimates, we observe the patterns similar to the p vs. c case: the profit decreases with increasing values of p and increasing reward costs (Figure 10).As we observed in the threshold matrix, when a reward cost surpasses the value of 200, the profit is not positive anymore due to high reward costs.
Both the p vs. c and p vs. ic cases illustrate that some prior knowledge is required prior to the deployment of the influencer prediction model in production.The optimal threshold values should be estimated based on this prior information to get the maximal benefits from the model.The probability p cannot be influenced by the company itself and is defined by a personal attitude towards a brand and personal marketing preferences (Ryu & Feick, 2007).This probability can be modelled in order to estimate its values depending on aforementioned characteristics.However, this is outside the scope of current research and is left for a future research.

Discussion
As was illustrated in Section 5, the GNN-RNN models dominate the baseline models, and this dominance is most pronounced on unseen nodes.Performance on unseen nodes is crucial for inductive learning where we focus on generalizing to unseen nodes to predict influencers.Notably, the generalization capability is the highest on the medium size network of City 2 with the highest values of AUC and AUPRC metrics calculated on unseen nodes.Nevertheless, the best combination of performance on seen and unseen nodes is obtained on the biggest network of City 3: here we consistently see high values on all performance metrics.In other words, when task complexity is increasing as is the case with inductive learning, the applicability of complex GNN-RNN models is justified.
It is worth mentioning that the performance of non-GNN dynamic baseline models, i.e., Figure 10: Profit matrix on test set: p vs. ic Features(+PR)-RNN models, deteriorates with a growing network size.While on the smallest network of City 1, these models are able to perform comparable to the GNN-RNN models, the opposite is true for the biggest network of City 3. Hence, for the networks of smaller sizes, the Features(+PR)-RNN models can be considered for use as the performance gain of using GNN-RNN models is not large enough.However, when it comes to the bigger networks, the dominance of the GNN-RNN models is clearly noticeable, especially on unseen nodes.When comparing two types of GNNs, the GIN encoder emerges as the clear winner in both GNN-RNN and static GNN model types.This is due to its ability to be trained in less time and with less energy, while consistently outperforming the GAT in terms of detection performance.The superiority of GIN may be attributed to the complex network structures we are dealing with: GINs are universal and invariant to the graph structure, whereas GATs are more sensitive to the structure of the graph, which can be disadvantageous for complex dynamic networks.This result illustrates a trade-off between complexity and performance: the increased complexity of GATs is not justified by a relatively low performance gain.
Another insight is that for inductive learning, the GNN component plays a more important role than the network dynamics, especially in the case of the GIN encoder: the AUC and AUPRC metrics for the models containing a GNN encoder are on average higher than for non-GNN dynamic models.Nevertheless, dynamics of the network play a crucial role and must be captured: the best results are obtained with GNNs and RNN are utilized together.
Regarding profit-driven decision making, the applicability of the model relies heavily both on the costs required to contact a customer as well as on the value of the reward offered to the referrer.According to Rui & Feick (2007), this reward should be non-zero, and the model is found to be useful in cases with smaller contact/reward costs and higher probabilities of not referring without being targeted (Ryu & Feick, 2007).As the company can only influencer contact/reward costs, those should be kept as low as possible.However, as there is a positive relationship between a reward size and a referral likelihood as found by Rui & Feick (2007), a trade-off between the two should be also taken into account.We leave a more sophisticated modeling of reward likelihood for future research.

Conclusion
Nowadays, network data has been getting increased attention in many research domains due to its capacity to reveal complex relationships that might go unnoticed by traditional data sources.The marketing domain is no exception, as network information can be utilized to promote products and services through referral marketing initiatives.Referral marketing leverages the power of Word of Mouth effects, with opinion leaders being the most effective candidates to refer new customers.Identifying these influencers early on is critical for the success of targeted marketing campaigns, which can benefit immensely from reaching the right customers and yielding the highest profits from their subsequent referral activities.Conversely, reaching to these customers can provide faster bancarization and access to desirable services, while limiting the economic cost of targeting these potential customers individually.Moreover, influencer detection is the most beneficial when performed in an inductive manner resulting in generalization to unseen network nodes, i.e., in influencer prediction problem.
To this end, this paper investigated the applicability of DGNNs for the inductive influencer prediction problem with the cost-sensitive decision making optimization.Four DGNN configurations were trained and evaluated on the network data of three cities and compared with the dynamic non-GNN and static GNN baseline models.The evaluation was performed based on the detection performance metrics such as AUC and AUPRC measures calculated on seen and unseen nodes, as well as based on the time performance and environmental impact of a model.A predict-and-optimize approach was applied in order to illustrate the usability of the models in real business settings.
First, our study shows that using DGNNs is highly advantageous for the inductive influencer prediction task.By incorporating network information and dynamics, these models demonstrate better generalization ability to unseen data while maintaining strong detection performance on nodes encountered during training.Among the different types of encoders tested, the Graph Isomorphism Network (GIN) emerged as the most effective, owing to its capacity to handle universal graph structures.We therefore recommend the use of DGNNs with a GIN encoder for influencer detection tasks that require generalization to new nodes in the network.
Next, we have found that larger networks with more nodes and edges provide the best combination of detection performance and model stability.In these cases, DGNNs are particularly useful, as they can effectively capture the increasing complexity of the network that cannot be fully captured by centrality metrics like PageRank.However, for smaller networks where network topology is not as informative, simpler dynamic non-GNN models can perform comparably to DGNNs.Therefore, the choice of model must carefully evaluate if the size of the data available warrants the use of more sophisticated models, taking into account performance and technical debt of operating the models.
Finally, practical deployment of the model depends heavily on the additional parameters, both controlled by the businesses and external factors that cannot be controlled but only estimated.Thus, we conclude that a proper analysis of target campaign costs together with the costs of the referral marketing should be conducted in order to find the optimal threshold used for a final deployment of the model in production.In case of influencer prediction tackled in this paper, we advice a careful study of the minimum reward necessary to induce a probability of response, using the methods proposed in this paper to evaluate outcomes.A/B tests and similar strategies can be used to fine tune the incentive value for this purpose.
While this study provides valuable insights into the applicability of DGNNs for influencer prediction, it is important to acknowledge its limitations.The approach used to construct the networks can influence the performance of the models.Investigating the impact of network topology on DGNN performance is an area for future research.Additionally, in the profit-driven decisionmaking part of our research, we relied on the probability of referring without being targeted, which requires explicit modeling to determine the classification threshold.Future research can focus on modeling this probability.

Figure 5 :
Figure 5: Train/validation/test setup for Static GNN models

Figure 6 :
Figure 6: Convergence plots, GIN-GRU models P u (T I) = Nu v=1 P v − ic −c P u (T N I) = −c P u (N T N I) = 0 P u (N T I) = (1 − p)Nu v=1 P v − ic P total = S∈G u∈S P u , G = {T I, T N I, N T N I, N T I} (8)

Figure 7 :
Figure 7: Threshold matrix optimized on validation set: p vs. c

Figure 8 :
Figure 8: Profit matrix on test set: p vs. c

Table 1 :
Edge types

Table 2 :
Hyperparameters tuning 3 .All the models are trained on NVidia A100 GPU with 40 GB memory.

Table 3 :
Models performance on test data: City 1

Table 4 :
Models performance on test data: City 2

Table 5 :
Models performance on test data: City 3