MRLBot: Multi-Dimensional Representation Learning for Social Media Bot Detection

: Social media bots pose potential threats to the online environment, and the continuously evolving anti-detection technologies require bot detection methods to be more reliable and general. Current detection methods encounter challenges, including limited generalization ability, susceptibil-ity to evasion in traditional feature engineering, and insufﬁcient exploration of user relationships. To tackle these challenges, this paper proposes MRLBot, a social media bot detection framework based on unsupervised representation learning. We design a behavior representation learning model that utilizes Transformer and a CNN encoder–decoder to simultaneously extract global and local features from behavioral information. Furthermore, a network representation learning model is proposed that introduces intra- and outer-community-oriented random walks to learn structural features and community connections from the relationship graph. Finally, the behavioral representation and relationship representation learning models are combined to generate fused representations for bot detection. The experimental results of four publicly available social network datasets demonstrate that the proposed method has certain advantages over state-of-the-art detection methods in this ﬁeld.


Introduction
The advent of online social networks (OSNs) such as Twitter, Instagram, and Weibo has brought revolutionary advancements in communication tools, providing users with a novel means to create and disseminate personal content. With the growing influence of social networks, an increasing number of individuals and organizations are utilizing OSNs to accomplish their objectives. To fully leverage the power of social networks, it is necessary to have sufficient influence on these platforms, leading to the emergence of social media bots.
Social media bots (SMBs) are computer programs that generate content, engage in social interactions, online discussions, and other activities, with the aim of imitating human accounts. Research on SMBs has shown that a substantial proportion of English-speaking users on Twitter, ranging from 9% to 15%, display bot-like behavior. The estimated total number of bots on Twitter is approximately 8.5% of all users, equating to tens of millions [1]. SMBs can be categorized as benign, neutral, or malicious. Among these, malicious bots pose a significant threat to OSNs. Attackers typically control and direct malicious bots, planning and executing their attack behaviors. The presence and activities of malicious bots have a significant impact on specific topics and public opinions on the Internet. For instance, during the 2010 U.S. midterm elections, malicious bots infiltrated Twitter, promoting specific candidates and spreading rumors and misinformation about others [2]. Similar attacks were observed during the 2016 U.S. presidential election [3].
To mitigate the proliferation of malicious behaviors on OSNs, researchers have proposed various techniques for detecting and banning bot accounts. Nevertheless, attackers can reverse-engineer their bot accounts by exploiting existing detection methods to evade 1.
The generalization ability of detection methods; existing detection models may only be applicable to particular OSNs, and their performance on other OSNs may be suboptimal, 2.
Traditional feature engineering-based methods can be expensive and susceptible to evasion; the manual extraction of distinctive features to discern between malicious bots and legitimate users requires significant domain expertise and human resources. Moreover, when malicious bots enhance their anti-detection capabilities, the originally defined feature set may lose its effectiveness, 3.
Insufficient exploration of user relationships; existing detection methods that rely on user relationships demand substantial time and computational resources. Furthermore, considering the privacy regulations of OSNs, potential features should be extracted from a restricted and accessible set of user relationships to the fullest extent feasible.
In order to tackle these challenges, this paper introduces a framework called MRLBot for detecting malicious bots in social networks using representation learning. The framework aims to model multi-dimensional information of user behaviors and relationships through automated feature extraction and the high generalization capability of general representation learning, ultimately accomplishing the detection task. The primary contributions of this paper are summarized as follows: • We propose a behavior representation learning model, DDTCN. The model abstracts user activities on social networks to obtain behavior sequences, which are then encoded using the contextual global feature extraction ability of Transformer for time series. A CNN encoder-decoder is subsequently cascaded to extract local information from the sequences. Additionally, the representation ability of the output vectors is enhanced through the proposed Transformer dual decoder; • This paper presents a network representation learning model, IB2V, based on latent communities. Additionally, we propose an incremental learning strategy for largescale network graphs to reduce the time cost of generating representations for newly added nodes, while maintaining the performance of the model. The model learns the structural features of node neighborhoods through a novel random walk algorithm, while also preserving the internal structure of communities and the correlations between them; • We design a generalized detection framework for various social network platforms, achieved through unified input from multiple platforms. The framework integrates components for behaviora; representation and relationship representation learning to generate fused representations, thereby enhancing detection performance.
In the rest of this paper, we conduct a comprehensive review of the relevant literature in Section 2. Section 3 provides a detailed overview of the preliminaries of user behaviors and relationships in social networks. Subsequently, in Section 4, we present the proposed MRLBot. In Section 5, we present the experimental results and corresponding analyses. Section 6 includes a review of the study's findings, along with our directions for future research.

Graph-Based Detection Approaches
A graph, G, in mathematical terms is a collection of vertices, V(G), and edges, E(G), that represent points connected in a plane or space. Graph structures are widely employed in diverse fields to represent pairwise relationships between objects. Early graph-based mod-els for SMB detection were based on the architecture of the studied OSN and established user relationships. Feng et al. [5] constructed an undirected graph using the bidirectional follow relationship among target users, and then created a matrix of target users and their related users based on the Jaccard coefficient. Subsequently, the probability of a user being identified as a social bot was calculated based on the similarity between matrices. Dorri et al. [6] proposed SocialBotHunter, which was based on the isomorphism of social network graphs, where isomorphism suggests that two accounts may share similar attributes if they are associated in the social network. This model detected bots on Twitter by analyzing users' social behaviors and interaction records. Abu-El-Rub et al. [7] employed a graphbased approach to model geographical information and clustered accounts based on graph structure and other collected information. Furthermore, Ahmad et al. [8] employed the unsupervised machine learning method to cluster graphs for bot detection by modeling social media data using a set of features and weighted graphs.
Extracting features from rich relational information in graph has long been a challenging problem. Recently, new research approaches have emerged, with a common method involving the transformation of topological and relational information into lowdimensional vectors, followed by training and inference using machine learning algorithms. Pham et al. [9] proposed a community-based random walk strategy that generates lowdimensional node representations while preserving local neighborhood relationships and intra-community structure. Magelinski et al. [10] utilized the graph neural network to extract latent local features of social network graphs by aggregating nodes along onedimensional slices of the feature space, and then performed classification based on generated multi-channel histograms. Feng et al. [11] introduced a framework called BotRGCN, which constructs a relation-based heterogeneous graph and enhances the model's ability to detect bots disguised as normal users by using multimodal user semantics and profiles. Building upon this, they considered varying relationship strengths between users in their recent work [12], and constructed a heterogeneous network with users as nodes and diverse relationships as edges.
However, in reality, OSNs are typically of massive scale, with some even having billions of user nodes. Consequently, most graph-based methods for detecting malicious SMBs in OSNs demand substantial training times and significant computing resources. Moreover, obtaining relationships for all users is highly challenging due to privacy constraints in OSNs, which impedes the further development of graph-based methods in this field.

Node-Based Detection Approaches
Most graph-based methods focus solely on the relationships between nodes, disregarding the valuable information contained within user nodes. Supervised machine learning is the prevailing method for node-based detection. These methods employ machine learning classifiers to detect malicious SMBs, treating the detection as binary classification and relying on a substantial amount of annotated data for training. Daouadi et al. [13] proposed an augmented set of features that leverage the interaction volume between accounts, combined with other features from previous research, to detect bot accounts on Twitter. Kudugunta et al. [14] employed the content of individual tweets and six account features to identify bots on Twitter. Wang et al. [15] hypothesized that tweets of social bots exhibit similarity due to shared goals among attackers, and utilized tweet similarity for detecting social bots on Twitter. Ping et al. [16] utilized CNN-LSTM to extract features from tweet content and metadata. Wei et al. [17] used the bidirectional long short-term memory (BiLSTM) network to capture features from tweets for classifying human and spam bots on Twitter. Stanton et al. [18] introduced a method called spamGAN that utilizes generative adversarial networks (GAN) to detect spam bots and enhance text classification accuracy in online comments with limited labeled data.
In addition to supervised learning, unsupervised machine learning methods have also been applied in this field. These methods do not rely on annotated data or unique features of individual accounts for classification purposes. Cresci et al. [19,20]  analyzed digital DNA sequences from users' online behaviors, and applied standard DNA analysis techniques to distinguish between legitimate accounts and spam accounts on the platform. Mazza et al. [21] collected a dataset of 10 million retweets and developed a novel visualization method to differentiate between benign and malicious retweet activities. They proposed an unsupervised bot detection technique called Retweet-Buster (RTbust), which utilizes feature extraction and clustering. Feng et al. [22] proposed a representation learning framework called SATAR for the unsupervised identification of bot accounts on Twitter. SATAR employs semantic, attribute, and neighborhood information about specific users for unsupervised pretraining on a large number of user samples to achieve generalization across different OSNs, and fine-tunes the model for adaptability to specific OSNs.
The majority of machine learning-based models for detecting malicious SMBs mentioned above focus on detecting at the user node, while disregarding the relationships between users and the structural information of the social graph. Moreover, these models are only applicable to OSNs with typical features. In different OSNs, users can perform varied actions, access personal information, and form relationships with other users, making it challenging to transfer feature sets or extraction methods proposed for a specific OSN to another. Additionally, behavioral patterns of malicious SMBs demonstrate high variability and diversity across different OSNs, which require increased demand for the generalizability of detection methods.

Preliminaries
This section presents definitions for "behavior" and "relationship" in social networks, and integrates input from various platforms to ensure the generalizability of the detection framework across different OSNs.

User Behaviors in Social Networks
Generally, social networks are commonly used by individuals to fulfill various needs, such as communication, information acquisition, and social interaction, among others. The actions of users in different OSNs such as those depicted in Figure 1 can be considered events that take place at different timestamps on a timeline, encompassing the diverse actions and interactions carried out by users at each timestamp. on the platform. Mazza et al. [21] collected a dataset of 10 million retweets and developed a novel visualization method to differentiate between benign and malicious retweet activities. They proposed an unsupervised bot detection technique called Retweet-Buster (RTbust), which utilizes feature extraction and clustering. Feng et al. [22] proposed a representation learning framework called SATAR for the unsupervised identification of bot accounts on Twitter. SATAR employs semantic, attribute, and neighborhood information about specific users for unsupervised pretraining on a large number of user samples to achieve generalization across different OSNs, and fine-tunes the model for adaptability to specific OSNs.
The majority of machine learning-based models for detecting malicious SMBs mentioned above focus on detecting at the user node, while disregarding the relationships between users and the structural information of the social graph. Moreover, these models are only applicable to OSNs with typical features. In different OSNs, users can perform varied actions, access personal information, and form relationships with other users, making it challenging to transfer feature sets or extraction methods proposed for a specific OSN to another. Additionally, behavioral patterns of malicious SMBs demonstrate high variability and diversity across different OSNs, which require increased demand for the generalizability of detection methods.

Preliminaries
This section presents definitions for "behavior" and "relationship" in social networks, and integrates input from various platforms to ensure the generalizability of the detection framework across different OSNs.

User Behaviors in Social Networks
Generally, social networks are commonly used by individuals to fulfill various needs, such as communication, information acquisition, and social interaction, among others. The actions of users in different OSNs such as those depicted in Figure 1 can be considered events that take place at different timestamps on a timeline, encompassing the diverse actions and interactions carried out by users at each timestamp. The behavioral activities of a social network user from the past to the present. When timestamp is , the user was posting. When timestamp is , the user liked a posted piece of content. When timestamp is , the user reposted posts. When timestamp is , the user was posting another piece of content.
The data generated from these behaviors contain valuable information that reflects the unique characteristics of users. Therefore, deep learning techniques can be utilized to analyze and model user behaviors, generating representation vectors that capture the Figure 1. The behavioral activities of a social network user from the past to the present. When timestamp is t 1 , the user was posting. When timestamp is t 3 , the user liked a posted piece of content. When timestamp is t 5 , the user reposted posts. When timestamp is t 7 , the user was posting another piece of content.
The data generated from these behaviors contain valuable information that reflects the unique characteristics of users. Therefore, deep learning techniques can be utilized to analyze and model user behaviors, generating representation vectors that capture the behavioral patterns of users. By comparing and analyzing these representation vectors, it is possible to distinguish normal accounts from malicious SMBs.
We summarize the common actions performed by users in OSNs, including Post, Follow, Like, and Repost. The behavior of a user, u, on a social network can be defined by the behavior type sequence, B u , and the corresponding timestamp sequence, T u , both of which comprise a set number of operations.
T u = t u,1 , t u,2 , . . . , t u,i , . . . , t u,l−1 , t u,l B u represents the behavior type sequence for user u's l actions, where b u,i ∈ {Post, Follow, Like, Repost} denotes the behavior type of the i-th action. T u is the timestamp sequence for each action of user u, where t u,i ∈ T represents the timestamp of the i-th action, and T represents the collection of all timestamps within the data collection period. To account for the temporal characteristics of user behavior, T u is arranged in an increasing sequence, with timestamps being sorted based on values, i.e., Posts or reposts generated by malicious SMBs often exhibit biased or directed characteristics and differ from the content posted by regular users. Therefore, the semantic features of these posts play a critical role in detecting malicious SMBs. Consequently, the content of posts should be incorporated as part of the definition of user behavior sequences. Specifically, the content sequence, C u , can be defined as follows: C u represents the posts published by user u during l actions, where c u,i denotes the content posted by user u in the i-th action. If no content is posted during a particular operation, c u,i is recorded as empty.

User Relationships in Social Networks
Social relationships among users in social networks can be integrated to form a social network graph. An OSN can be represented as a directed graph, G = (V, E), where V and E denote sets of user nodes and relationship edges, respectively. However, due to the existence of different types of relationships and varying interaction frequencies among users, representing the relationship information among users in social networks with the directed graph is inadequate. To more accurately represent the strength of relationships between users, we propose assigning weights to the edges to create a directed weighted graph, G = (V, E, W), where W denotes the set of edge weights. In this study, we categorize the strength of relationships into five levels: very weak, weak, medium, strong, and very strong. The allocation rules for relationship strength levels and corresponding weights, W i , are based on the assumption that relationship E i occurs k times from user V p to user V q , as shown in Table 1.

MRLBot: Methodology
This paper proposes a malicious SMB detection framework called MRLBot based on multi-dimensional representation learning, as depicted in Figure 2.

MRLBot: Methodology
This paper proposes a malicious SMB detection framework called MRLBot based on multi-dimensional representation learning, as depicted in Figure 2. is the generated behavior representation, and is the generated relationship representation. is the multi-dimensional representation.
The framework consists of the following three steps: 1. Data restructuration and preprocessing; due to potential differences in structured rules across diverse datasets, it is advisable to restructure the data for further preprocessing and expansion. The rules for data restructuration are formulated based on the user behaviors and relationships defined in Sections 3.1 and 3.2. Each record in the table represents the behavior of each user or bot at a specific time point, including the behavior type and posted content, as well as interaction relationships. Preprocessing for the restructured data involves aggregating the records to generate user behavior sequences ( , , ), and to build the social network relationship graph, = ( , , ), with a focus on each user and each relationship (source node and target node), 2. The generation and fusion of multi-dimensional user representations; the behavior sequences are input into the behavior representation learning model, DDTCN, and the directed weighted graph is input into the relationship representation learning model, IB2V. Through unsupervised learning, optimizers are performed separately to generate behavior representations and relationship representations that capture each user's characteristics. Then, these two types of representations are concatenated to complete the fusion of multi-dimensional representations, 3. The training and detection of the deep learning classifier; a fully connected neural network is employed to construct the classifier, with the fused representations serving as input, to achieve an accurate detection framework. During the training process of the detection framework, hyperparameters are adjusted to ensure optimal performance. Additionally, a labeled dataset is utilized to train the classifier, facilitating the effective judgment and identification of malicious SMBs based on different user representations.

DDTCN: Behavioral Representation Learning Model
Traditional RNN structures are capable of analyzing high-frequency and uniform behavior information and exhibit limitations in handling low-frequency and un-uniform is the relationship representation learning model. u b is the generated behavior representation, and u g is the generated relationship representation. u is the multi-dimensional representation.
The framework consists of the following three steps:

1.
Data restructuration and preprocessing; due to potential differences in structured rules across diverse datasets, it is advisable to restructure the data for further preprocessing and expansion. The rules for data restructuration are formulated based on the user behaviors and relationships defined in Sections 3.1 and 3.2. Each record in the table represents the behavior of each user or bot at a specific time point, including the behavior type and posted content, as well as interaction relationships. Preprocessing for the restructured data involves aggregating the records to generate user behavior sequences (B u , C u , T u ), and to build the social network relationship graph, G = (V, E, W), with a focus on each user and each relationship (source node and target node), 2.
The generation and fusion of multi-dimensional user representations; the behavior sequences are input into the behavior representation learning model, DDTCN, and the directed weighted graph is input into the relationship representation learning model, IB2V. Through unsupervised learning, optimizers are performed separately to generate behavior representations and relationship representations that capture each user's characteristics. Then, these two types of representations are concatenated to complete the fusion of multi-dimensional representations, 3.
The training and detection of the deep learning classifier; a fully connected neural network is employed to construct the classifier, with the fused representations serving as input, to achieve an accurate detection framework. During the training process of the detection framework, hyperparameters are adjusted to ensure optimal performance. Additionally, a labeled dataset is utilized to train the classifier, facilitating the effective judgment and identification of malicious SMBs based on different user representations.

DDTCN: Behavioral Representation Learning Model
Traditional RNN structures are capable of analyzing high-frequency and uniform behavior information and exhibit limitations in handling low-frequency and un-uniform behaviors that evolve over time [23]. In contrast, Transformer structures have been shown to effectively model multiple sequences and generate meaningful contextual representations in temporal interaction data [24]. Inspired by this, we propose a Transformer-based behavioral representation learning model, named DDTCN, for modeling user behavior in social networks. The architecture of DDTCN is illustrated in Figure 3.
behaviors that evolve over time [23]. In contrast, Transformer structures have been shown to effectively model multiple sequences and generate meaningful contextual representations in temporal interaction data [24]. Inspired by this, we propose a Transformer-based behavioral representation learning model, named DDTCN, for modeling user behavior in social networks. The architecture of DDTCN is illustrated in Figure 3. When considering behavior sequences in social networks, relying solely on Transformer may present limitations. To address this, DDTCN incorporates two optimizations based on Transformer: 1. Incorporating a CNN encoder-decoder concatenated with Transformer to capture both local and global information of user behavior simultaneously; 2. Adding a parallel decoder on top of Transformer to retain diverse information and further mitigate information loss during the generation of user behavior representations.
DDTCN encodes different types of sequences at the input layer and merges them together. The original input sequences are encoded using embedding layers, including a behavior type-embedding matrix, ∈ ℝ | |× , and time-embedding matrix, where d is the dimension of the projection vectors, | | denotes the number of behavior types, and | | is the number of timestamps. By performing a look-up table, input embeddings for and are obtained, denoted as ∈ ℝ × and ∈ ℝ × , respectively: where , ∈ ℝ is the embedding of , , and , ∈ ℝ is the embedding of , .
As the length of each content in the content sequence, , may be different, it is necessary to encode each text to align its dimension with the embeddings of other sequences When considering behavior sequences in social networks, relying solely on Transformer may present limitations. To address this, DDTCN incorporates two optimizations based on Transformer:

1.
Incorporating a CNN encoder-decoder concatenated with Transformer to capture both local and global information of user behavior simultaneously; 2.
Adding a parallel decoder on top of Transformer to retain diverse information and further mitigate information loss during the generation of user behavior representations.
DDTCN encodes different types of sequences at the input layer and merges them together. The original input sequences are encoded using embedding layers, including a behavior type-embedding matrix, M b ∈ R |B|×d , and time-embedding matrix, M t ∈ R |T|×d , where d is the dimension of the projection vectors, |B| denotes the number of behavior types, and |T| is the number of timestamps. By performing a look-up table, input embeddings for B u and T u are obtained, denoted as E b ∈ R l×d and E t ∈ R l×d , respectively: E t = (e t,1 , e t,2 , . . . , e t,i , . . . , e t,l−1 , e t,l ) where e b,i ∈ R d is the embedding of b u,i , and e t,i ∈ R d is the embedding of t u,i . As the length of each content in the content sequence, C u , may be different, it is necessary to encode each text to align its dimension with the embeddings of other sequences ( E a and E t ), while preserving its semantic features. To achieve this, we utilize the pre-trained BERT [25] model to encode C u and obtain the content embeddings E c ∈ R l×d : E c = (e c,1 , e c,2 , . . . , e c,i , . . . , e c,l−1 , e c,l ) where e c,i ∈ R d is the embedding of c u,i after encoding with BERT. Based on the embeddings defined above, the input of the Transformer encoder can be defined as E I ∈ R l×d : The Transformer encoder is the key component of our model, responsible for fusing and compressing different types of information. The Transformer encoder is composed of two components: the multi-headed self-attention mechanism (MHSA) and the feedforward network (FFN). The MHSA can capture valid information from a variety of subspaces, while the MHSA is defined explicitly as follows: where X n is the input to n-th layer of the Transformer encoder, i.e., the parameters that can be learned in each attention head in the MHSA. In this paper, the scaled dot product is used as the formula for attention, defined as follows: is the scaling factor that prevents the inner product (QK ) from becoming excessively large.
Additionally, we provide MHSA with the nonlinear characteristic using a feedforward network, which is defined as follows: where W F 1 , W F 2 , b F 1 , and b F 2 are all trainable parameters in FFN, and LeakyReLU is the activation function.
Through the residual structure, the final output at the n-th layer of the Transformer encoder is X n+1 ∈ R l×d :

CNN Encoder-Decoder
Peng et al. [26] proposed that the cascaded multi-head self-attention (MHSA) in Transformer can capture long-range feature dependencies. However, it may suffer from the loss of local feature information. In contrast, CNN convolutional operation excels at extracting local features but struggles to capture global features simultaneously. Relying solely on MHSA to focus on global user behavior representations may overlook the active state during specific periods. Therefore, taking these factors into consideration, we design a cascaded CNN encoder-decoder to capture important local information in behavior sequences.
Based on TextCNN [27], we have developed a CNN encoder-decoder that consists of a CNN encoder and a CNN decoder. The CNN encoder comprises a two-dimensional convolutional layer, a one-dimensional max pooling layer, and a fully connected layer. The size of the convolutional kernel, [k w , k e ], can be adjusted to capture specific lengths of continuous information. Meanwhile, to ensure coverage of the sequence embeddings, the dimension of the convolutional kernel is set to be the same as the embedding dimension, i.e., k e = d. The dimension of the max pooling layer is also set to match the feature map generated by the convolutional kernel, which reduces the model parameter size through downsampling to alleviate the overfitting. Finally, the fully connected layer compresses the feature map and generates the user behavior representation, u b . The definition of the CNN encoder is as follows: where X u is the output of the Transformer encoder, [k w , k e ] and Θ represent the parameters of the CNN encoder, and W CE and b CE are the parameters of the linear layer.
In the CNN decoder, u b is used to reconstruct the input, X u , of the CNN encoder, in preparation for the reconstruction task in the Transformer decoder. Corresponding to the encoder, the CNN decoder begins with a fully connected layer, followed by an upsampling layer to recover information lost due to max pooling. Finally, a two-dimensional transposed convolutional layer is connected to restore the convolutional operation of the encoder. The definition of the CNN decoder is as follows: where W CD and b CD are the parameters of the linear layer.

Transformer Dual Decoder
Typically, research involving the encoder-decoder structure (autoencoder) employs only one unsupervised reconstruction task to train the model. To minimize information loss during the compression, our method incorporates parallel reconstruction tasks. Specifically, E b and E c are used as the Query to reconstruct the timestamp embeddings, E t , in addition to reconstructing E c for retaining the feature of contents. The parallel reconstruction tasks endow the generated representations with the ability to recover diverse types of data, thereby enhancing the feature extraction capability of the autoencoder.
As the conventional Transformer decoder lacks the capability to perform parallel reconstruction tasks, we design the Transformer dual decoder (TDD) with two decoders: TDDc, which reconstructs E c , and TDDt, which reconstructs E t . Considering model complexity and computation time, we do not reconstruct E b . The reconstructions of E c and E t are defined as follows in the formulas: where MHA and FFN represent the formulas for the multi-head attention mechanism and feedforward network in Transformer. D n c is the input to the n-th layer of TDDc, and when n = 1, D 1 c = E b + E t . D n t denotes the input to the n-th layer of TDDt, and when n = 1, Lastly, E c and E t serve as the final outputs of the dual decoders, which are the reconstructions of the content embeddings, E c , and the timestamp embeddings, E t , respectively.
To minimize information loss and enhance feature extraction performance, the restoration of E c and E t to C u and T u is considered. Since T u is encoded using a lookup table in the embedding layer, the changing weights of the embedding layer during training may cause loss of information in E t . If only the differences between E t and E t are compared, the reconstructed information may not match the original sequence. Therefore, the most probable time-reconstructed sequence, T u , can be calculated by reverse computation using M t and the softmax layer: In contrast, C u is encoded using the pre-trained BERT model. Additionally, to retain its semantic features, the parameters in BERT are fixed during training, meaning that the embedding vectors in E c remain unchanged. Hence, the differences between E c and E c can be directly calculated to assess the model's ability to reconstruct contents.

Optimization Objectives
The DDTCN model adopts a multi-task joint training strategy. Specifically, the model uses two reconstruction tasks to achieve unsupervised training of the encoder-decoder structure, as outlined in Algorithm 1.

2.
Output: The learned user behavioral representations, Randomly initialize the parameters in DDTCN; 4.
Project the input series, obtain behavior type embeddings, E b , time embeddings, E t and content embeddings, E c ; 6.
Learn the user behavioral representation, u b , from the Transformer and CNN encoder-decoder; 8.
Learn the content sequence reconstruction, E c , and the time sequence reconstruction, T u , using Equations (17) and (19); 9.
Perform SGD based on Equation (24) to reduce the error in the reconstruction and improve the behavioral representation's performance; 10. End Following the reconstruction tasks described in Section 4.1.2, the Softmax cross-entropy (SCE) loss function is employed to calculate L time for time sequence reconstruction. Additionally, the mean squared error (MSE) loss function is employed to calculate L content for content sequence reconstruction: where L SCE is the calculation formula for SCE, 1 y denotes the one-hot encoding of the true label y, andŷ is the predicted value. L MSE is the calculation formula for MSE. When calculating L content , since the dimensions of E and E c are both R l×d , the specific calculation method for the loss value is to compute the mean squared error for each embedding vector, and then to take the average of the losses for all embedding vectors. The loss function of the entire model encompasses both L time and L content , and they are combined by summing up the loss values to define L main as follows: where L main serves as the loss function for the entire model, accounting for the losses incurred in both the time reconstruction and content reconstruction.

IB2V: Relationship Representation Learning Model
According to the research by Pham et al. [9] (Bot2Vec), in various OSNs, normal users or bots tend to interact with other users who belong to their shared social circles. Typically, botnets managed by attackers establish relationships with each other and maintain interaction frequency to disguise themselves as normal users for evading detection. These social circles can be considered potential communities within the network graph, and by utilizing community information and its internal structural features, it is possible to differentiate between normal users and bots.
Based on prior work on network representation learning and Bot2Vec, we have developed a novel relationship network representation learning model called IB2V (Incremental Bot2Vec), as shown in Figure 4. The main focus of IB2V is to characterize the relationships among users and embed the user nodes from the network graph into low-dimensional vectors using unsupervised learning. Once potential communities are identified in the graph, we employ a strategy that combines breadth-first sampling and depth-first sampling to generate context neighborhoods for each node. Meanwhile, in order to preserve the internal structure of each community, transfer probabilities are designed to restrict the neighborhood set to remain within the community. By acquiring the neighborhood relationships of user nodes and preserving the internal structure of communities, node representations are generated. In order to enhance the performance and usability of the model, IB2V introduces two optimizations:

1.
Outer-community association. During the random walk process, in case of crosscommunity movement, dummy nodes related to the communities are added between the nodes and participate in the similarity calculation of context neighborhoods; 2.
An incremental learning strategy. This strategy aims to learn representation vectors of newly added nodes while maintaining model performance as much as possible, avoiding the retraining of the entire graph structure and reducing time costs. function of the entire model encompasses both ℒ and ℒ , and they are combined by summing up the loss values to define ℒ as follows: where ℒ serves as the loss function for the entire model, accounting for the losses incurred in both the time reconstruction and content reconstruction.

IB2V: Relationship Representation Learning Model
According to the research by Pham et al. [9] (Bot2Vec), in various OSNs, normal users or bots tend to interact with other users who belong to their shared social circles. Typically, botnets managed by attackers establish relationships with each other and maintain interaction frequency to disguise themselves as normal users for evading detection. These social circles can be considered potential communities within the network graph, and by utilizing community information and its internal structural features, it is possible to differentiate between normal users and bots.
Based on prior work on network representation learning and Bot2Vec, we have developed a novel relationship network representation learning model called IB2V (Incremental Bot2Vec), as shown in Figure 4. The main focus of IB2V is to characterize the relationships among users and embed the user nodes from the network graph into low-dimensional vectors using unsupervised learning. Once potential communities are identified in the graph, we employ a strategy that combines breadth-first sampling and depthfirst sampling to generate context neighborhoods for each node. Meanwhile, in order to preserve the internal structure of each community, transfer probabilities are designed to restrict the neighborhood set to remain within the community. By acquiring the neighborhood relationships of user nodes and preserving the internal structure of communities, node representations are generated. In order to enhance the performance and usability of the model, IB2V introduces two optimizations: 1. Outer-community association. During the random walk process, in case of crosscommunity movement, dummy nodes related to the communities are added between the nodes and participate in the similarity calculation of context neighborhoods; 2. An incremental learning strategy. This strategy aims to learn representation vectors of newly added nodes while maintaining model performance as much as possible, avoiding the retraining of the entire graph structure and reducing time costs.

Intra-and Outer-Community-Oriented Random Walks
Initially, the Louvain community detection algorithm [28] is used to determine the community to which the node belongs. Subsequently, leveraging the community information, the graph-based structure of OSN is converted into a Skip-gram model using a walk strategy.
In a random walk process, the length of the random walk is set to , the starting point is = , steps have been taken, and the walker is at node . is a node in the social graph, and the transition probability from to is defined as ( → , ( )), as shown in the following formula:

Intra-and Outer-Community-Oriented Random Walks
Initially, the Louvain community detection algorithm [28] is used to determine the community to which the node belongs. Subsequently, leveraging the community information, the graph-based structure of OSN is converted into a Skip-gram model using a walk strategy.
In a random walk process, the length of the random walk is set to l, the starting point is s = v 0 , i steps have been taken, and the walker is at node v i . v i+1 is a node in the social graph, and the transition probability from v i to v i+1 is defined as π v i → v i+1 , C v i , as shown in the following formula: where E is the set of directed edges (relationships) in the graph. α (v i ,v i+1 ) is the nonnormalized transition probability from v i to v i+1 within the community when v i and v i+1 belong to the same community, C v i . β (v i ,v i+1 ) is the non-normalized transition probability across communities from v i to v i+1 when v i and v i+1 do not belong to the same community. w (v i ,v i+1 ) is the weight of the directed edge between v i and v i+1 . λ is the global normalization constant, which is calculated by summing up all the non-normalized transition probabilities of node v i .
The random walk strategy in Bot2Vec sets the transition probabilities α (v i ,v i+1 ) and β (v i ,v i+1 ) within and outside the community as follows: where v i−1 represents the previous node of v i . spd v i−1 , v i+1 is the shortest path distance from v i−1 to v i+1 . Since it takes at most two steps from v i−1 to v i+1 , the value of spd v i−1 , v i+1 can only be selected from {0, 1, 2}. C v i is the number of nodes in the community to which v i belongs. Meanwhile, p, q, and r are hyperparameters. r C v i is the penalty value. Figure 5 shows the random walk process of our model. The in-out parameter q can control the distance of the random walk. When q < 1, the walk is more likely to choose nodes farther away from v i−1 . When q > 1, the walk is more likely to visit nodes closer to v i−1 ; that is, nodes around v i−1 . By setting q, the random walk process can be controlled to explore outward from the source node or to explore the surroundings. In addition, the return parameter p can control the probability of the random walk revisiting the previous node; that is, v i+1 = v i−1 . Increasing the value of p helps reduce the likelihood of sampling the same node in the random walk process. The penalty value r C v i is used for crosscommunity sampling to control the movement of the random walker, either to the outside or staying within the current community, when starting from the current node. Thus, changing the value of r can guide the random walker in capturing the local structure of each user node.
In addition, to preserve cross-community information during the sampling process, IB2V introduces dummy community nodes between each community, as depicted in Figure 5. Let us assume a dummy node, D k ∈ D, where D is the set of all dummy nodes, and D k is positioned between different communities, C v k and C v k+1 . When the random walker moves across communities, from v k to v k+1 , we include the dummy node D k in the sampled context, forming v k → D k → v k+1 . By performing these operations, the generated node context during random walk sampling includes community dummy nodes, which takes into account the association between different communities and crosscommunity information. Electronics 2023, 12, x FOR PEER REVIEW 13 of 24 Figure 5. Intra-and outer-community-oriented random walks.
In addition, to preserve cross-community information during the sampling process, IB2V introduces dummy community nodes between each community, as depicted in Figure 5. Let us assume a dummy node, ∈ , where is the set of all dummy nodes, and is positioned between different communities, ( ) and ( 1 ) . When the random walker moves across communities, from to 1 , we include the dummy node in the sampled context, forming → → 1 . By performing these operations, the generated node context during random walk sampling includes community dummy nodes, which takes into account the association between different communities and cross-community information.

Node Representation Learning and Optimization Objectives
The primary objective of the model is to acquire the latent node representation with a dimensionality of . Skip-gram model and Word2Vec are used to learn the representation of each node by sampling context neighborhoods through random walks. For a node in a social network, the optimization objective of IB2V is to maximize the probability of its neighboring nodes appearing, as shown in the following equation: where ( ) represents the set of neighboring nodes of obtained through sampling strategies. Additionally, given node , ( | ; ) is the conditional probability of node occurring, which can be represented by the following equation: where and are the representation vectors of and , respectively. To reduce the computational cost of model parameter updates, the network embedding process adopts a negative sampling technique, which randomly selects a subset of neurons for updates. Negative sampling has been proven to be effective at training large-scale datasets, such as text corpora and information networks. In a more detailed implementation process, for each positive training sample, a set of negative samples is randomly selected from the given node set according to the noise distribution, ( ) . The optimization function with the negative sampling strategy is :

Node Representation Learning and Optimization Objectives
The primary objective of the model is to acquire the latent node representation u g with a dimensionality of e. Skip-gram model and Word2Vec are used to learn the representation of each node by sampling context neighborhoods through random walks. For a node v in a social network, the optimization objective of IB2V is to maximize the probability of its neighboring nodes appearing, as shown in the following equation: max θ ∑ c∈N(v) logPr(c|v; θ) (28) where N(v) represents the set of neighboring nodes of v obtained through sampling strategies. Additionally, given node v, logPr(c|v; θ) is the conditional probability of node c occurring, which can be represented by the following equation: where u c and u v are the representation vectors of c and v, respectively. To reduce the computational cost of model parameter updates, the network embedding process adopts a negative sampling technique, which randomly selects a subset of neurons for updates. Negative sampling has been proven to be effective at training large-scale datasets, such as text corpora and information networks. In a more detailed implementation process, for each positive training sample, a set of K negative samples is randomly selected from the given node set V according to the noise distribution, P(u). The optimization function with the negative sampling strategy is L θ : where σ is the sigmoid activation function, P(u) is the noise distribution for randomly selecting K negative-sample nodes from the given node set, and u k is the user node selected through negative sampling.

Incremental Learning Strategy
OSNs in the real world are typically characterized by a large scale, with some networks having hundreds of millions of user nodes. When new nodes and relationships are added, incremental training on the existing social graph must be considered. However, reloading the complete social graph, resampling, and relearning representations for each node can consume a large amount of time and computational resources. Therefore, this paper proposes an incremental learning strategy for IB2V that aims to reduce the time cost of model continuation training as much as possible, while minimizing performance degradation. The specific process is illustrated in Figure 6. lected through negative sampling.

Incremental Learning Strategy
OSNs in the real world are typically characterized by a large scale, with some networks having hundreds of millions of user nodes. When new nodes and relationships are added, incremental training on the existing social graph must be considered. However, reloading the complete social graph, resampling, and relearning representations for each node can consume a large amount of time and computational resources. Therefore, this paper proposes an incremental learning strategy for IB2V that aims to reduce the time cost of model continuation training as much as possible, while minimizing performance degradation. The specific process is illustrated in Figure 6. To improve the efficiency of model training, the incremental learning strategy adopts various methods, including avoiding redundant community detection, reducing sampling frequency, and fixing the weights of old nodes, as follows: 1. Specifically, the model first conducts community detection and saves the community membership for each node; 2. Then, the original social graph is converted into a community graph, where nodes are represented as communities. Based on the saved community information and structure, newly added nodes are integrated into the community graph and undergo a new round of community detection to determine their community membership; 3. Then, the newly added nodes are used as the starting point for the random walk process, which generates context sequences after sampling from the new social graph; 4. In the final step, the representations of old nodes are fixed, and the Skip-gram model is used to learn the representations for all new nodes. To improve the efficiency of model training, the incremental learning strategy adopts various methods, including avoiding redundant community detection, reducing sampling frequency, and fixing the weights of old nodes, as follows:

1.
Specifically, the model first conducts community detection and saves the community membership for each node; 2.
Then, the original social graph is converted into a community graph, where nodes are represented as communities. Based on the saved community information and structure, newly added nodes are integrated into the community graph and undergo a new round of community detection to determine their community membership; 3.
Then, the newly added nodes are used as the starting point for the random walk process, which generates context sequences after sampling from the new social graph; 4.
In the final step, the representations of old nodes are fixed, and the Skip-gram model is used to learn the representations for all new nodes.

Datasets and Evaluation Metrics
In order to assess and evaluate the performance of the detection framework and its generalization ability across different OSNs, this study utilizes four publicly available datasets from distinct OSNs:

1.
Cresci-2015 [29] (Twitter). The statistical information of the dataset is presented in Table 2, comprising 5301 users, out of which 3351 accounts are labeled as bots, accounting for 63% of all users. This dataset includes tweet records and user profiles of relevant users, and consists of five sub-datasets: TFP, E13, FSF, INT, and TWT. All sub-datasets have label information, with users annotated as either bots or normal users.

2.
Social-Spammer [30] (Tagged). This dataset is collected from the Tagged social networking website and contains a large labeled dataset of bots. It encompasses a total of 5,607,454 accounts with their profiles, 912,280,409 relationship records between accounts, and timestamps of interactions. This dataset forms a heterogeneous network with seven different types of relationships between users ("Message", "Pet Game", "Meet-Me Game", "Add Friend", "Give A Gift", "Report Abuse", and "View Profile"). Among these, 221,305 accounts are labeled as bots, which constitutes 3.9% of all accounts. 3.
MicroblogPCU [31] (Weibo). This dataset is collected by researchers for spam detection in Weibo, and includes basic attribute of users, as well as the content they posted and corresponding timestamps. It contains a total of 48,848 Weibo posts and 781 accounts. Among all the accounts, the number of labeled normal users is 113, and the number of malicious bots is 66. 4.
TwiBot-22 [32] (Twitter), a comprehensive graph-based Twitter bot detection benchmark that presents the largest dataset to date. The dataset contains 1,000,000 users (39,943 accounts are labeled as bots), 86,764,167 tweets, and 170,185,937 edges between users. According to the methods of re-structuring and pre-processing, the four datasets mentioned above are processed. For the experiments, all datasets are divided into training set, validation set, and test set in a ratio of 7:1:2, respectively. To test the model's generalization ability and its applicability to new users, MRLBot is trained on the training set in an unsupervised manner to generate user representations. The deep learning classifier is trained using a supervised task of predicting malicious SMBs, and the performance of the detection framework is evaluated using the test set. Additionally, to prevent model overfitting, the training quality is evaluated using the validation set. Table 3 presents the evaluation metrics used in this study. The ratio of the number of correctly identified malicious SMBs to the total number of identified samples.

Recall TP FN+TP
The ratio of the number of correctly identified malicious SMBs to the number of samples that should be identified. F1 score 2 * TP 2 * TP+FP+FN F1 score is a statistical metric to measure model accuracy.
Both accuracy and recall are considered in the metric.
In Table 3, TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. The primary focus of this study is the detection of malicious SMBs, with bots being considered positive samples and normal users being considered negative samples. Therefore, accuracy is employed as a measure of the overall classification performance. In cases where the dataset has a significant imbalance between positive and negative samples (e.g., in the Social-Spammer dataset where bots account for only 3.9% of all accounts), evaluation metrics that emphasize positive samples, such as precision, recall, and F1 score, should be given priority.

Experimental Setups
Based on the PyTorch deep learning framework (version 1.11.0), we constructed MRLBot. The hyperparameter settings are provided in Table 4. In the experimental discussion section of Section 5.4, significant hyperparameters will be adjusted to assess their impact on the model's performance. To demonstrate MRLBot's ability to detect malicious SMBs, we compared it against the following baseline methods:

1.
AdaBoost [14]. This method uses a 10-dimensional feature set based on user profiles and employs AdaBoost for bot classification. The feature "favorite_counts" is present in the Cresci-2015 dataset. but it is not available in the other two datasets, and thus is not used in experiments involving these datasets.

2.
DeeProBot [33]. This method uses metadata from user profiles and replaces the descriptive text in the metadata with pre-trained global word embeddings. The model consists of LSTM and fully connected layers to handle mixed types of features, including numerical, binary, and text. As the datasets used in this paper lack "Sentiment" and "Timing" features; these features are not taken into consideration during input construction. 3.
BotRGCN [11]. The authors constructed a heterogeneous graph from the follower relationships, embedding multimodal user semantics and attribute information into the graph, and applied graph convolutional network for detecting bots. Since the dataset used in this paper do not include account avatars, this feature is not considered when constructing inputs. 4.
SATAR [22]. This method is an unsupervised Twitter user representation learning framework that jointly utilizes semantic, attribute, and neighborhood information, and employs a co-influence module to aggregate this information.
The above methods have corresponding implementations in publicly available code repositories, and hyperparameters from the published papers are referred to for optimal performance.

Experimental Results
This section discusses the comparative experiments that were conducted on the proposed MRLBot, utilizing datasets from three distinct social networks, along with the use of the baseline methods mentioned in Section 5.2. Table 5 presents the experimental results of all detection frameworks on Cresci-2015 and TwiBot-22. MRLBot demonstrated superior performance in terms of accuracy and the F1 score compared to other bot detection frameworks, achieving 97.25% accuracy and a 97.85% F1 score. Despite achieving the highest precision, AdaBoost generated a significant number of false positives in bot identification, mistakenly classifying numerous bot samples as normal users. SATAR, although achieving the highest recall, also misclassified some normal users as bots. Overall, MRLBot achieved the best detection performance on the Cresci-2015 dataset. Similarly, MRLBot achieved the best performance on TwiBot-22.
The detection performance on Social-Spammer and MicroblogPCU is shown in Table 6. On the Social-Spammer dataset, which contains richer relational information, BotRGCN achieved the highest detection performance. This is attributed to the utilization of graph convolutional networks in BotRGCN, which capture the global contextual information of the social graph and define it as a heterogeneous graph, taking into account the diverse types of relationships among users. Our proposed MRLBot outperformed other methods except BotRGCN, indicating that the integration of multi-dimensional features can enhance the detection performance and accomplish the performance of graph neural network in supervised tasks.
On the MicroblogPCU dataset with a small sample size, MRLBot outperformed SATAR, which is also an unsupervised framework, and achieved the best performance among all the compared frameworks. This indicates that the fusion of multi-dimensional information improved the detection performance of user representations.
In light of the experimental results and the analysis presented above, we can draw the following conclusions: MRLBot demonstrated effective detection results on diverse datasets, showcasing its ability to generalize across various OSNs. Furthermore, on the four publicly available datasets, the detection performance of MRLBot surpassed that of state-of-the-art baseline frameworks in this field, validating the efficacy of our approach in integrating multi-dimensional representations.  To illustrate the significance of timestamp sequences for the model, ablation experiments were conducted to compare the performance of TCN with timestamp sequences, against that of TCNpos without timestamp sequences. The model's performance improvement with timestamp sequences is evident in terms of the accuracy and F1 score. This is because, although Transformer is capable of capturing contextual relationships in sequences, it lacks information about the timing of user behavior. Moreover, the Transformer dual decoder demonstrated a notable improvement of 1.47% in the F1 score compared to the other components in the control group. We posit that the parallel decoder, through the reconstruction of timestamp sequence, enhances the temporal features of user representations. This facilitates the more accurate identification of samples that were previously challenging to determine, thereby resulting in a more balanced detection performance.

Efficiency of Incremental Learning Strategy
In the comparative experiment, a graph of all nodes and relationships from the dataset was constructed to learn user representations. MRLBot was trained on a per-usernode basis, using 50% of the nodes from the Social-Spammer dataset as the initial training set. Subsequently, the percentage of nodes was incrementally increased from 50% to 100% to simulate the scenario of adding new users.
In the absence of the incremental learning strategy described in Section 4.2.3, the representations of all nodes must be relearned following the algorithmic flow of IB2V, and the classifier needs to be retrained using the newly added 10% nodes as the test set to validate the classification performance. The effectiveness of this strategy was verified by comparing the time taken and detection performance of generating representation vectors for newly added nodes with and without using the incremental learning strategy.
Based on the findings in Tables 7 and 8, the incremental learning strategy can greatly decrease the time cost of learning representations for newly added nodes. However, the strategy may result in a trade-off between time cost and detection performance. This is To illustrate the significance of timestamp sequences for the model, ablation experiments were conducted to compare the performance of TCN with timestamp sequences, against that of TCNpos without timestamp sequences. The model's performance improvement with timestamp sequences is evident in terms of the accuracy and F1 score. This is because, although Transformer is capable of capturing contextual relationships in sequences, it lacks information about the timing of user behavior. Moreover, the Transformer dual decoder demonstrated a notable improvement of 1.47% in the F1 score compared to the other components in the control group. We posit that the parallel decoder, through the reconstruction of timestamp sequence, enhances the temporal features of user representations. This facilitates the more accurate identification of samples that were previously challenging to determine, thereby resulting in a more balanced detection performance.

Efficiency of Incremental Learning Strategy
In the comparative experiment, a graph of all nodes and relationships from the dataset was constructed to learn user representations. MRLBot was trained on a per-user-node basis, using 50% of the nodes from the Social-Spammer dataset as the initial training set. Subsequently, the percentage of nodes was incrementally increased from 50% to 100% to simulate the scenario of adding new users.
In the absence of the incremental learning strategy described in Section 4.2.3, the representations of all nodes must be relearned following the algorithmic flow of IB2V, and the classifier needs to be retrained using the newly added 10% nodes as the test set to validate the classification performance. The effectiveness of this strategy was verified by comparing the time taken and detection performance of generating representation vectors for newly added nodes with and without using the incremental learning strategy.
Based on the findings in Tables 7 and 8, the incremental learning strategy can greatly decrease the time cost of learning representations for newly added nodes. However, the strategy may result in a trade-off between time cost and detection performance. This is because this strategy does not re-sample and generate context sequences for all nodes, resulting in the loss of structural information of some nodes in the graph. Additionally, the performance of the generated representations for new nodes may also be affected due to the fixed representation of old nodes and the loss of structure features. As the number of nodes grows, the loss of detection performance caused by the incremental learning strategy becomes more severe. Therefore, it is recommended to use the incremental learning strategy to generate representation vectors for a certain amount of newly added nodes to reduce training time costs. However, when a large number of new users are added to the social network over a period of time, in order to ensure detection performance, it is recommended to re-sample the new social graph structure and refresh the representation vectors of all nodes. In Section 4.1.1, the user's social network was defined as a directed weighted graph, and the relationship strength among users was divided. We tested the variation in MRLBot's detection performance on the Social-Spammer dataset when the input of IB2V was a directed unweighted graph and a directed weighted graph.
Based on the findings in Table 9, the detection performance of MRLBot was enhanced when the relationship strength among users was considered. We defined the relationship strength based on the number of interactions among users and quantified it using the weight of edges in the relationship graph. The experimental results indicate that this optimization scheme enables the network representation learning method to capture more relationship features from the social graph, leading to the improved performance of node representation in bot detection tasks. The primary objective of selecting suitable hyperparameters for the model is to achieve optimal performance within the constraints of time and computational resources. This section investigates the impact of different hyperparameters on the model's performance, in order to evaluate the robustness of our model using the Cresci-2015 dataset.
In our model, MRLBot, the behavior representation part is based on the Transformer encoder-decoder structure. The number of layers in the Transformer encoder and decoder may affect the model's performance. Without changing other hyperparameters, we recorded the results of the detection framework in terms of accuracy and the F1 score by varying the number of layers in the Transformer encoder and dual decoder in DDTCN, as shown in Figures 8 and 9.  The detection performance was optimized when the number of layers in the Transformer dual decoder was set to 1 and the number of layers in the Transformer encoder was set to 2. Thus, the depth of our model is not the determining factor of performance, as increasing the depth would raise the computational cost and model complexity.
To extract local information from the behavior sequence, we used a CNN autoencoder. The size of the convolutional kernel can affect the final performance of the model. We set the size of the convolutional kernel as two-dimensional [ , ], where is the same as the embedding dimension of the model, i.e., = . By varying in the set {2, 4, 8, 16, 32, 64}, we recorded the performance change of the detection framework with the change in the convolutional kernel size, as shown in Figure 10.   The detection performance was optimized when the number of layers in the Transformer dual decoder was set to 1 and the number of layers in the Transformer encoder was set to 2. Thus, the depth of our model is not the determining factor of performance, as increasing the depth would raise the computational cost and model complexity.
To extract local information from the behavior sequence, we used a CNN autoencoder. The size of the convolutional kernel can affect the final performance of the model. We set the size of the convolutional kernel as two-dimensional [ , ], where is the same as the embedding dimension of the model, i.e., = . By varying in the set {2, 4, 8, 16, 32, 64}, we recorded the performance change of the detection framework with the change in the convolutional kernel size, as shown in Figure 10. The detection performance was optimized when the number of layers in the Transformer dual decoder was set to 1 and the number of layers in the Transformer encoder was set to 2. Thus, the depth of our model is not the determining factor of performance, as increasing the depth would raise the computational cost and model complexity.
To extract local information from the behavior sequence, we used a CNN autoencoder. The size of the convolutional kernel can affect the final performance of the model. We set the size of the convolutional kernel as two-dimensional [k w , k e ], where k e is the same as the embedding dimension of the model, i.e., k e = d. By varying k w in the set {2, 4, 8, 16, 32, 64}, we recorded the performance change of the detection framework with the change in the convolutional kernel size, as shown in Figure 10.
as increasing the depth would raise the computational cost and model complexity.
To extract local information from the behavior sequence, we used a CNN autoencoder. The size of the convolutional kernel can affect the final performance of the model. We set the size of the convolutional kernel as two-dimensional [ , ], where is the same as the embedding dimension of the model, i.e., = . By varying in the set {2, 4, 8, 16, 32, 64}, we recorded the performance change of the detection framework with the change in the convolutional kernel size, as shown in Figure 10. This can be attributed to the fact that larger convolutional kernels result in longer sequence lengths being calculated within each kernel, causing the model to focus more on This can be attributed to the fact that larger convolutional kernels result in longer sequence lengths being calculated within each kernel, causing the model to focus more on textual content and temporal information. As a consequence, the ratio of important local information to unimportant information decreases, leading to diminishing returns in model performance. Furthermore, projecting high-dimensional information into a lowdimensional space does not increase the amount of effective information in the user vectors as the convolutional kernel size increases. As a result, the accuracy and F1 score plateaus, with no further improvement observed beyond a kernel size of eight.
Most network representation learning techniques are highly sensitive to changes in hyperparameters, such as the number of random walks per node (w) and the length of each walk (l). Based on the results depicted in in Figure 11, stability and optimal performance in bot classification tasks were achieved with w > 18 and l = 30 ∼ 50. This can be attributed to the fact that increasing the values of w and l results in an increased number of contextual nodes for each user node. Therefore, these parameters should be carefully chosen, taking into consideration the size of the network, in order to generate sufficient training samples for the network's representation learning. Finally, sensitivity experiments on parameters indicate that, considering the balance between time, computational resources, and model performance, parameter settings of w = 20 and l = 30 ∼ 50 can be adopted. textual content and temporal information. As a consequence, the ratio of important local information to unimportant information decreases, leading to diminishing returns in model performance. Furthermore, projecting high-dimensional information into a lowdimensional space does not increase the amount of effective information in the user vectors as the convolutional kernel size increases. As a result, the accuracy and F1 score plateaus, with no further improvement observed beyond a kernel size of eight. Most network representation learning techniques are highly sensitive to changes in hyperparameters, such as the number of random walks per node ( ) and the length of each walk ( ). Based on the results depicted in in Figure 11, stability and optimal performance in bot classification tasks were achieved with > 18 and = 30~50. This can be attributed to the fact that increasing the values of and results in an increased number of contextual nodes for each user node. Therefore, these parameters should be carefully chosen, taking into consideration the size of the network, in order to generate sufficient training samples for the network's representation learning. Finally, sensitivity experiments on parameters indicate that, considering the balance between time, computational resources, and model performance, parameter settings of = 20 and = 30 ∼ 50 can be adopted. Figure 11. Performance of the detection framework with different IB2V parameters (Cresci-2015 dataset); (a) performance of the detection framework with different number of random walks per node; (b) performance of the detection framework with different random walk distances.

Limitations
Our proposed method has the following limitations:

•
Limitations of research hypotheses. In Section 3, we simplified user behaviors and relationships in social networks, which is a critical weakness and a simplification of the research work. In actual social networks, user behaviors and relationships are more complex and have more variables. Furthermore, the hypothesis method we employed is based solely on past research and historical experience, without taking into account future changes. There are two main limitations to this approach. Firstly, due

Limitations
Our proposed method has the following limitations: • Limitations of research hypotheses. In Section 3, we simplified user behaviors and relationships in social networks, which is a critical weakness and a simplification of the research work. In actual social networks, user behaviors and relationships are more complex and have more variables. Furthermore, the hypothesis method we employed is based solely on past research and historical experience, without taking into account future changes. There are two main limitations to this approach. Firstly, due to the reduction in variables considered, the detection performance may decrease in the real environment, even though the rationality of the research is guaranteed. Secondly, attackers may bypass the detection methods we proposed by disguising software robots as real users based on these assumptions.

•
Limitations of technical methods. During the representation fusion stage, we concatenated representations of different dimensions, which has the potential to affect the final performance. • Limitations of experimental scenarios. All experiments in this paper were conducted using publicly available datasets and did not involve actual online environments. These datasets were collected by researchers in the past, and the performance on these datasets can indicate whether or not the detection method was effective in the past time period. However, it is also important to consider the timeliness of the detection method; that is, whether or not it is effective in the latest time period. Real-time monitoring in online environments requires more complex engineering work and will be the focus of our future research.

Conclusions
This paper presents a novel method for detecting malicious SMBs based on multidimensional representation learning. Detection methods relying on a single feature, such as behavior or relationship, may perform poorly when that feature is missing in a social network platform. To address this limitation, this study proposes a framework that combines behavior representation and relationship representation learning models to generate fused representations. By unifying the input from multiple platforms, the framework achieves generalization across different social network platforms. Specifically, an unsupervised representation learning model, DDTCN, based on user behavior, is proposed, along with different optimization components to enhance the model's output vectors for users. Additionally, a network representation learning model, IB2V, is proposed, which incorporates an incremental learning strategy for large-scale social network graphs to reduce the time cost of generating representations for newly added nodes while maintaining performance. This model captures not only the structural features of node neighborhoods, but also the internal structure of communities and the correlations between communities. The experimental results demonstrate the effectiveness of the framework, with good detection performance being achieved on all datasets. Funding: This work receives no external funding.

Data Availability Statement:
The source code can be obtained by contacting the authors via the emails provided.

Conflicts of Interest:
The authors declare no conflict of interest.