1 Introduction

Context and Motivations. The ubiquitous diffusion of mobile devices is a major driver for the development of new threats. For instance, attackers successfully developed techniques to eavesdrop unencrypted communications or to steal computing resources for mining cryptocurrencies (Almaiah et al. 2021). At the same time, the creation of mechanisms to exfiltrate sensitive data has also intensified, especially to gather personal details during reconnaissance campaigns (Mazurczyk and Caviglione 2021). In this perspective, the creation of effective countermeasures to mitigate advanced threats targeting mobile devices is mandatory to improve the security of the Internet.

Unfortunately, the availability of efficient schemes for the static and dynamic analysis of mobile applications ignited an “arm race" between defenders and attackers (Spreitzenbarth et al. 2013). As an example, mechanisms implemented in many stores to bounce unsafe applications can be eluded by loading additional code at run time (Poeplau et al. 2014). Among the various techniques to distribute attack routines or extend offensive functionalities, the adoption of steganography is becoming very popular among threat actors (Caviglione and Mazurczyk 2022). Even if the cloaking mechanism may vary, attackers mainly exploit digital images, especially owing to their capacity and popularity (Mazurczyk and Caviglione 2015; Caviglione and Mazurczyk 2022). Steganographic malware then conceals additional code or configuration data (e.g., IP addresses to contact) within innocent-looking images or icons. Prime evidences of threats using this technique have been observed in 2014 within applications published on the Google Play Store (Suarez-Tangil et al. 2014).

To face such a challenging scenario, security should be enforced in a holistic manner, starting from the very early phases of the development process until the final delivery of the software. Alas, the most popular countermeasures (e.g., security-by-design, sandboxing, and code signatures) often fail to mitigate threats exploiting information-hiding techniques, especially when deployed in mobile ecosystems (Cheddad et al. 2010). To this extent, Machine Learning (ML) and Artificial Intelligence (AI) are now core tools to mitigate the impact of threats leveraging advanced offensive schemes (Gibert et al. 2020). For instance, ML demonstrated its effectiveness to inspect mobile applications for revealing rogue code within software components (Yuan et al. 2016) or to detect hidden/obfuscated payloads (Gibert et al. 2020; Guarascio et al. 2022).

Objectives. The main goal of this work is to overcome some limitations characterizing many machine learning approaches used to improve the security of mobile ecosystems targeted by steganographic threats. In fact, despite their effectiveness, AI-based solutions often clash with the constraints of production-quality scenarios. For instance, resource-intensive computations may not be possible in a centralized manner due to scalability issues. Another limitation concerns the lack of appropriate datasets, including the need of merging details of a wide range of shared libraries and software components (Mylonas et al. 2013; Zhou et al. 2012). For the case of revealing malicious data hidden in mobile applications, the surge of multiple stores to bypass national constraints as well as the availability of unofficial distribution vectors (e.g., p2p networks) prevent having a single, Internet-scale framework. Therefore, this paper introduces a supervised approach exploiting federated learning, which can be distributed across multiple cloud replicas or edge nodes cooperating for the definition of models. Information can be also provided by a local replica of a store, crawled from Web pages hosting the software (e.g., .apk), or gathered from unofficial stores highly popular in regions like Asia (Li et al. 2017). The federated approach may also prevent GDPR violations due to processing operations hosted in areas with incompatible policies on data confidentiality (Papageorgiou et al. 2018).

Contributions and Improvements. The contributions of this paper are twofold. First, it showcases a framework based on a federated approach that allows cooperation among different “app stores” or Web sources to reveal applications containing steganographic threats. Second, it presents a performance evaluation not limited to hiding techniques observed in real samples but also considering contents obfuscated to prevent detection. Compared to our preliminary work (Cassavia et al. 2023), this paper has the following improvements: it is more focused on applications that can be obtained through multiple sources including the Web and unofficial stores, and it largely extends the performance evaluation campaign, especially by investigating the impact of “elusive” schemes exploiting zip compression and Base64 encoding.

Organization of the paper. The rest of the paper is structured as follows. Section 2 introduces the attack model and the reference scenario, while Sect. 3 deals with the proposed framework. Section 4 showcases numerical results obtained through simulations, and Sect. 5 reviews past works using federated approaches for counteracting malware. Finally, Sect. 6 concludes the paper and outlines some possible future directions.

Fig. 1
figure 1

Reference attack scenario where mobile applications “repacked” with icons cloaking malicious contents are delivered through a store or made available via the Web

2 Attack model and federated approach

The general attack model considered in this work analyzes a scenario where a threat actor employs steganography to conceal a harmful payload in application icons to evade detection. Such a scheme could be exploited to make the reverse engineering of the attack chain harder or to distribute additional assets (e.g., configuration files, URLs pointing to remote servers, or small scripts) without triggering standard security mechanisms based on signatures or static analysis of software. Each icon is “repacked” within an application and then published through a store to make it available to users. Moreover, malicious applications could be made available through alternative stores, repositories or Web sources (e.g., AppChina, Anzhi and F-Droid (Li et al. 2017)). Figure 1 depicts the reference attack scenario. To hide the malicious payload, we consider an attacker using the plain Least Significant Bit (LSB) technique, which has been observed in various real-world campaigns (Caviglione and Mazurczyk 2022; Mazurczyk and Caviglione 2015). In essence, LSB steganography alters the least significant bit(s) of the color components of each pixel of the container image to conceal a secret. We point out that the more bits are altered, the higher the chance of revealing the presence of the hidden payload via visible alterations or artifacts. To mitigate such an attack, there is the need to deploy a suitable scheme within the store to “reveal” the presence of hidden data and prevent that a malicious application is delivered (see, e.g., Cassavia et al. (2022) and the references therein for the case of centralized distribution pipelines).

To detect the hidden data, in this work, we leverage a federated-learning-based approach to learn a global, optimal model in a distributed fashion. Figure 2 depicts our reference architecture. For the sake of simplicity, we will refer to the centralized store as the server. Similarly, with end nodes, we will identify other (groups of) machines located on the Internet cooperating toward the distribution of applications and the detection process. Specifically, we assume that the server contains various mobile applications along with their icons, e.g., it can be considered an “app store.” To prevent computational or security hazards, the server also contains pointers to some applications acting as a sort of “cache,” for instance, for the most popular contents.

Instead, the end nodes represent local data centers containing a subset of applications/icons already published in the main store and replicated for redundancy. End nodes can also contain novel/unseen data collected from third-party markets or scraped from Web sources and social media, e.g., by retrieving .apk or .ipa bundles directly from repositories. To spot the presence of an application/icon hiding malicious content, end nodes and the server collaborate to find the optimal Deep Neural Network (DNN)-based model via a federated approach. Such strategies are mainly used to avoid moving raw data from end nodes to the server, take advantage of each node’s computational capabilities, and enforce privacy constraints of local devices or users.

Fig. 2
figure 2

Federated approach of cooperating stores

Referring again to Fig. 2, we supposed to have a centralized server denoted as \(S\) in the figure. The server contains a “weak” DNN detector model \(M_S\) trained on an initial dataset \(D_S\) and validated through the validation set \(D_{val}\). In the early stage, the detector is shared across \(K\) end nodes, which fine-tune their model \(M_i\) against the local data \(D_i\). To make the predictor more robust and to find a global model in a distributed manner, a subset of end nodes periodically sends updates to the server S containing the weights of each layer composing their local DNN. The server S merges the information received to obtain an ensemble model, and its predictive performances are evaluated against the validation set. If the model performs better with respect to the previous one, then the server sends back the best parameters to the end nodes. This process is iterated until a certain convergence criterion is reached. More formally, Algorithm 1 details the federated learning algorithm for training the malware classifier. The procedures to yield the ensemble model and to fine-tune both global and local models are fully described in Sect. 3.3, and have been named \(\texttt{CreateSoupModel}\) and \(\texttt{FineTune}\), respectively.

figure a

3 Framework

In this section, we first illustrate the methodology used to detect and classify compromised images, then we describe the neural architecture devised to tackle these problems. Finally, we present the ensemble solution for combining the different neural models yielded by end nodes.

3.1 Solution approach

Figure 3 depicts the general methodology for discovering images compromised via steganographic methods. In more detail, digital images represent the input of the proposed approach and are modeled as matrices with dimensions \(X\times Y\). The pixel is the smallest manageable element of these matrices and stores information about the color. The color of each pixel can be decomposed into three main components, i.e., Red (R), Green (G), and Blue (B). The values associated with the RGB components represent the intensity of the various colors and each value ranges in the interval [0, 255]. Hereinafter, we denote with N the size of the image computed as \(N = X \times Y \times 3\). In this work, we focus on high-resolution icons as they offer a sort of “unified playground” for various threats. At the same time, this does not account for a loss of generality, as the approach can be applied and scaled also to address regular-sized images.

Concerning the hiding method, LSB steganography is considered a prominent approach to hide malicious code or data in legitimate pictures by changing the value of the (k) least significant bit(s) of each color composing the pixels of the image (see, Fig. 3 for the case of \(k=1\)). When only a limited number of changes are performed, the image will not exhibit any visible alteration, i.e., pixels will look homogeneous compared to the surrounding elements (Zuppelli et al. 2021). As a consequence, many approaches proposed in the literature partially fail to detect the presence of hidden content as they produce weak detection models unable to discover the slight differences between licit and compromised contents.

To address all these issues, in this work, we devised a (Federated) Deep Learning approach that processes and analyzes the k LSBs of the icon under investigation. Basically, the first block of the proposed neural network is devoted to yielding a flat representation of the image by extracting the k least significant bits of each color channel composing each pixel. Such a representation is then propagated to the subsequent layers to detect and classify different malicious contents. The DNN allows for extracting high-level discriminative features to be further combined for producing the final classification.

Fig. 3
figure 3

Methodological approach for the classification of a stegomalware cloaked in a digital image via LSB steganography

3.2 Neural architecture

To mitigate the impact of threats taking advantage of information hiding, we designed a supervised neural architecture for the classification task of image icons. In more detail, we exploited the deep architecture shown in Fig. 4, which permits to produce reliable predictions. Essentially, our neural architecture is composed of a stack of several blocks. The first layer acts as a handler for the input provided to the network (denoted as Input Handler, in the figure) and propagates the information (i.e., the image) to the subsequent layers of the DNN for further processing. The second component (denoted as Low Level Feature Extraction, in the figure) yields a flat representation of the image and extracts the raw information by means of a masking procedure.

The overall DNN is composed of a variable number m of Building Block (BB) obtained by stacking three main components: (i) on top, a fully-connected dense layer, equipped with a Rectified Linear Unit (ReLU) activation function (Nair and Hinton 2010), is instantiated, (ii) then, a batch-normalization layer is stacked to the previous one to improve the stability of the learning phase and to boost the performances of the model, and (iii) a dropout layer is finally added to mitigate the risk of overfitting (Hinton et al. 2014).

Figure 4 details the building block architecture for the first instance of this specific configuration and has been labeled as BB\(_1\). In more detail, the Batch Normalization allows for standardizing the data to be propagated to the subsequent layers of the DNN w.r.t. the current batch (by considering the average \(\mu\) and the variance \(\sigma\) of each input). A reset of a random number of neurons in the training phase is performed via a dropout mechanism. As pinpointed in Hinton et al. (2012), the usage of the dropout method induces in the DNN a behavior similar to an ensemble model. Hence, the overall output of the whole neural network can be considered as the combination of different sub-networks resulting from this random masking, which disables some paths of the neural architecture. In our experiments, the neural model is instantiated with \(m=4\) building blocks.

The proposed neural classifier also includes a skip connection to implement a residual block. Essentially, residual blocks (He et al. 2016) differ from regular ones by the mere addition (through the skip connection) of an identity function to their output. Formally, the output of a residual block is defined as activation \((F(\mathbf {x'}) + \mathbf {x'})\), where F denotes the function that the residual block is learning to transform the input \(\mathbf {x'}\) of the block itself. A more detailed view of this subnet is sketched in Fig. 5. The usage of skip connections induces in the base DNN classifier a behavior similar to Residual Networks (He et al. 2016), which demonstrated to be effective solutions to the well-known degradation problem (i.e., neural networks performing worse at increasing depth), and capable of ensuring a good trade-off between convergence rapidity and expressivity/accuracy. Moreover, the high-level features extracted by the building blocks \({\mathbf{BB_{2}}}\) and \({\mathbf{BB_{4}}}\) are concatenated and used to feed the output layer of the model.

Finally, the Output Layer is instantiated with C neurons (one for each class) and equipped with a softmax activation function (Guarascio et al. 2018). The proposed neural model is trained against a set \({\mathcal {D}}=\{({\textbf{x}}_1, {\textbf{y}}_1), ( {\textbf{x}}_2, {\textbf{y}}_2), \ldots , ({\textbf{x}}_{D},{\textbf{y}}_{D})\}\), where \({\textbf{x}}_i\) is the matrix representation of the image and \({\textbf{y}}\) is the class of the image. As regards the output, a one-hot encoding based on C classes is used to model the different labels, each one indicating a specific malicious payload. As will be detailed later, in our work we considered C classes representing “clean” image icons and image icons cloaking JavaScript, HTML, PowerShell, Ethereum wallets, and URL/IP addresses. Finally, the training stage is responsible for optimizing the network weights by minimizing the loss function. The categorical cross-entropy is adopted for the classification task and it is calculated as follows:

$$\begin{aligned} \textrm{CCE}({\textbf{y}},\tilde{{\textbf{y}}}) = - \sum _{i=1}^{|D|} {\textbf{y}}_i \log \tilde{{\textbf{y}}}_i. \end{aligned}$$
Fig. 4
figure 4

Neural architecture for hidden content detection and classification

Fig. 5
figure 5

Residual block subnet architecture

3.3 Ensembling via soup models

As discussed in Sect. 2 (see Algorithm  1), the proposed federated approach uses a soup model mechanism to merge the contribution of each model produced by the K peer nodes (Wortsman et al. 2022). This ensemble approach is inspired by the work described in Neyshabur et al. (2020), which demonstrated that models independently fine-tuned from the same base model fall within the same loss landscape basin. The authors also suggested that interpolating two solutions (e.g., combining DNN weights) may yield a result that falls closer to the basin’s center. The main benefit of the Soup Model is hence to extract robust and reliable classification models by simply averaging the weights without requiring additional memory or inference time. In our framework, the idea consists in using this ensemble method to average (per layer) the DNN weights yielded by the peer models. Specifically, this strategy is known as Uniform Soup. Although in the literature other Soup strategies have been proposed to combine different models (e.g., Greedy Soups and Learned Soup (Wortsman et al. 2022)), we adopted the Uniform Soup to make the proposed approach as much as possible lightweight.

Formally, let be f(x, \(\theta\)) a neural network with input data x and parameters \(\theta \in \mathcal {R}^{\mathcal {D}}\). Let be \(\theta = \texttt{FineTune}(\theta _{0},x)\) the parameters obtained by fine-tuning the pre-trained initialization \(\theta _{0}\) against data x. Let be \(\theta _i = \texttt{FineTune}(\theta _{0}, x_i)\) the parameters obtained by fine-tuning \(\theta _{0}\) against \(x_i\), i.e., data of the node \(N_i\). Model soup \(f(x,\theta _T)\) is computed as an average \(\theta _i\), i.e., \(\theta _T = \frac{1}{K} \sum ^{K}_{i=1} \theta _i\).

4 Experimental results

In this section, we first present the dataset modeling realistic images used in mobile applications containing steganographic threats as well as the parameters and metrics adopted in our trials. Then, we will discuss numerical results.

4.1 Dataset and parameters

Table 1 Performance of the federated approach

Our federated approach is evaluated on the “Stego-Images-Dataset”Footnote 1 described in Cassavia et al. (2022). It contains 48, 000 icons of \(512 \times 512\) pixels hiding different realistic malicious payloads, i.e., JavaScript, HTML, PowerShell, URLs, and Ethereum addresses, embedded via the LSB steganography technique. The payloads allow for modeling a wide range of threats, such as malicious scripts and routines, links to additional configuration files or lists of commands, and wallets collecting the outcome of crypto jacking and ransomware campaigns. The dataset is split into 16, 000, 8, 000, and 8, 000 icons corresponding to the training, the validation, and the test set, respectively. The training set is further divided among the server and the end nodes composing our architecture. In more detail, the \(25\%\) of the training set (4, 000 icons) is used to train the model on the server S, whereas the remaining \(75\%\) is assigned to the \(K = 5\) end nodes, i.e., \(15\%\) images (2, 400 icons) for each node. Instead, the validation set is used in its entirety to validate the ensemble model of the server and partially to validate the models of the end nodes. Finally, the dataset contains three different test sets (each one composed of 8, 000 icons) to model an attacker unaware/aware of the countermeasure and trying to elude the detection via obfuscation approaches. In particular, the first is generated considering “plain” payloads, i.e., the attacker is completely unaware of the detection mechanism, whereas the others consider payloads encoded in Base64 and compressed with a zip method. Such datasets model an attacker performing a sort of “lateral movement” to bypass security checks.

Table 2 Performance of the federated approach

We implemented the proposed model using the PyTorch framework (Paszke et al. 2019). Basically, it consists of 4 Building Block BB in which the fully-connected dense layers, stacked on the top of each of them, include 64 neurons, except for the first BB that is instantiated with 128 neurons. The dropout rate is set to 0.1. The Output Layer includes 6 neurons. The model has been trained over 35 epochs with a batch size of 256. The best model has been chosen according to the F1-Score. We decide to use 5 clients representing the end nodes and 1 server. For the initialization stage, the server model is trained over 10 epochs, and the best model that maximizes the F1-Score on the whole validation set is selected. Each client model is trained over 6 epochs, and the best model is chosen again according to the F1-Score. The number of iterations is set to 10. The AdamW (Loshchilov and Hutter 2019) with a learning rate equal to 0.0001 is adopted as the optimizer.

4.2 Evaluation metrics

To evaluate our approach, we relied upon the following metricsFootnote 2:

  • F1-Score: it summarizes the overall system performances, and it is defined as the harmonic mean of the precision and recall. Specifically, the precision is calculated as \(\frac{\text {TP}}{\text {TP} + \text {FP}}\), whereas the recall is calculated as \(\frac{\text {TP}}{\text {TP} + \text {FN}}\);

  • Area Under the Curve (AUC): it is the area under the Receiver Operating Characteristic (ROC) curve, obtained by plotting the ratio between the false-positive rate and the true-positive rate (i.e., the recall) for different class probability values;

  • AUC-PR: it is the area under the Precision-Recall curve, obtained by plotting the precision and recall for different class probability values.

4.3 Numerical results

Table 3 Performance of the federated approach
Table 4 Comparison of centralized and federated approaches with different test sets

The first round of tests aims at evaluating the effectiveness of our federated-learning-based approach in detecting malicious payloads hidden within images. Tables 12 and 3 summarize the obtained results and show how the performances of the end nodes (i.e., peers) improve over 10 iterations of the algorithm for plain text, Base64-encoded and zip-encoded test sets, respectively. As reported in Table 1, for the plain text, the average AUC of the peers improves from \(94.3\%\) in the 1-st iteration to a maximum value of \(96.5\%\) when the \(max\_iter\), i.e., the number of iterations defined in Algorithm 1, is reached. Also, the AUC-PR and F1-Score exhibit the same behavior: both metrics improve up to \(82.9\%\) and \(81.1\%\), respectively. As a consequence, the performances of the server improve as well, i.e., from an AUC of \(92.6\%\) to \(97.1\%\). A similar trend can be observed for the other metrics. Analogous results can be observed for the zip compression (Table  3), where the average AUC of the end nodes improves from \(80.8\%\) in the 1-st iteration to \(83.5\%\) in the 3-rd iteration. Moreover, a better result can be observed on the server performances where the AUC and the AUC-PR improve in the last round from \(76.9\%\) to \(85.6\%\) and from \(40.3\%\) to \(49.8\%\), respectively. With regard to the results using the Base64 test set (Table 2), the improvement on the metrics for the peers is slightly less limited (e.g., the average AUC improves from \(89.5\%\) for the 1-st iteration to \(90.8\%\) in the 5-th iteration), whereas we can observe better results for the server performances. Specifically, the average AUC and AUC-PR improve up to \(89.3\%\) and \(59.4\%\), respectively.

Fig. 6
figure 6

All classes ROC curves, plain text test set

Fig. 7
figure 7

All classes ROC curves, Base64 test set

Fig. 8
figure 8

All classes ROC curves, zip test set

We also quantify the prediction capabilities of the model showing its performance w.r.t. each class. To this end, we plotted the ROC curves for each test set. Figure 6 shows the ROC curves for all classes when the malicious payload is embedded in plain text with no additional encoding. In this case, although the classifier gets good results, we can observe some misclassification for URL/IP addresses and Ethereum addresses with AUC equal to 0.89 in both cases. This could be because both Ethereum addresses and URLs are composed of few and similar alphanumeric characters, making the classification more difficult. The ROC curves for Base64-encoded test set are depicted in Fig. 7, where we can observe a small degradation of the performances for each class except for the legitimate images. Like the plain text test, there is again a similar misclassification for URL/IP and Ethereum addresses classes with AUC equal to 0.88 and 0.89, respectively. Moreover, the PowerShell class tends to be misclassified with JavaScript probably due to the presence of similar statements like if-then-else, for, and type declarations. Finally, in Fig. 8 the ROC curves for the zip compression are shown. In this case, the system is still able to distinguish between compromised and legitimate images, but different payloads are misclassified from each other, e.g., the AUC for PowerShell is equal to 0.70. An explanation for this behavior could be that zip compression reduces the differences between different payloads as well as introduces new metadata, which may be similar for all classes.

The second round of tests aimed at comparing the federated approach against a “centralized” blueprint, i.e., when all the data are stored in the node S. Moreover, we also evaluated the different approaches when dealing with payloads obfuscated by the attacker, i.e., via Base64 encoding and zip compression. Table 4 showcases the results. Concerning plain and Base64-encoded payloads, the differences between the approaches are minimal. Instead, in the case of compressed zip payloads the federated solution achieves an improvement of \({\sim }10\%\) in terms of AUC and AUC-PR w.r.t. the centralized approach.

Summing up, the above results demonstrate that federated learning can be effectively used to reveal the presence of concealed contents while guaranteeing privacy and scalability constraints. In more detail, the federated solution achieves comparable performances with a fully centralized method without the necessity of moving data toward a single node. In this way, our approach can also be used in resource-constrained scenarios, e.g., IoT ecosystems.

5 Related works

In recent years, the impact of threat actors taking advantage of information hiding techniques and steganography to make their attack chains more complex and stealthier has also been extended to the Web and its services. For instance, cloaking techniques can be used to juxtapose an additional layer for managing contents of the “hidden Web”, i.e., a portion of the Web only accessible via suitable search interfaces via known keywords (Ntoulas et al. 2005). At the same time, the surge of social media services offers a fertile playground for a variety of scams and malicious activities that can be cloaked in the bulk of data. Among the others, notable examples are the use of text-based steganography for abusing contents of online social networks (e.g., Twitter) to orchestrate bots and implement command & control communications (Gurunath et al. 2021). Another relevant scenario is the exploitation of images and metadata used in Facebook to enrich the overall user experience (Hiney et al. 2015). Owing to its double-edged nature, hiding mechanisms can also be considered effective tools to enforce copyright or track Web resources. For example, XML contents can be “marked” by embedding empty elements or patterns of white spaces in tags (Inoue et al. 2001).

Unfortunately, both the Web and ad-hoc distribution channels (e.g., application stores or software repositories) are characterized by an almost infinite set of digital assets that can be exploited for cloaking data. In particular, executables, such as .exe, .apk, and .ipa, can be retrieved almost everywhere on the Internet, thus making it impossible to outline a precise attack surface. Data can then be cloaked in executable code through the manipulation of specific redundancies, such as the injection of uncommon instructions or the alteration of the relative frequency of jumps (see, e.g., Anckaert et al. (2005) for IA-32 code).

For the specific case of enforcing the security of mobile applications, the most popular frameworks typically rely upon a variety of techniques (e.g., static binary analysis, anomaly-based detection, and definition of access control policies in end nodes). Alas, this requires a strict cooperation among developers, users, and administrators of application stores (He et al. 2015). In general, the current trend is to exploit ML or AI to analyze behaviors of software, mainly to search for existent attack signatures or unexpected interactions among software components.Footnote 3 Such tools have proven to be effective also to mitigate attacks based on information hiding targeting digital media (Monika and Eswari 2022; Cassavia et al. 2022). Unfortunately, the deployment of AI-capable techniques might clash with practical constraints. First, the inspection of bundled assets and copyrighted material should respect privacy-enforcing regulations requiring architectures to not process any personal information (Pawlicka et al. 2020). Second, the continuous growth of mobile ecosystems is leading to millions of samples to verify. Applications can also be made available via different store replicas for performance purposes, through unofficial channels (e.g., alternative stores (Guarascio et al. 2018) or via sideloading (Li et al. 2017)) as well as in Web or ad-hoc social media channels, thus rendering the creation of comprehensive datasets a hard task (Wang et al. 2019). To partially cope with challenges to improve the security of mobile ecosystems, federated approaches are becoming a precious tool (Rahman et al. 2020).

As regards the goal of spotting contents cloaked in digital images via distributed or cloud-native frameworks, Yang et al. (2020) exploits federated transfer learning to improve the performance of image steganalysis tasks while preserving the privacy of users. Even if this work has a similar goal, there are some major differences with our idea. First, it considers (user) end nodes instead of app-stores and does not focus on real malware samples. As a consequence, authors investigated the performance when in the presence of advanced steganographic methods acting on the spatial domain (i.e., WOW, S-UNIWARD, and HILL), which have never been observed in real attacks due to their complexity (Mazurczyk and Caviglione 2015; Caviglione and Mazurczyk 2022). Despite the work concentrates on digital images similar to those used for the creation of Android/iOS icons (i.e., cropped pictures of \(512 \times 512\) pixels), experiments only bear with greyscale images, which are seldom used in modern mobile applications. Rather, greyscale or B &W images are used for UI widgets, but their steganographic capacity could be very limited and the detection of massive tampering of assets bundled within a mobile application could be effectively done without the need for AI (see, e.g., Faruki et al. (2013)).

Considering different use cases, the literature offers various works dealing with the adoption of federated learning to tame realistic threats. As an example, Jiang et al. (2022) shows a framework for enabling end nodes running Android to classify several types of malware, including ransomware and spyware but not steganographic threats. The problem of classifying malicious samples is also addressed in Lin and Huang (2020), but it focuses on a generic scenario not related to the security of mobile applications. Instead, Shamili et al. (2010) concentrates on the problem of detecting malware but for an OS no longer used (i.e., Symbian S60).

To face the multifaceted cybersecurity challenges of modern deployments, a possible “meet in the middle” blueprint should offload end nodes toward edge entities placed at the border of the network, and cooperating stores could be adopted to implement such an architecture. To this extent, the literature does not offer prior attempts based on edge computing to reveal the presence of threats endowed with information hiding or image steganography capabilities. In fact, this paradigm, jointly with federated techniques, has been largely used in IoT scenarios often composed of resource-constrained nodes (Tian et al. 2021). Besides, for the specific case of mobile security, edge/federated approaches have been mainly adopted to guarantee privacy constraints. A notable exception is Hsu et al. (2020), which demonstrates how to detect malware without exposing sensitive information of end users, such as configuration details or how various application program interfaces are invoked.

6 Conclusions and future works

In this paper, we have presented a federated framework for the detection of malicious assets cloaked in icon images bundled or repackaged within applications. Our approach demonstrated its effectiveness in handling applications made available through multiple (un)official stores or directly from Web and social media. The federated framework also showcased good performances with threat actors trying to avoid detection via elusive schemes, e.g., when the secret data is encoded in Base64 or compressed with the zip algorithm to have an “obfuscating envelope”.

As shown, the federated blueprint should be considered of particular value to enforce the security of scenarios where applications could also be distributed outside classic pipelines. For instance, this is the case of software made directly available in public repositories, social media, or Web pages. A federated scheme can prevent constraints and bottlenecks characterizing single-point architectures, e.g., scalability issues and lack of comprehensive snapshots for training the models. Even if implementing such a vision is almost straightforward for single-vendor deployments, federating stores owned by different entities could be unfeasible, or require additional engineering. Another limitation concerns the ability of a threat actor running a malicious repository to join the federation and inject incorrect information to improve its undetectability. Lastly, crawling large sources to gather the required data could be time-consuming or difficult. As an example, some websites prevent scraping, and many social media services limit the rate at which information can be requested.

Therefore, future works aim at removing the aforementioned weaknesses. For instance, a suitable communication mechanism (e.g., a specific set of protocols endowed with security guarantees) could promote cooperation among various stores while mitigating the risk of attacks. Another relevant aim of our future research is devoted to extending the proposed approach to detect other types of steganographic threats, especially malicious information cloaked in network traffic. In more detail, we are interested in evaluating if the benefits of the federated approach can also be leveraged to monitor large-scale networks or microservice-based architectures. As an example, traffic can be collected in multiple points placed at the border of a network/datacenter so as to enforce scalability properties and avoid the necessity of moving sensitive/confidential data.