research-article

Open Access

Beyond the Hype: An Evaluation of Commercially Available Machine Learning–based Malware Detectors

Authors:
Robert A. Bridges

Oak Ridge National Laboratory, USA

Oak Ridge National Laboratory, USA

0000-0001-7962-6329
View Profile

,
Sean Oesch

Oak Ridge National Laboratory, USA

Oak Ridge National Laboratory, USA

0000-0002-6909-1022
View Profile

,
Michael D. Iannacone

Oak Ridge National Laboratory, USA

Oak Ridge National Laboratory, USA

0000-0003-3081-4761
View Profile

,
Kelly M. T. Huffer

Oak Ridge National Laboratory, USA

Oak Ridge National Laboratory, USA

0000-0002-1785-9108
View Profile

,
Brian Jewell

Oak Ridge National Laboratory, USA

Oak Ridge National Laboratory, USA

0000-0003-3712-6523
View Profile

,
Jeff A. Nichols

Oak Ridge National Laboratory, USA

Oak Ridge National Laboratory, USA

0000-0002-6127-3542
View Profile

,
Brian Weber

Oak Ridge National Laboratory, USA

Oak Ridge National Laboratory, USA

0000-0002-3261-5152
View Profile

,
Miki E. Verma

Stanford University, USA

Stanford University, USA

0000-0001-7793-4942
View Profile

,
Daniel Scofield

Amazon Inc., USA

Amazon Inc., USA

0000-0003-3440-7610
View Profile

,
Craig Miles

Amazon Inc., USA

Amazon Inc., USA

0000-0002-8648-803X
View Profile

,
Thomas Plummer

Lockheed Martin, USA

Lockheed Martin, USA

0000-0002-4900-0330
View Profile

,
Mark Daniell

Lockheed Martin, USA

Lockheed Martin, USA

0000-0002-3058-2641
View Profile

,
Anne M. Tall

MITRE Corporation, USA

MITRE Corporation, USA

0000-0001-5173-8484
View Profile

,
Justin M. Beaver

Lirio LLC, USA

Lirio LLC, USA

0000-0002-0281-6017
View Profile

,
Jared M. Smith

Security Scorecard, USA

Security Scorecard, USA

0000-0002-3240-2405
View Profile

Authors Info & Claims

Digital Threats: Research and Practice Volume 4 Issue 2Article No.: 27pp 1–22https://doi.org/10.1145/3567432

Published:10 August 2023Publication History

Digital Threats: Research and Practice

Abstract

There is a lack of scientific testing of commercially available malware detectors, especially those that boast accurate classification of never-before-seen (i.e., zero-day) files using machine learning (ML). Consequently, efficacy of malware detectors is opaque, inhibiting end users from making informed decisions and researchers from targeting gaps in current detectors. In this article, we present a scientific evaluation of four prominent commercial malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously and never-before-seen files? Is purchasing a network-level malware detector worth the cost? To investigate, we tested each tool against 3,536 total files (2,554 or 72% malicious and 982 or 28% benign) of a variety of file types, including hundreds of malicious zero-days, polyglots, and APT-style files, delivered on multiple protocols. We present statistical results on detection time and accuracy, consider complementary analysis (using multiple tools together), and provide two novel applications of the recent cost–benefit evaluation procedure of Iannacone and Bridges. Although the ML-based tools are more effective at detecting zero-day files and executables, the signature-based tool might still be an overall better option. Both network-based tools provide substantial (simulated) savings when paired with either host tool, yet both show poor detection rates on protocols other than HTTP or SMTP. Our results show that all four tools have near-perfect precision but alarmingly low recall, especially on file types other than executables and office files: Thirty-seven percent of malware, including all polyglot files, were undetected. Priorities for researchers and takeaways for end users are given. Code for future use of the cost model is provided.

1 INTRODUCTION

Attackers use malicious software, known as malware, to steal sensitive data, damage network infrastructure, and hold information for ransom. One of the top priorities for computer security tools is to detect malware and prevent or minimize its impact on both corporate and personal networks. Traditionally, signature-based methods have been used to detect files previously identified as malicious with near perfect precision but potentially miss newer malware samples. With the advent of self-modifying malware and the rapid increase in novel threats, signature-based methods are insufficient on their own. By generalizing patterns of known benign/malicious training examples, machine learning (ML) exhibits the capability to quickly and accurately classify novel file samples in many research studies [19]. Moreover, ML-based malware research has made the transition from the subject of myriad research efforts to a current mainstay of commercial-off-the-shelf (COTS) malware detectors. Yet few practical evaluations of COTS ML-based technologies have been conducted.

Turning from the academic literature to market reports from commercial companies can provide (for a fee) useful information, specifically, end-user feedback, itemization of all technologies in the antivirus/endpoint detection and response marketplace [17], and even statistics showing the efficacy of the detectors on malware tests [4, 40]. (See Related Works 2.0.2 for recent examples by AV-TEST and SE Labs.) Yet, these tests often report near perfect detection for many tools, diminishing trust in the capability of the test to sufficiently stress test the tools or to differentiate the tools. Further, as our results—under 60% recall of all tools tested—and other previous works [10] confirm, these near perfect evaluations promote a false narrative that modern (especially ML-driven) malware detectors provide just shy of perfect protection. Even when armed with many statistics on detection capabilities, balancing or weighing these heterogeneous measurements when reasoning about the efficacy of tools is difficult [20].

As a result, organizations have limited insight into whether or how COTS malware detection tools add value to their current defense solutions, short of trusting possibly skewed vendor-provided claims and market/consumer reports. The origin of this article is evidence of this fact, as the authors were tasked with evaluating four market-leading malware detection tools to assist in understanding their efficacy. This work describes a scientific experiment designed and conducted by the authors (professional researchers) to assist an organization in examining the benefits and tradeoffs of commercially available ML-based malware detection tools.

Guided by the organization’s requests, we established meetings with 22 sales and technical representatives of many popular vendors to learn the merits and approaches of their detection technologies. Based on the organization’s requirements, four prominent detectors were chosen for study: two host-based (one signature based and one ML based) and two network-based (both ML based, one static and one dynamic) detectors. Of particular interest in this study are detection capability across varying file types and categories, efficacy of the network-level detectors across varying protocols, and studying value provided by ML- vs. signature-based, host vs. network, and static vs. dynamic tools, with an eye toward “defense-in-depth” (i.e., strategically combining tools). Our overall goal is to provide the funding organization with methods for quantitatively reasoning about malware detectors and to exhibit results on these four tools that can provide insights into two questions:

(Q1) ML Generalization Hypothesis: We are interested in the fundamental promise of the three ML-based malware detectors, namely “Can ML-based COTS tools accurately classify both never-before-seen files (especially zero-day malware) and publicly available files?”

(Q2) Network-level Malware Detection Hypothesis: Network-level malware detectors seek to complement host detectors. They leverage the advantage of greater computational resources (available in a network appliance, generally a commodity server) without affecting users’ workstations. This means, in theory, they can afford more in-depth analysis without impacting usability, namely slowing processes on each workstation. Notably, the network-level detector must carve the file out of the packets to correctly process it, which is not necessary for host detectors. Of course, if the network detector cannot expand the detection capabilities, then it simply adds cost without enhancing defenses. In our experience with approximately 10 security operation centers (SOCs) [7, 32], all centers required endpoint malware detection and signature-based network intrusion detection, but none used network-level malware detection technologies. Thus, we investigate the question, “Is it worth purchasing, configuring, and maintaining a network-level malware detector, given in-place host-based detection?”

We were given licenses from four vendors for experimentation under the agreement that results could only be released with anonymity; we are unable to provide any details that might disclose the vendors or their intellectual property. The four COTS tools used in this experiment—two endpoint detection tools, one signature based and one claiming to be solely ML based (supervised learning with no signatures), and two network-level tools, both ML based with one claiming static analysis (does not execute files) and the other claiming dynamic analysis (executes files in a sandbox)—provide representatives of different available malware detection approaches. The first endpoint malware detector being solely signature based is considered a baseline to represent the status quo for modern commercial malware defenses. Both network detection tools perform file extraction from packets (whereas endpoint detectors scan files on each host), and both claim to use ML techniques to detect files. The first tool uses static analysis and an ensemble of ML- and signature-based detectors. The second tool performs dynamic analysis to feed an ML classifier. See Section 3.2. We note that two types of commercially available malware detectors are not represented in this study: those that leverage a cloud connection for interactive intelligence (which can be at the host and/or network level) and network-level detectors that are solely signature based.

To test these technologies, we configured a network at the National Cyber Range (NCR) [14] (see Section 3) to deliver 2,554 malware and 982 benignware samples of varying file types to hosts. Zero-days (specially crafted malware never seen in the wild), files to mimic advanced persistent threat (APT) actions, and polyglots (files functioning with multiple file types) were also included to test the capability of these tools to identify novel threats (see Figure 1).

Statistical results (Section 4) from the experiment give enlightening insight into how well or how poorly commercially available malware detectors perform. Analysis of detection statistics by file type and delivery protocol illuminates rather alarming gaps. Comparison of the signature-based tools to the three ML tools, especially on zero-day files, allows us to quantify question (Q1), the ML generalization hypothesis. Furthermore, we provide an analysis of conviction latency (i.e., time to detect) per tool and per file type to illustrate the gains/losses of network tools and of using static versus dynamic analysis. Under the assumption that host-based malware detection is a requirement, we consider a defense-in-depth question, “Is it worthwhile to use multiple detectors together?,” by simulating complementary detectors using the logical OR of their alerts.

While these statistical results are enlightening, they present tradeoffs that are difficult to reconcile when choosing the “best” malware detection tools (e.g., how to balance give and take among detection rates, false alerts, detection delays, and anticipated attack costs). To aid the funding organization in using all the results to assess the tools, we design a novel instantiating of the general cost–benefit framework of Iannacone and Bridges [20] to quantifiably assess and compare the network-level tools (question Q(2)). This method simulates the cost to an enterprise using the technologies by integrating detection accuracy and time-to-detect statistics (all learned from experimental testing), along with estimates of attack damage costs, labor costs, and resource costs. We configure the cost model to estimate costs of the two network-level tools on the emulated network. Then in a separate evaluation, we estimate the additional costs for adding the network tools to complement each endpoint tool. Our configuration of the cost model, in particular in evaluating complementary detectors used in tandem, is novel. It gives new vantage points for understanding the tools’ efficacy use and provides a procedure to assist future acquisition decisions. See Section 5.

1.1 Results Summary

The key findings from our experiment are as follows:

These COTS detectors have nearly perfect precision but with detection rates in the 34–55% range.
Detection rates jump to $\hbox{$\scriptstyle \mathtt {\sim }$}60\%$ for any pair of a host and a network detector; combining more than three detectors does not provide much more gain.
From a cost simulation analysis, the host signature-based detector is the best under our baseline assumptions.
When assuming that “hard” malware (i.e., zero-days, polyglots, APT-style) files will incur larger damage costs than n-day files, ML-based detectors provide greater savings according to our cost model.
Substantial savings are provided when adding one of the network-based detectors, with the dynamic detector providing the largest savings (for both host detectors).

We conclude this article with a discussion in Section 6 of the overall takeaways and the limitations of this work. These findings are mapped to takeaways for (1) researchers wanting to strengthen the state of malware defense and (2) SOCs considering the purchase of a malware detector.

1.2 Contributions

Our research is one of the few academic studies providing empirical evaluation of commercially available ML-based malware detection tools, and we believe it is the largest academic evaluation of such tools to date—using a corpus of more than 3,500 files (in comparison to 1,000 files in prior work on COTS [15]). More notably, it provides a unique set of files with a wide variety of file types of varying difficulty to detect, illuminating strengths and weaknesses of the tools. It is in response to commercial evaluations [4, 40] that do provide results of COTS malware detectors on larger corpora but fail to differentiate among the tools. Our contribution is also in our analysis of the results, considering complementary detectors and multiple applications of a cost–benefit analysis. We provide two novel contributions to the general cost model framework of Iannacone and Bridges [20]. We show how to use a large number of benign/malware samples to gain accuracy in detection accuracy statistics, which are needed inputs to the model, while scaling the model to represent a realistic benign/malware ratio in cost calculations. Our second configuration of the cost model estimates the additional cost/savings of a network-level tool, which is different and potentially more appropriate or useful than originally proposed. Code for our cost model configuration is provided, see footnote on first page.

Our statistical results provide empirical verification that contributes to a better understanding of the state of commercial malware detection in practice, a problem that has received little attention by researchers; in particular, we find several key weaknesses in the four tools investigated, including surprisingly low detection rates (but high precision), a complete failure to detect polyglot files, and varying detection capabilities across file categories and, for network-level tools, across protocols. This helps confirm (at least our, i.e., the authors’) suspicions about efficacy of commercial detectors. Finally, we map our findings to takeaways first for researchers, giving prioritized research directions to enhance detection in practice, and second for SOC personnel, giving guidance to consider when purchasing such tools. See the discussion in Section 6.

2 RELATED WORK

A large body of work proposes and tests ML algorithms for malware detection, and surveys have collected and organized these works. However, to our knowledge, few works seek to evaluate commercial, especially, ML-based malware detection, and none provides cost–benefit analysis informed by tool-specific detection statistics learned from experimental results for reasoning about these detectors. In addition, the most thorough existing evaluation of malware detection tools in the literature uses a corpus of 1,000 samples, compared with our evaluation of more than 3,500 samples. Finally, no previous work has the efficacy of pairing complementary detectors as suggested by the “defense-in-depth” idea. Here we address commercial evaluations, which are much larger than ours, but do not provide differentiation of tools—many tools get near perfect results—inducing skepticism of the efficacy of these evaluations.

2.0.1 COTS Evaluations.

Early work by Christodorescu and Jha [10] presents an evaluation of three COTS malware detectors on eight malware, with a goal of studying what signatures a blackhat hacker can learn about a blackbox detector. Notably, these authors summarize findings by exclaiming “From our experimental results we conclude that the state of the art for malware detectors is dismal!” The two studies that focus specifically on the capability of existing tools to detect malware use a corpus of 200 samples (100 malicious and 100 benign) [2] and 29 samples (all malicious) [34], respectively. The more thorough of these evaluations, conducted by Aslan and Samet [2], finds that static and dynamic analyses work better in tandem than either one on its own and that, in general, dynamic analysis outperforms static. Neither of these works focus on the benefit that ML provides for malware detection.

A large body of work has emerged to evaluate whether adversarial machine learning (AML) can be applied to perturb files to “trick” a detector [25], although the focus is to develop the AML techniques rather than evaluate commercial detectors. Notably, Fleshman et al. [15] test four commercial antivirus (AV) technologies and self-made ML classifiers on perturbations of malware. Similarly to this study, time constraints limit Fleshmen et al. to a corpus of 1,000 malware files, and further limitations are noted: “All the malware ... has been known for some time and likely been used by the AV companies in their product updates.” Overall, the results exhibit that the two (not commercial) ML-based classifiers are more robust to adversarial perturbations than the four commercial detectors.

VirusTotal¹ (VT) is a leading threat intelligence source that tracks malware and provides determinations of 65+ commercial detection engines for any submitted file. Previous work has involved VT data on files to study commercial malware detection engines. Zhu et al. [50] track labels (benign/malicious) through VT of these 65+ vendors for more than 14K files over time and focus on the label dynamics. A primary finding is that commercial detection engines represented on VT are finicky, changing detection results often; nevertheless, categorizing files based on a threshold voting schemes can be reliable for many thresholds that are intermediate in magnitudes. A portion of the article uses files for which Zhu et al. know ground truth (benign/malicious) and is most related to our study. Specifically, Zhu et al. create 120 zero-day ransomware samples (60 obfuscations of two known ransomware) and 236 benignware samples to test the commercial engines through VT. Observations include that detection rates are better over time, but for these samples, true positive and false positive rates vary from approximately 25–100%, which indicates wide variation in the commercial tools but also much worse precision than found in our study. Finally, Zhu et al. compare desktop version of 36 AV technologies against their VT counterparts, finding VT versions have on average higher recall but more false positives.

Prior works have evaluated the efficacy of open source network-based intrusion detection systems (IDSs), specifically Snort, Suricata, and Zeek [30, 41, 45]. Although these are commonly used in SOCs, they are not malware detectors.

2.0.2 Market Reports by Companies.

Market reports are used by security operations to learn about products in the cyber technology marketplace; although these are not peer-reviewed academic research, we include information on relevant market reporting companies to provide context. Gartner ( gartner.com/en/information-technology), a popular provider of market reports for the cyber technology space, curates market summaries composed of company profiles of the vendors, ratings, and comments from end users of the products, information from the vendors’ white papers, and often scores on a “magic quadrant,” proving quantification of the technologies’ “ability to execute” and “completeness of vision” [18]. Gartner does not provide any scientific testing of the efficacy of detection tools.

Evaluations of IDSs are provided by commercial companies [15]. MITRE ATT&CK provides an evaluation online [28] in which they examined 21 IDS products against the APT3 and APT29 threat groups. This evaluation includes custom malware and alternate execution methods. However, both the APT29 and APT3 include only two scenarios, for a total of four attack scenarios, so very little malware is actually executed. These evaluations do not specifically focus on the capability of the tools under test to detect malware but rather on their capability to detect realistic attack campaigns at any stage of execution. AV-TEST ( av-test.org) and SE Labs ( selabs.uk) both provide ratings based on evaluations of the malware detectors. Although AV-TEST and SE-Labs’ endpoint test results provide a rich source of statistics, they have two main drawbacks. First, evaluations are seemingly too easy, diminishing trust in the tests and inhibiting the tests’ capabilities to differentiate the technologies. As an example, in the evaluation of April 2022 [4], of the 20 detectors evaluated, 17 receive a 6/6 protection rating, with the remaining three obtaining a 5.5/6. Digging deeper, we find that the reviewers claim the industry average for detecting zero-day files is 99.8%. Similarly, the quarter 1 endpoint detector test from SE Labs [40] finds the top (approximately 10) detectors achieve near perfect (above 99% accuracy) in most tests. Comparing these results with our findings—where recall of any single detector is under 60% across all tested malware—we are skeptical of the differentiation power of the tests provided. The second drawback is inherent to statistical results; making sense and reasoning logically about many diverse measurements for each tool is difficult and has been a problem for evaluation of cyber technologies more generally [20]. Our work seeks similar tests for these tools but provides an approach that addresses these drawbacks by providing a variety of files to (hopefully) be sufficiently difficult and to use a cost–benefit framework to reason about many diverse measurements produced by experiments.

Notably, VT does not provide comparative analytics of detection engines [46].

2.0.3 Malware Detection Method Surveys.

Myriad papers examine existing or proposed novel malware detection methods. Because our work focuses on evaluating the efficacy of commercially available tools, we do not discuss these works in detail. However, in this section, we provide references to several surveys that provide thorough overviews of this literature. A number of well-cited surveys discuss malware detection methods generally [3, 21, 49], and others focus more narrowly on areas such as dynamic [33], static [42, 44], and mobile [37] malware detection. In their recent survey of malware detection approaches, Aslan and Samet [3] highlight several research challenges that remain open issues in the malware detection space, including the following that are particularly applicable to our work: No detection method can detect all unknown malware, obfuscation might prevent malware from being examined by existing tools, and false positives (FPs) and false negatives (FNs) are a problem with existing approaches. In our research, we quantify these issues in existing tools and suggest ways they can be alleviated, such as by using multiple approaches in tandem.

2.0.4 Methods of Evaluation of Detectors.

We refer the reader to Iannacone and Bridges [20], which provide a survey of evaluation methods for cyber defenses, in particular intrusion detectors. Trends other than the usual statistical metrics (e.g., recall, precision) include incorporating time of detection into IDS metrics [16] and cyber competition scoring frameworks [12, 13, 29, 35, 38, 47]. Several significant works on security cost modeling, such as the Security Attribute Evaluation Method [8], the Return on Security Investment model [11, 43], and the Information Security Risk Analysis Method [22], led to the general cost–benefit framework of Iannacone and Bridges that we build upon in Section 5. Kondakci [23] provides a costs analysis for the spread of malware through a network using epidemic-inspired models, which is an interesting and worthwhile work but not appropriate for our setting.

2.0.5 Polyglot Files.

Polyglot files are valid as multiple file types. For example, a file could be a valid joint photographic group and a valid Java archive (JAR) file, or it could be a valid portable document format (PDF) and a valid hypertext preprocessor file. Malware detection methods often require the file type to be known, so a polyglot file is used to prevent the correct file type association to bypass the detection mechanism. A large body of research exists related to exploiting systems using polyglot files [1, 5, 6, 27, 48], as well as online tutorials on creating polyglot files.² However, there is little research focused specifically on helping IDS systems handle polyglot files. The majority of existing research is focused on the challenges of identifying malicious PDF files [9, 31]. Although solutions such as binwalk³ can be used to help identify polyglot files, it is unclear why they have not been integrated into commercial tools. One possible reason is that using a tool such as binwalk would have a negative impact on throughput.

3 EXPERIMENTAL DESIGN

To execute the experiments, we leveraged the NCR, an air-gapped research test bed equipped with state-of-the-art network and user emulation capabilities to facilitate the experiment [14]. A variety of services were emulated, including email, website servers, and domain controllers, and they were used by emulated users both internal to the network and in the emulated worldwide web (external). We provide a network diagram in the Appendix (Figure A.1).

Traffic generation was provided on all client endpoint virtual machines (VMs) by the NCR Mantra software, a proprietary emulator built on Lariat [39]. All clients had broadly similar traffic generation profiles and were configured to automatically and randomly browse websites (both internal and external), send and reply to emails, generate Microsoft (MS) Office documents, transfer files to and from file shares and Sharepoint sites, and run a preset list of other benign executables. All clients were set to a fixed diurnal schedule of increased daytime activity between the hours of 0700 and 1,700 local time; activity outside of these hours was decreased by a factor of 10 (i.e., average inter-arrival time of traffic generation events was set to be 10 times as high).

System time was synchronized between all devices except the network static appliance via NTP to the central NCR testbed time source. Every node with an in-band connection synchronized to an NTP server in the emulated Internet environment, which in turn connected to the central time source via the testbed control network; everything with only an out-of-band connection (namely the network detection servers, a traffic collection server, and SOC workstations) connected directly via the control network to the same source. The network static appliance was supplied as a vendor-loaned server without administrative (root) access, which is required to change time settings; the offset of this appliance’s time vs. actual time was noted for compensation in after-action data analysis.

This environment enabled delivery of malware over the network to endpoints running the two host-level tools, so the network-level tools could (attempt to) reconstruct files from the packet stream and provide detection. In particular, emulated internal and external websites, mail servers, and hosts were used to deliver files (both benign and malicious) to internal clients through web download, email attachments, and direct connections, respectively.

The authors met with vendors of the technologies and configured the tools in advance of the NCR to ensure proper configuration. Configuration guides were used to duplicate the process at NCR, and, as a quality assurance check, detection results on a set of test files were confirmed to be identical before the experiment (with tools configured by the vendors) to after setup in the NCR. All technologies operated on-premises, without connections outside the internal portion of the emulated network. Although both host-level tools provide pre-execution conviction and quarantine, for the purpose of the experiment all tools were set to alert only (not block/prevent) to permit complementary (OR) analysis of multiple tools. All machines were time synchronized. For the network-level tools, a secure sockets layer (SSL) “break-and-inspect” decryption technology was not added to this network. Files were delivered on unencrypted protocols as the goal was to test the detectors’ capabilities to detect malware without dependency on the capabilities of the decryption technology. (In practice, under this configuration malware sent encrypted would almost certainly be opaque to these detectors. How such a decryption technology affects the network-level detector is out of scope for this work.)

For this experiment, we evaluated the endpoint detection on both Windows 7 and 10 for an endpoint (nonserver) host, although Linux hosts and servers were present. Windows 7 and 10 VMs were given 4 GB of RAM and 1 CPU core, along with a large disk drive for the installed operating system. RHEL6 VMs were given 1 GB of RAM and 1 CPU core, also with a large disk drive for storing the operating system and associated files. These values reflect the amount of minimum memory needed to create a VM for the specific operating system, with both the Windows VMs requiring at least 4 GB and RHEL requiring only 1 GB. The test bed was controlled via orchestration infrastructure leveraging virtualization software and custom wrapping software to coordinate with the VMs and manage the VM lifecycle.

The detection technologies were each exposed to 3,536 total files (2,554/982 = 72/28% malicious/benign) of varying categories, including 32- and 64-bit portable executables (PE32, PE64), MS Office, PDF, JAR, and three unusual categories: APT (described subsequently), zero-day PE (Windows PEs believed to not be available publicly), and polyglot (files that are valid as multiple file types, e.g., ZIP and PDF). Most malicious files were obtained through VirusShare.⁴ Our polyglot categories should also be never-before-seen files, and at least one of the valid file types has malicious functionality; specifically, we combined malicious Java JAR files with GIF and JPEG files. Notably, one can leverage open source software for polyglot creation ( https://github.com/corkami/mitra). The zero-day files were manufactured samples that never touched infrastructure with access to the Internet.

The goal of the APT class was to create artifacts that exemplified common APT evasion tactics. This included using signed vs. unsigned executables, cloning the PE headers of common benign executables, and advanced in-memory characteristics. Cobalt Strike⁵ provided a convenient way to test these features. Because Cobalt Strike is a known tool, we needed to minimize the risk that our artifacts would be discovered by generic Cobalt Strike signatures. To this end, we built a custom obfuscator for the samples in the APT class using the Cobalt Strike artifact kit. The artifact kit is effectively an API that allows users to change the way Cobalt Strike generates binary artifacts, allowing advanced users to more successfully evade AV technologies. As a quality assurance step, we built a named pipe bypass using the artifact kit and ensured that default samples (i.e., samples without any of the APT-like evasions) generated with this obfuscation were not detected by a fifth, commonly used malware detector. With this obfuscation in hand, the APT class was created by combining that obfuscation with the various APT-like evasions available in Cobalt Strike. For those APT samples that were correctly detected in our experiment, we cannot know if the correct detection was from a generic signature (that our obfuscator failed to evade) or if indeed the APT-like evasion tactic was correctly identified behaviorally.

3.1 File Samples

Our experiment used an imbalanced class, that is, a smaller number of benign files than malicious. See Figure 2. As the experiment involved instantiating an entire analysis environment and waiting up to 2 minutes for a detection decision on each file, time limitations forced our decision to limit the benign files. We biased the benign set to include a set of “tricky” benign samples generated and donated by researchers who developed the Hyperion tool [26, 36]. These benign files are mostly C/C++ compiled utilities and were often (incorrectly) classified as malicious in Hyperion research. The rest of the benign files are standard office type documents, JAR files, RPMs, DEBs, and EXEs/MSIs.

Overall, inclusion of nonpublic malware and benignware allows testing of the ML generalizability hypothesis—that it can correctly identify never-before-seen files, while the variety of malware allows examination of recall across file types and categories.

3.2 Tested Technologies

All technologies tested are established malware detection products from well-known companies in the cyber tech market place. The three ML-based tools used in the test are of type host ML, network static, and network dynamic detectors. All ML technologies came pretrained. A popular endpoint signature-based detector was used as the baseline. The endpoint ML detector claims to use no signatures; therefore, it allow us to compare signature-only to ML-only detection at at the host level. Two network-level tools provide ML-guided detection, but one is static and the other is dynamic. To test their technologies, we agreed to keep the vendors and their intellectual property confidential.

Host-level, Signature-based Detector: This proprietary detection tool uses static code analysis to compare hashes and portions of the file’s binary with signatures, known malicious/benign hashes, and code snippets. This standard approach for an AV tool promises high precision and fast detection but an inability to identify novel, even slightly changed, malicious files. In general, porting of signatures to each host requires relatively large host memory usage but facilitates fast and computationally inexpensive detection.

Host-level, ML Detector: This host-based AV tool promises pre-execution file conviction for certain file types. The binary classification uses supervised ML trained on large quantities of both malicious and benign files. According to vendor reports, the product requires 150 MB of memory on each host and claims that detection incurs computational expense similar to that incurred by taking a hash.

Network-level, Static, ML Detector: The network-level, static detection tool was designed to passively detect (not block) both existing and new/polymorphic attacks before a breach, on the wire, in near real time, with an on-premises solution. This technology centers on a binary (i.e., benign/malicious) classification of files and code snippets extracted from network traffic. All features of the reconstructed files and code snippets result from static analysis. This product’s sensors sit at a tap between the firewall and switch and between the router and hosts and rely on a variety of open source tools for both feature extraction and classification; that is, in addition to their proprietary ML classifier, open source file conviction technologies are used.

Network-level, Dynamic, ML Detector: The network-level dynamic detector reconstructs files from the network data stream and, using a central analytics server that resides either on-premises or in the cloud, builds both static and dynamic features by running the file in sandboxes and emulation environments for analysis. The vendor claims the tool accommodates many file types including executable, DLL, Mach-o, Dmg, PDF, MS Office, Flash, ISO, ELF, RTF, APK, Silverlight, Archive, and JAR. Features are then fed to a binary supervised classifier. Presumably, network-level and behavioral features are included as well. Detection timing was quoted as up to 20 s to a few minutes, which, based on vendor reports, is much (i.e., orders of magnitude) slower than competitors, although most competitors use static analysis only. This highlights the fact that, as with the network-level static detector tested, this product is primarily for detection, not prevention.

4 STATISTICAL RESULTS

Here we provide the usual accuracy and conviction time statistics in aggregate, broken down by file type and, for the network-level tools, by delivery protocol. Furthermore, we present results for each detector side by side and in complementary combinations.

4.1 Individual Sensor Results

Table 1 itemizes the results of each malware analysis technology, with rows per file type except the bottom row, which gives overall detection statistics. Because a focus of this experiment is to identify the ways in which ML-based technologies add value, we consider the host signature-based sensor as a baseline. The host signature-based tool demonstrates perfect precision but poor recall, pulled down by its inability to detect never-before-seen malware (0% polyglots, 3.5% zero-days detected). The tool performs relatively well with respect to MS Office files and APT files and fairly well with respect to executable and PDF files. Good performance on executable, MS Office, and PDF file types is not surprising, because these are the most common packaging for malware and, therefore, have the broadest set of signatures. As expected, the host signature-based tool performed poorly on JAR files and zero-day samples, where there are fewer relevant signatures, or in the case of zero-day samples, no existing signatures.

Table 1.

Category	Host Signature			Host ML			Network Static			Network Dynam.
(#mal./#ben.)	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
APT (27/0)	1	0.667	0.8	1	0.63	0.773	1	0.296	0.457	1	0.556	0.714
JAR (325/200)	1	0.0892	0.164	0	0	0	1	0.249	0.399	1	0.209	0.346
Office (400/200)	1	0.782	0.878	0	0	0	1	0.873	0.932	1	0.735	0.847
PDF (400/200)	1	0.512	0.678	0	0	0	1	0.51	0.675	0.996	0.565	0.721
PE32 (400/200)	1	0.505	0.671	0.922	0.855	0.887	0.814	0.69	0.747	0.977	0.642	0.775
PE64 (400/182)	1	0.233	0.377	1	0.777	0.875	1	0.833	0.909	1	0.757	0.862
Polyglot (199/0)	0	0	0	0	0	0	0	0	0	0	0	0
Zero-Day (403/0)	1	0.0347	0.0671	1	0.486	0.654	1	0.395	0.566	1	0.553	0.712
Total (2554/982)	1	0.342	0.51	0.968	0.339	0.502	0.957	0.552	0.7	0.995	0.543	0.702

Categories are disjoint in this table: all 403 zero-day files are PE32 files but are not counted in the PE32 row; similarly, all polyglot files are malicious JARs but are not counted in the JAR row. The Total row provides the statistics computed on all files.

View Table

Table 1. Individual Sensor Results Itemized

Categories are disjoint in this table: all 403 zero-day files are PE32 files but are not counted in the PE32 row; similarly, all polyglot files are malicious JARs but are not counted in the JAR row. The Total row provides the statistics computed on all files.

The good performance on the APT category was a surprise. It is likely that the Cobalt Strike framework used for packaging the APT malware samples just happens to be well known to the signature set used in this host signature-based sensor. The host signature-based tool’s perfect precision simplifies this analysis, as recall becomes the primary metric of comparison. Performing worse in overall recall, the host ML tool has 0% detection for all non-PE files. The two network-level tools perform quite similarly, with good ($\gt$95%) precision and both with better (but still alarmingly poor) recall of 55%, giving F1 near 70%.

The network static tool achieves a recall of $\hbox{$\scriptstyle \mathtt {\sim }$}$20% greater than the host-based tools. As this tool uses a suite of ML- and signature-based subclassifiers, these results suggest that complementary signature and ML tools might work well. This increase in recall is shadowed by a higher quantity of FP alerts. Nearly identical overall statistics are achieved by the lone dynamic tool. Our results imply that network-level tools do increase recall substantially with comparable precision to the host ML tool and only slightly worse precision to the host static tool.

Overall, regarding Table 1, there is no clear “winner.” Precision values are high for all four tools, meaning all sensors performed with relatively few false alerts. Perhaps the biggest takeaway of these results is the utter failure of these tools to detect most malware. Recall of 33–55% is both disappointing and alarming, with roughly 37% of the malware used in the exercise undetected. Furthermore, recall varies wildly across file categories for each tool. Notably, most operations use a sole endpoint detector, and both of these state-of-the-art commercial detectors identified only a third of the malware used in this test. Polyglots are a glaring problem for these four tools, as no such file was detected. This suggests that the approach to detection is dependent on determination of file type.

Considering the zero-day category, we see that all three ML-based tools (host ML and both network tools) increase recall by more than a factor of 10. This quantifies our first hypothesis, at least for these tools, that ML, as incorporated into modern detection products, can help generalize to never-before-seen files. Simply put, in our results, the detection rates increase by approximately 10 times but at the cost of precision—specifically precision on PE32 files—which drops by approximately 10%. Considering only the zero-day PEs and the APT files, the network dynamic tool is (unsurprisingly) superior. Comparing this result with the amount of time needed to detect will put these figures in context.

4.1.1 Detection Latency.

In addition to accuracy, timely detection is imperative to effective defense against malware. Figure 3 shows each tool’s time from file delivery until alert both per file type with true positives (TPs) and FPs indicated by color and in aggregate across all files to visualize the overall alert time empirical distribution. Detection latency is calculated from the timestamp generated when the file transmission is completed until the timestamp of a given tool’s alert on the respective file and encompasses all processing, including any file carving or decompression required. Tools’ average conviction times rank as expected, but the results hold some unexpected takeaways. The network static tool has the fastest average conviction time of 15.3 s, perhaps due to the resources available for a network appliance (much more than an application on a host) and static technique (generally faster than dynamic detection). Interestingly, all its FPs occur in PE32 files that take maximum times to alert for this tool. (Without visibility into the detection mechanism of the tool, we cannot diagnose this phenomenon. We notified the vendors of this anomaly.) Both host-based tools take at most a couple minutes to alert, with the ML-based tool slower on average. Anomalously, the signature-based tool JAR files exhibit wide variance in time to alert. Recall that the host ML detector does not alert on many file categories and has many FPs on PE32 files, whereas the host signature-based tool has no FPs. The network dynamic tool has a much greater average conviction time of 1,039 s. This is understandable given the process of creating a sandbox environment and then executing the file in that environment to make a conviction decision.

Fig. 3. Swarm plots show time to detection for each file type. Left plots show per-file-category results with blue/orange data for true/false alerts. Right plots show all file categories and depict the time-to-detection distribution. Top: (Left) As the signature-based detector has perfect precision, there are no FPs (no orange data points). (Right) Host ML detector does not detect many file categories and has many FPs in PE32 files. Bottom: (Left) Network static detector has nearly identical time to detection for each file type except PE 32, the only type with false alerts, and all have relatively long analytic time. (Right) The network dynamic detector has very few false alerts, but time to detect is one to two orders of magnitude longer.

We dug into the detection latency of these network-level tools in an additional experiment. Both vendors provided updated network appliances, and we tested their time-to-detect capabilities in a separate, more realistic experiment. Each tool received a mirrored feed of a real network’s traffic blended with emulated network traffic, allowing more than 2,000 malware samples to be delivered safely to VMs but spread over a few hours. While seeing data rates of up to 7 GB/s, the tools achieved median detection times, from file delivery until alert, of under 1 s for the network static tool and 258 s = 4.3 m for the network dynamic tool. (Median is reported here because of outliers in the statistics that affected the mean.) The network-level tools had a significant bottleneck in the computational overhead of decompressing files. Introducing emulated network data had a positive effect on detection latency as it significantly decreased the rate at which compressed files were delivered to the tools despite dramatically increasing the amount of raw data to process—the tools received more network traffic but a decreased rate of compressed files to process. These results show that, indeed, there is a latency cost for using at least this dynamic analysis tool and that both the network tools’ latencies can increase by an order of magnitude if files cross the network too rapidly relative to the server’s computational resources.

4.1.2 Network-level Detectors Efficacy across Protocols.

The two network-level tools rely on extracting files from traffic and claimed accommodation of multiple protocols. As such, we ran a small test using multiple protocols. Our results revealed that visibility into those network streams was not as reliable as expected. All of the 3,536 file samples used in the analysis were delivered over the HTTP protocol. To test the network tools’ capabilities across protocols, 200 of the malware samples were also delivered over four additional protocols. To deliver files over multiple protocols, web downloads (HTTP), external file transfers (FTP), internal file transfer (SMB), attachments to emails (SMTP), and clear-text TCP via Netcat were used. Netcat was used to test a vendor’s claim that it could detect files sent in this manner. We present the counts of correctly detected samples for these 200 malware samples on the five total protocols in Table 2. The network static tool has an email client that was not used in this analysis, and so its performance on email was not determined. All network transmissions were unencrypted. Both network IDS tools require de-/un-encrypted traffic. These results show that the tools’ capabilities to detect varies greatly by protocol, and this does not include the complexity induced by the realistic assumption that these tools would sit downstream of a companion decrypter.

Table 2.

	Dynamic	Static
FTP	0/200	9/200
HTTP	145/200	156/200
SMB	0/200	38/200
SMTP	180/200	N/A
TCP	0/200	0/200

View Table

Table 2. Detection Counts for Network-level Detectors on 200 Malware Samples Each Sent on Five Protocols

4.2 Complementary Detection

In light of the surprisingly low detection rates (33–55%, see Table 1), we consider simulated results of using pairs of tools together by taking the logical OR of binary classifiers. Tool pairs will score well if they alert on different malware—thereby increasing recall—but overlap on false alerts. Specifically, we consider pairing the two host-based detectors together to see if host signature and ML techniques are a good combination. Then, we consider combining each host tool paired independently with each network-level tool. See Table 3.

Table 3.

Category	HostSig$\cup$HostML			HostSig$\cup$NetStat			HostSig$\cup$NetDyn			HostML$\cup$NetStat			HostML$\cup$NetDyn
(#mal/#benign)	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
APT (27/0)	1	0.67	0.8	1	0.67	0.8	1	0.67	0.8	1	0.63	0.77	1	0.63	0.77
JAR (325/200)	1	0.089	0.16	1	0.3	0.46	1	0.25	0.4	1	0.25	0.4	1	0.21	0.35
Office (400/200)	1	0.78	0.88	1	0.93	0.96	1	0.91	0.95	1	0.87	0.93	1	0.73	0.85
PDF (400/200)	1	0.51	0.68	1	0.57	0.73	1	0.58	0.73	1	0.51	0.68	1	0.56	0.72
PE32 (400/200)	0.93	0.91	0.92	0.85	0.86	0.86	0.98	0.83	0.9	0.85	0.93	0.89	0.92	0.9	0.91
PE64 (400/182)	1	0.82	0.9	1	0.86	0.92	1	0.8	0.89	1	0.86	0.93	1	0.81	0.9
Polyglot (199/0)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Zero-Day (403/0)	1	0.49	0.65	1	0.4	0.57	1	0.56	0.72	1	0.5	0.67	1	0.62	0.77
Total (2554/982)	0.98	0.57	0.72	0.96	0.61	0.75	.996	0.61	0.76	0.96	0.61	0.75	0.98	0.6	0.75
	HostSig$\cup$HostML			HostSig$\cup$NetStat			HostSig$\cup$NetDyn			HostML$\cup$NetStat			HostML$\cup$NetDyn
	$\Delta$HostSig			$\Delta$HostSig			$\Delta$HostSig			$\Delta$HostML			$\Delta$HostML
(#mal/#benign)	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
APT (27/0)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
JAR (325/200)	0	0	0	0	0.21	0.29	0	0.16	0.24	1	0.25	0.4	1	0.21	0.35
Office (400/200)	0	0	0	0	0.15	0.08	0	0.12	0.07	1	0.87	0.93	1	0.73	0.85
PDF (400/200)	0	0	0	0	0.06	0.05	0	0.07	0.06	1	0.51	0.68	1	0.56	0.72
PE32 (400/200)	$-$0.07	0.4	0.24	$-$0.15	0.36	0.18	$-$0.02	0.32	0.23	$-$0.07	0.07	0	0	0.05	0.02
PE64 (400/182)	0	0.59	0.53	0	0.63	0.55	0	0.56	0.51	0	0.09	0.05	0	0.04	0.02
Polyglot (199/0)	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Zero-Day (403/0)	0	0.45	0.59	0	0.36	0.5	0	0.52	0.65	0	0.02	0.01	0	0.14	0.11
Total (2554/982)	$-$0.02	0.23	0.21	$-$0.04	0.27	0.24	0	0.27	0.25	0	0.27	0.25	0.01	0.27	0.25

Bottom half of table: Difference of the first (lone, base, host) detector’s results in Table 1 from the paired complementary detectors’ results in the top half of this table ($result(A \texttt { OR } B) - results(A)$).

View Table

Table 3. Top Half of Table: Itemized Complementary Malware Detection Results ( $result(A \texttt { OR } B)$ )

Bottom half of table: Difference of the first (lone, base, host) detector’s results in Table 1 from the paired complementary detectors’ results in the top half of this table ($result(A \texttt { OR } B) - results(A)$).

Complementing the host signature-based tool with other technologies increases recall dramatically for PE32, PE64, and zero-days files. Precision of the host signature tool is minimally sacrificed only for PE32 files in these pairings. Similarly, as the host-ML tool does not alert on many file categories, substantial gains in detection capabilities are possible by adding any other tool. As a final baseline, we considered the union of all four tools (Table 4), which only slightly increases recall with little change to precision over these pairs. In short, combinations of either the host-level tool with one other malware detector yields much better recall with little impact to precision, and only pairs of tools are needed—unioning three or more detectors provides little gain.

Table 4.

Category	All Four Detectors
(#mal/#benign)	Prec.	Recall	F1
APT (27/0)	1	0.67	0.8
JAR (325/200)	1	0.3	0.46
Office (400/200)	1	0.93	0.96
PDF (400/200)	1	0.58	0.73
PE32 (400/200)	0.85	0.94	0.89
PE64 (400/182)	1	0.89	0.94
Polyglot (199/0)	0	0	0
Zero-Day (403/0)	1	0.63	0.77
Total (2554/982)	0.96	0.67	0.79

View Table

Table 4. Results of Complementary Detection (Logical Union) of All Four Detectors

5 COST MODEL EVALUATION

Although the aforementioned metrics provide a valuable statistical summary of these sensors’ performance, recent work by Iannacone and Bridges [20] provides a cost–benefit framework that models the real-world implications of using a particular tool, yielding a single, comparable cost metric to more easily reason out the tradeoffs implied in Section 4. For example, the network dynamic tool has an average detection time of 1,039.0 s, an FP rate of 0.71%, and a TP rate of 54.3%, whereas the network static tool detects on average in 15.3 s (faster) with an FP rate of 6.42% (worse) and TP rate of 55.21% (about the same). It is difficult to know which tool is best for a given SOC to adopt based on these basic statistics. It is even more difficult to choose the best pair of tools.

To assist with these determinations, we configured the cost model to evaluate these tools as standalone detectors (Section 5.2). This follows the original cost-model work [20] but makes one novel contribution to the basic model. Since the cost model as originally proposed depends on the accuracy statistics of the tool, we test with a relatively large number of both malicious and benign files to accurately estimate these inputs. Our change to the model is to compute average costs per benign-/malware and then scale the costs to respect two new model inputs, namely the expected number of files per year times the ratio of benign-/malware expected in the wild. This allows us to gain accuracy in both the detection statistics and the cost model. In Section 5.3 we create a new version of the model to compute savings of adding a network-based malware detector, assuming a host-based detector is in place. This new estimate is created to answer (Q2).

5.1 Cost Model Overview

This model estimates and sums labor and resource costs incurred for buying, configuring, and using the tool in addition to attack damage cost estimates. These costs are itemized by the following components. Refer to Table 5.

Table 5.

	Initial	Ongoing
Host Sig.	$2K	$8K/y
Host ML	$6.5K	$35K/y
Net. Static	$15K	$20K/y
Net. Dynam.	$23K	$16K/y

View Table

Table 5. Initial (One-time) Server or Appliance and Ongoing (Annual) Subscription Costs

$C_I$, the one-time initial costs of purchases, setup, and installation. The initial configuration costs are based on our team’s experience with installing and configuring these devices at the NCR. All costs are slightly obfuscated to preserve anonymity of the tool vendors. This minimally impacts results but does not impact the rankings of the tools. All license, hardware, and labor costs assume a relatively small network ($\hbox{$\scriptstyle \mathtt {\sim }$}$1K IPs), comparable in size and complexity to the NCR test network described previously. Note that we ignore several smaller costs, such as power and cooling, because they are negligible.

$C_B$, the ongoing (per month, year, etc.) costs of normal operation, including subscription-based licenses and labor needed for periodic updates, reconfiguration, and maintenance. Since we did not use the tools in a real setting for a long period of time, we simply cannot give per-tool estimates. We assumed 8 h/tool/month for all tools. This does not affect our comparison and was included for completeness.

$C_{IR}$, the incident response (IR) cost (per true positive alert), representing labor costs for investigation and remediation of detected attacks. We estimate this at $280 = $70/h (fully burdened cost of SOC operator) $\times$ 4 h.

$C_T$, the alert triage costs (per alert), representing labor costs and storage costs for the alert’s data. We estimate this at $70/h $\times$ 1 h. A SOC lead verified that these estimates of labor time for triage and (above for) IR were verified as reasonable. The triage resource cost of $0.05 was obtained by finding the lowest volume Splunk license based on the rate of raw data being processed ($1,800 for 1 GB/month, or 34 MB/day) and estimating what fraction of that volume would be consumed by each alert. For both detectors, we observed a peak of 1,000 alerts per day during our testing, and the average alert size was under 1 KB per alert, for a maximum of 1 MB/day. Because each detector is using less than 1/34th of the licensed volume for its 1,000 alerts per day, each alert represents a cost of (at most) $1,800/34/1,000 or about $0.05.

$C_A$: While the above four costs are “defense costs,” $C_A$ estimates the losses due to attack damages. Attack cost is a function of time modeling the average cost incurred from the moment of infection. We follow formulation of attack cost from the initial work [20] and use an “S” curve to model average attack cost over time. The intuition is that costs will begin at $0 and after some time (e.g., the malware unpacking itself) begin to grow quickly, then level out approaching a maximum cost. In reality, this might happen in a sequence of discrete jumps, but as discussed in the previous work, an “S” curve provides a reasonable model of the average cost for an attack. Specifically, the attack damage cost $f(t)$ is defined as $f(t) = M \exp (- (\alpha /t)^2 \ln {2}) = M 2^{-1/(t/\alpha)^2}$ where parameter M is the maximum cost (and horizontal asymptote), and parameter $\alpha$ is the time when half the maximum cost is attained ($f(\alpha) = M/2)$). See depiction of $f(t)$ in Figure 4 and input parameter choices in Table 6. The shape of the attack cost function, the parameters in Table 6, and the cost of labor hours used in this model were informed by discussions with a SOC lead who verified they are reasonable.

Fig. 4. Attack cost model, $f(t) = M 2^{-1/(t/\alpha)^2}$ .

Table 6.

Param. Value	Description
M = $2,000	Max. Attack Cost
$\alpha$ = 15m = 900s	Half-Cost Time
N = 125K	# of Files
r = 1.16%	Malware Ratio

View Table

Table 6. Cost Model Parameters

To apply the model we first computed initial and ongoing costs, as explained previously, then compute each tool’s cost per sample as follows:

False Negative Costs:	If the tool does not alert on a malicious sample, then maximum attack cost is incurred ($M = \lim _{t\rightarrow \infty } f(t)$).
True Positive Costs:	If the tool correctly detects malware in t seconds, then the incurred cost is the sum of costs from triage resource and labor ($0.05 + $70), IR labor ($280), and attack damages $f(t)$. Very late detection of malware can cost more than no detection, because, for large t, the attack cost is near maximal, and alert triage and remediation still occurs.
False Positive Costs:	If the tool incorrectly alerts on a benign sample, then triage costs are incurred ($0.05 + $70).
True Negative Costs:	If the tool (correctly) does not alert on benignware, then no extra costs are incurred.

Note that FN and TP costs are costs from malicious samples, whereas FP and TN costs are from benign samples.

We break up the sum of $C_{IR} + C_T + C_A$ (incident response, triage, and attack costs) into two quantities—costs from false alerts plus from attacks (true alerts or no alert). To do so, we compute each tool’s average benignware cost $a_b$ (from the remaining 992 benign samples) and average malware cost $a_m$ (from the 2,554 malware samples). Finally, we set N, the number of unique files crossing the network detector (border traffic + internal-to-internal) for a 1,000 IP network, to be 125K files (125 per host). We also need an estimate of the ratio of malware to benignware in the wild. For this we leverage Li et al. [24], who upon investigating files on 10K IPs find 1.16% malware and 98.84% benignware. Thus, we can scale the average cost per benign or malicious file to 125K files per year, respecting the 98.84/1.16% benign-/malware ratio.

Putting it all together, the estimated cost for the first y years is as follows: (1) $\begin{align} C &= C_I &&\text{initial resource and labor costs plus} \nonumber \nonumber\\ &+ y[ &&\text{years ($\times $ quantity)} \nonumber \nonumber\\ & \phantom{+ y[ + } C_B && \text{yearly ongoing costs, e.g., subscriptions and tweaking, plus} \\ & \phantom{+ y[ } + N \times 98.84\% \times a_b && \text{annual costs from false alerts, plus} \nonumber \nonumber\\ & \phantom{+ y[ } + N \times 1.16\% \times a_m] && \text{annual costs from attacks (end quantity).} \nonumber \nonumber \end{align}$

Importantly, N linearly scales all costs except the initial and ongoing costs (see in Figure 5 (left)), so unless the initial and ongoing costs for a tool are exorbitant, inaccuracies in this file number estimate will not affect any comparison of the tools. For the latter two, we assumed detection was “immediate” using a detection time of $t = 1\mathrm{E}{-10}$ s.

Fig. 5. Cost model results for varying $M =$ maximum attack cost with half-cost time $\alpha = 15$ minutes (left); varying $\alpha$ with $M =\$2000$ (right). Regarding the left plot, for M near 0, it is cheaper to never detect (never alert baseline is cheapest). For $M \in [\hbox{$\scriptstyle \mathtt {\sim }$}\$300, \hbox{$\scriptstyle \mathtt {\sim }$}\$2300]$ the host signature detector is best and is second to the network static tool for $M\gt \hbox{$\scriptstyle \mathtt {\sim }$}\$2300$ . Regarding the right plot, for small (large) $\alpha$ , attack cost accrues quickly (slowly), with the effect that slow detection will be very (not very) penalized. The lone dynamic tool is very slow to detect but very accurate; hence, with large $\alpha$ it dominates. Of the two host-based tools, the signature-based method wins for nearly all parameters.

5.2 Single-tool Cost Model Results

For comparison in the subsequent results, we include three simulated detector devices: “Never Alert,” which simulates no detector; “Always Alert,” which alerts on every file; and a “Perfect Detector,” which correctly alerts on only malware.

Table 7 displays the average benign-/malware costs and the estimated cost of using each tool alongside the simulated baseline under parameters as shown in Table 6. First and foremost, the simulations provide validation of the model, as all tools perform much better than the always alert simulation (accruing enormous false positive cost but minimal true positive cost) and much worse than the perfect detector device. Because malware is relatively rare (ratio of 1.16%), the cost of never alerting, though still more than all actual detectors, is closer than the other two simulated baselines. The tool rankings show that the host signature tool is superior, because it has no FPs and competitive time to detect. As we only consider the host tools for realistic stand-alone defenses, we note that even with identical initial and ongoing costs, the host signature-based tool is superior. Interestingly, the network static detector places second, because it is by far the fastest detector and has high recall, both outweighing its higher FP rate (high Ave $Benign). The network dynamic tool is hindered by its slow detection time (Figure 3) but benefits from accuracy.

Table 7.

	Benign	Malware	1st Year
Host Sig.	$0.00	$1,435.47	$2,098,712
Host ML	$2.07	$1,440.54	$2,393,151
Network Dyn.	$0.50	$1,500.41	$2,283,562
Network Static	$4.49	$1,089.10	$2,176,719
Never Alert	$0.00	$2,000.00	$2,942,280
Always Alert	$70.05	$350.05	$9,204,530
Perfect Detector	$0.00	$350.05	$549,852

Cost increases linearly in per-file costs. Mean benignware cost is a multiple of the false alert rate, while respectively mean malware cost is monotonically related to the recall and time to detect via attack cost function.

View Table

Table 7. Cost Model Results (Mean Cost per Benign-/Malware ( $a_b$ , $a_m$ ) and 1st Year Cost) with Three Simulated Baselines Using Parameters as in Table 6

Cost increases linearly in per-file costs. Mean benignware cost is a multiple of the false alert rate, while respectively mean malware cost is monotonically related to the recall and time to detect via attack cost function.

5.2.1 Insights from Varying Cost Model Parameters.

The skeptic might object to cost–benefit analyses, citing that, especially for cybersecurity applications, such methods suffer from required inputs that are fundamentally difficult to ascertain [8, 20], and, indeed, our model is no different. To account for these unknowns, this section investigates parameter sensitivity.

Costs for all tools scale linearly in N, the number of files, so this will not change rankings. Cost increases linearly with mean $benignware with weight $N\times (1-r)$ and mean $malware with weight r, where r is the malware ratio. Ave$benign is a multiple of the false alert rate, while Ave$malware is monotonically related to the tool’s recall and time to detect via attack cost. Thus, increasing/decreasing r decreases/increases impact of the false alert rate but increases/decreases the impact of recall and time to detect. Precision directly affects results, but recall’s effect depends indirectly on M and $\alpha$. To investigate, we present costs while varying M and $\alpha$ in Figure 5. The main takeaway is that the host signature tool is the best standalone choice based on these results and cost model. It offers the lowest cost for nearly all values of M and $\alpha$, and otherwise (Figure 5) is second to network tools.

5.2.2 Insights from Increasing Cost of “Hard” Files—Zero-days, Polyglots, and APT-style Files.

Another addition to the original cost model that is applicable to this study is increasing the cost of the “hard” malware: zero-days, polyglots, and APT style files, representing the assumption that such files, if undetected, will result in more damage than $n-$day files, on average. To exhibit results under these conditions, we compute $a_{mh}$ the average hard file cost using the detection times on only the zero-days, polyglots, and APT-style functions and with augmented maximum attack cost $M_{hard}$ (still using the same attack cost equation from Figure 4 caption with identical $\alpha$). Similarly, we compute $a_{me}$, the average “easy” file cost identically on the remaining ($n-$day) malware detection results and using a maximum attack cost of $M = \$2K$. Finally, we replace the final line of Equation (1) (namely, $N \times 1.16\% \times a_m$) with a convex sum of the average cost from hard malware files and from easy malware files (namely $N\times 1.16\% \times p \times a_{mh} + N\times 1.16\% \times (1- p)\times a_{mh}$) using a parameter p, the percentage of malware assumed to be hard. Results are depicted in Figure 6. The main takeaway is that if one expects (1) more than a couple percentages of malicious files to be a zero-day, polyglot, or APT-style file and (2) the cost from an undetected hard file to be much greater than otherwise, ML-based detectors provide substantial savings.

Fig. 6. Cost model results for varying $M_{hard} =$ maximum attack cost for hard files (namely polyglot, zero-days, and APT-style files) with percentage of hard malware $p=0.05$ (left) and varying p the percentage of malware that is hard with $M_{hard} = \$20,\!000$ (right). In both, the same half-cost time $\alpha = 15$ m is used for all malware, and for easy malware, $M =\$2,\!000$ . First notice that both plots are similar in structure because of the linear nature of these variables—both increases in $M_{hard}$ and p convert savings from those with low recall to those with better recall on hard files. Hence, in both plots, we see that for small values of $M_{hard}$ and p the signature-based detector wins; for intermediate values, the network static (relatively faster detection, competitive recall) wins; yet for larger values, the host ML and network dynamic (best recall on hard files) win. These results are intuitive, which is reassuring, but force a difficult reality; that is, deciding which tool is best depends on one’s estimate of the hard” malware expected. This suggests pairing signature- and ML-based tools.

5.3 Savings by Adding Network Detectors

Here we describe a new way to configure the cost model that is likely more useful for real-world use—estimating the cost/savings of adding each network malware detector to each host-based detector. To do so, we add the initial and ongoing costs for the network tool then compute the difference in cost between using both the host-based and network-based tools and solely the host-based tool. As before, we (1) compute the additional cost/savings per file, (2) find the average cost/savings per malware file and per benignware file, and (3) linearly scale these costs/savings to 125K total files, respecting the 98.84/1.16% benign-to-malicious file ratio. To estimate the change in cost on a given file, if the network tool does not alert on a given sample, there is no change in cost. If the network tool alerts on a sample (FP or TP), then triage resource and labor costs ($\$0.05 + \$70$) are incurred, accounting for an operator fielding the alert from the network tool (possibly in addition to alerts from the host-based detector). Moreover, if this network-based alert is a TP, and the host-based tool also correctly alerts on the sample, then we make no other change to costs for this file; only triage cost is added for the network alert. This follows two assumptions: (1) IR actions will have taken place with or without the network-based alert, since the host-based tool correctly alerted, and (2) host-based detection is at least as fast as the network detector, resulting in no change in attack cost. These assumptions are based on the fact that most host-based tools offer pre-execution file conviction and blocking. As the final case, consider when the network tool’s alert is a TP but the host-based tool failed to alert on the malware. In this scenario, without the network-based tool, the attack goes undetected, and the maximum attack cost is incurred. Yet with the network-based tool, there are now triage and IR costs, but the attack cost is reduced from maximal to $f(t)$, the attack cost at $t =$ the network-level detection time. In our model, this is the lone scenario in which savings from the network-level tool can occur—when the network tool identifies malware that is undetected by the host tool and the network tool identifies it fast enough that the change in attack cost is lower than the incurred triage and IR costs.

Table 8 reports the results of adding both network-based tools to the two host-based detectors with parameters as in Table 6. Reassuringly, both network-level detectors yield substantial ($22K–$325K) savings when added to the host tools. These results predict that, for SOCs using either host detector, dynamic detection is the best pairing, saving 3–15$\times$ more than the static detector.

Table 8.

		Host ML		Host Signature
	Benign	Malware	1st Year	Malware	1st Year
Dyn.	$0.50	$$-$145.26	$$-$102,651	$$-$298.22	$$-$324,443
Stat.	$4.49	$$-$434.09	$$-$31,913	$$-$427.36	$$-$22,148

Negative costs indicate savings.

View Table

Table 8. Cost on Average per File and for First Year of Network Detection Tools when Added to Each Host-based Detector Individually

Negative costs indicate savings.

6 CONCLUSIONS, LIMITATIONS, NEXT STEPS, AND TAKEAWAYS

This study describes experiments with four prominent malware detection technologies aimed at helping an organization assess (Q1) the ML generalization hypothesis and (Q2) the added value of network-level malware detectors. Our results provide empirical quantification on the efficacy of four market-leading malware detectors. The three ML tools demonstrated a minimal 2$\times$ increase in detection coverage for publicly available executables and a 10$\times$ increase in malware detection coverage for zero-day executables. Less intuitive perhaps was how well the host signature–based tool performed: exhibiting the best (perfect) precision; demonstrating comparable recall ($\hbox{$\scriptstyle \mathtt {\sim }$}$35%) to the host ML–host tool; and, according to the cost model, performing the best overall for small percentages of hard (zero-day, polyglot, or APT-style) files and when maximum possible damage cost incurred by an undetected hard file was not much more than an n-day malware. Notably, if one expects (1) more than a couple percentages of malicious files to be zero-day, polyglot, or APT-style files, and (2) the cost from an undetected hard file to be much greater than otherwise, then ML-based detectors provide substantial savings; this provides a strong argument for pairing detectors. Combining the signature-based tool with an ML-based tool enables malware detection ranges above 90% for in-the-wild executables and above 60% for zero-day malware. As demonstrated by increased recall and the new configuration of the cost model, substantial value is added by combining either of the two network-level tools to either host-level tool. Unsurprisingly, dynamic analysis incurs the cost of latency for increased recall, and both network-level detectors vary in their capabilities across protocols. Our results indicate that low false positive rates (precision $\gt$98%) should be expected, with fast detection time for static detectors, but that this comes at the cost of recall. Of the four detectors in this study, none surpassed 55% recall, and recall falls dramatically with JAR and polyglot file types.

Testing only four representative tools is a clear limitation; in particular, many COT tools offer (and some require) cloud connectivity to enhance detection at the expense of privacy, but no such tool was represented. Our test environment did not allow connection to the Internet, which might have stymied malware actions and therefore affected detection results. Similarly, network-level signature-based tools are not fully represented. Impact to host resources was not taken into account in this study (CPU, memory, and disk I/O, in particular). More sophisticated (and presumably more accurate) models of attack cost could be integrated in future work (e.g., Reference [23]). Finally, accurate cost estimates from the cost model will require per-SOC inputs. Based on our parameter sensitivity analysis, the rankings and takeaways of the cost model results for these tools should generalize (and our treatment provides a blueprint for future use). Nevertheless, we argue that trends found among these four market leaders are sufficient to pose hypotheses, raise awareness of gaps, and sharpen next-step research.

We itemize priorities for next-step research from this work: Our polyglot detection results suggest that detectors are failing at preclassifying a file into possibly many file types for which it can execute. Research is needed to accurately and quickly determine all of the file types of a given sample and integrate this capability into COTS detectors to enhance detection. Further research is needed to address the polyglot detection issue and performance with throughput and computational expense in mind. More accurate detection models for all file types are needed. In particular, future malware detection research should focus on increasing the detection capabilities for non-PE and non-MS Office file types. This might entail enhancing the quantity of malware and benignware files of many types that are currently less prevalent (e.g., through an effort to build and make public a more diverse malware data set). For the two network-level detectors, recall varied widely when identical files were delivered on different protocols. We suggest a study to identify whether the file carving is the limiting factor and to find methods for accurately extracting a copy of the file from the packet stream to the detector. Future work to tune the cost model to a specific SOC is both reasonable and likely very valuable if successful, especially if it produces a procedure and code-base for refitting to any SOC. Malware detection studies in which computational resources of the hosts are monitored and incorporated are needed, in particular for a higher-fidelity cost model. As evidenced by our results, many of the takeaways depend on difficult assumptions; for example, the percentage of files that are malware (e.g., Li et al. [24]) or the percentage of different file types or categories in the wild. Empirical studies to provide data on cyber-related artifacts provides fodder for useful studies.

Finally, some takeaways for SOCs. Great gains in detection rate (recall) with few false positives (high precision) are permitted by all pairs of tools in our study, but little is gained from using more than two of these tools. When choosing a pair of tools, strive for tools that complement each other in terms of increasing the coverage (detection rate) of all file types, rather than duplicating results—this might require some preliminary testing with different file types. A feasibility study for any desired pairing of host tools is needed before deployment to examine how the combination will affect the host’s resources. Our results (Table 8) show that substantial savings are yielded by adding a network-level detector in all four combinations of network-to-host detectors tested. The accuracy gains permitted by the dynamic detector in our study outweighed latency costs. For any network detector, testing bandwidth and file rates with the specific software and hardware is needed. Note that network-level detectors require unencrypted traffic, so we are presupposing that an SSL/TSL “break and inspect” technology is in place.

APPENDIX

Footnotes

Supplemental Material

Available for Download

zip

dtrap-2022-0003-file002.zip (7.3 MB)

Supplementary material

REFERENCES

[1] Albertini Ange. 2015. The International Journal of Proof of Concept GTFO PoC or GTFO 0x07; Abusing file formats or Corkami the Novella. https://www.alchemistowl.org/pocorgtfo/pocorgtfo07.pdf.Google Scholar
Reference
[2] Aslan Ömer and Samet Refik. 2017. Investigation of possibilities to detect malware using existing tools. In Proceedings of the IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA’17). IEEE, 1277–1284.Google ScholarCross Ref
Reference 1Reference 2
[3] Aslan Ömer Aslan and Samet Refik. 2020. A comprehensive review on malware detection approaches. IEEE Access 8 (2020), 6249–6271.Google ScholarCross Ref
Reference 1Reference 2
[4] AV-TEST. 2022. The best Windows antivirus software for business users. https://www.av-test.org/en/antivirus/business-windowsclient/windows-10/april-2022/.2022-06-23.Google Scholar
Reference 1Reference 2Reference 3
[5] Bratus Sergey, Goodspeed Travis, Albertini Ange, and Solanky Debanjum S.. 2016. Fillory of PHY: Toward a periodic table of signal corruption exploits and polyglots in digital radio. In Proceedings of the10th USENIX Workshop on Offensive Technologies (WOOT’16). USENIX Association, Austin, TX. https://www.usenix.org/conference/woot16/workshop-program/presentation/bratus.Google Scholar
Reference
[6] Bratus Sergey and Shubina Anna. 2017. Exploitation as code reuse: On the need of formalization. Inf. Technol. 59, 2 (2017), 93.Google Scholar
Reference
[7] Bridges Robert A., Iannacone Michael D., Goodall John R., and Beaver Justin M.. 2018. How do information security workers use host data? A summary of interviews with security analysts. arXiv:1812.02867 [cs.HC] Retrieved from https://arxiv.org/abs/1812.02867.Google Scholar
Reference
[8] Butler Shawn A.. 2002. Security attribute evaluation method: A cost-benefit approach. In Proceedings of the 24th International Conference on Software Engineering. (ICSE’02). Association for Computing Machinery, New York, NY, 232–240.Google ScholarDigital Library
Reference 1Reference 2
[9] Carmony Curtis, Hu Xunchao, Yin Heng, Bhaskar Abhishek Vasisht, and Zhang Mu. 2016. Extract me if you can: Abusing PDF parsers in malware detectors. In Proceedings of the Network and Distributed System Security Symposium (NDSS’16). The Internet Society.Google ScholarCross Ref
Reference
[10] Christodorescu Mihai and Jha Somesh. 2004. Testing malware detectors. ACM SIGSOFT Softw. Eng. Not. 29, 4 (2004), 34–44.Google ScholarDigital Library
Reference 1Reference 2
[11] Davis Adrian. 2005. Return on security investment—Proving it’s worth it. Netw. Secur. 2005, 11 (2005), 8–10.Google ScholarDigital Library
Reference
[12] Doupé Adam et al. 2011. Hit ’em where it hurts: A live security exercise on cyber situational awareness. In Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 51–61.Google ScholarDigital Library
Reference
[13] Drinkwater Doug and Zurkus Kacy. 2017. Red Team Versus Blue Team: How to Run an Effective Simulation. Retrieved from www.csoonline.com/article/2122440/disaster-recovery/emergency-preparedness-red-team-versus-blue-team-how-to-run-an-effective-simulation.html.Google Scholar
Reference
[14] Ferguson B., Tall A., and Olsen D.. 2014. National cyber range overview. In Proceedings of the IEEE Military Communications Conference. 123–128. DOI:Google ScholarDigital Library
Reference 1Reference 2
[15] Fleshman William, Raff Edward, Zak Richard, McLean Mark, and Nicholas Charles. 2018. Static malware detection & subterfuge: Quantifying the robustness of machine learning and current anti-virus. In Proceedings of the 13th International Conference on Malicious and Unwanted Software (MALWARE’18). IEEE, 1–10.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[16] Garcia Sebastian, Grill Martin, Stiborek Jan, and Zunino Alejandro. 2014. An empirical comparison of botnet detection methods. Comput. Secur. 45 (2014), 100–123.Google ScholarDigital Library
Reference
[17] Gartner. 2022. Endpoint Detection & Response EDR Reviews & Ratings. Retrieved June 23, 2022 from https://www.gartner.com/reviews/market/endpoint-detection-and-response-solutions.Google Scholar
Reference
[18] Gartner. 2022. Gartner’s Magic Quadrant Frequently Asked Questions. Retrieved June 23, 2022 from https://emtemp.gcom.cloud/ngw/globalassets/en/research/documents/magic-quad-faq.pdf.Google Scholar
Reference
[19] Gibert Daniel, Mateu Carles, and Planes Jordi. 2020. The rise of machine learning for detection and classification of malware: Research developments, trends and challenges. J. Netw. Comput. Appl. 153 (2020), 102526. DOI:Google ScholarDigital Library
Reference
[20] Iannacone Michael D. and Bridges Robert A.. 2020. Quantifiable & comparable evaluations of cyber defensive capabilities: A survey & novel, unified approach. Comput. Secur. (2020), 101907.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
[21] Idika Nwokedi and Mathur Aditya P.. 2007. A survey of malware detection techniques. Purdue University. 48 (2007), 20–2. https://profsandhu.com/cs5323_s17/im_2007.pdf.Google Scholar
Reference
[22] Karabacak Bilge and Sogukpinar Ibrahim. 2005. ISRAM: Information security risk analysis method. Comput. Secur. 24, 2 (2005), 147–159.Google ScholarDigital Library
Reference
[23] Kondakci Suleyman. 2009. A concise cost analysis of Internet malware. Comput. Secur. 28, 7 (October2009), 648–659. DOI:Google ScholarDigital Library
Reference 1Reference 2
[24] Li Bo, Roundy Kevin, Gates Chris, and Vorobeychik Yevgeniy. 2017. Large-scale identification of malicious singleton files. In Proceedings of the 7th Conference on Data and Application Security and Privacy. ACM, 227–238.Google ScholarDigital Library
Reference 1Reference 2
[25] Ling Xiang, Wu Lingfei, Zhang Jiangyu, Qu Zhenqing, Deng Wei, Chen Xiang, Wu Chunming, Ji Shouling, Luo Tianyue, Wu Jingzheng, et al. 2023. Adversarial attacks against Windows PE malware detection: A survey of the state-of-the-art. Computers & Security (2023), 103134Google ScholarDigital Library
Reference
[26] Linger Rick, Pleszkoch Mark, Prowell Stacy, Sayre Kirk, and Ankrum T. Scott. 2013. Computing legacy software behavior to understand functionality and security properties: An IBM/370 demonstration. In Proceedings of the 8th Annual Cyber Security and Information Intelligence Research Workshop. 1–4.Google ScholarDigital Library
Reference
[27] Magazinius Jonas, Rios Billy K., and Sabelfeld Andrei. 2013. Polyglots: Crossing origins by crossing formats. In Proceedings of the ACM SIGSAC Conference on Computer & Communications Security (CCS’13). Association for Computing Machinery, New York, NY, 753–764.Google ScholarDigital Library
Reference
[28] MITRE. 2022. ATT&CK Evaluations, APT 3. Retrieved May 23, 2021 from https://attackevals.mitre-engenuity.org/enterprise/evaluations.html?round=APT3.Google Scholar
Reference
[29] Mullins Barry E., Lacey Timothy H., Mills Robert F., Trechter Joseph E., and Bass Samuel D.. 2007. How the cyber defense exercise shaped an information-assurance curriculum. IEEE Secur. Priv. 5, 5 (2007), 40–49.Google ScholarDigital Library
Reference
[30] Murphy Brandon R.. 2019. Comparing the Performance of Intrusion Detection Systems: Snort and Suricata. Ph.D. Dissertation. Colorado Technical University.Google Scholar
Reference
[31] Nissim Nir, Cohen Aviad, Glezer Chanan, and Elovici Yuval. 2015. Detection of malicious PDF files and directions for enhancements: A state-of-the art survey. Comput. Secur. 48 (2015), 246–266.Google ScholarDigital Library
Reference
[32] Oesch Sean, Bridges Robert, Smith Jared, Beaver Justin, Goodall John, Huffer Kelly, Miles Craig, and Scofield Dan. 2020. An assessment of the usability of machine learning based tools for the security operations center. In Proceedings of the International Conferences on Internet of Things (iThings’20) and IEEE Green Computing and Communications (GreenCom’20) and IEEE Cyber, Physical and Social Computing (CPSCom’20) and IEEE Smart Data (SmartData’20), and IEEE Congress on Cybermatics (Cybermatics’20). IEEE, 634–641.Google Scholar
Reference
[33] Or-Meir Ori, Nissim Nir, Elovici Yuval, and Rokach Lior. 2019. Dynamic malware analysis in the modern era–A state of the art survey. ACM Comput. Surv. 52, 5 (2019), 1–48.Google ScholarDigital Library
Reference
[34] Pandey Sudhir Kumar and Mehtre B. M.. 2014. Performance of malware detection tools: A comparison. In Advanced Communications, Control and Computing Technologies. IEEE, 1811–1817.Google Scholar
Reference
[35] Patriciu Victor-Valeriu and Furtuna Adrian Constantin. 2009. Guide for designing cyber security exercises. In Proceedings of the 8th WSEAS International Conference on E-Activities and Information Security and Privacy. World Scientific and Engineering Academy and Society, 172–177.Google ScholarDigital Library
Reference
[36] Prowell Stacy J., Sayre Kirk D., and Awad Rima L.. 2016. Automatic clustering of malware variants based on structured control flow. (December 82016). US Patent App. 15/172,884.Google Scholar
Reference
[37] Qamar Attia, Karim Ahmad, and Chang Victor. 2019. Mobile malware attacks: Review, taxonomy & future directions. Fut. Gener. Comput. Syst. 97 (2019), 887–909.Google ScholarDigital Library
Reference
[38] Reed Theodore, Nauer Kevin, and Silva Austin. 2013. Instrumenting competition-based exercises to evaluate cyber defender situation awareness. In International Conference on Augmented Cognition. Springer, 80–89.Google ScholarCross Ref
Reference
[39] Rossey Lee M., Cunningham Robert K., Fried David J., Rabek Jesse C., Lippmann Richard P., Haines Joshua W., and Zissman Marc A.. 2002. LARIAT: Lincoln adaptable real-time information assurance testbed. In Proceedings, IEEE Aerospace Conference, Vol. 6. IEEE, 6–6.Google ScholarCross Ref
Reference
[40] Labs SE. 2022. Endpoint Security (EPS): Enterprise 2022 Q1. Retrieved June 23, 2022 from https://selabs.uk/reports/enterprise-endpoint-protection-2022-q1/.Google Scholar
Reference 1Reference 2Reference 3
[41] Shah Syed Ali Raza and Issac Biju. 2018. Performance comparison of intrusion detection systems and application of machine learning to Snort system. Fut. Gener. Comput. Syst. 80 (2018), 157–170.Google ScholarDigital Library
Reference
[42] Shalaginov Andrii et al. 2018. Machine learning aided static malware analysis: A survey and tutorial. In Cyber Threat Intelligence. Springer, 7–45.Google ScholarCross Ref
Reference
[43] Sonnenreich Wes, Albanese Jason, Stout Bruce, et al. 2006. Return on security investment (ROSI)–A practical quantitative model. J. Res. Pract. Inf. Technol. 38, 1 (2006), 45.Google Scholar
Reference
[44] Souri Alireza and Hosseini Rahil. 2018. A state-of-the-art survey of malware detection approaches using data mining techniques. Hum.-centr. Comput. Inf. Sci. 8, 1 (2018), 3.Google ScholarDigital Library
Reference
[45] Thongkanchorn Kittikhun et al. 2013. Evaluation studies of three intrusion detection systems under various attacks and rule sets. In Proceedings of the IEEE Region 10 Conference (TENCON’13). IEEE, 1–4.Google ScholarCross Ref
Reference
[46] VirusTotal. 2022. Why Do Not You Include Statistics Comparing Antivirus Performance? Retrieved June 22, 2022 from https://support.virustotal.com/hc/en-us/articles/115002094589-Why-do-not-you-include-statistics-comparing-antivirus-performance-.Google Scholar
Reference
[47] Werther Joseph, Zhivich Michael, Leek Tim, and Zeldovich Nickolai. 2011. Experiences in cyber security education: The MIT Lincoln Laboratory Capture-the-Flag Exercise. In Proceedings of the Workshop on Cyber Security Experimentation and Test (CSET’11).Google Scholar
Reference
[48] Wolf Julia. 2010. OMG WTF PDF. In 27th Chaos Communication Congress.Google Scholar
Reference
[49] Ye Yanfang, Li Tao, Adjeroh Donald, and Iyengar S Sitharama. 2017. A survey on malware detection using data mining techniques. ACM Comput. Surv. 50, 3 (2017), 41.Google ScholarDigital Library
Reference
[50] Zhu Shuofei, Shi Jianjun, Yang Limin, Qin Boqin, Zhang Ziyi, Song Linhai, and Wang Gang. 2020. Measuring and modeling the label dynamics of online anti-malware engines. In Proceedings of the USENIX Security Symposium (USENIX Security’20). 2361–2378.Google Scholar
Reference

Index Terms

Beyond the Hype: An Evaluation of Commercially Available Machine Learning–based Malware Detectors
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Economics of security and privacy
  2. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation

Recommendations

Malware detection using adaptive data compression
AISec '08: Proceedings of the 1st ACM workshop on Workshop on AISec

A popular approach in current commercial anti-malware software detects malicious programs by searching in the code of programs for scan strings that are byte sequences indicative of malicious code. The scan strings, also known as the signatures of ...
Read More
Opcode sequences as representation of executables for data-mining-based unknown malware detection

Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a ...
Read More
Combining Crowd Contributions with Machine Learning to Detect Malicious Mobile Apps
Internetware '15: Proceedings of the 7th Asia-Pacific Symposium on Internetware

Android is undoubtedly becoming the most popular smartphone platform. The popularity of Android, unfortunately, has also made the devices become the target of malware. Most of existing malicious mobile apps feature stealthy operations such as collecting ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Digital Threats: Research and Practice Volume 4, Issue 2
June 2023
344 pages
EISSN:2576-5337
DOI:10.1145/3615671
Editors:
Arun Lakhotia
University of Louisiana at Lafayette and Cythereal, USA
,
Leigh Metcalf
CERT, USA
Issue’s Table of Contents
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2023
- Online AM: 16 February 2023
- Accepted: 29 September 2022
- Revised: 26 July 2022
- Received: 27 January 2022
Published in dtrap Volume 4, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Malware detection
endpoint detection
network detection
evaluation
test
intrusion detection
cost benefit analysis
static analysis
dynamic analysis
machine learning
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 1,077
  Total Downloads
- Downloads (Last 12 months)932
- Downloads (Last 6 weeks)135
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Param. Value	Description
M = $2,000	Max. Attack Cost
\(\alpha\) = 15m = 900s	Half-Cost Time
N = 125K	# of Files
r = 1.16%	Malware Ratio

Beyond the Hype: An Evaluation of Commercially Available Machine Learning–based Malware Detectors

Digital Threats: Research and Practice

Abstract

1 INTRODUCTION

1.1 Results Summary

1.2 Contributions

2 RELATED WORK

2.0.1 COTS Evaluations.

2.0.2 Market Reports by Companies.

2.0.3 Malware Detection Method Surveys.

2.0.4 Methods of Evaluation of Detectors.

2.0.5 Polyglot Files.

3 EXPERIMENTAL DESIGN

3.1 File Samples

3.2 Tested Technologies

4 STATISTICAL RESULTS

4.1 Individual Sensor Results

4.1.1 Detection Latency.

4.1.2 Network-level Detectors Efficacy across Protocols.

4.2 Complementary Detection

5 COST MODEL EVALUATION

5.1 Cost Model Overview

5.2 Single-tool Cost Model Results

5.2.1 Insights from Varying Cost Model Parameters.

5.2.2 Insights from Increasing Cost of “Hard” Files—Zero-days, Polyglots, and APT-style Files.

5.3 Savings by Adding Network Detectors

6 CONCLUSIONS, LIMITATIONS, NEXT STEPS, AND TAKEAWAYS

APPENDIX

Footnotes

Supplemental Material

Available for Download

REFERENCES

Cited By

Index Terms

Recommendations

Malware detection using adaptive data compression

Opcode sequences as representation of executables for data-mining-based unknown malware detection

Combining Crowd Contributions with Machine Learning to Detect Malicious Mobile Apps

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media