research-article

ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search

Authors:
Enes Altinisik

Qatar Computing Research Institute, HBKU, Doha, Qatar

Qatar Computing Research Institute, HBKU, Doha, Qatar

0000-0001-9300-6564
View Profile

,
Fatih Deniz

Qatar Computing Research Institute, HBKU, Doha, Qatar

Qatar Computing Research Institute, HBKU, Doha, Qatar

0000-0001-9987-9569
View Profile

,
Hüsrev Taha Sencar

Qatar Computing Research Institute, HBKU, Doha, Qatar

Qatar Computing Research Institute, HBKU, Doha, Qatar

0000-0001-6910-6194
View Profile

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications SecurityNovember 2023Pages 2247–2261https://doi.org/10.1145/3576915.3623187

Published:21 November 2023Publication History

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

Pages 2247–2261

ABSTRACT

We present ProvG-Searcher, a novel approach for detecting known APT behaviors within system security logs. Our approach leverages provenance graphs, a comprehensive graph representation of event logs, to capture and depict data provenance relations by mapping system entities as nodes and their interactions as edges. We formulate the task of searching provenance graphs as a subgraph matching problem and employ a graph representation learning method. The central component of our search methodology involves embedding of subgraphs in a vector space where subgraph relationships can be directly evaluated. We achieve this through the use of order embeddings that simplify subgraph matching to straightforward comparisons between a query and precomputed subgraph representations. To address challenges posed by the size and complexity of provenance graphs, we propose a graph partitioning scheme and a behavior-preserving graph reduction method. Overall, our technique offers significant computational efficiency, allowing most of the search computation to be performed offline while incorporating a lightweight comparison step during query execution. Experimental results on standard datasets demonstrate that ProvG-Searcher achieves superior performance, with an accuracy exceeding 99% in detecting query behaviors and a false positive rate of approximately 0.02%, outperforming other approaches.

References

Abdulellah Alsaheel, Yuhong Nan, Shiqing Ma, Le Yu, et al. 2021. ATLAS: A Sequence-based Learning Approach for Attack Investigation. In USENIX Security Symposium.Google Scholar
Ben Athiwaratkun and Andrew Gordon Wilson. 2018. Hierarchical density order embeddings. arXiv preprint arXiv:1804.09843 (2018).Google Scholar
MITRE ATT&CK. 2021. MITRE ATT&CK. https://attack.mitre.org. Accessed: February 28, 2023.Google Scholar
Jinheon Baek, Minki Kang, and Sung Ju Hwang. 2021. Accurate learning of graph representations with graph multiset pooling. arXiv preprint arXiv:2102.11533.Google Scholar
Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, et al. 2019. Simgnn: A neural network approach to fast graph similarity computation. In WSDM.Google ScholarDigital Library
Adam Bates, Dave Jing Tian, Kevin RB Butler, and Thomas Moyer. 2015. Trustworthy whole-system provenance for the linux kernel. In USENIX Security Symposium. 319--334.Google ScholarDigital Library
Tristan Bilot, Nour El Madhoun, Khaldoun Al Agha, and Anis Zouaoui. 2023. A Survey on Malware Detection with Graph Representation Learning. arXiv preprint arXiv:2303.16004 (2023).Google Scholar
Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. 2020. Graph representation learning: a survey. APSIPA (2020), e15.Google Scholar
Meng-Fen Chiang, Ee-Peng Lim, Wang-Chien Lee, Xavier Jayaraj Siddarth Ashok, and Philips Kokoh Prasetyo. 2019a. One-class order embedding for dependency relation prediction. In ACM SIGIR. 205--214.Google Scholar
Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, et al. 2019b. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In SIGKDD. 257--266.Google ScholarDigital Library
DARPA. 2014. Transparent Computing. http://www.darpa.mil/program/transparent-computing.Google Scholar
Angjela Davitkova, Damjan Gjurovski, and Sebastian Michel. 2021. LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs. arXiv preprint arXiv:2102.10588 (2021).Google Scholar
Ashita Diwan. 2021. Representation Learning for Vulnerability Detection on Assembly Code. McGill University (Canada).Google Scholar
Altinisik Enes, Deniz Fatih, and Sencar Husrev Taha. 2023. ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search. arXiv preprint arXiv:2309.03647.Google Scholar
Pengcheng Fang, Peng Gao, Changlin Liu, Erman Ayday, et al. 2022. Back-Propagating System Dependency Impact for Attack Investigation. In USENIX Security Symposium. 2461--2478.Google Scholar
Peng Fei, Zhou Li, Zhiying Wang, Xiao Yu, Ding Li, and Kangkook Jee. 2021. SEAL: Storage-efficient Causality Analysis on Enterprise Logs with Query-friendly Compression.. In USENIX Security Symposium. 2987--3004.Google Scholar
Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, et al. 2021. Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence. In ICDE. 193--204.Google Scholar
W. Hamilton, Z. Ying, and J. Leskovec. 2017a. Inductive Representation Learning on Large Graphs. In NIPS.Google Scholar
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017b. Representation Learning on Graphs: Methods and Applications. IEEE Data Eng. Bull. (2017).Google Scholar
Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer. 2020. Unicorn: Runtime provenance-based detector for advanced persistent threats. arXiv preprint arXiv:2001.01525 (2020).Google Scholar
Wajih Ul Hassan, Lemay Aguse, Nuraini Aguse, Adam Bates, and Thomas Moyer. 2018. Towards scalable cluster auditing through grammatical inference over provenance graphs. In NDSS.Google Scholar
Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020a. Tactical provenance analysis for endpoint detection and response systems. In S&P. 1172--1189.Google Scholar
Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, et al. 2019. Nodoze: Combatting threat alert fatigue with automated provenance triage. In NDSS.Google Scholar
Wajih Ul Hassan, Ding Li, Kangkook Jee, Xiao Yu, et al. 2020b. This is why we can't cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage. In ACSAC. 165--178.Google Scholar
Wajih Ul Hassan, Mohammad Ali Noureddine, Pubali Datta, and Adam Bates. 2020c. OmegaLog: High-fidelity attack investigation via transparent multi-layer log analysis. In NDSS.Google Scholar
Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, et al. 2017. SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data.. In USENIX Security Symposium. 487--504.Google Scholar
Md Nahid Hossain, Sanaz Sheikhi, and R Sekar. 2020. Combating dependence explosion in forensic analysis using alternative tag propagation semantics. In S&P. 1139--1155.Google Scholar
Md Nahid Hossain, Junao Wang, R Sekar, and Scott D Stoller. 2018. Dependence-preserving data compaction for scalable forensic analysis. In USENIX Security Symposium. 1723--1740.Google Scholar
Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020. 2704--2710.Google ScholarDigital Library
Wenbing Huang, Yu Rong, Tingyang Xu, et al. 2020. Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864 (2020).Google Scholar
Arijit Khan, Yinghui Wu, Charu C Aggarwal, and Xifeng Yan. 2013. Nema: Fast graph search with label similarity. VLDB Endowment, Vol. 6, 181--192.Google ScholarDigital Library
Samuel T King and Peter M Chen. 2003. Backtracking intrusions. In SOSP. 223--236.Google Scholar
T. Kipf and M. Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.Google Scholar
Yonghwi Kwon, Fei Wang, Weihang Wang, et al. 2018. MCI: Modeling-based Causality Inference in Audit Logging for Attack Investigation. In NDSS. 4.Google Scholar
Zixun Lan, Limin Yu, Linglong Yuan, et al. 2021. Sub-gmn: The subgraph matching network model. arXiv preprint arXiv:2104.00186.Google Scholar
Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013a. High Accuracy Attack Provenance via Binary-based Execution Partition. In NDSS, Vol. 16.Google Scholar
Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013b. LogGC: garbage collecting audit log. In SIGSAC. 1005--1016.Google Scholar
Yujia Li, Chenjie Gu, Thomas Dullien, et al. 2019. Graph matching networks for learning the similarity of graph structured objects. In ICML. 3835--3845.Google Scholar
Zitong Li, Xiang Cheng, Lixiao Sun, Ji Zhang, and Bing Chen. 2021. A hierarchical approach for advanced persistent threat detection with attention-based graph neural networks. Security and Communication Networks, Vol. 2021 (2021), 1--14.Google Scholar
Chung-Shou Liao, Kanghao Lu, Michael Baym, Rohit Singh, and Bonnie Berger. 2009. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 12, i253--i258.Google ScholarDigital Library
Fucheng Liu, Yu Wen, Dongxue Zhang, et al. 2019b. Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise. In SIGSAC. 1777--1794.Google Scholar
Lihui Liu, Boxin Du, Hanghang Tong, et al. 2019a. G-finder: Approximate attributed subgraph matching. In IEEE BigData. 513--522.Google Scholar
Yushan Liu, Mu Zhang, Ding Li, Kangkook Jee, et al. 2018. Towards a Timely Causality Analysis for Enterprise Security.. In NDSS.Google Scholar
Zhaoyu Lou, Jiaxuan You, Chengtao Wen, et al. 2020. Neural subgraph matching. arXiv preprint arXiv:2007.03092.Google Scholar
Andreas Loukas. 2019. What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199 (2019).Google Scholar
Yao Lu, Kaizhu Huang, and Cheng-Lin Liu. 2016. A fast projected fixed-point algorithm for large graph matching. Pattern Recognition, 971--982.Google Scholar
Shiqing Ma, Juan Zhai, Fei Wang, Kyu Hyung Lee, et al. 2017. MPI: Multiple Perspective Attack Investigation with Semantic Aware Execution Partitioning. In USENIX Security Symposium. 1111--1128.Google Scholar
Shiqing Ma, Xiangyu Zhang, Dongyan Xu, et al. 2016. Protracer: Towards Practical Provenance Tracing by Alternating Between Logging and Tainting. In NDSS.Google Scholar
Emaad Manzoor, Sadegh M Milajerdi, and Leman Akoglu. 2016. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In SIGKDD. 1035--1044.Google Scholar
Noor Michael, Jaron Mink, Jason Liu, Sneha Gaur, et al. 2020. On the forensic validity of approximated audit logs. In ACSAC. 189--202.Google Scholar
Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrishnan. 2019a. Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In SIGSAC. 1795--1812.Google ScholarDigital Library
Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, Ramachandran Sekar, and VN Venkatakrishnan. 2019b. Holmes: real-time apt detection through correlation of suspicious information flows. In S&P. 1137--1152.Google Scholar
Luc Moreau, Juliana Freire, Joe Futrelle, Robert E McGrath, et al. 2008. The open provenance model: An overview. In IPAW. 323--326.Google Scholar
Kiran-Kumar Muniswamy-Reddy and Margo Seltzer. 2010. Provenance as first class cloud data. SIGOPS (2010), 11--16.Google Scholar
Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, et al. 2017. Practical whole-system provenance capture. In SoCC. 405--418.Google Scholar
Kexin Pei, Zhongshu Gu, Brendan Saltaformaggio, Shiqing Ma, et al. 2016. Hercule: Attack story reconstruction via community discovery on correlated log graph. In ACSAC. 583--595.Google ScholarDigital Library
Robert Pienta, Acar Tamersoy, Hanghang Tong, and Duen Horng Chau. 2014. Mage: Matching approximate patterns in richly-attributed graphs. In IEEE BigData. 585--590.Google Scholar
Indradyumna Roy, Venkata Sai Baba Reddy Velugoti, Soumen Chakrabarti, and Abir De. 2022. Interpretable Neural Subgraph Matching for Graph Retrieval. In AAAI, Vol. 36. 8115--8123.Google ScholarCross Ref
Kiavash Satvat, Rigel Gjomemo, and VN Venkatakrishnan. 2021. Extractor: Extracting attack behavior from threat reports. In EuroS&P. 598--615.Google Scholar
Franco Scarselli, Marco Gori, Ah Chung Tsoi, et al. 2008. The graph neural network model. IEEE transactions on neural networks, Vol. 20, 61--80.Google Scholar
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, et al. 2018. Modeling relational data with graph convolutional networks. In ESWC 2018. 593--607.Google ScholarDigital Library
Joerg Thalheim, Pramod Bhatotia, and Christof Fetzer. 2016. Inspector: data provenance using intel processor trace (pt). In ICDCS. 25--34.Google Scholar
Yuanyuan Tian, Richard C Mceachin, Carlos Santos, et al. 2007. SAGA: a subgraph matching tool for biological graphs. Bioinformatics, Vol. 23, 232--239.Google ScholarDigital Library
Hanghang Tong, Christos Faloutsos, Brian Gallagher, and Tina Eliassi-Rad. 2007. Fast best-effort pattern matching in large attributed graphs. In SIGKDD. 737--746.Google Scholar
Jacob Torrey. 2020. Transparent Computing Engagement 3 Data Release. https://www.darpa.mil/program/transparent-computingGoogle Scholar
Guillem Cucurull, Arantxa Casanova, Adriana Romero, et al. 2018. Graph Attention Networks. In ICLR.Google Scholar
Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361.Google Scholar
Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. 2018. Probabilistic embedding of knowledge graphs with box lattice measures. arXiv preprint arXiv:1805.06627 (2018).Google Scholar
Qi Wang, Wajih Ul Hassan, Ding Li, Kangkook Jee, et al. 2020a. You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis.. In NDSS.Google Scholar
Qi Wang, Wajih Ul Hassan, Ding Li, Kangkook Jee, et al. 2020b. You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis.. In NDSS.Google Scholar
Su Wang, Zhiliang Wang, Tao Zhou, Hongbin Sun, Xia Yin, et al. 2022. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning. TIFS, Vol. 17 (2022), 3972--3987.Google Scholar
Renzheng Wei, Lijun Cai, Lixin Zhao, Aimin Yu, and Dan Meng. 2021. Deephunter: A graph neural network based approach for robust cyber threat hunting. In SecureComm. Springer, 3--24.Google Scholar
Yuting Wu, Xiao Liu, Yansong Feng, et al. 2019. Relation-aware entity alignment for heterogeneous knowledge graphs. arXiv preprint arXiv:1908.08210 (2019).Google Scholar
Yulai Xie, Dan Feng, Yuchong Hu, Yan Li, et al. 2018. Pagoda: A hybrid approach to enable efficient real-time provenance based intrusion detection in big data environments. IEEE TDSC , Vol. 17, 1283--1296.Google Scholar
Chunlin Xiong, Tiantian Zhu, Weihao Dong, Linqi Ruan, et al. 2020. CONAN: A practical real-time APT detection system with high accuracy and efficiency. IEEE TDSC, Vol. 19, 1, 551--565.Google Scholar
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google Scholar
Kun Xu, Liwei Wang, Mo Yu, Yansong Feng, et al. 2019. Cross-lingual knowledge graph alignment via graph matching neural network. arXiv preprint arXiv:1905.11605 (2019).Google Scholar
Zhiqiang Xu, Pengcheng Fang, Changlin Liu, et al. 2022. Depcomm: Graph summarization on system audit logs for attack investigation. In S&P. 540--557.Google Scholar
Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, et al. 2016. High fidelity data reduction for big data security dependency analyses. In SIGSAC. 504--516.Google Scholar
Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, et al. 2021b. Decoupling the depth and scope of graph neural networks. NeurIPS (2021), 19665--19679.Google Scholar
Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2019. Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931 (2019).Google Scholar
Jun Zeng, Zheng Leong Chua, Yinfang Chen, et al. 2021a. WATSON: Abstracting Behaviors from Audit Logs via Aggregation of Contextual Semantics.. In NDSS.Google Scholar
Jun Zengy, Xiang Wang, Jiahao Liu, et al. 2022. Shadewatcher: Recommendation-guided cyber threat analysis using system audit records. In S&P. 489--506.Google Scholar
Tiantian Zhu, Jinkai Yu, Chunlin Xiong, et al. 2023. APTSHIELD: A Stable, Efficient and Real-time APT Detection System for Linux Hosts. IEEE TDSC.Google Scholar
Bo Zong, Xusheng Xiao, Zhichun Li, et al. 2015. Behavior query discovery in system-generated temporal graphs. arXiv preprint arXiv:1511.05911 (2015).Google Scholar

Index Terms

ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Intrusion detection systems
  2. Systems security

Recommendations

Subgraph Isomorphism Building on A Hierarchical Query Graph
ICCDA '21: Proceedings of the 2021 5th International Conference on Compute and Data Analysis

Subgraph isomorphism is an essential problem of graph theory. It has broad application on information retrieval in many research field, such as biology, chemistry, knowledge graph and social network. The settlement to graph isomorphism is to find the ...
Read More
A subgraph matching algorithm based on subgraph index for knowledge graph
Abstract
The problem of subgraph matching is one fundamental issue in graph search, which is NP-Complete problem. Recently, subgraph matching has become a popular research topic in the field of knowledge graph analysis, which has a wide range of ...
Read More
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
WWW '19: The World Wide Web Conference

Can neural networks learn to compare graphs without feature engineering? In this paper, we show that it is possible to learn representations for graph similarity with neither domain knowledge nor supervision (i.e. feature engineering or labeled graphs). ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
November 2023
3722 pages
ISBN:9798400700507
DOI:10.1145/3576915
General Chairs:
Weizhi Meng
Technical University of Denmark
,
Christian D. Jensen
Technical University of Denmark
,
Program Chairs:
Cas Cremers
CISPA Helmholtz Center for Information Security
,
Engin Kirda
Khoury College of Computer Sciences
Copyright © 2023 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
apt behaviors
graph entailment
graph neural networks
graph reduction
order embeddings
provenance graph
security logs
subgraph matching
threat hunting
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,261of6,999submissions,18%
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 384
  Total Downloads
- Downloads (Last 12 months)384
- Downloads (Last 6 weeks)84
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Subgraph Isomorphism Building on A Hierarchical Query Graph

A subgraph matching algorithm based on subgraph index for knowledge graph

DDGK: Learning Graph Representations for Deep Divergence Graph Kernels

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Subgraph Isomorphism Building on A Hierarchical Query Graph

A subgraph matching algorithm based on subgraph index for knowledge graph

DDGK: Learning Graph Representations for Deep Divergence Graph Kernels

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media