ABSTRACT
We present ProvG-Searcher, a novel approach for detecting known APT behaviors within system security logs. Our approach leverages provenance graphs, a comprehensive graph representation of event logs, to capture and depict data provenance relations by mapping system entities as nodes and their interactions as edges. We formulate the task of searching provenance graphs as a subgraph matching problem and employ a graph representation learning method. The central component of our search methodology involves embedding of subgraphs in a vector space where subgraph relationships can be directly evaluated. We achieve this through the use of order embeddings that simplify subgraph matching to straightforward comparisons between a query and precomputed subgraph representations. To address challenges posed by the size and complexity of provenance graphs, we propose a graph partitioning scheme and a behavior-preserving graph reduction method. Overall, our technique offers significant computational efficiency, allowing most of the search computation to be performed offline while incorporating a lightweight comparison step during query execution. Experimental results on standard datasets demonstrate that ProvG-Searcher achieves superior performance, with an accuracy exceeding 99% in detecting query behaviors and a false positive rate of approximately 0.02%, outperforming other approaches.
- Abdulellah Alsaheel, Yuhong Nan, Shiqing Ma, Le Yu, et al. 2021. ATLAS: A Sequence-based Learning Approach for Attack Investigation. In USENIX Security Symposium.Google Scholar
- Ben Athiwaratkun and Andrew Gordon Wilson. 2018. Hierarchical density order embeddings. arXiv preprint arXiv:1804.09843 (2018).Google Scholar
- MITRE ATT&CK. 2021. MITRE ATT&CK. https://attack.mitre.org. Accessed: February 28, 2023.Google Scholar
- Jinheon Baek, Minki Kang, and Sung Ju Hwang. 2021. Accurate learning of graph representations with graph multiset pooling. arXiv preprint arXiv:2102.11533.Google Scholar
- Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, et al. 2019. Simgnn: A neural network approach to fast graph similarity computation. In WSDM.Google ScholarDigital Library
- Adam Bates, Dave Jing Tian, Kevin RB Butler, and Thomas Moyer. 2015. Trustworthy whole-system provenance for the linux kernel. In USENIX Security Symposium. 319--334.Google ScholarDigital Library
- Tristan Bilot, Nour El Madhoun, Khaldoun Al Agha, and Anis Zouaoui. 2023. A Survey on Malware Detection with Graph Representation Learning. arXiv preprint arXiv:2303.16004 (2023).Google Scholar
- Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. 2020. Graph representation learning: a survey. APSIPA (2020), e15.Google Scholar
- Meng-Fen Chiang, Ee-Peng Lim, Wang-Chien Lee, Xavier Jayaraj Siddarth Ashok, and Philips Kokoh Prasetyo. 2019a. One-class order embedding for dependency relation prediction. In ACM SIGIR. 205--214.Google Scholar
- Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, et al. 2019b. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In SIGKDD. 257--266.Google ScholarDigital Library
- DARPA. 2014. Transparent Computing. http://www.darpa.mil/program/transparent-computing.Google Scholar
- Angjela Davitkova, Damjan Gjurovski, and Sebastian Michel. 2021. LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs. arXiv preprint arXiv:2102.10588 (2021).Google Scholar
- Ashita Diwan. 2021. Representation Learning for Vulnerability Detection on Assembly Code. McGill University (Canada).Google Scholar
- Altinisik Enes, Deniz Fatih, and Sencar Husrev Taha. 2023. ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search. arXiv preprint arXiv:2309.03647.Google Scholar
- Pengcheng Fang, Peng Gao, Changlin Liu, Erman Ayday, et al. 2022. Back-Propagating System Dependency Impact for Attack Investigation. In USENIX Security Symposium. 2461--2478.Google Scholar
- Peng Fei, Zhou Li, Zhiying Wang, Xiao Yu, Ding Li, and Kangkook Jee. 2021. SEAL: Storage-efficient Causality Analysis on Enterprise Logs with Query-friendly Compression.. In USENIX Security Symposium. 2987--3004.Google Scholar
- Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, et al. 2021. Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence. In ICDE. 193--204.Google Scholar
- W. Hamilton, Z. Ying, and J. Leskovec. 2017a. Inductive Representation Learning on Large Graphs. In NIPS.Google Scholar
- William L. Hamilton, Rex Ying, and Jure Leskovec. 2017b. Representation Learning on Graphs: Methods and Applications. IEEE Data Eng. Bull. (2017).Google Scholar
- Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer. 2020. Unicorn: Runtime provenance-based detector for advanced persistent threats. arXiv preprint arXiv:2001.01525 (2020).Google Scholar
- Wajih Ul Hassan, Lemay Aguse, Nuraini Aguse, Adam Bates, and Thomas Moyer. 2018. Towards scalable cluster auditing through grammatical inference over provenance graphs. In NDSS.Google Scholar
- Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020a. Tactical provenance analysis for endpoint detection and response systems. In S&P. 1172--1189.Google Scholar
- Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, et al. 2019. Nodoze: Combatting threat alert fatigue with automated provenance triage. In NDSS.Google Scholar
- Wajih Ul Hassan, Ding Li, Kangkook Jee, Xiao Yu, et al. 2020b. This is why we can't cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage. In ACSAC. 165--178.Google Scholar
- Wajih Ul Hassan, Mohammad Ali Noureddine, Pubali Datta, and Adam Bates. 2020c. OmegaLog: High-fidelity attack investigation via transparent multi-layer log analysis. In NDSS.Google Scholar
- Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, et al. 2017. SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data.. In USENIX Security Symposium. 487--504.Google Scholar
- Md Nahid Hossain, Sanaz Sheikhi, and R Sekar. 2020. Combating dependence explosion in forensic analysis using alternative tag propagation semantics. In S&P. 1139--1155.Google Scholar
- Md Nahid Hossain, Junao Wang, R Sekar, and Scott D Stoller. 2018. Dependence-preserving data compaction for scalable forensic analysis. In USENIX Security Symposium. 1723--1740.Google Scholar
- Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020. 2704--2710.Google ScholarDigital Library
- Wenbing Huang, Yu Rong, Tingyang Xu, et al. 2020. Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864 (2020).Google Scholar
- Arijit Khan, Yinghui Wu, Charu C Aggarwal, and Xifeng Yan. 2013. Nema: Fast graph search with label similarity. VLDB Endowment, Vol. 6, 181--192.Google ScholarDigital Library
- Samuel T King and Peter M Chen. 2003. Backtracking intrusions. In SOSP. 223--236.Google Scholar
- T. Kipf and M. Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.Google Scholar
- Yonghwi Kwon, Fei Wang, Weihang Wang, et al. 2018. MCI: Modeling-based Causality Inference in Audit Logging for Attack Investigation. In NDSS. 4.Google Scholar
- Zixun Lan, Limin Yu, Linglong Yuan, et al. 2021. Sub-gmn: The subgraph matching network model. arXiv preprint arXiv:2104.00186.Google Scholar
- Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013a. High Accuracy Attack Provenance via Binary-based Execution Partition. In NDSS, Vol. 16.Google Scholar
- Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013b. LogGC: garbage collecting audit log. In SIGSAC. 1005--1016.Google Scholar
- Yujia Li, Chenjie Gu, Thomas Dullien, et al. 2019. Graph matching networks for learning the similarity of graph structured objects. In ICML. 3835--3845.Google Scholar
- Zitong Li, Xiang Cheng, Lixiao Sun, Ji Zhang, and Bing Chen. 2021. A hierarchical approach for advanced persistent threat detection with attention-based graph neural networks. Security and Communication Networks, Vol. 2021 (2021), 1--14.Google Scholar
- Chung-Shou Liao, Kanghao Lu, Michael Baym, Rohit Singh, and Bonnie Berger. 2009. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 12, i253--i258.Google ScholarDigital Library
- Fucheng Liu, Yu Wen, Dongxue Zhang, et al. 2019b. Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise. In SIGSAC. 1777--1794.Google Scholar
- Lihui Liu, Boxin Du, Hanghang Tong, et al. 2019a. G-finder: Approximate attributed subgraph matching. In IEEE BigData. 513--522.Google Scholar
- Yushan Liu, Mu Zhang, Ding Li, Kangkook Jee, et al. 2018. Towards a Timely Causality Analysis for Enterprise Security.. In NDSS.Google Scholar
- Zhaoyu Lou, Jiaxuan You, Chengtao Wen, et al. 2020. Neural subgraph matching. arXiv preprint arXiv:2007.03092.Google Scholar
- Andreas Loukas. 2019. What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199 (2019).Google Scholar
- Yao Lu, Kaizhu Huang, and Cheng-Lin Liu. 2016. A fast projected fixed-point algorithm for large graph matching. Pattern Recognition, 971--982.Google Scholar
- Shiqing Ma, Juan Zhai, Fei Wang, Kyu Hyung Lee, et al. 2017. MPI: Multiple Perspective Attack Investigation with Semantic Aware Execution Partitioning. In USENIX Security Symposium. 1111--1128.Google Scholar
- Shiqing Ma, Xiangyu Zhang, Dongyan Xu, et al. 2016. Protracer: Towards Practical Provenance Tracing by Alternating Between Logging and Tainting. In NDSS.Google Scholar
- Emaad Manzoor, Sadegh M Milajerdi, and Leman Akoglu. 2016. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In SIGKDD. 1035--1044.Google Scholar
- Noor Michael, Jaron Mink, Jason Liu, Sneha Gaur, et al. 2020. On the forensic validity of approximated audit logs. In ACSAC. 189--202.Google Scholar
- Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrishnan. 2019a. Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In SIGSAC. 1795--1812.Google ScholarDigital Library
- Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, Ramachandran Sekar, and VN Venkatakrishnan. 2019b. Holmes: real-time apt detection through correlation of suspicious information flows. In S&P. 1137--1152.Google Scholar
- Luc Moreau, Juliana Freire, Joe Futrelle, Robert E McGrath, et al. 2008. The open provenance model: An overview. In IPAW. 323--326.Google Scholar
- Kiran-Kumar Muniswamy-Reddy and Margo Seltzer. 2010. Provenance as first class cloud data. SIGOPS (2010), 11--16.Google Scholar
- Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, et al. 2017. Practical whole-system provenance capture. In SoCC. 405--418.Google Scholar
- Kexin Pei, Zhongshu Gu, Brendan Saltaformaggio, Shiqing Ma, et al. 2016. Hercule: Attack story reconstruction via community discovery on correlated log graph. In ACSAC. 583--595.Google ScholarDigital Library
- Robert Pienta, Acar Tamersoy, Hanghang Tong, and Duen Horng Chau. 2014. Mage: Matching approximate patterns in richly-attributed graphs. In IEEE BigData. 585--590.Google Scholar
- Indradyumna Roy, Venkata Sai Baba Reddy Velugoti, Soumen Chakrabarti, and Abir De. 2022. Interpretable Neural Subgraph Matching for Graph Retrieval. In AAAI, Vol. 36. 8115--8123.Google ScholarCross Ref
- Kiavash Satvat, Rigel Gjomemo, and VN Venkatakrishnan. 2021. Extractor: Extracting attack behavior from threat reports. In EuroS&P. 598--615.Google Scholar
- Franco Scarselli, Marco Gori, Ah Chung Tsoi, et al. 2008. The graph neural network model. IEEE transactions on neural networks, Vol. 20, 61--80.Google Scholar
- Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, et al. 2018. Modeling relational data with graph convolutional networks. In ESWC 2018. 593--607.Google ScholarDigital Library
- Joerg Thalheim, Pramod Bhatotia, and Christof Fetzer. 2016. Inspector: data provenance using intel processor trace (pt). In ICDCS. 25--34.Google Scholar
- Yuanyuan Tian, Richard C Mceachin, Carlos Santos, et al. 2007. SAGA: a subgraph matching tool for biological graphs. Bioinformatics, Vol. 23, 232--239.Google ScholarDigital Library
- Hanghang Tong, Christos Faloutsos, Brian Gallagher, and Tina Eliassi-Rad. 2007. Fast best-effort pattern matching in large attributed graphs. In SIGKDD. 737--746.Google Scholar
- Jacob Torrey. 2020. Transparent Computing Engagement 3 Data Release. https://www.darpa.mil/program/transparent-computingGoogle Scholar
- Guillem Cucurull, Arantxa Casanova, Adriana Romero, et al. 2018. Graph Attention Networks. In ICLR.Google Scholar
- Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361.Google Scholar
- Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. 2018. Probabilistic embedding of knowledge graphs with box lattice measures. arXiv preprint arXiv:1805.06627 (2018).Google Scholar
- Qi Wang, Wajih Ul Hassan, Ding Li, Kangkook Jee, et al. 2020a. You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis.. In NDSS.Google Scholar
- Qi Wang, Wajih Ul Hassan, Ding Li, Kangkook Jee, et al. 2020b. You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis.. In NDSS.Google Scholar
- Su Wang, Zhiliang Wang, Tao Zhou, Hongbin Sun, Xia Yin, et al. 2022. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning. TIFS, Vol. 17 (2022), 3972--3987.Google Scholar
- Renzheng Wei, Lijun Cai, Lixin Zhao, Aimin Yu, and Dan Meng. 2021. Deephunter: A graph neural network based approach for robust cyber threat hunting. In SecureComm. Springer, 3--24.Google Scholar
- Yuting Wu, Xiao Liu, Yansong Feng, et al. 2019. Relation-aware entity alignment for heterogeneous knowledge graphs. arXiv preprint arXiv:1908.08210 (2019).Google Scholar
- Yulai Xie, Dan Feng, Yuchong Hu, Yan Li, et al. 2018. Pagoda: A hybrid approach to enable efficient real-time provenance based intrusion detection in big data environments. IEEE TDSC , Vol. 17, 1283--1296.Google Scholar
- Chunlin Xiong, Tiantian Zhu, Weihao Dong, Linqi Ruan, et al. 2020. CONAN: A practical real-time APT detection system with high accuracy and efficiency. IEEE TDSC, Vol. 19, 1, 551--565.Google Scholar
- Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google Scholar
- Kun Xu, Liwei Wang, Mo Yu, Yansong Feng, et al. 2019. Cross-lingual knowledge graph alignment via graph matching neural network. arXiv preprint arXiv:1905.11605 (2019).Google Scholar
- Zhiqiang Xu, Pengcheng Fang, Changlin Liu, et al. 2022. Depcomm: Graph summarization on system audit logs for attack investigation. In S&P. 540--557.Google Scholar
- Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, et al. 2016. High fidelity data reduction for big data security dependency analyses. In SIGSAC. 504--516.Google Scholar
- Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, et al. 2021b. Decoupling the depth and scope of graph neural networks. NeurIPS (2021), 19665--19679.Google Scholar
- Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2019. Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931 (2019).Google Scholar
- Jun Zeng, Zheng Leong Chua, Yinfang Chen, et al. 2021a. WATSON: Abstracting Behaviors from Audit Logs via Aggregation of Contextual Semantics.. In NDSS.Google Scholar
- Jun Zengy, Xiang Wang, Jiahao Liu, et al. 2022. Shadewatcher: Recommendation-guided cyber threat analysis using system audit records. In S&P. 489--506.Google Scholar
- Tiantian Zhu, Jinkai Yu, Chunlin Xiong, et al. 2023. APTSHIELD: A Stable, Efficient and Real-time APT Detection System for Linux Hosts. IEEE TDSC.Google Scholar
- Bo Zong, Xusheng Xiao, Zhichun Li, et al. 2015. Behavior query discovery in system-generated temporal graphs. arXiv preprint arXiv:1511.05911 (2015).Google Scholar
Index Terms
- ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search
Recommendations
Subgraph Isomorphism Building on A Hierarchical Query Graph
ICCDA '21: Proceedings of the 2021 5th International Conference on Compute and Data AnalysisSubgraph isomorphism is an essential problem of graph theory. It has broad application on information retrieval in many research field, such as biology, chemistry, knowledge graph and social network. The settlement to graph isomorphism is to find the ...
A subgraph matching algorithm based on subgraph index for knowledge graph
AbstractThe problem of subgraph matching is one fundamental issue in graph search, which is NP-Complete problem. Recently, subgraph matching has become a popular research topic in the field of knowledge graph analysis, which has a wide range of ...
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
WWW '19: The World Wide Web ConferenceCan neural networks learn to compare graphs without feature engineering? In this paper, we show that it is possible to learn representations for graph similarity with neither domain knowledge nor supervision (i.e. feature engineering or labeled graphs). ...
Comments