skip to main content
10.1145/3576915.3623187acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search

Published:21 November 2023Publication History

ABSTRACT

We present ProvG-Searcher, a novel approach for detecting known APT behaviors within system security logs. Our approach leverages provenance graphs, a comprehensive graph representation of event logs, to capture and depict data provenance relations by mapping system entities as nodes and their interactions as edges. We formulate the task of searching provenance graphs as a subgraph matching problem and employ a graph representation learning method. The central component of our search methodology involves embedding of subgraphs in a vector space where subgraph relationships can be directly evaluated. We achieve this through the use of order embeddings that simplify subgraph matching to straightforward comparisons between a query and precomputed subgraph representations. To address challenges posed by the size and complexity of provenance graphs, we propose a graph partitioning scheme and a behavior-preserving graph reduction method. Overall, our technique offers significant computational efficiency, allowing most of the search computation to be performed offline while incorporating a lightweight comparison step during query execution. Experimental results on standard datasets demonstrate that ProvG-Searcher achieves superior performance, with an accuracy exceeding 99% in detecting query behaviors and a false positive rate of approximately 0.02%, outperforming other approaches.

References

  1. Abdulellah Alsaheel, Yuhong Nan, Shiqing Ma, Le Yu, et al. 2021. ATLAS: A Sequence-based Learning Approach for Attack Investigation. In USENIX Security Symposium.Google ScholarGoogle Scholar
  2. Ben Athiwaratkun and Andrew Gordon Wilson. 2018. Hierarchical density order embeddings. arXiv preprint arXiv:1804.09843 (2018).Google ScholarGoogle Scholar
  3. MITRE ATT&CK. 2021. MITRE ATT&CK. https://attack.mitre.org. Accessed: February 28, 2023.Google ScholarGoogle Scholar
  4. Jinheon Baek, Minki Kang, and Sung Ju Hwang. 2021. Accurate learning of graph representations with graph multiset pooling. arXiv preprint arXiv:2102.11533.Google ScholarGoogle Scholar
  5. Yunsheng Bai, Hao Ding, Song Bian, Ting Chen, et al. 2019. Simgnn: A neural network approach to fast graph similarity computation. In WSDM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Adam Bates, Dave Jing Tian, Kevin RB Butler, and Thomas Moyer. 2015. Trustworthy whole-system provenance for the linux kernel. In USENIX Security Symposium. 319--334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tristan Bilot, Nour El Madhoun, Khaldoun Al Agha, and Anis Zouaoui. 2023. A Survey on Malware Detection with Graph Representation Learning. arXiv preprint arXiv:2303.16004 (2023).Google ScholarGoogle Scholar
  8. Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. 2020. Graph representation learning: a survey. APSIPA (2020), e15.Google ScholarGoogle Scholar
  9. Meng-Fen Chiang, Ee-Peng Lim, Wang-Chien Lee, Xavier Jayaraj Siddarth Ashok, and Philips Kokoh Prasetyo. 2019a. One-class order embedding for dependency relation prediction. In ACM SIGIR. 205--214.Google ScholarGoogle Scholar
  10. Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, et al. 2019b. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In SIGKDD. 257--266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. DARPA. 2014. Transparent Computing. http://www.darpa.mil/program/transparent-computing.Google ScholarGoogle Scholar
  12. Angjela Davitkova, Damjan Gjurovski, and Sebastian Michel. 2021. LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs. arXiv preprint arXiv:2102.10588 (2021).Google ScholarGoogle Scholar
  13. Ashita Diwan. 2021. Representation Learning for Vulnerability Detection on Assembly Code. McGill University (Canada).Google ScholarGoogle Scholar
  14. Altinisik Enes, Deniz Fatih, and Sencar Husrev Taha. 2023. ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search. arXiv preprint arXiv:2309.03647.Google ScholarGoogle Scholar
  15. Pengcheng Fang, Peng Gao, Changlin Liu, Erman Ayday, et al. 2022. Back-Propagating System Dependency Impact for Attack Investigation. In USENIX Security Symposium. 2461--2478.Google ScholarGoogle Scholar
  16. Peng Fei, Zhou Li, Zhiying Wang, Xiao Yu, Ding Li, and Kangkook Jee. 2021. SEAL: Storage-efficient Causality Analysis on Enterprise Logs with Query-friendly Compression.. In USENIX Security Symposium. 2987--3004.Google ScholarGoogle Scholar
  17. Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, et al. 2021. Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence. In ICDE. 193--204.Google ScholarGoogle Scholar
  18. W. Hamilton, Z. Ying, and J. Leskovec. 2017a. Inductive Representation Learning on Large Graphs. In NIPS.Google ScholarGoogle Scholar
  19. William L. Hamilton, Rex Ying, and Jure Leskovec. 2017b. Representation Learning on Graphs: Methods and Applications. IEEE Data Eng. Bull. (2017).Google ScholarGoogle Scholar
  20. Xueyuan Han, Thomas Pasquier, Adam Bates, James Mickens, and Margo Seltzer. 2020. Unicorn: Runtime provenance-based detector for advanced persistent threats. arXiv preprint arXiv:2001.01525 (2020).Google ScholarGoogle Scholar
  21. Wajih Ul Hassan, Lemay Aguse, Nuraini Aguse, Adam Bates, and Thomas Moyer. 2018. Towards scalable cluster auditing through grammatical inference over provenance graphs. In NDSS.Google ScholarGoogle Scholar
  22. Wajih Ul Hassan, Adam Bates, and Daniel Marino. 2020a. Tactical provenance analysis for endpoint detection and response systems. In S&P. 1172--1189.Google ScholarGoogle Scholar
  23. Wajih Ul Hassan, Shengjian Guo, Ding Li, Zhengzhang Chen, et al. 2019. Nodoze: Combatting threat alert fatigue with automated provenance triage. In NDSS.Google ScholarGoogle Scholar
  24. Wajih Ul Hassan, Ding Li, Kangkook Jee, Xiao Yu, et al. 2020b. This is why we can't cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage. In ACSAC. 165--178.Google ScholarGoogle Scholar
  25. Wajih Ul Hassan, Mohammad Ali Noureddine, Pubali Datta, and Adam Bates. 2020c. OmegaLog: High-fidelity attack investigation via transparent multi-layer log analysis. In NDSS.Google ScholarGoogle Scholar
  26. Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, et al. 2017. SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data.. In USENIX Security Symposium. 487--504.Google ScholarGoogle Scholar
  27. Md Nahid Hossain, Sanaz Sheikhi, and R Sekar. 2020. Combating dependence explosion in forensic analysis using alternative tag propagation semantics. In S&P. 1139--1155.Google ScholarGoogle Scholar
  28. Md Nahid Hossain, Junao Wang, R Sekar, and Scott D Stoller. 2018. Dependence-preserving data compaction for scalable forensic analysis. In USENIX Security Symposium. 1723--1740.Google ScholarGoogle Scholar
  29. Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of The Web Conference 2020. 2704--2710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wenbing Huang, Yu Rong, Tingyang Xu, et al. 2020. Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864 (2020).Google ScholarGoogle Scholar
  31. Arijit Khan, Yinghui Wu, Charu C Aggarwal, and Xifeng Yan. 2013. Nema: Fast graph search with label similarity. VLDB Endowment, Vol. 6, 181--192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Samuel T King and Peter M Chen. 2003. Backtracking intrusions. In SOSP. 223--236.Google ScholarGoogle Scholar
  33. T. Kipf and M. Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.Google ScholarGoogle Scholar
  34. Yonghwi Kwon, Fei Wang, Weihang Wang, et al. 2018. MCI: Modeling-based Causality Inference in Audit Logging for Attack Investigation. In NDSS. 4.Google ScholarGoogle Scholar
  35. Zixun Lan, Limin Yu, Linglong Yuan, et al. 2021. Sub-gmn: The subgraph matching network model. arXiv preprint arXiv:2104.00186.Google ScholarGoogle Scholar
  36. Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013a. High Accuracy Attack Provenance via Binary-based Execution Partition. In NDSS, Vol. 16.Google ScholarGoogle Scholar
  37. Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013b. LogGC: garbage collecting audit log. In SIGSAC. 1005--1016.Google ScholarGoogle Scholar
  38. Yujia Li, Chenjie Gu, Thomas Dullien, et al. 2019. Graph matching networks for learning the similarity of graph structured objects. In ICML. 3835--3845.Google ScholarGoogle Scholar
  39. Zitong Li, Xiang Cheng, Lixiao Sun, Ji Zhang, and Bing Chen. 2021. A hierarchical approach for advanced persistent threat detection with attention-based graph neural networks. Security and Communication Networks, Vol. 2021 (2021), 1--14.Google ScholarGoogle Scholar
  40. Chung-Shou Liao, Kanghao Lu, Michael Baym, Rohit Singh, and Bonnie Berger. 2009. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 12, i253--i258.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Fucheng Liu, Yu Wen, Dongxue Zhang, et al. 2019b. Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise. In SIGSAC. 1777--1794.Google ScholarGoogle Scholar
  42. Lihui Liu, Boxin Du, Hanghang Tong, et al. 2019a. G-finder: Approximate attributed subgraph matching. In IEEE BigData. 513--522.Google ScholarGoogle Scholar
  43. Yushan Liu, Mu Zhang, Ding Li, Kangkook Jee, et al. 2018. Towards a Timely Causality Analysis for Enterprise Security.. In NDSS.Google ScholarGoogle Scholar
  44. Zhaoyu Lou, Jiaxuan You, Chengtao Wen, et al. 2020. Neural subgraph matching. arXiv preprint arXiv:2007.03092.Google ScholarGoogle Scholar
  45. Andreas Loukas. 2019. What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199 (2019).Google ScholarGoogle Scholar
  46. Yao Lu, Kaizhu Huang, and Cheng-Lin Liu. 2016. A fast projected fixed-point algorithm for large graph matching. Pattern Recognition, 971--982.Google ScholarGoogle Scholar
  47. Shiqing Ma, Juan Zhai, Fei Wang, Kyu Hyung Lee, et al. 2017. MPI: Multiple Perspective Attack Investigation with Semantic Aware Execution Partitioning. In USENIX Security Symposium. 1111--1128.Google ScholarGoogle Scholar
  48. Shiqing Ma, Xiangyu Zhang, Dongyan Xu, et al. 2016. Protracer: Towards Practical Provenance Tracing by Alternating Between Logging and Tainting. In NDSS.Google ScholarGoogle Scholar
  49. Emaad Manzoor, Sadegh M Milajerdi, and Leman Akoglu. 2016. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In SIGKDD. 1035--1044.Google ScholarGoogle Scholar
  50. Noor Michael, Jaron Mink, Jason Liu, Sneha Gaur, et al. 2020. On the forensic validity of approximated audit logs. In ACSAC. 189--202.Google ScholarGoogle Scholar
  51. Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrishnan. 2019a. Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In SIGSAC. 1795--1812.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, Ramachandran Sekar, and VN Venkatakrishnan. 2019b. Holmes: real-time apt detection through correlation of suspicious information flows. In S&P. 1137--1152.Google ScholarGoogle Scholar
  53. Luc Moreau, Juliana Freire, Joe Futrelle, Robert E McGrath, et al. 2008. The open provenance model: An overview. In IPAW. 323--326.Google ScholarGoogle Scholar
  54. Kiran-Kumar Muniswamy-Reddy and Margo Seltzer. 2010. Provenance as first class cloud data. SIGOPS (2010), 11--16.Google ScholarGoogle Scholar
  55. Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, et al. 2017. Practical whole-system provenance capture. In SoCC. 405--418.Google ScholarGoogle Scholar
  56. Kexin Pei, Zhongshu Gu, Brendan Saltaformaggio, Shiqing Ma, et al. 2016. Hercule: Attack story reconstruction via community discovery on correlated log graph. In ACSAC. 583--595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Robert Pienta, Acar Tamersoy, Hanghang Tong, and Duen Horng Chau. 2014. Mage: Matching approximate patterns in richly-attributed graphs. In IEEE BigData. 585--590.Google ScholarGoogle Scholar
  58. Indradyumna Roy, Venkata Sai Baba Reddy Velugoti, Soumen Chakrabarti, and Abir De. 2022. Interpretable Neural Subgraph Matching for Graph Retrieval. In AAAI, Vol. 36. 8115--8123.Google ScholarGoogle ScholarCross RefCross Ref
  59. Kiavash Satvat, Rigel Gjomemo, and VN Venkatakrishnan. 2021. Extractor: Extracting attack behavior from threat reports. In EuroS&P. 598--615.Google ScholarGoogle Scholar
  60. Franco Scarselli, Marco Gori, Ah Chung Tsoi, et al. 2008. The graph neural network model. IEEE transactions on neural networks, Vol. 20, 61--80.Google ScholarGoogle Scholar
  61. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, et al. 2018. Modeling relational data with graph convolutional networks. In ESWC 2018. 593--607.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Joerg Thalheim, Pramod Bhatotia, and Christof Fetzer. 2016. Inspector: data provenance using intel processor trace (pt). In ICDCS. 25--34.Google ScholarGoogle Scholar
  63. Yuanyuan Tian, Richard C Mceachin, Carlos Santos, et al. 2007. SAGA: a subgraph matching tool for biological graphs. Bioinformatics, Vol. 23, 232--239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Hanghang Tong, Christos Faloutsos, Brian Gallagher, and Tina Eliassi-Rad. 2007. Fast best-effort pattern matching in large attributed graphs. In SIGKDD. 737--746.Google ScholarGoogle Scholar
  65. Jacob Torrey. 2020. Transparent Computing Engagement 3 Data Release. https://www.darpa.mil/program/transparent-computingGoogle ScholarGoogle Scholar
  66. Guillem Cucurull, Arantxa Casanova, Adriana Romero, et al. 2018. Graph Attention Networks. In ICLR.Google ScholarGoogle Scholar
  67. Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361.Google ScholarGoogle Scholar
  68. Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. 2018. Probabilistic embedding of knowledge graphs with box lattice measures. arXiv preprint arXiv:1805.06627 (2018).Google ScholarGoogle Scholar
  69. Qi Wang, Wajih Ul Hassan, Ding Li, Kangkook Jee, et al. 2020a. You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis.. In NDSS.Google ScholarGoogle Scholar
  70. Qi Wang, Wajih Ul Hassan, Ding Li, Kangkook Jee, et al. 2020b. You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis.. In NDSS.Google ScholarGoogle Scholar
  71. Su Wang, Zhiliang Wang, Tao Zhou, Hongbin Sun, Xia Yin, et al. 2022. Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning. TIFS, Vol. 17 (2022), 3972--3987.Google ScholarGoogle Scholar
  72. Renzheng Wei, Lijun Cai, Lixin Zhao, Aimin Yu, and Dan Meng. 2021. Deephunter: A graph neural network based approach for robust cyber threat hunting. In SecureComm. Springer, 3--24.Google ScholarGoogle Scholar
  73. Yuting Wu, Xiao Liu, Yansong Feng, et al. 2019. Relation-aware entity alignment for heterogeneous knowledge graphs. arXiv preprint arXiv:1908.08210 (2019).Google ScholarGoogle Scholar
  74. Yulai Xie, Dan Feng, Yuchong Hu, Yan Li, et al. 2018. Pagoda: A hybrid approach to enable efficient real-time provenance based intrusion detection in big data environments. IEEE TDSC , Vol. 17, 1283--1296.Google ScholarGoogle Scholar
  75. Chunlin Xiong, Tiantian Zhu, Weihao Dong, Linqi Ruan, et al. 2020. CONAN: A practical real-time APT detection system with high accuracy and efficiency. IEEE TDSC, Vol. 19, 1, 551--565.Google ScholarGoogle Scholar
  76. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google ScholarGoogle Scholar
  77. Kun Xu, Liwei Wang, Mo Yu, Yansong Feng, et al. 2019. Cross-lingual knowledge graph alignment via graph matching neural network. arXiv preprint arXiv:1905.11605 (2019).Google ScholarGoogle Scholar
  78. Zhiqiang Xu, Pengcheng Fang, Changlin Liu, et al. 2022. Depcomm: Graph summarization on system audit logs for attack investigation. In S&P. 540--557.Google ScholarGoogle Scholar
  79. Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, et al. 2016. High fidelity data reduction for big data security dependency analyses. In SIGSAC. 504--516.Google ScholarGoogle Scholar
  80. Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, et al. 2021b. Decoupling the depth and scope of graph neural networks. NeurIPS (2021), 19665--19679.Google ScholarGoogle Scholar
  81. Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2019. Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931 (2019).Google ScholarGoogle Scholar
  82. Jun Zeng, Zheng Leong Chua, Yinfang Chen, et al. 2021a. WATSON: Abstracting Behaviors from Audit Logs via Aggregation of Contextual Semantics.. In NDSS.Google ScholarGoogle Scholar
  83. Jun Zengy, Xiang Wang, Jiahao Liu, et al. 2022. Shadewatcher: Recommendation-guided cyber threat analysis using system audit records. In S&P. 489--506.Google ScholarGoogle Scholar
  84. Tiantian Zhu, Jinkai Yu, Chunlin Xiong, et al. 2023. APTSHIELD: A Stable, Efficient and Real-time APT Detection System for Linux Hosts. IEEE TDSC.Google ScholarGoogle Scholar
  85. Bo Zong, Xusheng Xiao, Zhichun Li, et al. 2015. Behavior query discovery in system-generated temporal graphs. arXiv preprint arXiv:1511.05911 (2015).Google ScholarGoogle Scholar

Index Terms

  1. ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Article Metrics

        • Downloads (Last 12 months)384
        • Downloads (Last 6 weeks)84

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader