Abstract
Command-line commands form a special kind of semi-natural language. Analyzing their structure and classifying them is a useful approach in the field of cyber security to detect anomalous commands used by malicious actors. Without any contextual knowledge, commands’ analysis is a difficult task as similar-looking commands might be performing different tasks, and commands with different aliases might be performing the same tasks. To understand command-line commands’ structure and their syntactic and semantic meanings, we created a rule-based system based on expert opinions. Using this system, we classified command-line commands into similar and not-similar classes. This rule-based system transformed command-line commands’ data into a binary classified form. We trained three machine learning models (a logistic regression document classifier, a deep learning document classifier, and a deep learning sentence-pair classifier) to learn the set of rules created in the rule-based system. We used Mathews Correlation Coefficient (MCC) score for the models’ performance comparison. The logistic regression model shows an MCC score of 0.85, whereas both the Deep Learning (DL) models scored above 0.98. DL document classifier and DL sentence-pair classifier achieved an accuracy of 0.943 and 0.983 respectively on unseen data. Our proposed hybrid approach solves the complex problem of classifying semi-natural language data. This approach can be used to create a domain-specific set of rules, and classify any semi-natural language data into multi-classes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Villena-Román, J., Collada-Pérez, S., Lana-Serrano, S., González, J. : Hybrid approach combining machine learning and a rule-based expert system for text categorization. In: FLAIRS Conference (2011)
Melero, M., Aikawa, T., Schwartz, L.: Combining machine learning and rule-based approaches in Spanish and Japanese sentence realization. In: INLG 2002 (2002)
Pihlqvist, F., Mulongo, B.: Using rule-based methods and machine learning for short answer scoring (2018)
Ng, A.Y.: Feature selection, \({\rm L}_{1}\) vs. \({\rm L}_{2}\) regularization, and rotational invariance. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML 2004, Banff, Alberta, Canada, vol. 78. Association for Computing Machinery, New York (2004). 1581138385
Mladeni, D., Brank, J., Grobelnik, M.: Document classification. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 289–293. Springer, Boston (2010). 978-0-387-30164-8. https://doi.org/10.1007/978-0-387-30164-8_230
Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21 (2020). https://doi.org/10.1186/s12864-019-6413-7
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)
Qing, L., Jing, W., Dehai, Z., Yun, Y., Wang, N.: Text features extraction based on TF-IDF associating semantic. 12, 2338–2343 (2018). https://doi.org/10.1109/CompComm.2018.8780663
Zhang, Y., Zhou, Y., Yao, J.T.: Feature extraction with TF-IDF and game-theoretic shadowed sets. In: Lesot, M.-J., et al. (eds.) IPMU 2020. CCIS, vol. 1237, pp. 722–733. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50146-4_53
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)
Hussain, Z., Nurminen, J.K., Mikkonen, T., Kowiel, M.: Command similarity measurement using NLP. In: 10th Symposium on Languages, Applications and Technologies (SLATE 2021), p. 13:1 14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, August 2021. (Open Access Series in Informatics; vol. 94)
Bedziechowska, J.: NLP for cyber security - language model for command lines @ F-Secure. https://www.youtube.com/watch?v=yORkNjBzuN0 &ab_channel=GHOSTDay%3AAMLC
Waltl, B., Bonczek, G., Matthes, F.: Rule-based information extraction: advantages, limitations, and perspectives. In: Proceedings of IRIS 2018 (2018)
Yoon, Y., Guimaraes, T., Swales, G.: Integrating artificial neural networks with rule-based expert systems. Decis. Support Syst. 11(5), 497–507 (1994). ISSN 0167-9236
https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/defrag
Volker, T., Jurgen, H., Munchen, T., Subutai, A.: Network structuring and training using rule-based knowledge (2002). https://www.researchgate.net/profile/Volker-Tresp/publication/2400373_Network_Structuring_And_Training_Using_Rule-based_Knowledge/links/0deec515be8bfa3b7b000000/Network-Structuring-And-Training-Using-Rule-based-Knowledge.pdf
Gallant, S.I.: Connectionist expert systems. Commun. ACM (Association for Computing Machinery, New York, NY, USA) 31(2), 152–169 (1988). ISSN 0001-0782. https://doi.org/10.1145/42372.42377
Pomerleau, D.A., Gowdy, J., Thorpe, C.E.: Combining artificial neural networks and symbolic processing for autonomous robot guidance. Eng. Appl. Artif. Intell. 4(4), 279–285 (1991). ISSN 0952-1976. https://www.sciencedirect.com/science/article/pii/0952197691900425
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hussain, Z., Nurminen, J.K., Mikkonen, T., Kowiel, M. (2023). Combining Rule-Based System and Machine Learning to Classify Semi-natural Language Data. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-031-16072-1_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-16072-1_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16071-4
Online ISBN: 978-3-031-16072-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)