Combining Rule-Based System and Machine Learning to Classify Semi-natural Language Data

Hussain, Zafar; Nurminen, Jukka K.; Mikkonen, Tommi; Kowiel, Marcin

doi:10.1007/978-3-031-16072-1_32

Zafar Hussain¹⁰,
Jukka K. Nurminen¹⁰,
Tommi Mikkonen¹⁰ &
…
Marcin Kowiel¹¹

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 542))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

727 Accesses

Abstract

Command-line commands form a special kind of semi-natural language. Analyzing their structure and classifying them is a useful approach in the field of cyber security to detect anomalous commands used by malicious actors. Without any contextual knowledge, commands’ analysis is a difficult task as similar-looking commands might be performing different tasks, and commands with different aliases might be performing the same tasks. To understand command-line commands’ structure and their syntactic and semantic meanings, we created a rule-based system based on expert opinions. Using this system, we classified command-line commands into similar and not-similar classes. This rule-based system transformed command-line commands’ data into a binary classified form. We trained three machine learning models (a logistic regression document classifier, a deep learning document classifier, and a deep learning sentence-pair classifier) to learn the set of rules created in the rule-based system. We used Mathews Correlation Coefficient (MCC) score for the models’ performance comparison. The logistic regression model shows an MCC score of 0.85, whereas both the Deep Learning (DL) models scored above 0.98. DL document classifier and DL sentence-pair classifier achieved an accuracy of 0.943 and 0.983 respectively on unseen data. Our proposed hybrid approach solves the complex problem of classifying semi-natural language data. This approach can be used to create a domain-specific set of rules, and classify any semi-natural language data into multi-classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Villena-Román, J., Collada-Pérez, S., Lana-Serrano, S., González, J. : Hybrid approach combining machine learning and a rule-based expert system for text categorization. In: FLAIRS Conference (2011)
Google Scholar
Melero, M., Aikawa, T., Schwartz, L.: Combining machine learning and rule-based approaches in Spanish and Japanese sentence realization. In: INLG 2002 (2002)
Google Scholar
Pihlqvist, F., Mulongo, B.: Using rule-based methods and machine learning for short answer scoring (2018)
Google Scholar
Ng, A.Y.: Feature selection, \({\rm L}_{1}\) vs. \({\rm L}_{2}\) regularization, and rotational invariance. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML 2004, Banff, Alberta, Canada, vol. 78. Association for Computing Machinery, New York (2004). 1581138385
Google Scholar
Mladeni, D., Brank, J., Grobelnik, M.: Document classification. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 289–293. Springer, Boston (2010). 978-0-387-30164-8. https://doi.org/10.1007/978-0-387-30164-8_230
Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21 (2020). https://doi.org/10.1186/s12864-019-6413-7
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)
Qing, L., Jing, W., Dehai, Z., Yun, Y., Wang, N.: Text features extraction based on TF-IDF associating semantic. 12, 2338–2343 (2018). https://doi.org/10.1109/CompComm.2018.8780663
Zhang, Y., Zhou, Y., Yao, J.T.: Feature extraction with TF-IDF and game-theoretic shadowed sets. In: Lesot, M.-J., et al. (eds.) IPMU 2020. CCIS, vol. 1237, pp. 722–733. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50146-4_53
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)
Hussain, Z., Nurminen, J.K., Mikkonen, T., Kowiel, M.: Command similarity measurement using NLP. In: 10th Symposium on Languages, Applications and Technologies (SLATE 2021), p. 13:1 14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, August 2021. (Open Access Series in Informatics; vol. 94)
Google Scholar
Bedziechowska, J.: NLP for cyber security - language model for command lines @ F-Secure. https://www.youtube.com/watch?v=yORkNjBzuN0 &ab_channel=GHOSTDay%3AAMLC
Waltl, B., Bonczek, G., Matthes, F.: Rule-based information extraction: advantages, limitations, and perspectives. In: Proceedings of IRIS 2018 (2018)
Google Scholar
Yoon, Y., Guimaraes, T., Swales, G.: Integrating artificial neural networks with rule-based expert systems. Decis. Support Syst. 11(5), 497–507 (1994). ISSN 0167-9236
Google Scholar
https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/defrag
Volker, T., Jurgen, H., Munchen, T., Subutai, A.: Network structuring and training using rule-based knowledge (2002). https://www.researchgate.net/profile/Volker-Tresp/publication/2400373_Network_Structuring_And_Training_Using_Rule-based_Knowledge/links/0deec515be8bfa3b7b000000/Network-Structuring-And-Training-Using-Rule-based-Knowledge.pdf
Gallant, S.I.: Connectionist expert systems. Commun. ACM (Association for Computing Machinery, New York, NY, USA) 31(2), 152–169 (1988). ISSN 0001-0782. https://doi.org/10.1145/42372.42377
Pomerleau, D.A., Gowdy, J., Thorpe, C.E.: Combining artificial neural networks and symbolic processing for autonomous robot guidance. Eng. Appl. Artif. Intell. 4(4), 279–285 (1991). ISSN 0952-1976. https://www.sciencedirect.com/science/article/pii/0952197691900425

Download references

Author information

Authors and Affiliations

University of Helsinki, Helsinki, Finland
Zafar Hussain, Jukka K. Nurminen & Tommi Mikkonen
F-Secure Corporation, Warsaw, Poland
Marcin Kowiel

Authors

Zafar Hussain
View author publications
You can also search for this author in PubMed Google Scholar
Jukka K. Nurminen
View author publications
You can also search for this author in PubMed Google Scholar
Tommi Mikkonen
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Kowiel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zafar Hussain .

Editor information

Editors and Affiliations

Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hussain, Z., Nurminen, J.K., Mikkonen, T., Kowiel, M. (2023). Combining Rule-Based System and Machine Learning to Classify Semi-natural Language Data. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-031-16072-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-16072-1_32
Published: 31 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16071-4
Online ISBN: 978-3-031-16072-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Combining Rule-Based System and Machine Learning to Classify Semi-natural Language Data