Skip to main content

Combining Rule-Based System and Machine Learning to Classify Semi-natural Language Data

  • Conference paper
  • First Online:
Intelligent Systems and Applications (IntelliSys 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 542))

Included in the following conference series:

  • 727 Accesses

Abstract

Command-line commands form a special kind of semi-natural language. Analyzing their structure and classifying them is a useful approach in the field of cyber security to detect anomalous commands used by malicious actors. Without any contextual knowledge, commands’ analysis is a difficult task as similar-looking commands might be performing different tasks, and commands with different aliases might be performing the same tasks. To understand command-line commands’ structure and their syntactic and semantic meanings, we created a rule-based system based on expert opinions. Using this system, we classified command-line commands into similar and not-similar classes. This rule-based system transformed command-line commands’ data into a binary classified form. We trained three machine learning models (a logistic regression document classifier, a deep learning document classifier, and a deep learning sentence-pair classifier) to learn the set of rules created in the rule-based system. We used Mathews Correlation Coefficient (MCC) score for the models’ performance comparison. The logistic regression model shows an MCC score of 0.85, whereas both the Deep Learning (DL) models scored above 0.98. DL document classifier and DL sentence-pair classifier achieved an accuracy of 0.943 and 0.983 respectively on unseen data. Our proposed hybrid approach solves the complex problem of classifying semi-natural language data. This approach can be used to create a domain-specific set of rules, and classify any semi-natural language data into multi-classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://networkx.org/.

  2. 2.

    https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/windows-commands.

  3. 3.

    https://github.com/huggingface/transformers.

  4. 4.

    https://stackoverflow.com/.

References

  1. Villena-Román, J., Collada-Pérez, S., Lana-Serrano, S., González, J. : Hybrid approach combining machine learning and a rule-based expert system for text categorization. In: FLAIRS Conference (2011)

    Google Scholar 

  2. Melero, M., Aikawa, T., Schwartz, L.: Combining machine learning and rule-based approaches in Spanish and Japanese sentence realization. In: INLG 2002 (2002)

    Google Scholar 

  3. Pihlqvist, F., Mulongo, B.: Using rule-based methods and machine learning for short answer scoring (2018)

    Google Scholar 

  4. Ng, A.Y.: Feature selection, \({\rm L}_{1}\) vs. \({\rm L}_{2}\) regularization, and rotational invariance. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML 2004, Banff, Alberta, Canada, vol. 78. Association for Computing Machinery, New York (2004). 1581138385

    Google Scholar 

  5. Mladeni, D., Brank, J., Grobelnik, M.: Document classification. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 289–293. Springer, Boston (2010). 978-0-387-30164-8. https://doi.org/10.1007/978-0-387-30164-8_230

  6. Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21 (2020). https://doi.org/10.1186/s12864-019-6413-7

  7. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)

  8. Qing, L., Jing, W., Dehai, Z., Yun, Y., Wang, N.: Text features extraction based on TF-IDF associating semantic. 12, 2338–2343 (2018). https://doi.org/10.1109/CompComm.2018.8780663

  9. Zhang, Y., Zhou, Y., Yao, J.T.: Feature extraction with TF-IDF and game-theoretic shadowed sets. In: Lesot, M.-J., et al. (eds.) IPMU 2020. CCIS, vol. 1237, pp. 722–733. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50146-4_53

    Chapter  Google Scholar 

  10. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)

  11. Hussain, Z., Nurminen, J.K., Mikkonen, T., Kowiel, M.: Command similarity measurement using NLP. In: 10th Symposium on Languages, Applications and Technologies (SLATE 2021), p. 13:1 14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, August 2021. (Open Access Series in Informatics; vol. 94)

    Google Scholar 

  12. Bedziechowska, J.: NLP for cyber security - language model for command lines @ F-Secure. https://www.youtube.com/watch?v=yORkNjBzuN0 &ab_channel=GHOSTDay%3AAMLC

  13. Waltl, B., Bonczek, G., Matthes, F.: Rule-based information extraction: advantages, limitations, and perspectives. In: Proceedings of IRIS 2018 (2018)

    Google Scholar 

  14. Yoon, Y., Guimaraes, T., Swales, G.: Integrating artificial neural networks with rule-based expert systems. Decis. Support Syst. 11(5), 497–507 (1994). ISSN 0167-9236

    Google Scholar 

  15. https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/defrag

  16. Volker, T., Jurgen, H., Munchen, T., Subutai, A.: Network structuring and training using rule-based knowledge (2002). https://www.researchgate.net/profile/Volker-Tresp/publication/2400373_Network_Structuring_And_Training_Using_Rule-based_Knowledge/links/0deec515be8bfa3b7b000000/Network-Structuring-And-Training-Using-Rule-based-Knowledge.pdf

  17. Gallant, S.I.: Connectionist expert systems. Commun. ACM (Association for Computing Machinery, New York, NY, USA) 31(2), 152–169 (1988). ISSN 0001-0782. https://doi.org/10.1145/42372.42377

  18. Pomerleau, D.A., Gowdy, J., Thorpe, C.E.: Combining artificial neural networks and symbolic processing for autonomous robot guidance. Eng. Appl. Artif. Intell. 4(4), 279–285 (1991). ISSN 0952-1976. https://www.sciencedirect.com/science/article/pii/0952197691900425

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zafar Hussain .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hussain, Z., Nurminen, J.K., Mikkonen, T., Kowiel, M. (2023). Combining Rule-Based System and Machine Learning to Classify Semi-natural Language Data. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-031-16072-1_32

Download citation

Publish with us

Policies and ethics