Skip to main content
Log in

Leveraging BERT for extractive text summarization on federal police documents

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

A document known as notitia criminis (NC) is use in the Brazilian Federal Police as the starting point of the criminal investigation. An NC aims to report a summary of investigative activities. Thus, it contains all relevant information about a supposed crime that occurred. To manage an inquiry and correlate similar investigations, the Federal Police usually needs to extract essential information from an NC document. The manual extraction (reading and understanding the entire content) may be human mentally exhausting, due to the size and complexity of the documents. In this light, natural language processing (NLP) techniques are commonly used for automatic information extraction from textual documents. Deep neural networks are successfully apply to many different NLP tasks. A neural network model that leveraged the results in a wide range of NLP tasks was the BERT model—an acronym for Bidirectional Encoder Representations from Transformers. In this article, we propose approaches based on the BERT model to extract relevant information from textual documents using automatic text summarization techniques. In other words, we aim to analyze the feasibility of using the BERT model to extract and synthesize the most essential information of an NC document. We evaluate the performance of the proposed approaches using two real-world datasets: the Federal Police dataset (a private domain dataset) and the Brazilian WikiHow dataset (a public domain dataset). Experimental results using different variants of the ROUGE metric show that our approaches can significantly increase extractive text summarization effectiveness without sacrificing efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Term Frequency—Inverse Document Frequency (TF-IDF) is a technique for text vectorization based on the Bag of words (BoW) model.

  2. Elbow is a common heuristic in mathematical optimization. In clustering, this technique means when a number of clusters are chosen, the addition of another cluster to that set does not provide much better modeling of the data.

References

  1. Alguliyev R, Aliguliyev R, Isazade N, Abdi A, Idris N (2019) Cosum: text summarization based on clustering and optimization. Expert Syst 36:02. https://doi.org/10.1111/exsy.12340

    Article  Google Scholar 

  2. Bird S, Klein E, Loper E (eds) (2009) Natural language processing with Python : [analyzing text with the natural language toolkit]. O’Reilly, Beijing; Köln [u.a.], 1. ed. edition. ISBN 978-0-596-51649-9 0-596-51649-5

  3. Brown TB., Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. arXiv:2005.14165

  4. Bühlmann P (2004) Bagging, boosting and ensemble methods. Papers ,31, Berlin, 2004. http://hdl.handle.net/10419/22204

  5. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly BHA, Varoquaux G (2013) API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning, pp 108–122. arXiv

  6. Cohan A, Dernoncourt F, Kim DS, Bui T, Kim S, Chang W, Goharian N (2018) A discourse-aware attention model for abstractive summarization of long documents

  7. Mostafa D, Stephan G, Jakob U, Łukasz K (2019) Universal transformers, Oriol Vinyals

  8. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  9. Filatova E, Hatzivassiloglou V (2004) Event-based extractive summarization. In: Text summarization branches out. Barcelona, Spain, July. Association for Computational Linguistics, pp 104–111. https://aclanthology.org/W04-1017

  10. Galassi A, Lippi M, Torroni P (2021) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32(10):4291–4308. https://doi.org/10.1109/tnnls.2020.3019893

    Article  Google Scholar 

  11. Grail Q, Perez J, Gaussier E (2021) Globalizing BERT-based transformer architectures for long document summarization. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. Online, April. Association for Computational Linguistics, pp 1792–1810. https://doi.org/10.18653/v1/2021.eacl-main.154

  12. Jadhav A, Jain R, Fernandes S, Shaikh S (2019) Text summarization using neural networks. In: 2019 international conference on advances in computing, communication and control (ICAC3), pp 1–6. https://doi.org/10.1109/ICAC347590.2019.9036739

  13. Spärck Jones K (2007) Automatic summarising: the state of the art. Inf Process Manag 43:1449–1481

    Article  Google Scholar 

  14. Kiani F, Oguzhan T (2017) A survey automatic text summarization. 5:205–213. https://doi.org/10.17261/Pressacademia.2017.591

  15. Koh HY, Ju J, Liu M, Pan S (2022) An empirical survey on long document summarization: datasets, models and metrics. ACM Comput Surv. https://doi.org/10.1145/3545176

    Article  Google Scholar 

  16. Koupaee M, Wang WY (2018) Wikihow: a large scale text summarization dataset. arXiv:1810.09305

  17. Kouzis-Loukas D (2016) Learning scrapy. Packt Publishing Ltd, Birmingham

    Google Scholar 

  18. Kupiec J, Pedersen J, Chen F (1995) A trainable document summarizer. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’95. New York, NY, USA. Association for Computing Machinery, pp 68-73. ISBN 0897917146. https://doi.org/10.1145/215206.215333

  19. Oliveira H, de Brito Gomes Laerth B A multi-document summarization system for news articles in Portuguese using integer linear programming, pp 131–143. 09 2030. ISBN 9786557063613. https://doi.org/10.22533/at.ed.61320040912

  20. Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out. Barcelona, Spain, July. Association for Computational Linguistics, pp 74–81. https://aclanthology.org/W04-1013

  21. Liu PJ, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N (2018) Generating wikipedia by summarizing long sequences. arXiv:1801.10198

  22. Liu Y (2019) Fine-tune bert for extractive summarization. arXiv:1903.10318

  23. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692

  24. Lloret E, Plaza L, Aker A (2018) The challenging task of summary evaluation: an overview. Lang Resour Eval 52:03. https://doi.org/10.1007/s10579-017-9399-2

    Article  Google Scholar 

  25. Mani Inderjeet (2002) Summarization evaluation: an overview. In: NTCIR, 06

  26. Mihalcea R, Tarau P (2004) TextRank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain, July. Association for Computational Linguistics, pp 404–411. https://aclanthology.org/W04-3252

  27. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space

  28. Miller D (2019) Leveraging bert for extractive text summarization on lectures. arXiv:1906.04165

  29. Miller D (2019) Leveraging bert for extractive text summarization on lectures. arXiv:1906.04165

  30. Moradi M, Dorffner G, Samwald M (2020) Deep contextualized embeddings for quantifying the informative content in biomedical text summarization. Comput Methods Programs Biomed 184:105117. https://doi.org/10.1016/j.cmpb.2019.105117

    Article  Google Scholar 

  31. Gopalan Moratanch N, Chitrakala (2016) A survey on abstractive text summarization. In: 2016 international conference on circuit, power and computing technologies (ICCPCT). arXiv, 03, pp 1–7. https://doi.org/10.1109/ICCPCT.2016.7530193

  32. Gopalan M, Chitrakala (2017) A survey on extractive text summarization. In: 2017 international conference on computer, communication and signal processing (ICCCSP). arXiv, 01, pp 1–6. https://doi.org/10.1109/ICCCSP.2017.7944061

  33. Nallapati R, Zhai F, Zhou B (2016) Summarunner: a recurrent neural network based sequence model for extractive summarization of documents

  34. Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

  35. Nenkova A, McKeown K (2011) Automatic summarization, 5. 06. https://doi.org/10.1561/1500000015

  36. Nguyen T-H, Do T-N (2022) Extractive text summarization on large-scale dataset using k-means clustering. In: Advances and trends in artificial intelligence. Theory and practices in artificial intelligence: 35th international conference on industrial, engineering and other applications of applied intelligent systems, IEA/AIE 2022, Kitakyushu, Japan, July 19-22, Proceedings. Berlin, Heidelberg, 2022. Springer, pp 737–746. ISBN 978-3-031-08529-1. https://doi.org/10.1007/978-3-031-08530-7_62

  37. Norambuena B, Horning M, Mitra T (2020) Evaluating the inverted pyramid structure through automatic 5w1h extraction and summarization. Comput J Symp. https://par.nsf.gov/biblio/10168974

  38. Oliveira (2014) As notícias de crime: uma análise retórico-argumentativa do discurso jornalístico online por antecipação ao discurso jurídico. Master’s thesis, Universidade de São Paulo

  39. Orrú T, Rosa J, Andrade NM (2006) Sabio: an automatic portuguese text summarizer through artificial neural networks in a more biologically plausible model. pp 11–20, 01

  40. Otter DW, Medina JR, Kalita JK (2018) A survey of the usages of deep learning in natural language processing. arXiv:1807.10854

  41. Adam P, Sam G, Francisco M, Adam L, James B, Gregory C, Trevor K, Zeming L, Natalia G, Luca A, Alban D, Andreas K, Edward Y, Zachary D, Martin R, An Alykhan T, Sasank C, Benoit S, Lu F, Junjie B, Soumith C (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems 32. Curran Associates, Inc., pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

  42. Pottker H (2003) News and its communicative quality: the inverted pyramid-when and why did it appear? J Stud 4:501–511. https://doi.org/10.1080/1461670032000136596

    Article  Google Scholar 

  43. XiPeng Q, TianXiang S, YiGe X, YunFan S, Ning D, Huang X (2020) Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63(10):1872–1897. https://doi.org/10.1007/s11431-020-1647-3

    Article  Google Scholar 

  44. Radev D, Jing H, Styś M, Tam D (2004) Centroid-based summarization of multiple documents. Inf Process Manag 40:919–938. https://doi.org/10.1016/j.ipm.2003.10.006

    Article  MATH  Google Scholar 

  45. Machado RLH, Salgueiro PTA, Nascimento SC Jr, Kaestner Celso AA, Pombo M (2004) A comparison of automatic summarizers of texts in brazilian portuguese. In: Bazzan ALC, Sofiane L (eds) SBIA, volume 3171 of Lecture Notes in Computer Science. Springer, 235–244. ISBN 3-540-23237-0

  46. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv:1910.01108

  47. Savelieva A, Au-Yeung B, Ramani V (2020) Abstractive summarization of spoken and written instructions with bert. arXiv:2008.09676

  48. Souza F, Nogueira R, Lotufo R (2020) BERTimbau: pretrained BERT models for Brazilian Portuguese. pp 403–417. 10 2020. ISBN 978-3-030-61376-1

  49. Torres J (2011) Sumarização automática de artigos científicos de engenharia de software como suporte AO processo de revisão sistemática

  50. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762

  51. Dingding W, Tao L (2010) Document update summarization using incremental hierarchical clustering. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ’10. New York, NY, USA. Association for Computing Machinery, pp 279-288. ISBN 9781450300995. https://doi.org/10.1145/1871437.1871476

  52. Wang F, Franco-Penya H-H, Kelleher J, Pugh J, Ross R (2017) An analysis of the application of simplified silhouette to the evaluation of k-means clustering validity. In: IAPR international conference on machine learning and data mining in pattern recognition, 07. ISBN 978-3-319-62415-0. https://doi.org/10.1007/978-3-319-62416-7_21

  53. Widyassari AP, Rustad S, Shidik GF, Noersasongko E, Syukur A, Affandy A, De Rosal IMS (2020) Review of automatic text summarization techniques & methods. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.05.006

    Article  Google Scholar 

  54. Xu J, Gan Z, Cheng Y, Liu J (2019) Discourse-aware neural extractive text summarization. arXiv:1910.14142

  55. Yamuna K, Shriamrut V, Singh D, Gopalasamy V, Menon V (2021) Bert-based braille summarization of long documents. In: 2021 12th international conference on computing communication and networking technologies (ICCCNT), pp 1–6. https://doi.org/10.1109/ICCCNT51525.2021.9579748

  56. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le Quoc V (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237

  57. Zhang R, Wei Z, Shi Y, Chen Y (2020) BERT-al: BERT for arbitrarily long document understanding. https://openreview.net/forum?id=SklnVAEFDB

  58. Zheng C, Zhang K, Wang HJ, Fan L, Wang Z (2021) Topic-guided abstractive text summarization: a joint learning approach

  59. Zhong M, Liu P, Chen Y, Wang D, Qiu X, Huang X (2020) Extractive summarization as text matching. arXiv:2004.08795

  60. Zhuang F, Qi Z, Duan K, Xi K, Zhu Y, Zhu H, Xiong H, He Q (2019) A comprehensive survey on transfer learning. arXiv:1911.02685

Download references

Acknowledgements

This work was ostensibly supported by the Federal Police and the Federal University of Campina Grande under the Epol project 08200.01128/2019-72. We thank them for providing all the data for the construction of the research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thierry S. Barros.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barros, T.S., Pires, C.E.S. & Nascimento, D.C. Leveraging BERT for extractive text summarization on federal police documents. Knowl Inf Syst 65, 4873–4903 (2023). https://doi.org/10.1007/s10115-023-01912-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01912-8

Keywords

Navigation