Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries

Authors

DOI:

https://doi.org/10.5753/jidm.2023.3181

Keywords:

Semantic Classification, Active Learning, Official Diaries, Annotation Tool

Abstract

Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public actions, such as bidding processes and auctions. However, different public agencies have their own criteria for standardizing the models and formats used to make information available, as exemplified in the varying styles observed in municipal, state, and union (federal) documents. In this context, we aim to minimize the effort to deal with public documents, notably official gazettes. For this, we propose a structure-oriented heuristic for extracting relevant excerpts from their texts. We then characterize these excerpts through morphosyntactic analysis and entity recognition. Subsequently, we semantically classify the extracted fragments into "sections of interest" (e.g., bids, laws, personnel, budget) using an active learning strategy to reduce the manual labeling effort. We also improve the classification process by incorporating transformers, stacking, and by combining different types of representations (e.g., frequentist, static, and contextual semantic embeddings). Furthermore, we exploit oversampling based on semi-supervised learning to deal with (labeled) data scarceness and skewness. Finally, we combine all these contributions in a real-time annotation tool with active learning support that achieves 100% accuracy in extraction and an overall accuracy of 85% in classification with very little labeling effort.

Downloads

Download data is not yet available.

References

Belém, F. M., Ganem, M., França, C., Carvalho, M., Laender, A. H. F., and Gonçalves, M. A. (2022). Reforço e Delimitação Contextual para Reconhecimento de Entidades e Relações em Documentos Oficiais. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 292–303, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224650.

Blei, D. M. (2012). Probabilistic Topic Models. Commun. ACM, 55(4):77–84. DOI: 10.1145/2133806.2133826.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan):993–1022.

Campos, R., Canuto, S., Salles, T., de Sá, C. C., and Gonçalves, M. A. (2017a). Stacking Bagged and Boosted Forests for Effective Automated Classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 105–114, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3077136.3080815.

Campos, R. R., Canuto, S. D., Salles, T., de Sá, C. C. A., and Gonçalves, M. A. (2017b). Ranked Batch-mode Active Learning. Inf. Sci., 379:313–337. DOI: 10.1016/j.ins.2016.10.037.

Carvalho, J. and Plastino, A. (2021). On the Evaluation and Combination of State-of-the-Art Features in Twitter Sentiment Analysis. Artif. Intell. Rev., 54(3):1887–1936. DOI: 10.1007/s10462-020-09895-6.

Constantino, K., Cruz, V. A. L., Zucheratto, O. M. M., França, C., Carvalho, M., Silva, T. H. P., Laender, A. H. F., and Gonçalves, M. A. (2022). Segmentação e Classificação Semântica de Trechos de Diários Oficiais Usando Aprendizado Ativo. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 304–316, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224656.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S. D., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L., and Gonçalves, M. A. (2021). On the Cost-Effectiveness of Neural and Non-Neural Approaches and Representations for Text Classification: A Comprehensive Comparative Study. Inf. Process. Manag., 58(3):102481. DOI: 10.1016/j.ipm.2020.102481.

Córdova Sáenz, C. A., Dias, M., and Becker, K. (2021). Assessing the Combination of DistilBERT News Representations and Difusion Topological Features to Classify Fake News. Journal of Information and Data Management, 12(1). DOI: 10.5753/jidm.2021.1895.

de Andrade, Claudio Moisés Valiense and Gonçalves, Marcos André (2020). Combining Representations for Effective Citation Classification. In Proceedings of the 8th International Workshop on Mining Scientific Publications, pages 54–58, Wuhan, China. Association for Computational Linguistics.

de Freitas, J., Pappa, G. L., da Silva, A. S., Gonçalves, M. A., de Moura, E. S., Veloso, A., Laender, A. H. F., and de Carvalho, M. G. (2010). Active Learning Genetic Programming for Record Deduplication. In Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2010, Barcelona, Spain, 18-23 July 2010, pages 1–8. IEEE. DOI: 10.1109/CEC.2010.5586104.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171–4186. Association for Computational Linguistics. DOI: 10.18653/v1/n19-1423.

Džeroski, S. and Ženko, B. (2004). Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning, 54:255–273. DOI: 10.1023/B:MACH.0000015881.36452.6e.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W. (2022). Language-agnostic BERT Sentence Embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL, pages 878–891. Association for Computational Linguistics. DOI: 10.18653/v1/2022.acl-long.62.

Fernandes, D., de Moura, E. S., Ribeiro-Neto, B., da Silva, A. S., and Gonçalves, M. A. (2007). Computing Block Importance for Searching on Web Sites. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 165–174. DOI: 10.1145/1321440.1321466.

Garg, S., Vu, T., and Moschitti, A. (2020). TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, pages 7780–7788. AAAI Press.

Gomes, C., Gonçalves, M. A., Rocha, L., and Canuto, S. D. (2021). On the Cost-Effectiveness of Stacking of Neural and Non-Neural Methods for Text Classification: Scenarios and Performance Prediction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4003–4014, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2021.findings-acl.350.

Inuzuka, M., do Nascimento, H., Almeida, F., Barros, B., and Jradi, W. (2020). Doclass: Open-source Software to Support Document Labeling and Classification. In Anais do VIII Symposium on Knowledge Discovery, Mining and Learning, pages 105–112, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/kdmile.2020.11965.

Knackstedt, R., Heddier, M., and Becker, J. (2014). Conceptual Modeling in Law: An Interdisciplinary Research Agenda. Communications of the Association for Information Systems, 34(1):36. DOI: 10.17705/1cais.03436.

Lewis, D. D. and Catlett, J. (1994). Heterogeneous Uncertainty Sampling for Supervised Learning. In Machine Learning Proceedings 1994, pages 148–156. Elsevier. DOI: 10.1016/b978-1-55860-335-6.50026-x.

Lin, F.-R., Chou, S.-Y., Liao, D., and Hao, D. (2015). Automatic Content Analysis of Legislative Documents by Text Mining Techniques. In 2015 48th Hawaii International Conference on System Sciences, pages 2199–2208. IEEE. DOI: 10.1109/HICSS.2015.263.

Nadeau, D. and Sekine, S. (2007). A Survey of Named Entity Recognition and Classification. Lingvisticae Investigationes, 30(1):3–26.

Pak, I. and Teh, P. L. (2018). Text Segmentation Techniques: A Critical Review. Innovative Computing, Optimization and Its Applications, pages 167–181. DOI: 10.1007/978-3-319-66984-710.

Pereira, G. C., Monteiro, I. T., Vasconcelos, D. R., Braz, L., and Silva, C. H. (2021). Classificação Taxonômica de Categorias de Serviços Públicos para Aplicações Digitais. In Anais do IX Workshop de Computação Aplicada em Governo Eletrônico, pages 119–130, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/wcge.2021.15982.

Pinto, F., Santos, J., Lifschitz, S., and Haeusler, E. (2023). A Benchmarking for Public Information by Machine Learning and Regular Language. In Anais do XI Workshop de Computação Aplicada em Governo Eletrônico, pages 60–71, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/wcge.2023.229975.

Pinto, F. A. D., Haeusler, E. H., and Lifschitz, S. (2021). Transparência Pública Automatizada a Partir da Gramática do Diário Oficial. In Anais do IX Workshop de Computação Aplicada em Governo Eletrônico, pages 59–70, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/wcge.2021.15977.

Rangel, M., Bernardini, F., Viterbo, J., Monteiro, R., Seixas, E., and dos Santos Pinto, H. (2020). Uso de Aprendizado de Máquina para Categorização Automática de Conjuntos de Dados de Portais de Dados Abertos. In Anais do VIII Workshop de Computação Aplicada em Governo Eletrônico, pages 120–131, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/wcge.2020.11263.

Salles, T., Gonçalves, M. A., Rodrigues, V., and Rocha, L. (2018). Improving Random Forests by Neighborhood Projection for Effective Text Classification. Information Systtems, 77:1–21. DOI: 10.1016/j.is.2018.05.006.

Silva, M. O., Costa, L. L., Bezerra, G., Gomide, L. D., Hott, H. R., Oliveira, G. P., Brandão, M. A., Lacerda, A., and Pappa, G. (2023). Análise de Sobrepreço em Itens de Licitações Públicas. In Anais do XI Workshop de Computação Aplicada em Governo Eletrônico, pages 118–129, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/wcge.2023.230608.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Proceedings of the 9th Brazilian Conference on Intelligent Systems, (BRACIS), pages 403–417. Springer.

Wolpert, D. H. (1992). Stacked Generalization. Neural Networks, 5(2):241–259. DOI: https://doi.org/10.1016/S0893 - 6080(05)80023-1.

Downloads

Published

2023-10-31

How to Cite

Constantino, K., H. P. Silva, T., B. Silva, J. V., L. Cruz, V. A., M. M. Zucheratto, O., Carvalho, M., Santos, W., França, C., M. V. de Andrade, C., H. F. Laender, A., & Gonçalves, M. A. (2023). Using Active Learning for Segmentation and Semantic Classification of Legal Acts Extracted from Official Diaries. Journal of Information and Data Management, 14(1). https://doi.org/10.5753/jidm.2023.3181

Issue

Section

SBBD 2022 Full papers - Extended Papers