Improving NLP Model Performance on Small Educational Data Sets Using Self-Augmentation

Keith Cochran; Keith Cochran; Clayton Cohn; Clayton Cohn; Peter Hastings; Peter Hastings

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Improving NLP Model Performance on Small Educational Data Sets Using Self-Augmentation

Topics: Architectures for AI-based Educational Systems; Intelligent Tutoring Systems; Machine Learning; Natural Language Processing

In Proceedings of the 15th International Conference on Computer Supported Education - Volume 1: CSEDU, 70-78, 2023 , Prague, Czech Republic

Authors: Keith Cochran ^{1

;

2} ; Clayton Cohn ^{1

;

2} and Peter Hastings ^{1

;

2}

Affiliations: ¹ DePaul University, Chicago IL 60604, U.S.A. ; ² Vanderbilt University, Nashville TN 37240, U.S.A.

Keyword(s): Educational Texts, Natural Language Processing, BERT, Data Augmentation, Text Augmentation, Imbalanced Data Sets.

Abstract: Computer-supported education studies can perform two important roles. They can allow researchers to gather important data about student learning processes, and they can help students learn more efficiently and effectively by providing automatic immediate feedback on what the students have done so far. The evaluation of student work required for both of these roles can be relatively easy in domains like math, where there are clear right answers. When text is involved, however, automated evaluations become more difficult. Natural Language Processing (NLP) can provide quick evaluations of student texts. However, traditional neural network approaches require a large amount of data to train models with enough accuracy to be useful in analyzing student responses. Typically, educational studies collect data but often only in small amounts and with a narrow focus on a particular topic. BERT-based neural network models have revolutionized NLP because they are pre-trained on very large corpora , developing a robust, contextualized understanding of the language. Then they can be “fine-tuned” on a much smaller set of data for a particular task. However, these models still need a certain base level of training data to be reasonably accurate, and that base level can exceed that provided by educational applications, which might contain only a few dozen examples. In other areas of artificial intelligence, such as computer vision, model performance on small data sets has been improved by “data augmentation” — adding scaled and rotated versions of the original images to the training set. This has been attempted on textual data; however, augmenting text is much more difficult than simply scaling or rotating images. The newly generated sentences may not be semantically similar to the original sentence, resulting in an improperly trained model. In this paper, we examine a self-augmentation method that is straightforward and shows great improvements in performance with different BERT-based models in two different languages and on two different tasks which have small data sets. We also identify the limitations of the self-augmentation procedure. (More)

Computer-supported education studies can perform two important roles. They can allow researchers to gather important data about student learning processes, and they can help students learn more efficiently and effectively by providing automatic immediate feedback on what the students have done so far. The evaluation of student work required for both of these roles can be relatively easy in domains like math, where there are clear right answers. When text is involved, however, automated evaluations become more difficult. Natural Language Processing (NLP) can provide quick evaluations of student texts. However, traditional neural network approaches require a large amount of data to train models with enough accuracy to be useful in analyzing student responses. Typically, educational studies collect data but often only in small amounts and with a narrow focus on a particular topic. BERT-based neural network models have revolutionized NLP because they are pre-trained on very large corpora, developing a robust, contextualized understanding of the language. Then they can be “fine-tuned” on a much smaller set of data for a particular task. However, these models still need a certain base level of training data to be reasonably accurate, and that base level can exceed that provided by educational applications, which might contain only a few dozen examples. In other areas of artificial intelligence, such as computer vision, model performance on small data sets has been improved by “data augmentation” — adding scaled and rotated versions of the original images to the training set. This has been attempted on textual data; however, augmenting text is much more difficult than simply scaling or rotating images. The newly generated sentences may not be semantically similar to the original sentence, resulting in an improperly trained model. In this paper, we examine a self-augmentation method that is straightforward and shows great improvements in performance with different BERT-based models in two different languages and on two different tasks which have small data sets. We also identify the limitations of the self-augmentation procedure.

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.17.79.60

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Cochran, K.; Cohn, C. and Hastings, P. (2023). Improving NLP Model Performance on Small Educational Data Sets Using Self-Augmentation. In Proceedings of the 15th International Conference on Computer Supported Education - Volume 1: CSEDU; ISBN 978-989-758-641-5; ISSN 2184-5026, SciTePress, pages 70-78. DOI: 10.5220/0011857200003470

@conference{csedu23,
author={Keith Cochran. and Clayton Cohn. and Peter Hastings.},
title={Improving NLP Model Performance on Small Educational Data Sets Using Self-Augmentation},
booktitle={Proceedings of the 15th International Conference on Computer Supported Education - Volume 1: CSEDU},
year={2023},
pages={70-78},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011857200003470},
isbn={978-989-758-641-5},
issn={2184-5026},
}

TY - CONF

JO - Proceedings of the 15th International Conference on Computer Supported Education - Volume 1: CSEDU
TI - Improving NLP Model Performance on Small Educational Data Sets Using Self-Augmentation
SN - 978-989-758-641-5
IS - 2184-5026
AU - Cochran, K.
AU - Cohn, C.
AU - Hastings, P.
PY - 2023
SP - 70
EP - 78
DO - 10.5220/0011857200003470
PB - SciTePress