Dialogue Act Classification via Transfer Learning for Automated Labeling of Interviewee Responses in Virtual Reality Job Interview Training Platforms for Autistic Individuals

Computer-based job interview training, including virtual reality (VR) simulations, have gained popularity in recent years to support and aid autistic individuals, who face significant challenges and barriers in finding and maintaining employment. Although popular, these training systems often fail to resemble the complexity and dynamism of the employment interview, as the dialogue management for the virtual conversation agent either relies on choosing from a menu of prespecified answers, or dialogue processing is based on keyword extraction from the transcribed speech of the interviewee, which depends on the interview script. We address this limitation through automated dialogue act classification via transfer learning. This allows for recognizing intent from user speech, independent of the domain of the interview. We also redress the lack of training data for a domain general job interview dialogue act classifier by providing an original dataset with responses to interview questions within a virtual job interview platform from 22 autistic participants. Participants’ responses to a customized interview script were transcribed to text and annotated according to a custom 13-class dialogue act scheme. The best classifier was a fine-tuned bidirectional encoder representations from transformers (BERT) model, with an f1-score of 87%.


Introduction
Autism spectrum disorder (ASD) is a lifelong neurological and developmental disorder, with a prevalence of 1 in 44 children in the United States (US) [1], which affects an individual's ability to learn, as well as to communicate and interact socially. ASD is characterized by deficits and/or impairments in social and emotional reciprocity, in communicative nonverbal behaviors, and in forming and maintaining relationships; the level of severity varies, where an individual may require minimal to substantial support [2]. Autistic individuals (identity-first terminology has been chosen based on preference of autistic individuals [3]) are often under-or unemployed, and finding and keeping employment is often difficult [4][5][6]. A 2017 National Autism Indicators Report indicated that amongst 3520 autistic adults (ages 18-64) from 31 states in the US who received developmental disability services, only 14% were employed with pay in the community, 54% had unpaid jobs or activities in a facility, and 27% had no jobs or any form of participant in the community [4]. Some factors for these low rates of employment amongst autistic individuals include difficulties in social interaction, discrimination in the workplace [7], sensitivity to workplace environments (e.g., lighting, sounds, smells, etc.) [8], lack of support from vocational rehabilitation services or a lack of services thereof [9], and/or a poor job fit [8]. Amongst those challenges, the job interview, where autistic individuals are expected to adhere to the "norm" in terms of mannerisms and interactions, poses a major barrier to employment, and in turn, to independent living [7,10,11].
To address the above, several computer-based job interview simulators have been developed, catering to autistic individuals' needs (e.g., routine and certainty), which have demonstrated efficacy [12][13][14][15][16][17]. VR-based interview training systems, including VR simulations, provide autistic job candidates a controlled self-paced learning environment to practice their interviewing skills across multiple scenarios with lower interpersonal risk and a sense of psychological safety [12,18,19]. Studies have shown that technology-enhanced job interview training platforms can improve interviewing skills and overall self-confidence when compared to traditional interview training methods [12,20]. Further, virtual job interview training can potentially reduce costs, as it reduces the need for personnel to conduct real-life mock interviews [21]. Studies have demonstrated that depending on its realism, a virtual interview can induce experiences of anxiety or nervousness consistent with real interviews, which can help candidates get accustomed to the affective experience of interviews, and in turn, feel more comfortable during a real interview scenario [22][23][24]. Previous virtual job interview simulations with conversation agents, aimed at helping autistic individuals, collect interviewee (participant) responses and perform keyword extraction on transcribed text, which is dependent on the domain of the interview script [12,[14][15][16]19]; oftentimes, the interview systems do not allow for freely spoken responses, and instead provide multiple answers from which to choose the appropriate one. Though the existing methods work, they are ad hoc solutions, and the interview systems lack text classification methods which can recognize the intent of the interviewee's spoken utterances, independent of the domain of the interview script. In our recent work [25], we show how manually identifying intents from autistic interviewees' utterances can provide insight into their performance. Hence, automatic labeling to recognize the meaning of utterances has the potential to enhance current simulators for training autistic individuals. We address this via automated dialogue act classification.
Dialogue act classification or recognition is a method where utterances of a conversation are classified as one of several dialogue acts or speech acts [26]. A dialogue act (DA) or speech act represents the meaning or intent of an utterance, e.g., whether the utterance is a question, a yes or no response, a "thank you" or a "good-bye", or a statement of opinion or fact [27]. Hence, DAs can capture the semantics of utterances in a domain by mapping utterances to DA. DA classification has several applications, such as intent mining and identification [28], classifying issues from comments on online collaborative platforms such as GitHub [29] in conversation agents and dialogue systems for an agent/system to understand the user and execute tasks that drive the flow of dialogue [30][31][32][33]. When combined with conversation context, DA classification has shown to improve contextual topic modeling (extraction of themes or topics in text [34]) [35]. When automated, DA classification can be used to automatically label utterances, whether from text or other modes of data, such as text combined with facial expressions [36,37] or speech signals [32,38].
During the development of our previous work [39], we encountered a lack of publicly available data to train a text classifier that classifies utterances into domain-independent DA that are relevant in a job interview context. Hence, in response to the aforementioned gap, we collected utterances of participants responding to questions from an interview script while interacting with our virtual job interview platform described in [39]. This data collection study resulted in a small dataset of 640 utterances. Observing patterns in the responses to interview questions, we developed a DA labeling scheme based on previous works [32,40,41] that captures the semantics of utterances in the context of a job interview regardless of the domain of the interview. The anonymized data were then labeled by two annotators according to the developed DA scheme. These annotated data were then used to train the classifier. First, a baseline support vector machine (SVM) classifier and a random forest (RF) classifier were trained on the annotated data, which helped understand our data and highlighted where traditional machine learning failed to classify some labels. Second, a pretrained bidirectional encoder representations from transformers (BERT) model was fine-tuned using transfer learning on our annotated data (building on the work of [41]). The final BERT classifier performed with a 0.92 accuracy and an f1-score of 0.87.
The scope of this article is twofold: (1) to introduce a DA scheme to capture semantics of utterances in a job interview context through data collected in an experimental study with autistic individuals using our virtual reality job interview training system [39]; and (2) to present the best-performing classifier on the data. The three main contributions of this paper are (1) an original dataset comprising annotated responses to job interview questions by autistic individuals of working age, where responses are labeled using an in-house dialogue act scheme for job interviews; (2) a predictive model that takes an interview response as input and outputs the dialogue act to understand the general intent of the response; and (3) the interview script and annotation guidelines to help future developers replicate this work and collect more data. In this work, we aim to introduce a new autistic dataset with real speech data collected from a simulated job interview that can be used for automated DA classification in VR simulations to aid autistic individuals. It is to be noted that we have adapted a standard BERT model for DA classification of utterances to demonstrate feasibility as a first step towards this aim. More sophisticated models and/or methods can be employed in the future to better the accuracy.

Related Work
Previous studies discuss several methods for automated DA classification. Early DA classification models were traditional machine learning (ML) methods such as support vector machines [42,43], sometimes combined with hidden Markov models [44], Bayesian network-based classifiers [45,46], decision trees [47], and more [48]. Recently, solutions for DA classification involve methods in deep learning. Khatri et al. [35] created a 14-DA scheme based on observed patterns in a user-chatbot interaction dataset to determine the goal from the user's utterances. Examples of DAs were InformationRequest, Informa-tionDelivery, GeneralChat, TopicSwitch, and more. They used DA as context to improve the accuracy of a topic modeling algorithm, where the best performing DA classifier was a contextual deep average network (DAN) model. Raheja and Tetreault [49] combined a context-aware self-attention mechanism with a hierarchical recurrent neural network (RNN) to conduct utterance-level DA classification. They demonstrated improvements in classification on the Switchboard Dialogue Act (SwDA) Corpus [50] , which is a benchmark dataset with a 42 DA scheme to capture general conversation [40]. Ahmadvand et al. [31] introduced a novel method for contextual DA classification built upon the method of [35]. The features included utterances represented as 300-dimension Word2Vec embeddings, parts-of-speech (POS) tags, topics, and lexical features, including word count and sentence count. Their model was a fully connected convolutional neural network tested on known state-of-the-art conversation corpora, demonstrating improved accuracy relative to previously reported baselines. Chatterjee and Sengupta [28] introduced an extension to the density-based algorithm (DBSCAN) [51] for intent identification for conversation agents by using available conversation data. Their motivation was to help reduce the need for labeling conversation data by automatically labeling previously available data with their classifier. The results on six datasets show the potential of their algorithm in intent mining. Although the above have shown promise, they do not show improvements as significant as those demonstrated via transfer learning [52] using pretrained BERT [53]. Transfer learning in ML is an approach that involves improved learning in a new task via transfer of knowledge from a previously learned related task, where the previously trained model is further trained or "fine-tuned" on new task data [54]. Duran et al. [55] conducted experiments to investigate the results using different single-sentence representations in training DA classifiers, which included embeddings, punctuation, text case, vocabulary size, and tokenization. The results demonstrated the impact of the different features in sentence representations for the training of several models. The results also implied that the pretrained BERT and a variation of it called RoBERTa [56] showed notable increase in accuracy on the SwDA corpus when compared to other models, which demonstrated that contextual sentence representations produced by the BERT models are an improvement over other fine-tuned models. In another study, Noble and Maraev [57] found that while a standard pretrained BERT performs well, for good performance on task-specific DA classification, fine-tuning the pretrained BERT is essential. Wu et al. [58] further pretrained a BERT model on eight datasets using masked language modeling (MLM), which is a fill-in-the-blank pretraining strategy where the model is trained to replace a [MASK] token in a sentence with a random input token (e.g., "Paris is the [MASK] of France", where [MASK] can be replaced by the token which contextually makes sense such as "capital"). Their task-oriented dialogue-BERT, or TOD-BERT, can be further fine-tuned for tasks such as response selection, dialogue state tracking, intent recognition, and DA prediction, which can help address the task-oriented dialogue data scarcity problem. Chakravarty et al. [41] created a DA scheme to fit the context of a question-answering scenario where DAs were developed for questions and answers, respectively (e.g., wh-question or yes-answer) . They trained three models on an annotated question-answering dataset with transcribed text from interviews: a convolutional neural network (CNN), a Bi-LSTM, and a BERT-base-cased pretrained model. Amongst the three fine-tuned models, their BERT model performed the best, with an overall F1-score of 0.84.
From the above literature review, we concluded that fine-tuning a pretrained model via transfer learning demonstrates significantly better results than training a model "from scratch". We can also see in previous literature [41,58] that the pretrained BERT models perform significantly better when it comes to text classification, which includes DA classification. Hence, we chose to train a DA classifier using a transfer learning approach.

Data Collection via Job Interview Training System
A total of 22 autistic participants (mean age = 19.1; SD = 2.8; 17 male and 5 female; 20 White-of which 4 were Hispanic/Latino-and 2 African American) were recruited for a data collection study. Participants were required to be verbal, fluent in English, and at the eligible age of employment (16+ years) in the state of Tennessee, and the study was approved by the Institutional Review Board (IRB) of the lead author's university. Participants went through a mock interview with our career interview readiness in virtual reality (CIRVR) job interview environment with a virtual interviewer avatar [39]. During the interview simulation, the participant is seated across a desktop computer that runs the CIRVR simulation, as shown in Figure 1, where the participant can interact with CIRVR and move around the virtual space using mouse and keyboard controls. Spoken responses are captured via a microphone headset, which are then submitted via keyboard controls. This feature allows the user some time to think about their response before submission. Although CIRVR provides an immersive VR option, we chose not to use a VR headset or head-mounted display for this study due to preferences of the population, as mentioned in our previous work [39].
To ensure a realistic job interview experience in CIRVR, the interview script comprised questions for the position of a data entry clerk. The questions reflected both a review of multiple years of questions used by actual employers in a database of a university career management center, and interviews with 36 autistic employees, support professionals (e.g., job coaches and university career center professionals), and employers. The interview structure consisted of the following segments, as described in [15]: greetings ("hello" and "good morning"); technical questions (about work experience and specific job skills); education questions (e.g., formal schooling, favorite subjects, cocurricular activities, or job-relevant hobbies and activities); personal questions (e.g., behavior-based questions, such as experience working in a team); and an interviewee-initiated questions section, where the participant can ask questions about the job (e.g., the work environment and work hours). CIRVR went through interview questions from a predefined interview script, which the virtual avatar output as speech using Microsoft Azure Text-to-Speech synthesizer. The participants' responses to each question were captured via a microphone on a headset and were transcribed in real-time using Microsoft Azure Speech-to-Text [59], which has been benchmarked as the speech recognizer with lowest transcription error rates when compared with competitors [60]. Transcribed speech for each session was stored as text along with the corresponding questions, in a comma-separated values (CSV) file. These files were then analyzed to create a DA labeling scheme, as discussed in the next subsection.

The BERT Architecture and Fine-Tuning BERT for Text Classification
BERT is a transformer-based model, first introduced by Devlin et al. [53], that has been pretrained via self-supervised learning on a large corpus of unlabeled data in English. It consists of a multilayer transformer encoder architecture that is based on the original transformer encoder, as described in [61]. Figure 2 presents a visualization of the BERT base model that has 12 encoder layers, a hidden size of 768, and 12 attention heads [53]. Each encoder consists of (1) a self-attention layer, where a sentence is passed to help the model learn to which words it should "pay attention" to based on context of the surrounding words; and (2) a feedforward neural network layer that processes the output from the attention layer to format/fit it into an acceptable form for the next encoder [53,61]. Pretraining of BERT was conducted over two tasks: MLM and next sentence prediction (NSP). In MLM, a model is trained to predict missing words in a sentence based on the context provided by neighboring words in a sentence. In the MLM task for BERT, 15% of the words in an input sentence were masked, and the masked sentence was run though the BERT model for prediction of the missing words. This was carried out to train a model with the ability to learn a deep bidirectional representation of the sentence. During the NSP task, the input consisted of two sentences in order to train a model to understand the relationship between them, i.e., is sentence B next to sentence A or not [53]. Pretraining of BERT using these two methods prepares the model for downstream tasks such as question answering, or in our case, sentence classification; specifically, single-sentence classification. To understand how BERT classifies text, let us take an example such as "I like cooking". This sentence is first converted to lowercase (we chose the BERT-base-uncased model for fine tuning). Then, the sentence is tokenized as per the BERT tokenizer, which gives the Tokenized Input, as shown in Figure 3. The word "cooking" is stemmed or broken into two words, "cook" and "##ing", so as to not increase the size of the vocabulary, as well as to be able to handle any variation of the word "cook" such as "cooks", which will be broken into "cook" and "##s". Each token has a pretrained embedding, and collectively for a sentence input, they form the Token Embeddings. The [CLS] token is a special classification task token, and the [SEP] token is a separator between multiple sentences; so, there can be a sentence after [SEP] for multi-sentence classification. The next layer of inputs are the learned embeddings that state the sentence to which the token belongs. In our example, we have a single sentence only, so all will be E A , i.e., each token belongs to sentence A. The final inputs are the Positional Embeddings, which index the position of the tokens in the vocabulary. The sum of the three embeddings is the input to the BERT encoder. Each token, when passed through the encoders, is embedded into a vector of length 768. In our example in Figure 3, the 4 tokens in a sentence plus the 2 special tokens form an embedding matrix of size 6 × 768, where 768 corresponds to the number of hidden units. This is the Model Output. For the text classification task, BERT takes the final hidden state of the [CLS] token, which is a representation of the whole sequence (the input sentence), and passes it to the Classifier to predict the label. The Classifier here can be a feedforward neural network on whose output we apply SoftMax to obtain the probabilities for each class. The class with the maximum probability value is the predicted label [62].

Dialogue Act Labeling
The DAs to label the interview questions were adapted from the Questions DA scheme in [41], with modifications made to the descriptions and guidelines to fit the context of our data, and in general, for the job interview context. These DAs and our modified descriptions of the data are given in Table 1. The interviewees' utterances/responses were classified according to the scheme in Table 2, which is an adaptation of the Answers DA scheme from [41], with the addition of six labels to capture all types of utterances in our dataset: xx, query, ft, fa, fp, fe. The xx label was used to account for any responses that were ambiguous or uninterpretable, and the query label was used to account for intervieweeinitiated questions. The last four of the six labels were adapted from [32]. Based on patterns observed in the mock interview's text, a set of in-house guidelines were developed for annotating the questions and the participant utterances according to the DA labeling schemes in Tables 1 and 2.
What is your name?
wh-d Wh-declarative questions, when there is more than one statement in a wh-*question.
You said math is your favorite subject. What kind of grades did you get in Math?
bin Binary question, which can be answered with a "yes" or "no". Does that sound good?

bin-d
Binary-declarative question, which can also be answered with a "yes" or "no", but has an explanation or a statement before it.
Before getting into some technical questions about the position, tell me, do you have any prior work experience?
qo Open question or general questions, not specific to any context, to know the options of the person who is answering.
How do you feel this interview is going?
or Choice question, made of two parts connected by conjunction "or".
Do you have any experience with spreadsheet software such as Microsoft Excel or Google Spreadsheets?
Since the mock interview questions were the same for all the participants, two annotators first labeled the questions according to the scheme in Table 1. Annotators settled any differences in labels via discussions. The DA labels of the questions were then used as a reference to annotate each participant's/interviewee's response to those questions according to the scheme in Table 2. Types of questions often prompt certain responses [63], e.g., a yes-or-no type of question will often prompt a yes or no type of answer, which can make the annotation less ambiguous and can help increase interannotator agreement. The interannotator agreement on the interviewee utterances, after settling differences, was calculated to be 86%, which was the percentage of utterances labeled with same DA by both annotators out of the total number of utterances. The utterances that the two annotators did not agree upon were discarded, and the ones labeled xx were also discarded, as these were uninterpretable and would be noise in the dataset. Finally, we were left with a dataset of 640 annotated interviewee utterances, and the distribution of labels for the 13 classes is shown in Table 3. Note that the dataset of the participants has been deidentified, and contains placeholders '{personName}' for where the participants introduce themselves to the virtual interviewer. Table 2. Answers Dialogue Acts.

Dialogue Act
Description Example y Variations of "yes" answers. "yes", "yeah", "of course", "definitely is", "that's right", "I am sure", etc. y-d Yes-answer with an explanation. Yes. I have experience with Excel.
n Variations of "no" answers.
xx Uninterpretable responses or any responses which look incomplete, such as one word and the user stopped talking.
How do you feel this interview is going?
sno Non-opinionated statements. I started working on a project the other day.
dno It is a response given when the person is unsure, doesn't know, or doesn't recall.

The Data
From the distribution in Table 3, we can clearly see that the sno (non-opinionated statements) label comprises almost 60% of the dataset. This distribution is similar to the distribution in the SwDA benchmark dataset, where sno comprises over 50% of the data [40]. The ft, fa, fe, and fp classes have <10 samples each, where fa has only 1 sample. This would have been an issue during the training and testing split, where the 1 fa sample could end up in either the training set or the test set, and the other set would not have any samples from the fa class. Instead of random oversampling (creating copies of the minority samples), which is known to lead to overfitting (model trains well on training data, but performs poorly on unseen or test data [64]) [65], we added some new data to the ft, fp, fa, and fe classes. Data for ft, fp, and fa were gathered from the SwDA dataset, as those were general cases of sorry, greetings, and thank you. After isolating the utterances in those classes from the SwDA corpus and removing duplicates, we ended up with 38 samples of ft, 57 samples of fp, and 41 samples of fa. For the fe class, we gathered samples from a corpus developed in our previous work, which included exclamation-type words spelled specifically as captured by the Microsoft Azure Speech-to-Text service. For example, curse words are censored by Azure Speech, with the asterisk '*' character replacing the characters (e.g., ****), and other words are spelled differently, such as "jeez", which is spelled as "geez" when transcribed by Azure. After removing duplicates, we ended up with 51 samples of fe.
We added these to the original 640 sample dataset, which led to an 827-sample dataset. To prepare the dataset for model training, we removed most punctuation with the exception of the following: '?', which was found in interviewee-initiated queries; '{'and '}', which were used for the placeholder '{personName}'; '*', which was left for the curse words that were censored by Azure's Speech-to-Text transcriber; and " ' ", apostrophes used to preserve " 's " or "I'm" in utterances. The utterances were then converted to lowercase, which made our final dataset.

Model Training
Model training was conducted on a Microsoft Windows 11 PC with 64 GB RAM and a 16 GB NVIDIA Quadro RTX 5000 GPU. To understand the predictive abilities of our small dataset, we trained two classifiers as baseline models. Baseline models are often used to understand the dataset, and help determine the specific classes with which the model fails that may affect later steps in the project, or whether there are sufficient data for the classes [66]. For example, it would help understand whether 10-15 samples in a class are enough or the dataset needs revisiting. We chose support vector machine (SVM) [67] and random forest [68] as the baseline models as they have previously shown to provide acceptable results in DA classification tasks [32,69,70]. The model training process is summarized in Figure 4.
Since our dataset was small and imbalanced, we chose to perform 5-fold cross validation (CV) to obtain the best-performing model as there were not enough data to apply a 3-part split with a training, a validation, and a testing set [71]. We used the scikit-learn KFold CV [72] to conduct the training and testing split and set the shuffle parameter in the method to True to shuffle the data. For reproducibility, we chose a random seed of 47 as a parameter to obtain the same splits with all 13 classes present in the training and the testing datasets. The distribution of the 5 folds of data are given in Table 4. Note that the 5-fold split of data was conducted on the text version of the data and not the encoded version, because on setting the random_state parameter of scikit-learn's KFold method, the indices of the encoded tokens of utterances change, which is not ideal, as we want to keep the order of tokens the same in the dataset for BERT. After the 5 folds (groups/splits) of data were obtained, the untokenized, unencoded, text versions were saved to comma-separated values (CSV) files to use later for BERT.
For feature extraction for the traditional ML models, we first accumulated all the unique words in the utterances and created a vocabulary of 1305 words. To account for unknown words or words outside of the vocabulary during classification of unseen, new data, we added a token '<UNK>' to the vocabulary list (which made the length of the vocabulary 1306 words). We then tokenized each utterance using the word tokenizer from the Natural Language Toolkit (NLTK) [73], and then one-hot encoded the utterances, where each feature vector representing the utterance was the length of the vocabulary (see Appendix A.1). We used the label encoder [74] from scikit-learn ML toolkit [75] to encode the labels associated with each utterance.
Hyperparameter tuning was conducted via Randomized Search. Randomized Search [76] is a method of hyperparameter tuning where random combinations from a fixed grid of values (see Table 5) are tried with the data to obtain the hyperparameter values that produce the best-estimating classifier. Scikit-learn's RandomizedSearchCV allows the user to perform hyperparameter tuning while cross-validating the combinations of hyperparameters on the k-fold splits of data, which makes the process faster. After obtaining the best hyperparameters on the 5 folds of data, we ran 5-fold CV again to obtain the classification results on each fold (train and test set) of data. The classification results for each best-performing classifier and the best parameters obtained from hyperparameter tuning are discussed in Section 4. Figure 4 summarizes the aforementioned preprocessing, hyperparameter tuning, and training process.   We then moved on to fine-tuning the pretrained BERT model. The data for BERT were utterances that were preprocessed (not encoded), as discussed in Section 3.4, and the labels were encoded by the aforementioned label encoder. The same 5 folds of data that were used for the traditional ML classifiers were also used to cross-validate the BERT model to keep the datasets consistent. The BERT pretrained model and tokenizer were retrieved from the Hugging Face Transformers library [78] ;specifically, we chose the bert-base-uncased version on Hugging Face, which consists of 12 transformer encoders stacked together, with a hidden state of 768, 12 attention heads, and 110 million parameters, and ignores case in words (e.g., "Camel" and "camel" are the same). For feature extraction, we used the Fast Tokenizer [79] from Hugging Face. BERT models have variations for several tasks, such as token classification, text classification, language modeling, and question answering. The BERT model for sequence classification has an additional layer for fine-tuning BERT for text classification tasks. All 12 layers of BERT were left unfrozen; hence, the 12 layers were fine-tuned. First, we conducted an in-house randomized search CV over a fixed grid of hyperparameters (see third column of Table 5 on the 5 folds of data. The weight decay was kept constant at 0.01, and the default AdamW [80] optimizer was used. On training, we obtain CV scores for each hyperparameter combination on each fold of testing data. The best parameter combination is the one that produced the highest average f1-score over the 5-folds of data. In Section 4, we report the average accuracies and the f1-scores across all 5 folds of data, and highlight the best fine-tuned classifier, as determined by 5-fold CV, with the best parameters found via randomized search CV. In addition, we also report which split of data gave the best classifier. The final predictive BERT classifier was trained on the entire dataset, with the best hyperparameters from the randomized Search CV. Table 6 presents the 5-fold CV results of all three models on the 13-class datasets. As mentioned above, randomized search CV was used to determine the best hyperparameter values on which to train each model. For SVM, they were C = 10 and the rbf kernel [69]. For RF, max_depth = 70, max_features = 'sqrt', min_samples_leaf = 1, min_samples_split = 10, n_estimators = 190, and bootstrap = False. Finally, the best hyperparameters for BERT on the 13 classes were batch_size = 8, learning_rate = 5 × 10 −5 , and number of epochs = 40. Since our dataset was imbalanced and we had multiple classes, instead of accuracy, we considered f1-score as the determining factor for identifying the best-performing models. In Table 6, we observe that training on Split 2 (see Table 4) produced the best models across all three classifiers. Table 7 presents the classification report of the best-estimating classifiers on the Split 2 dataset. The SVM classifier for 13 classes had an accuracy of 0.85 and an f1-score of 0.72, with regularization C = 0 and rbf kernel as the best parameters. The RF classifier performed better with respect to the f1-score of 0.74, but had a slightly smaller accuracy of 0.84. The fine-tuned BERT model, on the other hand, outperformed both baseline classifiers with an accuracy of 0.92 and an f1-score of 0.87. The confusion matrices in Figure 5 of the classifiers provide more insight into the classification report and the scores that we observe in Table 7. The color gradient to the right of each matrix corresponds to the number of samples, i.e., the grid cell with ≥70 samples will be filled in yellow.

Discussion
From the results in Table 7, we observe that the SVM baseline classifier had an acceptable overall accuracy and f1-score; however, it failed to classify the classes n-d (no, with explanation) and y-d (yes, with explanation) and performed poorly on so (opinionated state-ments) utterances. On observing the data, an utterance in the n-d class looks like "No but I can try", and an utterance in the n class looks like "No ma'am". The features for the SVM model did not take into account the order of words, since one-hot encoding was performed by the position of the word in the vocabulary, rather than by the order of occurrence of the words in the utterance (see Appendix A.1 for example). The presence of "No" in both utterances may be why the utterances in n-d were misclassified as the n label. However, on viewing the confusion matrix for SVM in Figure 5 , we see that the n-d sample got misclassified as sno. This misclassification was also observed with the y-d utterances being misclassified as sno class utterances, possibly due to the presence of words common to both classes and/or model bias towards the majority sno class; however, there are not enough samples in the test set of the minority classes for an analysis. The RF model performs better for y-d, where 1 sample is classified as sno (see Figure 5) out of the 2; however, n-d is misclassified completely. The performance on so improved in the RF classification report. The BERT model's features, on the other hand, involve integer-encoded utterances, where the order of the words in the utterance is preserved and each feature vector is padded to a maximum fixed length (see Appendix A.2, for example). Hence, it performed well on the two classes that the SVM model missed, but the accuracy on so remained the same. Figure 6 shows two graphs where we plot the average accuracy and average f1-score of the three models (see Table 6) with standard error bars to present a visual representation of the variation in results from the 5-fold CV across the three classifiers. Independent one-tailed t-tests were conducted between the results of SVM and RF, SVM and fine-tuned BERT, and RF and fine-tuned BERT. The p-values are visualized on the dotted lines between the pairs. Results show that there was no statistical significance between the results of SVM and RF (t (4) = 1.25, p = 0.123 for mean accuracy and t (4) = 1.19, p = 0.135 for mean f1-score). However, the results between SVM and fine-tuned BERT (t (4) = −2.11, p = 0.034) for mean accuracy and t (4) = −3.46, p = 0.004 for f1-scores) and RF and fine-tuned BERT (t (4) = −3.56, p = 0.004 for mean accuracy and t (4) = −3.93, p = 0.002 for f1-scores) were statistically significant. The fine-tuned BERT performed significantly better than SVM and RF. In Section 1, we described how our project was motivated by Chakravarty et al. [41], who performed transfer learning and trained three classifiers, including a pretrained BERT, for the task of question answering in an interview context where the input was the questionanswer pair. Their DA scheme was adapted from the early work by Jurafsky et al. [81], and led to the development of a rich dataset of interview questions and answers that they used to fine-tune a pretrained BERT with an f1-score of 0.84. Our DA scheme builds on their work and that of Jurafsky et al. [32,40], and achieves similar results, with an f1-score of 0.87. Our score is likely slightly higher due to the difference in the number of classes, and because we were performing single-sentence text classification, which has no dependence on the question. Their fine-tuned BERT model may have a high f1-score when compared to the scores of the other two models they trained; however, it failed to classify two labels, including one that is similar to query. Coincidently, their classifier produced the same f1-score for the so label (0.67). Recent research by Wu et al. [58], introduces two further pretrained BERT models for task-oriented dialogue (TOD-BERT) using the BERT-baseuncased model. After pretraining, they fine-tuned their TOD-BERT model for downstream tasks such as DA classification. Although our model had not been further pretrained on other datasets, our micro-f1 score of 0.92 or 92.2% is in line with their DA classification results (91.7%, 93.8%, and 99.5%) achieved on three task-oriented datasets with several domains. These comparisons to previous research further demonstrate the potential of our model to perform well on job interview-related utterances. The results also suggest that further pretraining of our model data will likely improve performance. As for our baseline models, they perform significantly better than those mentioned in [47], which also aimed to classify DA (in the context of online chat forums) using traditional ML methods with 10-fold CV.

Conclusions
Autistic individuals face disproportionately poor employment outcomes. Though previous job interview training systems have shown promise, they lack an automated response understanding/labeling mechanism that can be used by the system to automatically classify the types of responses received for a question, regardless of the domain. In this article, we discuss the contributions of our project. First, we present a DA classification scheme in the context of job interviews and provide original data comprising interviewee utterances collected via a virtual job interview training environment. We also share the interview script for use in future data collection studies, which we hope will pave the way for more available data for virtual reality-based job interview training systems for all individuals, including autistic individuals. Second, we present a classifier based on the pretrained BERT, fine-tuned via transfer learning on our original data, that performs with acceptable accuracy across each class, ready to be integrated into a job interview platform for automatic classification of interviewee responses. Automated classification of interviewee utterances in existing job interview training environments can help create adaptive environments where the virtual interviewer (the conversational agent) can understand the basic intent of an interviewee's utterance, which can then be used to create individualized and naturalistic training experiences. The DA detected may also provide insights into the performance of the autistic interviewee to facilitate individualized feedback for improvement. Further, a future developer may experiment with different combinations of classes on which to train their models or further train our fine-tuned BERT, to see what fits best with their application (e.g., another job interview system may only require the utterances from sno, query, and yes/no type of answers). Appendix A.3 describes the process on how to further train our fine-tuned BERT model for classification.
Despite the above accomplishments, our work has a few limitations. The original dataset, with the addition of more data, is still quite small, and the results reported are on a very small number of samples as support for each class. A larger sample of participants and more training data would be ideal for comparing the efficacy of our ML models. Our ability to carry this out is limited by two key factors: (1) we are sampling a specialized population-working-age, employment-seeking autistic adults; (2) CIRVR currently requires participants to physically come into our lab, where all the multimodal data (responses to questions, eye gaze tracking data, stress detection, etc.) can be collected. We are, however, working to partner with other organizations who work with autistic adults preparing for employment to collect more data by deploying CIRVR at their sites or having these partner organizations help identify a set of participants willing and able to come to our lab. We are also in the early stages of developing a version of CIRVR that can be used on the user's personal device in any location they prefer, which would help in acquiring more data in the future.
As for improvements, to further address the limited training data, we will explore data augmentation methods that have been accepted for use with autistic data [82,83]. It would be useful in future work to examine how a pretrained BERT that has not been fine-tuned on our data performs on the entire dataset. Furthermore, in hopes to improve individual label accuracy of the minority classes, we will explore variations of BERT, such as those in current works [84][85][86], and explore fine-tuning of TOD-BERT [58] on our data to observe differences in accuracy, and in turn, determine the best-performing method on our data. Future research also needs to evaluate the model's real-world performance by integrating the final BERT classifier trained on the entire dataset, where the predicted labels will be used by the conversational agent to direct the flow of the interview. The interview script and annotation guidelines in Supplementary Materials can aid in replication of our study. The final BERT model has been provided to allow for integration in existing work or for further fine tuning (please contact corresponding author for all data, the model, and the code). Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patient(s) to publish this paper.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: Categorical features such as text in sentences are represented as one-hot encoded vectors. Suppose we have the following utterances spoken by the interviewee or participant: data = ["I don't have experience with Microsoft Word", "I don't know"] The vocabulary would then consist of the following words when the above sentences are tokenized and converted to lowercase: vocab = ["i", "do", "n't", "have", "experience", "with", "microsoft", "word", "know"] The tokenized data will look like the following: toknenized_data = [ ["i", "do", "n't", "have", "experience", "with", "microsoft", "word"], [""i", "do", "n't", "know"] ] The encoded data will comprise each data sample represented as a list of zeros with the length of the vocabulary. If a token exists in the vocabulary, we mark the presence of the token with a 1 in the position of the word in the vocabulary list (like an "on-and-off" binary switch from 0 to 1). In this example, the vocabulary has nine words; each data sample will start with [0, 0, 0, 0, 0, 0, 0, 0, 0]. The first utterance will be [1, 1, 1, 1, 1, 1, 1, 1, 0], and the second sentence will be represented as [1, 1, 1, 0, 0, 0, 0, 0, 1], where 1 is marked for the presence of the word "know" in the vocabulary. The final encoded data will look like the following: encoded_data = [[1, 1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 0, 0, 0, 0, 0, 1]] Since the encoding is carried out by the position of the word in the vocabulary list, the order of words in the utterance is lost.

Appendix A.2
Building on the example from Appendix A.1, integer encoding by the BERT tokenizer maintains the order of words/tokens in an utterance. We define a maximum limit of the length of each feature vector representing each utterance. For this example, let us say that the maximum length of the feature vector is 15 tokens. Since the lengths of the utterances vary, the tokenizer fills the empty space in the feature vector with padding, which represents "no presence of tokens" in those spaces. Each of the words in the vocabulary list above would be assigned a unique integer identifier. Here, we use the position of the word in the vocabulary list to assign the unique identifier for each word, which can be imagined as a hash table. We assign integer 0 for the token <PAD>, which represents padding. vocab_dictionary = "<PAD>": 0, "i": 1, "do": 2, "n't": 3, "have": 4, "experience": 5, "with": 6, "microsoft": 7, "word": 8, "know": 9 Following the same concept, the tokenizer will encode the data according to the dictionary above, where each feature vector is of a maximum length of 15 tokens. Although we did not need to encode the data ourselves, we have presented the underlying concept of how the BERT model accepts feature vectors.

Appendix A.3. Training a Fine-Tuned BERT
There are several libraries and methods to fine-tune a pretrained BERT model. We use functions from the Hugging Face transformers library to demonstrate the process of training a fine-tuned BERT model. Just as we loaded the bert-base-uncased model for fine-tuning, a user can load the fine-tuned model for further pretraining for classification tasks.
The fine-tuned BERT model, when saved, has the following files in a single folder, which we named "fine-tuned-BERT": • config.json has the BERT-base-uncased architecture and configuration of the model, saved in JavaScript Object Notation (JSON Below, we describe the process of loading the model for further training/fine-tuning on new data. The programs were written in Python, and the models are stored as PyTorch models that accept PyTorch tensors: 1. Data Preprocessing The first step is to prepare the data. Based on the task, a user may choose to preprocess their data, as we describe in Section 3.4. For further fine tuning of BERT, a user may have fewer or more classes than what the model was previously trained to predict. The utterances need to be in text (not tokens) and stored in the first column of a CSV file with the header "text". The labels should be encoded to be represented as unique integers. For example, if your labels are 'positive', 'negative' or 'neutral', you can encode the labels as 'positive': 0, 'negative': 1, 'neutral': 2. The encoded label for each corresponding utterance should be in the second column of the CSV file with the header 'label'. The training and testing data are to be separated into a training.csv file and a testing.csv file, and the columns in both files should have the same headers "text" and "label" (see Figure A1). Note that the training data can be further split into a training and validation set for evaluation of models during training, which is especially useful for early stopping. In the program, the CSV files are loaded using the load_dataset() function of the Python datasets library. train_data = load_dataset('csv', 'train': 'training.csv') For tokenization and feature extraction, we use the BERTFastTokenizer from the transformers library, which has a from_pretrained() function that loads the saved tokenizer in fine-tuned BERT. The function takes the model name as an argument or the name of the folder where the fine-tuned BERT model files are stored, i.e., fine-tuned-BERT. The second parameter is do_lower_case = True, which internally converts the text to lowercase if it is not already. Below is the line of code to load the tokenizer. tokenizer = BERTFastTokenizer.from_pretrained('fine-tuned-BERT', do_lower_case = True) Once loaded, the dataset is tokenized one sample at a time, where the tokenizer accepts a few arguments: padding = [True, False], truncation = [True, False], max_length = integer value. For example, to make all samples a fixed length, we can set padding to True; if the sentences are very long (greater than 512 tokens), we can set truncation to True; or if we want each sequence to be of a specific length, e.g., 32 tokens, then we can set max_length to that value. The preprocess function passed each sample in the train_data and returns the tokenized data. def preprocess(samples): return tokenizer(samples['text'], padding = True) tokenized_train_data = train_data.map(preprocess) Once the training and testing sets have been tokenized, we move on to the model.

2.
Loading the Model Here, we use the transformers library's BertForSequenceClassification to load the model using the function from_pretrained(), as shown below. model = BERTForSequenceClassification.from_pretrained ("fine-tuned-BERT", num_labels = 3) We can also specify where we want to store the model: on the GPU or on the CPU. To store the model on the GPU, we use model.to("cuda"), and to store it on CPU, we use model.to("cpu").

3.
Defining Training Arguments There are several hyperparameters that can be initialized before training. Here, we use the TrainingArguments function of the transformers library to set the hyperparameters. Specifically, we focus on the number of training epochs or training cycles, the learning rate, and the batch size, which can be affected by the amount of free memory we have available on our system. Here, we had a GPU memory of 16GB, which stores the model and the training data at any given time. Hence, we were only able to initiate the maximum batch size to 32. Other parameters in the TrainingArgs can be found at [87]. Just as we conducted hyperparameter tuning using an in-house randomized search on 5 folds of data, the user can follow the same method, or experiment with different values and use early stopping to find the best model. More hyperparameter-tuning options can be found in [88]. The training arguments, for example, are initialized as follows: training_args = TrainingArguments( num_train_epochs = 40, learning_rate = 5e-5, per_device_batch_size = 8, weight_decay = 0.01, output_dir= "Models/fine-tuned-BERT2") 4.
Model Training We use the transformers Trainer function to train the model. This function takes a few arguments. The first is the model loaded in Step 2; then, the args, which are the training_args from Step 3. The tokenized_train_data from Step 1 is passed to the train_dataset parameter. The tokenizer parameter is set to the tokenizer that we loaded from the pretrained model. Another optional parameter is the Data Collator. Data collators [89] are objects that form batches of data from a list of data input. Here, we used DataCollatorWithPadding and passed the tokenizer as an argument: data_collator = DataCollatorWithPadding(tokenizer=tokenizer) trainer = Trainer (model =model, args = training_args, train_dataset = tokenized_train_data, tokenizer = tokenizer, data_collator = data_collator) Next, we call the Trainer's train() function. trainer.train() After training, the model can be saved along with the learned weights and data using the Trainer's save(model name) function.
Inference For predictions, we create an inference pipeline. For this task, we can switch to the CPU if not enough GPU space is available. The text input, for example, "Hi there!", is first preprocessed by converting to lower case and by removing any punctuation, as described in Section 3.4. The input is converted into tokens by the tokenizer with the same arguments as those used to tokenize the training set (e.g., padding = True). This input is passed to the model, from which we obtain the outputs on which we apply SoftMax, which gives us three probabilities for each class that add up to 1. The argmax of the probabilities is the predicted class label.