kFolden: k-Fold Ensemble for Out-Of-Distribution Detection

Out-of-Distribution (OOD) detection is an important problem in natural language processing (NLP). In this work, we propose a simple yet effective framework kFolden, which mimics the behaviors of OOD detection during training without the use of any external data. For a task with k training labels, kFolden induces k sub-models, each of which is trained on a subset with k-1 categories with the left category masked unknown to the sub-model. Exposing an unknown label to the sub-model during training, the model is encouraged to learn to equally attribute the probability to the seen k-1 labels for the unknown label, enabling this framework to simultaneously resolve in- and out-distribution examples in a natural way via OOD simulations. Taking text classification as an archetype, we develop benchmarks for OOD detection using existing text classification datasets. By conducting comprehensive comparisons and analyses on the developed benchmarks, we demonstrate the superiority of kFolden against current methods in terms of improving OOD detection performances while maintaining improved in-domain classification accuracy.


Introduction
Recent progress in deep neural networks has drastically improved accuracy in numerous NLP tasks (Sun et al., 2019;Raffel et al., 2019;Chai et al., 2020;He et al., 2020), but detecting out-ofdistribution (OOD) examples from the in-domain (ID) examples is still a challenge for existing state-1 Corresponding author: Jun Zhang.of-the-art deep NLP models.The ability of identifying OOD examples is critical for building reliable and trustworthy NLP systems for, say, text classification (Hendrycks and Gimpel, 2016;Mukherjee and Awadallah, 2020), question answering (Kamath et al., 2020) and neural machine translation (Kumar and Sarawagi, 2019).Existing works studying OOD detection in NLP often rely on external data (Hendrycks et al., 2018) to diversify model predictions and achieve better generality in OOD detection.The reliance on external data not only brings additional burden for data collection, but also results in the annoying issue in deciding which subset of external data to use: there is massive amount of external data and the using different subsets leads to different final results.Therefore, developing OOD detection system without external data is important towards building reliable NLP systems.
In this work, we propose a novel, simple yet effective framework, kFolden, short for a k-Fold ensemble, to address OOD detection for NLP without the use of any external data.We accomplish this goal by simulating the process of detecting OOD examples during training.Concretely, for a standard NLP task with k labels for both training and test, we first obtain k separate sub-models, each of which is trained on a set of different k − 1 labels with the left one being masked unknown to the model.We train each sub-model by jointly optimizing the cross entropy loss for the visible k − 1 labels and the KL divergence loss between the predicted distribution and the uniform distribution for the left-one-out label.During test, we simply average the probability distributions produced by these k sub-modules and treat the result as the final prob-ability estimate for a given input.Intuitively, if the input is an ID example, the final probability distribution will lay much of the weight on one of the k seen labels, but if the input is an OOD example, we expect the final probability distribution to get close to the uniform distribution, since each sub-model has tried to even its probability distribution when encountering unseen labels during training.
This training paradigm does not rely on any external data, and by mimicking the behaviors of distinguishing unseen labels from the seen, i.e., simulating the process of OOD detection during training, which is completed via the KL divergence loss, the framework naturally detects OOD examples and is able to perform reasonably better than other widely used strong OOD detection methods.Moreover, kFolden is complementary to existing post-hoc OOD detection methods, and combining both leads to the most performance boosts.
To facilitate OOD detection researches in NLP, we also construct benchmarks on top of four widely used text classification datasets: 20NewsGroups, Reuters, AG News and Yahoo!Answers.This created benchmark consists of 7 datasets with different levels of difficulty directed to two types of OOD examples: semantic shift and non-semantic shift, which differ in whether a shift is related to the inclusion of new semantic categories.The proposes benchmarks help comprehensively examine OOD detection methods, and we hope it can serve as a convenient and general tool for developing more robust and effective OOD detection models.
To summarize, the contributions of this work are: • We propose a simple yet effective framework -kFolden, which simulates the process of OOD detection during training without using any external data.• We construct benchmarks for OOD detection in text classification hoping for facilitating future related researches.• We conduct comprehensive comparisons and analyses between existing methods and the proposed kFolden on the benchmark, and we show that kFolden achieves performance boosts regarding OOD detection while maintaining ID classification accuracy.

Related Work
Out-Of-Distribution Detection Detecting OOD examples using deep neural models has gained substantial traction over recent years.Hendrycks and Gimpel (2016) proposed a baseline for misclassified and OOD examples by thresholding candidates based on the predicted softmax class probability.Lee et al. (2018) trained a classifier concurrent with a generator under the GAN framework (Goodfellow et al., 2014).Kamath et al. (2020) proposed to leverage the confidence estimate of a QA model to determine whether a question should be answered under domain shift to maintain a moderate accuracy.Hendrycks et al. (2019Hendrycks et al. ( , 2020) ) showed that pretraining improves model robustness in terms of uncertainty estimation and OOD detection.Measuring model confidence has also exhibited power in detecting OOD examples (Lee et al., 2017a,b;DeVries and Taylor, 2018;Papadopoulos et al., 2021).This work differs from Hendrycks et al. (2020) mainly in that (1) they used a simple MaxProb-based method (Hendrycks and Gimpel, 2016) to estimate uncertainty while we propose a novel framework kFolden to improve OOD detection; and (2) they focused on comparing different NLP models on OOD generalization and shed light on the importance of pretraining for OOD robustness, whereas we highlight the merits of OOD simulation during training without the use of any external data, and construct a dedicated benchmark for text classification OOD detection.
Meta Learning in NLP Meta learning (Thrun and Pratt, 2012;Andrychowicz et al., 2016;Nichol et al., 2018;Finn et al., 2017) tackles the problem of model learning in the domain with scarce data when large quantities of data are accessible in another related domain.
Meta learning has been applied to considerable NLP tasks including semantic parsing (Huang et al., 2018;Guo et al., 2019;Sun et al., 2020), dialog generation (Song et al., 2019;Huang et al., 2020), text classification (Wu et al., 2019;Sun et al., 2020;Bansal et al., 2020;Lin et al., 2021) and machine translation (Gu et al., 2018).Our work is distantly related to meta learning in terms of the way we train kFolden by simulating the behaviors of predicting the unseen label during training, But we do not intend to achieve strong few-shot learning performances, which is the main goal of meta learning.
3 Task Definition All training examples in D train with label i now becomes unknown to f i .For the visible k − 1 labels, f i should still achieve high accuracy as we want; but for the masked label i, f i needs to give nondeterministic estimates when the input instance x has the ground-truth label i because the label i is masked and not found in the training set.This implies that the model can not determine which label x belongs to and may attribute it to an OOD example.These two considerations can be satisfied by jointly optimizing the following objective: where CrossEntropy(y train , f i (x)) (2) γ is a hyper-parameter ranging over [0, 1] and tuned on validation set.In the above equations, u is a uniform distribution.Eq.( 2) is a standard cross entropy loss that requires the model to achieve accurate predictions on the visible labels, while Eq.( 3) draws on the KL divergence to encourage the model to produce a probability distribution close to the uniform distribution u on the k − 1 labels for the masked label.By jointly training on both loss functions, f i will be able to detect the OOD label i while preserving non-reduced performances on other k−1 labels.We proceed with this process for all k sub-models, each with a different masked label.f i (x) takes as input x and outputs a probability distribution of dimensionality k − 1. f i can be implemented using any model backbone such as LSTM (Hochreiter and Schmidhuber, 1997), CNN (Kim, 2014), Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2018).

Sub-Model Ensemble
A single sub-model f i will inevitably result in poor performances during test regarding the ID examples with label i.This is because for f i , the masked label i during training will never have the chance to be predicted by the model, so that all the test examples with label i in D test will be associated with possibly low probability, leading to overall reduced accuracy.
To tackle this issue, we adopt the idea of model ensemble: given an input x, we first obtain k probabil- } respectively produced by the k sub-models.In order to coordinate the label dimensions for different sub-models, we manually pad a zero dimension to each probability distribution at the corresponding masked position.For example, if k = 4 and the output from then the padded output distribution would thus be f2 Next, we average all the k padded probability distributions, and take the result as the final probability estimate: f (x) is still a valid probability distribution and naturally remedies the shortcoming of a single submodel: if x is an ID example, i.e., its ground-truth label y belongs to Y train , f (x) will put most of the probability mass on one of the k labels; if x is an OOD example, f (x) will get close to the uniform distribution because all sub-models comprising f (x) will even their probability masses across all the k labels.After training, f (x) can be used for ID evaluation and OOD evaluation simultaneously.

Benchmark Construction
Out-of-distribution data can be conceptually divided into two categories: non-semantic shift (NSS) and semantic shift (SS) and datasets (Hsu et al., 2020).We construct benchmarks on multi-class topic classification datasets.The topic classification task has less vocabulary overlap between ID and OOD data.We use data from 20NewsGroups (Joachims, 1996), Reuters-215782 , AG News (Del Corso et al., 2005) and Yahoo!Answers (Zhang et al., 2015).More details of the original datasets can be found at Appendix A. The statistics of the benchmark are present in Table 1.
We construct NSS benchmarks as follows: 20Newsgroups-6S This dataset is a modified version of 20Newsgroups.The original 20Newsgroups dataset has 20 newsgroups and each newsgroup (e.g., "comp.sys.ibm.pc.hardware") has a root subject topic (e.g., "comp").We divide articles by its root subject and obtain 6 newsgroups ("comp", "rec", "sci", "religion", "politics" and "misc").In this way, train and test data share the same root topic labeled but have different finegrained topic labels.and we use 5 of them ("Health", "Science & Mathematics", "Sports", "Entertainment & Music", "Business & Finance") for the training and ID test data.
We construct SS benchmarks as follows: Reuters-mK-nL This dataset is a modified version of Reuters.We first follow previous works (Yang and Liu, 1999;Joachims, 1998) to use the ModApte split3 to remove documents belonging to multiple classes, and then considered only 10 classes ("Acquisitions", "Corn", "Crude", "Earn", "Grain", "Interest", "Money-fx", "Ship", "Trade" and "Wheat") with the highest numbers of training examples.The resulting dataset is called Reuters-ModApte.We train the model on a subset of Reuters-ModApte and test on the rest subset.Specifically, we train with m topic articles and test the model on the other n = 10 − m topics.In this paper, we use five settings: (m, n) = (9, 1)/(6, 4)/(5, 5)/(3, 7)/(2, 8).
AGNews-FL The dataset is adapted from AG-News and additional articles come from AG Corpus.In this setting, the training and ID test data are from the 4 classes ("World", "Sports", "Business", "Sci/Tech") in AGNews, and the OOD test data are from another 4 classes ("U.S.", "Europe", "Italia", "Software and Development") in AG Corpus.
AGNews-FM This dataset is adapted from AG-New and additional articles are taken from the AG Corpus.In this setting, the training and ID data are from the 4 classes ("World", "Sports", "Busi-ness", "Sci/Tech") in AGNews, and the OOD test data are from another 4 classes ("Entertainment", "Health", "Top Stories", "Music Feeds") in AG Corpus.This dataset is easier than AGNews-FL because the OOD labels are more distinct from the ID labels regarding the label semantics.
Yahoo!Answers-FM This dataset is modified from the Yahoo!Answers dataset.We use five topic articles ("Health", "Science & Mathematics", "Sports", "Entertainment & Music", "Business & Finance") for the training and ID tet data and use the other five unseen topics ("Society & Culture", "Education & Reference", "Computers & Internet", "Family & Relationships", "Politics & Government") for the OOD test data.

Experimental Setups
We use both contextual and non-contextual model skeletons for experiments.We use CNN and BiL-STM as the non-contextual model backbones.We follow the CNN-non-static model (Kim, 2014) as the CNN implementation and the BiLSTM model is of a single layer.Both CNN and BiLSTM have 300d word vectors pretrained on Wikipedia 2014 using Glove (Pennington et al., 2014).The average of the hidden states of all words is used as the feature for classification.We trained the noncontextual models with a batch size of 32 and an initial learning rate of 0.001 using the Adam (Kingma and Ba, 2014).For contextual models, we use the officially pretrained BERT-uncased-base (Devlin et al., 2018) and RoBERTa-uncased-Base (Liu et al., 2019) for comparison.We use AdamW4 to optimize all contextual models, with 0.01 weight  decay and 1000 warmup steps.The learning rate was choosen in the range of {1e−5, 2e−5, 3e−5}.
We use batch size in the range of {16, 24, 32} for all experiments.And use dropout 0.2 for BERT and RoBERTa experiments.

Baselines
We choose the following OOD detection methods for comparison: MSP: The Maximum Softmax Probability method proposed by Hendrycks and Gimpel (2016).It uses the maximum probability in the final probability distribution over labels as the prediction score.If  the maximum probability is under some specified threshold ϕ ∈ [0, 1], then the example would be classified as OOD.We tune the threshold on the dev set.This is the default setting for all model backbones.
Scaling: The temperature scaling (Guo et al., 2017) method leverages a temperature T > 0 to sharpen or widen the probability distribution, and then treats the maximum probability as the final score.The temperature T is chosen from {1, 10, 100, 1000, 5000} and is selected on the OOD validation set.
Mahalanobis: Lee et al. (2018) defined the confidence score using the Mahalanobis distance of a test example x with respect to the closest classconditional distribution, which can be expressed as: score where ψ(x) is the vector representation of the input x, µ c = 1 Dropout: Gal and Ghahramani (2016) casted dropout training as Bayesian inference for neural networks and obtained multiple predictions by running the model multiple times with dropout opened for a fixed input.These predictions are then averaged, giving the final probability distribution.Note that we can combine this method with the above three approaches.
More details regarding hyperparameter selection are present in Table 4. Since the proposed strategy uses the ensemble of K models, we also implement an ensemble of k vanilla models.

Metrics
We use accuracy (ACC) to evaluate model performances on the in-distribution testset and follow previous works (Hendrycks and Gimpel, 2016;Hsu et al., 2020;Lee et al., 2018) to employ three metrics for the OOD detection task, including AUROC, AUPR out , TNR@95TPR.

AUROC:
The AUROC is short for area under the receiver operating characteristic curve.The ROC curve is a graph plotting true negative rate against the false positive rate = FP/(FP+TN) by varying a threshold.This score is a threshold-independent evaluation metric and can be interpreted as the probability that a positive example has a greater detec-  tor score/value than a negative example (Fawcett, 2006).A random classifier has an AUROC score of 50%.A higher AUROC value indicates a better OOD detection performance.
AUPR out : The AUPR is short for the area under the precision-recall curve.The precison-recall curve is a graph plotting the precision=TP/(TP+FP) against recall=TP/(TP+FN) by varying a threshold.AUPR out requires taking out-of-distribution data as the positive class.It is more suitable for highly imbalanced data compared to AUROC.
TNR@95TPR: The TNR@95TPR is short for true negative rate (TNR) at 95% true positive rate (TPR).The TNR@95TPR measures the true negative rate (TNR = TN/(FP+TN)) when the true positive rate (TPR = TP/(TP+FN)) is 95%, where TP, TN, FP and FN denotes true positive, true negative, false positive and false negative, respectively.It can be interpreted as the probability that an example predicted incorrectly is misclassified as a corrected prediction when TPR is equal to 95%.

Results
Experimental results for non-semantic shift and semantic shift benchmarks are shown in Table 2 and Table 3, respectively.The first observation is that contextual models (BERT and RoBERTa) can achieve significantly better performances on both in-distribution and out-of-distribution datasets than non-contextual models (e.g., CNN, LSTM).The second observation is that existing methods including Scaling, Mahalanobis and Dropout can improve ID and OOD performances.The proposed kFolden framework introduces performance boost over the ensemble of its corresponding vanilla model (e.g., CNN, LSTM, Bert and RoBerta) in both ID and OOD evaluations.Additionally, we also find that kFolden is a flexible and general framework, which can be combined to existing OOD detection methods such as Mahalanobis, scaling and dropout, and can introduce addition performance boosts in OOD detection.
It is interesting to see that the improvements on SS datasets are greater than on NSS datasets when augmenting with the kFolden framework.This is because compared to NSS tasks, SS poses more variability in data distributions and requires a better generality from ID to OOD samples.kFolden serves this purpose well since it performs in a way as OOD simulation during training, which naturally addresses ID classification and OOD detection at the same time during training.This training paradigm wins better results for kFolden on SS data.

The Ratio of Unseen Labels
In this subsection, we explore the effect of unseen categories at different ratios.We use RoBERTa as the model backbone and conduct experiments on Reuters-mK-nL datasets, including 9K-1L, 6K-4L, 5K-5L, 3K-7L and 2K-8L.We use accuracy and the error rate as evaluation metrics.The error rate represents the the proportion of OOD examples that are incorrectly classified to an in-distribution label, i.e., the maximum class probability is above the threshold tuned on the valid set.Experimental results are shown in Table 5.As we can see from Table 5, the overall trend is that the error rate increases as more unseen text categories are added to the out-of-distribution test set.Regarding specific models, we find that kFolden always outperforms Dropout, and the combination of kFolden and Mahalanobis leads to the best performance.We speculate that this is because unlike Dropout which relies on the masking patterns within the neural network, the kFolden framework straightforwardly performs at the output, or the training objective level using the training data.This gives a direct learning signal for the model to learn to distinguish OOD examples.

Conclusion
In this paper, we propose a simple yet effective framework kFolden for OOD detection.It works • ID-TestSet We use 4,000 articles from the testset in AG-News.Each class has 1,000 articles.• OOD-ValidSet We assemble titles and description fields of articles in AG Corpus from another 4 classes different from AG-News.
There are 4,000 articles and 1,000 articles per class.• OOD-TestSet We assemble titles and description fields of articles in AG Corpus from another 4 classes different from AG-News.There are 4,000 articles and 1,000 articles per class.
AGNews-FM The dataset is composed of data from AGNews and additional articles from the AG Corpus.In this setting, the training and ID data are from the 4 classes ("World", "Sports", "Business", "Sci/Tech") in AGNews, and the OOD data are from another 4 classes ("Entertainment", "Health", "Top Stories", "Music Feeds") in AG Corpus.This dataset is easier than AGNews-FL because the OOD labels are more distinct from the ID labels regarding the label semantics.Data in the five sets do not overlap.
• TrainingSet We use 116,000 articles from the trainset in AG-News belonging to 4 classes.
Nc x∈D c ψ(x) is the centroid for class c in the valid set D valid and Σ = 1 N c x∈D c (ψ(x) − µ c )(ψ(x) − µ c ) is the co-variance matrix.N c is the number of instances belongs to class c in D valid .
Each class contains 29,000 articles.•ID-ValidSet We use 4,000 articles from the trainset in AG-News.Each class has 1,000 articles.• ID-TestSet We use 4,000 articles from the testset in AG-News.Each class has 1,000 articles.• OOD-ValidSet We assemble titles and description fields of articles in AG Corpus from another 4 classes different from AG-News.There are 4,000 articles and 1,000 articles per class.• OOD-TestSet We assemble titles and description fields of articles in AG Corpus from another 4 classes different from AG-News.There are 4,000 articles and 1,000 articles per class.Yahoo!Answers-FM This dataset is modified from the Yahoo!Answers dataset.We use five topic articles ("Health", "Science & Mathematics", "Sports", "Entertainment & Music", "Business & Finance") for the training and ID data and use the other five unseen topics ("Society & Culture", "Education & Reference", "Computers & Internet", "Family & Relationships", "Politics & Government") for the OOD data.Data in the five sets do not overlap.
The generator produces examples at the in-domain boundary and the classifier is forced to give lower confidence in predicting the classes for those examples.Hendrycks et al. (2018) leveraged real datasets instead of the generated examples, enabling the classifier to better generalize and detect anomalies.Liang et al.
(2017)observed that temperature scaling and small perturbations lead to widened gaps between ID and OOD examples, for which they proposed proposed ODIN, a technique that makes OOD instances distinguishable by pulling apart the softmax scores of ID and OOD examples.
Let D train = {x, y train } and D test = {x, y test } denote the two sets respectively used for model training and test, where we assume the label space for training consists of k distinct labels Y train = {1, • • • , k} and all possible labels for test is the ones in Y train plus t additional labels, i.e., Y test = {1, • • • , k, k + 1, • • • , k + t}.Assume that a neural network f is trained on D train , and tested on D test .More specifically, assume we are training the i-th sub-model f i (1 ≤ i ≤ k), and thus the visible label set for training f i would be Y train \{i}.
car" category, but examples in the training set are able "real car", e.g."that's when they took out the fuel tank and poured it into a jug", and all OOD test are about "toy car", e.g."Raleigh 2year-old fills up toy car with 'gas' amidst shortage".
They are different in terms of whether a shift is related to the inclusion of new semantic categories: the training and OOD test examples in the NSS dataset come from different sub-categories of the same broader category.For example, the training and OOD test sets in an NSS dataset are both from the "

Table 1 :
The training and ID test data are from 11 sub-classes in 20News, while OOD test data are from the rest 9 sub-classes.Statistics for the constructed benchmark."T" is for "Training Set", "V" is for "Valid Set", and "T" is for "Test Set".All the data in each of the set are evenly distributed over the labels except 20News-6S.f (m, train/valid/test) means that the actual number is related to m and the corresponding train/valid/test set in the original Reuters-ModApte dataset.
Yahoo-AGNews-five This dataset contains a subset of Yahoo!Answers and a subset of AG Corpus.The original Yahoo!Answers dataset has 10 classes,
The number in the bracket (k) denotes averaging k model predictions and k equals to the number of labels in the training dataset.

Table 3 :
Results of Semantic Shift (SS) datasets.The number in the bracket (k) denotes averaging k model predictions and k equals to the number of labels in the training dataset.

Table 4 :
The range of hyperparameter values.

Table 5 :
Results on Reuters-mK-nL OOD test sets.The Reuters dataset contains 10 label categories.We use m to represent the number of labels in ID training set and n for the number of categories in OOD testset.

•
TrainingSet We use 680,000 examples belonging to five categories in Yahoo!Answers, 136,000 samples per class.• ID-ValidSet We use 20,000 examples belonging to five categories in Yahoo!Answers.4,000 samples per class.• ID-TestSet We use 25,000 examples belonging to five categories in Yahoo!Answers.5,000 samples per class.• OOD-ValidSet The data are from another five categories in Yahoo!Answers.The OOD-ValidSet contains 20,000 articles with 4,000 per class.• OOD-TestSet The data are from another five categories in Yahoo!Answers.The OOD-TestSet contains 25,000 articles with 5,000 per class.