Bias Analysis and Mitigation in the Evaluation of Authorship Verification

The PAN series of shared tasks is well known for its continuous and high quality research in the field of digital text forensics. Among others, PAN contributions include original corpora, tailored benchmarks, and standardized experimentation platforms. In this paper we review, theoretically and practically, the authorship verification task and conclude that the underlying experiment design cannot guarantee pushing forward the state of the art—in fact, it allows for top benchmarking with a surprisingly straightforward approach. In this regard, we present a “Basic and Fairly Flawed” (BAFF) authorship verifier that is on a par with the best approaches submitted so far, and that illustrates sources of bias that should be eliminated. We pinpoint these sources in the evaluation chain and present a refined authorship corpus as effective countermeasure.


Introduction
When tackling a problem in empirical research, a sound and reliable evaluation of competing solution approaches is a prerequisite to achieve agreement on the state-of-the-art performance. For authorship verification, the PAN series of shared tasks caters for the most important benchmarks to which new approaches refer and compare against. The fundamental problem in authorship verification is to decide whether two given texts were written by the same author. When experimenting within the PAN setting, we learned that one can quickly achieve a competitive performance for this task-with one of the most basic approaches: a TFIDF-weighted character 3-gram model. By extending this model with a few additional features, such as the Kullback-Leibler divergence and related measures, we were able to reach the performance of the best verifiers submitted so far. 1 However, reality caught up with us when we applied our verifier to other authorship verification problems with little success. To 1 https://www.tira.io/task/authorship-verification/ get to the bottom of this rather baffling outcome, we carried out a systematic analysis of the entire evaluation chain, its problem definition, its corpora, its evaluation procedure, and of course our model, in search of any sources of bias that may have artificially inflated the performance of our approach. The paper in hand introduces our "Basic and Fairly Flawed" (BAFF) model and reports on our bias analysis. Moreover, in an attempt to improve the situation and call for better data, we not only contribute a new and carefully curated authorship verification corpus, 2 but also collect a few best practices for the creation of such corpora. The outlined situation calls into question a lot of what we believed to know about the state of the art, and future PAN tasks on verification will have to rectify these issues in order to provide for a more valid assessment of the state of the art.

Related Work
Authorship verification is a young task in the field of authorship analysis. Proposed by Koppel and Schler (2004), and mostly solved on book-sized texts right away, it remains a challenging task on short texts. The numerous verification approaches developed over the years employ a wide array of features, methods, and corpora (Stamatatos, 2009), rendering a comparison between approaches difficult. A dedicated shared task series at PAN (Stamatatos et al., 2015(Stamatatos et al., , 2014Juola and Stamatatos, 2013;Argamon and Juola, 2011) was a key enabler for comparability and reproducibility. The verifiers submitted by Bagnall (2015), Fréry et al. (2014), and Modaresi and Gross (2014) form the state of the art. While new verifiers are run against the shared task's data to assess their performance against these baselines (e.g., Halvani et al., 2017;Kocher and Savoy, 2017), PAN continues to develop new benchmarks on closely related tasks. 3

BAFF: A Baffling Authorship Verifier
In authorship verification, the most basic question to answer is whether two given texts p and q have been written by the same author. 4 Key to solving the task is finding a good representation r of the style difference between p and q. We resort to seven well-known measures for this purpose.

Features: Style Difference Measures
To compute the style difference measures listed below, we first represent p and q as character trigram vectors p and q; character n-grams are considered robust style indicators across many authorship analysis tasks (Stamatatos, 2013). Given p and q, we calculate the following well-known measures: 5 1. Cosine similarity (TF-weighted) 2. Cosine similarity (TFIDF-weighted) 3. Kullback-Leibler divergence (KLD) 4. Skew divergence (skew-balanced KLD) 5. Jensen-Shannon divergence 6. Hellinger distance 7. Avg. logarithmic sentence length difference (a feature frequently used by PAN participants) After assembling r as a 7-dimensional vector from these difference measures, we rescale all computed features to the interval [0, 1] with respect to the dataset so as to align the diverse value ranges. We fully expect the divergence measures to be correlated to a greater or lesser extent; the learning algorithm will select the best-performing ones. Table 1a shows the performance of four WEKA classifiers based on our model on the PAN15 test dataset. The decision tree performs best, beating Bagnall's winning deep learning approach in terms of accuracy by one percentage point for an overall second place (Table 1e). We can produce similar results on the PAN14 novels dataset (Table 1f), and, switching to a random forest, even claim first place on the essays dataset (Table 1g). Altogether, with very little effort, our model outperforms the 31 approaches submitted to PAN in 2014 and 2015, competing with much more elaborate solutions. 4 In forensic applications, a text of unknown authorship and one or more texts known to be written by a given author are considered (van Halteren, 2004). If solved, other authorshiprelated tasks, such as authorship attribution, would be solved as well, since they can be reduced to a series of verifications. 5 Except for the cosine similarity and the average sentence length difference, the other statistical difference measures we use have rarely been considered for verification to date.

Bias Analysis
Unable to reproduce these outstanding results on other verification problems, our ensuing analysis of the evaluation chain revealed several interdependent sources of bias in all its components, namely our model, the data, and the evaluation procedure.
In what follows, we discuss these biases, outline their underlying flaws, and ways to mitigate them.

Model Bias
In an attempt to pinpoint which feature contributes how much to the overall performance, we ran an ablation test. While the removal of each feature causes some performance loss, the removal of Feature 2, the TFIDF-weighted cosine similarity, resulted in the loss of 19 percentage points, by far the largest among all features. What makes TFIDF special is its IDF factor, which was the key to identify two sources of bias in our model: (B1) Corpus-relative features. TFIDF is used so matter-of-factly throughout machine learning that hardly anyone discusses the origin of its document frequency (DF) values. In the absence of any explanation, one may assume that they are computed from the currently processed dataset. This is perfectly alright for most tasks, but crucially not for authorship verification where computing DF from the evaluation datasets at runtime is both unrealistic and prone to overfit. The rather small number of test cases in the PAN datasets combined with Bias B4 allows the learning algorithm to "reverseengineer" part of the ground-truth from the DF values, while in practice, a forensic linguist analyzes only one case at a time, not many (see Bias B6). Table 1c ("scaled" rows) shows BAFF's performance when computing DF from the processed corpus, and when using the Brown corpus instead, revealing a severe drop of performance. Hence, corpus-relative features should be avoided.
(B2) Feature scaling. Another machine learning technique that is often applied without second thought is scale normalization of all features. However, applying the same reasoning as for the (I)DF calculation, scale normalization biases our features towards corpus specifics. Table 1c shows BAFF's performance with and without scale normalization. We experience a massive performance drop in combination with corpus-relative IDF, but much less so with "external" IDF from the Brown corpus. This aggravation of Bias B1 through feature scaling is most likely influenced by Biases B3-B6. a Counting "known" texts as a single large text. A case in the essays corpus has one "unknown" and up to five "known" texts. b Not all authors are unique across subsets.  (c), and a comparison of 10-fold cross-validation naive Bayes with corpus-relative TFIDF as the only feature between the two corpora (d). Column 2 ranks BAFF against the top-5 PAN15 (e) and PAN14 (f / g) submissions (final score = C@1 · ROC). Column 3 lists general statistics for all corpora (h) and genres and time periods covered by our Gutenberg corpus (i).

Data Bias
Just as the creators of a verification model should mitigate bias by avoiding unsuitable features and techniques, so should the creators of an evaluation dataset take precautions not to make it readily exploitable. The reason why Biases B1 and B2 inflated the performance of our model is largely due to the fact that the data is biased, too, or else the model's biased features would not have had such a significant positive effect. Reviewing PAN's datasets, we identify three sources of bias.
(B3) Plain text heterogeneity. Inspecting the plain text files of the datasets, many of them carry artifacts that are unlikely to signal authorial style, but rather originate from the plain text converter used or the human transcriber. Examples we observed include mixed use of ASCII and Unicode ellipsis markers (some as iconic as ". . . ."), a wide variety of quotation marks and em dashes (also mixed encodings), and curly braces for parentheses. Moreover, the texts are formatted to be human-readable by preserving white space, including indentations and line breaks, which vary greatly across authors, but were not necessarily introduced by them. Given that many verification models use character n-grams as basic style representation, ngrams covering these artifacts may indicate authorship even across cases. To mitigate this bias, the texts in a dataset should be fully homogenized (particularly in the presence of Bias B4).
(B4) Population homogeneity. Many monographs are required to construct a verification dataset. But the sources tapped so far lack scale, so that three shortcuts are commonly applied to maximize yield: 6 For same-author cases, more than one case is constructed for a given author, (1) by systematically pairing more than two texts by that author, and/or (2) by splitting long texts (e.g., books) to obtain more text chunks from that author. For different-authors cases, (3) texts from authors for whom same-author cases exist are reused, using different, or even the same chunks also found in same-author cases. Such imbalance causes authors' styles to be over-/underrepresented. Steady use of these shortcuts also gives rise to Bias B5.
(B5) Accidental text overlap. The strong contribution of the TFIDF-weighted cosine similarity points to text overlap in same-author cases that renders them easier-to-discriminate from differentauthors cases. Caused by Bias B4, text overlap includes named entities (e.g., speaker names in the plays of PAN15), topic words shared between text chunks taken from the same source text, repeated phrases, and unique character sequences. The fanfiction used for PAN14 contains text reuse from the original books. Accidental overlap between cases may lead a learning algorithm astray, especially in the presence of Biases B1 and B6. For mitigation, a text overlap analysis and correction is necessary.

Evaluation Bias
Lastly, the evaluation procedure itself is biased.
(B6) Test conflation. At testing time, authorship verifiers can usually access the entire test dataset. This is unrealistic; a forensic linguist works on a case-by-case basis, and cases are independent of one another, or their underlying population is unknown. Emulating this scenario, a verifier should process only one test case at a time, without referring to previously processed cases to solve the next one. Incidentally, this policy would mitigate many of the aforementioned biases. While not enforcible in individual evaluations and shared tasks with run submissions, at PAN, it may indeed be, by adjusting the TIRA platform  to handle the software runs accordingly.

The Webis Authorship Verification Corpus
With the goal of avoiding all data biases, we constructed a new authorship verification corpus based on books obtained from Project Gutenberg: 7 the Webis Authorship Verification Corpus 2019. We validate the corpus using our BAFF approach.

Corpus Construction
At Project Gutenberg, transcriptions of many public domain books are provided. Given their diversity, we limit our choice to fiction books from the 19th and 20th century and the two specific genres adventure and science fiction, controlling for respective style variation. Table 1h and i compare the corpus statistics with the three PAN corpora.
To avoid Bias B4, we ensured that each author is unique within, though not necessarily across any combination of time period and genre. Moreover, no texts were reused to construct different-authors cases, but texts from previously unused authors were collected. The same-author cases were created so that both texts are from different books, and where possible, neither book is from the same series of books. Altogether, we created a total of 274 verification cases of which 50 % are sameauthor and the rest different-authors cases, with a 70/30 split of training and test. The size of each text varies between 3,500 and 4,000 words (21,870 characters on average), with a few individual texts being shorter due to insufficient material. Unlike the PAN datasets, we aimed for a corpus that can also be processed by Koppel and Schler's unmasking, an important state-of-the-art approach. 7 https://www.gutenberg.org/ To avoid Bias B3, all texts were carefully normalized to remove editorial and non-authorial artifacts. We stripped book and chapter titles, illustration placeholders, ASCII art, repeated character runs, footnotes, and obvious quotations from the texts (to also avoid Bias B5), as well as any Gutenbergrelated front pages and additions to the original text. Gutenberg books make use of underscores to signify italic text; we removed those as well. Special characters like ellipses and quotation marks were manually replaced by a consistent ASCII representation. We further collapsed all newlines and other white space into a single space character to avoid incidental and inadvertent bias due to formatting.

Corpus Validation
As per Bias B1, a high performance of TFIDFweighted cosine similarity hints at a biased dataset. To validate our corpus in this respect, we crossvalidated a naive Bayes classifier using only this feature (Table 1d), which achieved merely 57 % accuracy compared to 74 % on PAN15. Excluding cosine similarity, BAFF still gets up to 70 % accuracy (Table 1b), which marks statistical divergence measures as promising features for future verifiers.

Conclusion
In shared tasks, sometimes basic approaches outperform more sophisticated ones. This is frequently the case when machine learning meets small data. Inadvertent properties of the data act as confounders that a learning algorithm will gladly fit onto if they are not controlled. In the case of authorship verification as per PAN, this was a major part of the problem. As long as much larger corpora remain out of reach for lack of a sufficient source of monographs, extra care needs to be taken in preparing the data, as exemplified for our corpus.
Another important take-away message is that model authors in authorship verification need to be extra careful about their feature selection. Fortunately, this will come naturally to researchers in the field as they are already trained to avoid features that encode topic rather than style. In particular, we strongly suggest that future evaluations should adopt a stateless one-case-at-a-time test policy.
Finally, in a spin-off study on unmasking, we generalized the algorithm to work on short, essaylength texts (Bevendorff et al., 2019): it achieves an accuracy of 0.73, an F 1 of 0.69, and a precision of 0.82, marking the first baseline for our corpus.