How to Extract Good Knowledge from Bad Data: An Experiment with Eighteenth Century French Texts

From a digital historian’s point of view, Ancien Regime French texts suffer from obsolete grammar, unreliable spelling, and poor optical character recognition, which makes these texts ill-suited to digital analysis. This paper summarizes methodological experiments that have allowed the author to extract useful quantitative data from such unlikely source material. A discussion of the general characteristics of hand-keyed and OCR’ed historical corpora shows that they differ in scale of difficulty rather than in nature. Behavioural traits that make text mining certain eighteenth century corpora particularly challenging, such as error clustering, a relatively high cost of acquisition relative to salience, outlier hiding, and unpredictable patterns of error repetition, are then explained. The paper then outlines a method that circumvents these challenges. This method relies on heuristic formulation of research questions during an initial phase of open-ended data exploration; selective correction of spelling and OCR errors, through application of Levenshtein’s algorithm, that focuses on a small set of keywords derived from the heuristic project design; and careful exploitation of the keywords and the corrected corpus, either as raw data for algorithms, as entry points from which to construct valuable data manually, or as focal points directing the scholar’s attention to a small subset of texts to read. Each step of the method is illustrated by examples drawn from the author’s research on the hand-keyed Encyclopedie and Bibliotheque Bleue and on collections of periodicals obtained through optical character recognition. Du point de vue d’un historien numerique, les textes francais d’Ancien Regime souffrent d’une grammaire obsolete, d’une orthographe irreguliere et d’une reconnaissance optique des caracteres de faible qualite. Cet article resume les experiences methodologiques qui ont permis a l’auteur d’extraire des mesures quantitatives utiles de ces improbables matieres premieres. Une discussion des caracteristiques generales des corpus de textes historiques transcrits a la main et des corpus produits par reconnaissance optique revele qu’ils different en degre de difficulte mais non en nature. Les comportements qui rendent certains de ces corpus particulierement difficiles a traiter numeriquement, dont la distribution non aleatoire des erreurs, un cout unitaire d’acquisition relativement eleve, la dissimulation des documents atypiques et l’imprevisibilite des erreurs repetees, sont ensuite expliques. L’article trace ensuite les grandes lignes d’une methode qui contourne ces problemes. Cette methode repose sur la selection heuristique de questions de recherche pendant une phase d’exploration ouverte des donnees; la correction selective des erreurs a l’aide de l’application de l’algorithme de Levenshtein a un petit nombre de mots-cles choisis pendant la phase d’exploration; et l’exploitation des mots-cles et du corpus corrige soit en tant que donnees brutes, soit comme points d’entree permettant l’extraction manuelle de donnees probantes, soit comme boussoles permettant d’orienter l’attention du chercheur vers un sous-ensemble de documents pertinents a lire. Des exemples tires de la recherche de l’auteur, qui porte a la fois sur des corpus ocerises de periodiques et sur les corpus reconstitues manuellement de l’Encyclopedie et de la Bibliotheque bleue, illustrent chacune des etapes. Mots-cles: fouille de texte; fouille de donnees; textometrie; production de l’espace; histoire numerique; correction d’erreurs

From a digital historian's point of view, Ancien Régime French texts suffer from obsolete grammar, unreliable spelling, and poor optical character recognition, which makes these texts ill-suited to digital analysis. This paper summarizes methodological experiments that have allowed the author to extract useful quantitative data from such unlikely source material. A discussion of the general characteristics of hand-keyed and OCR'ed historical corpora shows that they differ in scale of difficulty rather than in nature. Behavioural traits that make text mining certain eighteenth century corpora particularly challenging, such as error clustering, a relatively high cost of acquisition relative to salience, outlier hiding, and unpredictable patterns of error repetition, are then explained. The paper then outlines a method that circumvents these challenges. This method relies on heuristic formulation of research questions during an initial phase of open-ended data exploration; selective correction of spelling and OCR errors, through application of Levenshtein's algorithm, that focuses on a small set of keywords derived from the heuristic project design; and careful exploitation of the keywords and the corrected corpus, either as raw data for algorithms, as entry points from which to construct valuable data manually, or as focal points directing the scholar's attention to a small subset of texts to read. Each step of the method is illustrated by examples drawn from the author's research on the hand-keyed Encyclopédie and Bibliothèque Bleue and on collections of periodicals obtained through optical character recognition.
Mots-clés: fouille de texte; fouille de données; textométrie; production de l'espace; histoire numérique; correction d'erreurs What is a digital historian supposed to do with data that is barely tractable to digital methods? Most humanists, whether they use computational methods or not, have to contend with incomplete, inconsistent, error-ridden, or otherwise problematic data. When the amount of this messy data required to answer a research question is small enough, it may be possible to clean it up by hand or even to fill in the blanks and filter out the inconsistencies mentally as one reads through sources and computational results. However, this strategy grows less feasible as the volume of data increases, especially for an individual scholar with finite reserves of time. Historical sources compound the problem by introducing issues that are not found in more recent documents. For example, eighteenth century French books and periodicals are peppered with obsolete grammar and irregular spelling, which natural language processing software designed with modern digitized text in mind is ill-equipped to handle. Historical text is also prone to poor optical character recognition (OCR) results, and error correction techniques that perform well when applied to modern Laramée: How to Extract Good Knowledge from Bad Data Art. 2, page 3 of 24 text do not always translate well to sources that do not follow modern text's OCR error production patterns. Thus, in many cases, perhaps in most cases, cleaning an entire historical corpus prior to performing a digital analysis is unrealistic.
By analogy with Big Data, this paper calls a large corpus of text that defies cleanup efforts because of its size or its internal characteristics Bad Data. This paper contends that, when handled with proper care, this Bad Data can still yield good quantitative results. The theoretical-methodological framework required to do so involves an intimate knowledge of the corpus, a targeted approach to error correction, and a measure of humility about the historical questions that can be answered given the limits of the two. Throughout the paper, this theoretical framework will be presented step by step and illustrated by examples from the author's own research.
The framework also illustrates the inescapable need, in making a Bad Data corpus tractable, for a symbiosis between digital methods and human judgement. Finally, the paper contends that its framework applies to many situations in which mining a Bad Data corpus is likely to be useful (albeit at the cost of some customization based on the idiosyncrasies of the case at hand), if only as a parable about the limits of quantification and about the value that one can derive despite these limits. This article is divided into two major parts. First comes a characterization of Bad Data, how it differs from Big Data, and why historical text qualifies as Bad Data. In regard to the last, first and foremost of these reasons is the way in which defects in historical text, including actual errors and artefacts of language that may pose similar challenges to scholars, tend to be clustered rather than spread more or less uniformly. When these error clusters happen in parts of the source material that is highly salient to the research questions under study, they can skew the results to an unacceptable level. Other reasons include potentially high unit cost of acquisition of a large corpus of text; the ways in which text encoding methods may introduce crucial defects; the ambiguity caused by polysemy; and the partial incompatibility between historical text and current digital language processing tools. The second half of the article is devoted to outlining a method that can extract reliable information from a Bad Data historical text corpus despite these defects. This method is divided into three steps, and it may need to be iterated several times before it reaches a stable solution. The first step is a heuristic process of research question design that relies on exploration of the corpus to figure out what it may be able to answer. The second step consists of a targeted error correction scheme built around a limited number of keywords that are likely to lead to an answer. In the third step, the scholar assesses how the keywords and the corpus can be mined for the answer, either as raw data themselves, as tools to guide the indirect extraction or construction of further data, or as a way to focus the scholar's close reading on a small number of particularly salient elements of the corpus. Thus, the approach described in this article relies on constant back and forth between data curation, judgement calls, and digital methods, far more than it does on unadulterated technical wizardry. This approach has, however, served the current author well.

Big Data, Bad Data, and the Perils of Historical Corpora
Historians are accustomed to working with Small Data. The typical historical argument relies upon a limited collection of highly salient documents, painstakingly exhumed from the archive at considerable expense in time and toil. These documents are then critically interpreted, sometimes against the grain, to filter out the biases of their creators or to unearth nuggets of information that the documents' creators never directly intended to transmit to posterity. In other words: each unit of Small Data comes at a high price, but it yields correspondingly high value. Big Data, inasmuch as it can be defined, is the opposite in all aspects. Big Data all but accumulates on its own: once a pipeline has been set up to harvest tweets or Web search queries, the incremental effort required to obtain millions of them is negligible. However, since we collect Big Data to reveal trends and patterns that escape the human eye, it is only meaningful in very large amounts. Finally, the promise of Big Data is that quantity trumps quality (Mayer-Schönberger and Cukier 2013, 16-33). If one has easy access to millions of units of content, says the theory, there is no need for critical assessment of each individual unit because the errors will spread more or less uniformly, and a useful signal will still percolate from underneath the noise.
For digital historians, it is tempting to approach large textual corpora as if they were Big Data. And indeed, it is certainly possible to assemble corpora that are large Laramée: How to Extract Good Knowledge from Bad Data Art. 2, page 5 of 24 enough to qualify. Broadly speaking, such corpora can be divided into two categories.
The first category includes the small number of sources that have been hand-keyed into digital form through the efforts of scholars and volunteers, either as plain text files or as sets of TEI-encoded, metadata-enhanced documents. Chief among them, for the historian of Ancien Régime France, is the treasure trove provided by the University of Chicago's invaluable ARTFL project, including the Encyclopédie (Morrissey and Roe 2017), which was painstakingly reconstructed from microfilm in the late 1990s, and the Bibliothèque Bleue (ARTFL 2016), ARTFL's collection of 284 works of popular literature published between the 16 th and 19 th centuries. As scholarly editions, these digital archives reproduce the original source materials, with all of their idiosyncrasies, as faithfully as possible. Far more common, of course, is the second category, which includes the sources to which scholars only have access thanks to optical character recognition. Gallica (Bibliothèque nationale de France 2018), the French national library's online archive, provides an enormous collection of such documents, including complete or nearly complete runs of several eighteenth century periodicals such as the news-oriented Gazette, the literary Mercure de France, and the western world's first scholarly publication, the Journal des Savants. In both cases, it is relatively easy for a scholar to mine these resources to assemble data sets containing tens of millions of word tokens; hardly comparable to the billions of tokens in Google's word vector training set, perhaps, but plenty to make a Big Data approach seem appealing.
However, treating such data sets as Big Data would be hazardous because historical text tends to violate the rules that define Big Data. OCR errors and other artefacts of language are definitely not distributed at random. Mining textual corpora that have been created by others means abiding by the decisions of others, including the authors and editors of the original source material in the past, which may or may not be appropriate for one's needs. Critical interpretation and close reading are always necessary because words have multiple meanings and their usage changes over time. Perhaps worst of all, extracting numerical features from a large volume of text may suggest the existence of patterns that are mere artifacts of the ways in which the text has been encoded -something that the transformation into numbers has made invisible. And of course, if the sources we want to examine have not yet been digitized and only exist in print or microfilm, acquiring data in bulk can be extremely time consuming and, especially when the source material has been poorly preserved, devilishly tricky. For these reasons, historical text corpora should not be considered Big Data, but rather a form of Bad Data that combines some of the most troublesome features of Small Data and Big Data, even when they have been handkeyed to perfection. The next two sections will explain why.

Why Historical Text Violates the Random Distribution of Errors
As mentioned earlier, one of the key assumptions of Big Data theory is that defects are spread more or less uniformly, which makes them irrelevant when the amount of data is large enough. All historical text corpora violate this rule because they contain clusters of defects, some but not all of them predictable. The very nature of language is the source of most of these clusters; print technology and editorial decisions create others. And while common sense dictates that hand-keyed corpora are preferable in the abstract, they are no more immune to the clustering effect than those assembled through OCR.
The clustering effect emerges as a consequence of the three types of defects identified by Michael Piotrowski (2012) in his discussion of the pitfalls of historical text processing: changes in spelling and word meanings over time, irregular spelling in the case of sources that predate the standardization of orthography, and uncertainty due to errors in transcription or optical recognition. The first two types of defects are not errors per se but rather historical phenomena that may or may not be significant to a scholar's work. For linguists, these defects may be salient pieces of data; for historians interested in measuring the number of references to a place whose name is spelled in multiple ways, they are functionally identical to OCR errors unless the scholar knows the list of possible spellings ahead of time and plans accordingly. In any case, as we will now see, none of these defects are randomly distributed. Intimate knowledge of the corpus is necessary to figure out what defects are present, how they can influence the research process, and how to implement the appropriate corrective measures.

Art. 2, page 7 of 24
In the case of eighteenth century French texts, a particularly cumbersome cluster of defects due to spelling changes over time arises from the relatively recent (historically speaking) replacement of an o with an a in such ubiquitous French words as avoit/avait (had), étoit/était (was), and even françois/français (French). This seemingly innocuous change can wreak havoc on digital text analysis because most natural language processing tools have been designed with contemporary grammar and spelling (and only contemporary grammar and spelling) in mind. The popular TreeTagger part-of-speech parser (Schmid 1997), for example, does not possess an Early Modern French grammar, and its contemporary French grammar regularly misidentifies the word types avoit and étoit as nouns instead of the archaic pasttense spellings of the two most common verbs in the French language that they are.
This mistake repeats itself thousands of times in any large corpus, with potentially dire consequences for the unwary. When using TreeTagger along with the TXM textometric software package (Heiden, Magué and Pincemin 2010), scholars studying vocabulary divided by parts of speech must compensate for these tagging errors by hand.
Spelling variance also tends to cluster in highly salient parts of historical text, such as named entities (people, places, etc.) For instance, the word Louisiane is spelled three different ways in the ARTFL Encyclopédie, and a keyword search for Encyclopédie content that mentions Louisiana and that only takes the "correct" variant into consideration would miss nearly a quarter of the relevant entries, including the main article about Louisiana itself. (Called Louysiane with a Y, this article contains all three occurrences of the Louysiane word type found in the entire seventeen-volume encyclopedia, and no trace of either the "correct" spelling or of any other.) Only through an iterative process of trial and error can language artefacts such as these be uncovered, and it is all but impossible to guarantee that none will escape the scholar's attention.
Even transcription errors may cluster, especially in OCR data. In the Gazette, for example, article headers contain highly valuable information about the cities from where the news originates and the dates on which they were sent to the editor.
However, Gazette headers are italicized and therefore misread by OCR at a much higher rate than the surrounding text. The author has observed that the ubiquitous Versailles, for example, is misread in headers in more than a dozen different ways, some of which are completely unrecognizable as words at all. For a scholar interested in news dissemination patterns, this type of error cluster is extremely damaging.
As an aside, transcription defects are by no means limited to OCR. The ARTFL Encyclopédie was hand-keyed by professionals, and yet more than 650,000 corrections had to be applied to the database between 1998 and 2013 (Morrissey 2016), a process that was undoubtedly made more difficult by the fact that, to twenty-first century eyes, the difference between a transcription error and a correct transcription of a word that was incorrectly or fancifully spelled in the eighteenth century is far from obvious. OCR data derived from eighteenth century periodicals is itself of much lower and much less predictable quality than what a scholar accustomed to working with twentieth century sources would expect. Eighteenth century printers often packed text tightly (paper wasn't cheap) and had to contend with irregular type and with ink that seeped through one sheet to the next. Many old documents accumulated stains, rips and mildew in musty attics for several decades before they even entered the archive. Some sources only survive on microfilm, as slightly misaligned or warped pictures that cause no trouble to the human readers for whom they were produced but bedevil OCR software. As a result, while the OCR success rates reported by Gallica can reach 95% or more for many issues of the Gazette, they fall below 50% for some annual compendia of the Journal des Savants, whose most problematic passages are almost indistinguishable from strings of characters generated at random.
In summary, because of the clustering effect, the difference between hand-keyed corpora and those obtained through OCR seems to be one of degree rather than of nature. The fact that the examples given for the first two types of defects have been drawn from the most recent release of the hand-keyed Encyclopédie, which may very well be the highest-quality digital source available for Early Modern French studies, is sobering indeed. The lesson: every historical corpus must be treated as potential Bad Data until proven otherwise.

Further Characteristics of Bad Data
Beyond the clustering effect, which applies everywhere, other characteristics of historical text may sometimes violate the rules that define Big Data.

Laramée: How to Extract Good Knowledge from Bad Data
Art. 2, page 9 of 24 First, acquiring a unit of textual data can be relatively expensive compared to its salience. Recovering the places and dates of origin from the Gazette's italicized article headlines had to be done by hand. Each of the 1,184 Gazette articles that discuss the colonial Atlantic world between 1740 and 1761 also had to be cut from the raw OCR files and pasted into its own .txt file by hand, one at a time; the process could not be automated in any way because the endpoints of an article are just as likely to be misread by OCR as anything else. In both cases, it was obvious that the data would reveal interesting patterns only when acquired in bulk, but the acquisition process required effort at retail.
Second, the sheer volume of text and lack of a regular structure in large corpora make it more difficult to pinpoint outliers. This is dangerous because some outliers are highly salient while others are mere artefacts of the way in which the sources were encoded and must therefore be discarded. Figure 1 shows an intriguing pattern that emerges from correspondence analysis (Benzécri c1992; Cibois 2007) of the 14,547 Encyclopédie articles that discuss geography: the articles extracted from Volume XIII seem to have very little in common with the others. At first glance, nothing seems to explain this phenomenon. A deep dive into the word frequency statistics calculated by volume, however, reveals an odd discrepancy. The letter P, written as a singlecharacter word type, appears no fewer than 9,613 times in Volume XIII and no more than a few dozen times in any of the others. Further inquiry reveals that nearly all of these unexpected occurrences belong to a single article, about the Italian city of Reggia. It turns out that the ARTFL Encyclopédie has encoded this article and Volume XIII's appendix in a single file. The appendix in question contains a table listing the prime factors of every number between 1 and 100,000, a table in which prime numbers are marked with a P. Deleting the offending table from the dataset makes the abnormal correspondence analysis result disappear; a less conspicuous culprit, however, could easily have gone unnoticed, with potentially deleterious consequences.
Third, because words are polysemic, word type counts can never be taken at face value. In the corpus of eighteenth century periodicals, for example, the word type Halifax refers to a port city in Nova Scotia and to a British lord (usually spelled Hallifax). When trying to figure out how often a reader is reminded of the existence Laramée: How to Extract Good Knowledge from Bad Data Art. 2, page 10 of 24 of the city, should mentions of the lord be counted? And if the text also talks about the warship HMS Halifax, does that count? How do we know whether the ship was named after the colonial port, the lord, or some other town in Britain? A similar problem occurs with the word type colonie (colony), which is sometimes used in the periodicals to refer to Atlantic world colonies but far more often to talk about ancient Greco-Roman cities -or, in a few cases, about Cardinal Coloni of the Roman Catholic Church. It is surprisingly easy to second guess one's judgement calls in matters like these.
Finally, from a purely technical standpoint, the application of OCR to historical text tends to produce defects that do not follow the patterns found in OCR data obtained from contemporary text. Among the latter are relatively high numbers of words split into two parts by a phantom period or blank space, and the letter m recognized as a sequence made up of an i and an n (Lopresti 2009). Experience has shown that applying algorithms designed to fix errors in modern OCR data to historical corpora yields mediocre gains. For example, an attempt to repair words

From Bad Data to Good Science, Step 2: Focused Error Correction
Designing a research question that can be answered by mining a large corpus for a small number of keywords naturally orients error correction efforts towards making sure that the presence of these keywords is measured accurately. This means finding and fixing, through some sort of fuzzy search algorithm, as many misspelled and badly recognized keyword tokens as possible, so that every article, page or paragraph relevant to the research question can be extracted from the corpus. (At this time, it may not be obvious whether this extracted content will itself be suitable for digital analysis or whether the scholar will have to examine it through close reading, but the extraction process is the same in either case.) Some Web-based resources, like ARTFL, may include their own fuzzy search engines. When dealing with raw text files downloaded from an archive like Gallica, however, the scholar must apply their own solution. Levenshtein's algorithm (Levenshtein 1966), which defines the distance between two strings as the number of characters that must be deleted, inserted or swapped to transform one into the other, provides an easily customizable model. and three. This means that a visual inspection of the 2,956 candidate string types eliminated 95% of them. This was easier than one might think: the vast majority of these discarded candidates were either French words themselves and therefore unlikely to represent misread keywords, or else nothing more than random OCR detritus from which nothing could be recovered. The word musique (music), which stands at a Levenshtein distance of 3 from Amérique (America), is an example of the first case; a string made up of the letter a repeated six times, which stands at a Levenshtein distance of 3 from Canada, is an example of the second. Coloni, for example, came as a complete surprise to this article's author.) Table 2 summarizes the results of this two-part process as applied to the Gazette.
Overall, Levenshtein's algorithm was able to recover 1,867 tokens of the ' canonical' tokens themselves, plus 532 damaged or misspelled keyword tokens of 103 different types, which resulted in an increase of 29.5% in the total number of tokens compared to a simple perfect-match search. In total, 2,399 keyword tokens (damaged or not) were found in 1,184 different articles.
Note that, while most of the variants of keyword tokens recovered by this method are the results of one-, two-and three-character OCR errors, the method is equally adept at finding unexpected but correct keyword spellings. For example, the algorithm found 143 instances of the unaccented word type Bresil (Brazil), against only 5 occurrences of the accented form Brésil used in twenty-first century French.
This latest example is particularly telling of the method's value: while the application of Levenshtein's algorithm to the raw OCR data uncovered relevant articles about every part of the colonial Atlantic, Brazil's importance in the corpus would have been vastly underestimated without it.
While the process described in this section of the article was designed to handle a few dozen keywords and several thousand documents, it is relatively straightforward to adapt it to different contexts. If the list of keywords is very long, for example, extracting candidate word types that are only separated from a keyword by a Levenshtein distance of 2 or less might provide an acceptable compromise, since the number of such candidates is approximately ten times smaller than for a distance of 3 and experience has shown that only a handful of candidates at distance 3 turn out to be useful. (In

Laramée: How to Extract Good Knowledge from Bad Data
Art. 2, page 15 of 24 the case of the Gazette, approximately 50% of the recovered tokens were at distance 1, 45% at distance 2, and only 5% at distance 3.) It may also be a good idea to run the algorithm more than once with short distances instead of once with a longer distance, augmenting the list of keywords with frequent alternative spellings after each iteration. This is how l'amérique and d'amérique were included in the keyword list, which allowed a second pass to find a handful of misread tokens separated from l'amérique by a distance of 2 but from the original amérique by a distance of 4. Finally, if two keywords are very similar, such as Russie (Russia) and Prusse (Prussia), some candidate word types may lie within a short Levenshtein distance of both. A good rule of thumb, in this case, is to assign the candidate to the keyword to which it is closest; if there is a tie, the decision must be made through a judgement call, possibly one token at a time, after visual inspection of the token in context.

From Bad Data to Good Science, Step 3: Resolution
Now that the corpus has assisted in the process of heuristic research question design and that targeted error correction has solidified our understanding of the presence of a certain number of crucial keywords in the corpus, it is time to return to the source material. Broadly speaking, the project has reached one of four states, depending on the prevalence of the remaining defects and on their relevance to the research question.
In the best-case scenario, token counts for the corrected keywords themselves (or statistics that can be directly computed from them) are the signal that needs to be measured to answer the research question. Now that the keywords have been Atlantic is all but invisible. Insomuch as they can find a message in this silence, it is that the New World is none of their concern.
Laramée: How to Extract Good Knowledge from Bad Data

Art. 2, page 17 of 24
In the next best case, the keywords are not the answer but they allow us to extract a subset of the corpus that can be treated as a reasonable approximation of Big Data from which to derive the answer. To qualify as ersatz Big Data, the errors that remain in this subcorpus must be either relatively rare, randomly distributed, or irrelevant to the research question. In the Encyclopédie, for example, a set of several dozen keywords representing the four major parts of the known world (America, Africa, Asia and Europe) and some of the better known Atlantic World colonies of the eighteenth century allows us to extract a collection of 6,053 articles that refer to one or more of these parts of the world. The text of these Encyclopédie articles is of high quality and can safely be submitted to any number of textometric methods, provided that the scholar is aware of the spelling issues mentioned earlier in this paper. Linguistic specificity analysis (Figure 2), for example, shows that articles about America feature an abnormally high number of verbs in the present tense, such as sont and est (to be), font (to make), trouve (to find) and servent (to serve), compared to the rest of the world. Asia and Africa, on the other hand, show high linguistic specificity for verbs conjugated in the past tense (Figure 3) earlier, mining twenty-two years' worth of Gazette issues for articles discussing Atlantic World colonies yielded a relatively small set of 1,184 articles. The OCR data in these articles is too noisy for most purposes, and the idiosyncratic nature of Gazette articles, which read like diaries of unrelated events rather than carefully constructed narratives, makes techniques like topic modeling irrelevant anyway. However, visual inspection of the articles shows that far more of them seem to originate from abroad than from France itself. This information is invisible in the OCR data because of an error cluster, since articles' cities of origin appear in headlines, which the Gazette printed in italics, which OCR has difficulty understanding. It is, however, a relatively simple matter to retrieve this information visually, from the PDF versions of the periodicals, and to include it in a metadata file that also contains, for each article, computationally extracted values like year of publication and sentinels indicating the presence or absence of each keyword. This metadata file, created through a mixture of hand-crafting and calculation, contains reliable data that can be studied while the unreliable raw text is set aside. A k-means classification of the 1,184 articles into five  classes, based on the contents of this metadata file, confirms the suspected pattern.
Four of the five classes, including the ones characterized by the presence of the keyword Amérique and by the presence of the keyword colonie, are overwhelmingly composed of articles emanating from abroad; many of them appear to be translations of material copied from English periodicals, translated so faithfully that they use first-person constructs such as notre (ours) and nous (us) when discussing British fleets and colonies. In only one class out of five do the names of French colonies appear more often than those of their foreign rivals (Figure 4), and even in this class, sources of French origin are in the minority compared to news briefs sent to the editor from London ( Figure 5). Thus, a dirty corpus has indirectly revealed that the Gazette presents the Atlantic world to the French reading public as an essentially foreign phenomenon.
If all else fails, the keyword instances retrieved from the noisy data by Levenshtein's algorithm at least show the scholar a better picture of which parts

Conclusion
This article has shown that raw historical text, as a category, cannot be treated as Big Data. It also outlined a method that can neutralize the defects of this source material through an iterative process of heuristic project design, targeted error correction, and careful assessment of what can be computed from the two. The method relies not so  This method is somewhat labour intensive and relies on judgement calls at every step, which suggests a trade-off between the size of the corpus that can be mined, the internal structure of that corpus, the types of error clusters found within it, the size of the list of keywords that can be fixed, the frequency at which these keywords occur in the corpus, and the data quality in the raw text. A very large corpus made up of low-quality OCR data, for example, may not be compatible with a research question that can only be answered by looking at every instance of hundreds of ubiquitous and polysemic keywords. Within these parameters, the method has been applied to several corpora and unrelated research projects. There is no reason to believe that it cannot be adapted across languages and time periods, as long as the language has clearly defined written word boundaries and uses an alphabet so that a Levenshtein distance between words can be computed.
Yet the prudent scholar will retain a healthy skepticism regarding results derived from Bad Data. Cross-validation of multiple experiments on a Bad Data corpus, involving different digital methods and visual confirmation of the results, is required to protect the scholar against software bugs and data accidents. Perhaps more importantly, the lower the quality of the original data, the stronger and more consistent across methods the results must be before they can be used to support, and only to support, humanistic interpretation.
A final word on reproducibility. It is easy to publish the general parameters employed in a given study, such as the list of original keywords, a table of additional keyword types and tokens identified using Levenshtein's algorithm, the number of keyword tokens that have been discarded from consideration as a result of judgement calls about polysemy, etc.
However, the method outlined in this paper only provides a (very) partial cleanup of the source data. Further, what counts as a correction for a scholar's purpose, such as merging all of the spellings of Louisiane in the Encyclopédie into a single word type, may count as introducing even more noise for someone else's research. Thus, distributing the corrected data files to the community would be of limited value, except perhaps for those attempting an exact duplication of the original results. And of course, the judgement calls required at every step call into question the level of duplication that can be achieved anyway. Perhaps this should serve as a warning. In the current author's experience, digital history projects involving text reach the limits of what can be achieved through algorithmic approaches distressingly fast. In other words, the human era is far from over.