Big Data for Global History The Transformative Promise of Digital Humanities 1

This article discusses the promises and challenges of digital humanities methodologies for historical inquiry. In order to address the great outstanding question whether big data will re-invigorate macro-history, a number of research projects are described that use cultural text mining to explore big data repositories of digitised newspapers. The advantages of quantitative analysis, visualisation and named entity recognition in both exploration and analysis are illustrated in the study of public debates on drugs, drug trafficking, and drug users in the early twentieth century (wahsp), the comparative study of discourses about heredity, genetics, and eugenics in Dutch and German newspapers, 1863-1940 (biland) and the study of trans-Atlantic discourses (Translantis). While many technological and practical obstacles remain, advantages over traditional hermeneutic methodology are found in heuristics, analytics, quantitative trans-disciplinarity, and reproducibility, offering a quantitative and trans-national perspective on the history of mentalities.


57
big data for global history: the transformative promise of digital humanities van eijnatten, pieters and verheul situated' knowledge. Some scholars even feel the urge to clarify why so many humanities scholars are reluctant to convert to the new digital creed. Others simply ignore the hype. 6 If digital humanities is a new paradigm, it is one that comes with its stalwart believers and underwhelmed agnostics. We believe that overstatements and emotions will do little to help us assess the intrinsic value of these new digital methodologies for the humanities. It is important to ask ourselves though, whether these new technologies and approaches will change the nature of historical inquiry. Will we see a gradual change, with tools or techniques increasingly being added to established practices, or will digital humanities inaugurate a fundamentally new way of framing historical questions? Are we facing a revolution, as some seem to suggest, comparable to the influx of Byzantine scholarship that led to the re-evaluation of Antiquity on the eve of the Renaissance, or to the opening of national archives at the beginning of the nineteenth century that radically transformed history writing? The question whether the application of computational methods to the humanities holds a transformative promise seems urgent. 7 One way to address this question is to look at the possibilities offered by the availability of so-called big data in combination with the emergence of digital tools that enable us to mine and analyse gargantuan quantities of digital sources in innovative ways. What is the promise of research projects in which innovative interpretative techniques are applied to big data repositories that are now increasingly being opened up by digital humanities research infrastructures? From a humanities point of view, big data refers to huge quantities of digitised information that can be analysed using data-intensive methods but for which conventional humanist methods, geared as they are towards the interpretation of a limited number of texts, images and data sets, are simply inadequate. It was in this sense that the celebrated Digital Manifesto 2.0 of 2009 called for a new wave of scholarship that would be 'qualitative, interpretive, experimental, emotive, generative in character' and predicted the emergence of 'bigger pictures out of the tesserae of expert knowledge'. 8 Will big data re-invigorate big history? What large outstanding questions can historians hope to address by implementing digital humanities?
In what follows we will gauge some of the new avenues currently being explored in the Netherlands, with a particular focus on the projects wahsp and biland (funded by clarin, one of the European research infrastructures), and the project Translantis and its tool Texcavator (funded by the Netherlands Organisation for Scientific Research (nwo)). Both projects revolve particularly around finding new digital forms of the quantitative history of mentalities.
Our premise is that new text mining techniques for big data analysis will not replace traditional hermeneutic methods in historical research. Rather, the two should be seen as complementary.

A new turn in digital historical methodology 9
Most researchers explore data manually, using their knowledge and expertise to extract the information they deem relevant. Media research as a field is almost inherently interested in discovering large patterns of opinion formation. Media historians have traditionally employed a variety of sampling methods. An example from the Netherlands is the methodology that media historian Frank van Vree adopted when he studied Dutch public opinion regarding Germany in the period [1930][1931][1932][1933][1934][1935][1936][1937][1938][1939]. This was one of the first studies to use public media to gain insight into what the French have called histoire des mentalities -the history of mentalities. Using newspapers as the most important mass medium of the interwar period, Van Vree selected four titles that each represented a major population group (such as Catholics and Protestants). Newspaper issues were then browsed manually, yielding a selection of almost 4,000 articles expressing an opinion on the subject. The 'neutral' press, with a market share of about forty-five per cent, was left out of 9 The most common approach to meet this challenge in historical research is to use statistically grounded sampling methods, such as simple or stratified sampling, snowball sampling, and sampling with replacement. Simple sampling refers to the random selection of individual data from a single population, a method that is sometimes refined by sampling from subpopulations or strata; snowball sampling uses a small selection of initial data to select further data (comparable to the way social networks expand through the selection of 'friends'); while sampling with replacement is a form of random sampling that leaves open the possibility that individual data are selected more than once. Where big data is involved however, it is impossible to analyse all relevant articles by browsing, while making a selection through a sampling method becomes increasingly problematic because the end selection always needs to be manageable for an individual researcher. Indeed, historians have recognised since at least the 1970s that there are corpora available for historical research that are simply too large to be examined in their entirety and to be perused manually. Nevertheless, manual browsing is still common practice in historical research. Conventional sampling methods to some extent do address the challenge of big data in that they reduce the amount of data to manageable proportions, but in practice they are relevant only to the analysis of a limited number of serial titles. 11

Text mining big data collections: program design
One of the expectations about digital humanities is that it will enable us to investigate much larger quantities of public media. After half a century of digital humanities we are now entering a new phase in which historians are able to analyse massive volumes of texts, particularly by integrating (socio-) linguistic methods into humanities research. New techniques of large-scale data analysis allow historians to manage big data sets that were difficult to access earlier. Semantic text analysis is a particularly promising form of data mining that can be applied to textual data in order to derive subject-specific information from 'mountains' of textual data without having to read it all.
Text analytics or text mining is an umbrella term for incorporating and implementing a wide range of tools or techniques (algorithms, methods), 10 Frank van Vree, De Nederlandse pers en Duitsland, 1930Duitsland, -1939 big data for global history: the transformative promise of digital humanities van eijnatten, pieters and verheul including data mining, machine learning, natural language processing and artificial intelligence. Semantic text analytics focuses specifically on the historical-contextual meanings of words and phrases in a big data set. 12 The goal of text mining is to reduce the effort required of humanities researchers to obtain useful information from large digitised textual data sources. Current international and national programmes such as Digging into Data and catch-plus demonstrate the feasibility of performing interdisciplinary humanities research facilitated by digital research tools. 13 Adapting the digital methodologies arising from these programmes to humanities research gives rise to more easily reproducible results, more refined computationally-based research methods for historians and new research questions. These programmes also demonstrate that collaborative and integrative strategies such as common group learning (all knowledge is necessary pooled and learning is both shared and cumulative), modelling, negotiation among experts and integration by leaders are central to the functioning and therefore the success of this approach. The design and execution of such large digital humanities programmes is obviously grafted on common practice in the sciences and may be contrasted to the great majority of humanities research (exceptions excluded notably linguistics), where research is predominantly individualistic.
The role of experts in the field, in our case cultural and science historians, in the development of new text and data mining technologies is particularly important. The process of articulating the needs and demands of users in relation to available technical options is no less significant and crucially depends on including programme mediators who bring a strong background in the humanities as well as state-of-the-art text mining expertise into the research team. Incorporating regular feedback loops for instance, allows an iterative refinement of analysis algorithms and the development of a user-friendly digital tool. In the following sections we will illustrate two particular programs, wahsp/biland and Translantis.

Towards historical sentiment mining in public media: wahsp/biland
The first step towards the development of an open-source mining technology that can be used by historians without specific computer skills is to obtain a hands-on experience with research groups that use currently available open-source mining tools. A recently developed tool that has been utilised to accomplish this is the clarin-supported web application for historical sentiment mining (a form of semantic text analytics that focuses on historical opinions, attitudes, and value judgments) in public media that is known under its acronym wahsp. 14 wahsp is specifically designed for text mining the digital newspaper archive of the National Library of the Netherlands (Koninklijke Although the wahsp-tool offers a number of options for quantitative analysis, such as the frequency of words or combinations of words used in specific newspaper articles in a certain period of time, it derives its most promising analytical potential from its visualisation and arrangement features. Each query results in a term cloud that is based on the relative frequencies of the words occurring in the retrieved selection of documents from the corpus. The visualisation of word associations in these term clouds allows the historian, on the basis of existing domain expertise, to quickly determine the characteristics of the selected documents and to refine or adapt the query. The wahsp software is also able to indicate sentiments by highlighting terms with a negative or positive connotation (although it should be noted that this technique of sentiment detection is still in need of historical contextualisation). Advanced techniques for what is called Named Entity Recognition (ner) enable the researcher to recognise and highlight the names of 'entities' such as places, persons, institutions and events. This tool allows the historian to place the occurrence of certain terms, ideas or debates within a geographical context, or connect them to persons or organisations (see fig. 3).
Lastly, a visualisation of the temporal distribution of the documents allows the historian to discover patterns in publication dates. This This approach has been successfully employed by Stephen Snelders in a wahsp-assisted study of public debates on drugs, drug trafficking and drug users in the early twentieth century   to qualify and quantify these 'hidden debates'. In this kind of cultural history (or 'history of mentalities'), the combination of scientific concepts and cultural notions is of primary interest. Thus, we will not only be able to mine concepts but also explore the more unconscious, latent use of genetic or eugenic ideas by ordinary people as they were mediated in public debates. 22 debates about specific issues. They also open up a much wider panorama. Text mining techniques offer an innovative way to map trans-national influences and measure how debates crossed regional, cultural and national borders.
The availability of massive repositories of digitised periodicals that span not years or decades but centuries, and represent national 'climates of opinion', offers the possibility to map long-term changes in national and transnational debates on a myriad of issues in their cultural, economic, political and social contexts. It will be evident that there is a link here with global or world history, which as an established sub-discipline is primarily focused on both long-term developments and large-scale comparisons. 27 Given that the larger part of global history research has been exclusively socio-political or economic in nature, one of the most promising new approaches is a study of the interactions between large culturally defined areas such as health, crime, religion and mass communication. The emergence of trans-national power constellations or empires, as well as the way their influence was radiated beyond national borders, can now be mapped over longer periods of time.
A way to conceptualise these trans-national and trans-cultural vectors of influence is to investigate the emergence of 'reference cultures'. shifting subjectivities central to cultural encounters and question rather than assume national identity formation. Reference cultures are mental constructs or 'cognitive maps' that do not necessarily represent a geopolitical reality with an internal hierarchy and recognisable borders. These culturally conditioned images of trans-national models are typically established and negotiated in public discourse over a long period of time. 28 The academic discussion suggests that the interplay of political, economic and technological supremacy with the 'soft power' of cultural attraction and reputation plays a crucial role in how dominant nations and cultures establish guiding standards for other cultures. However, the specific historical dynamics of reference cultures have never been systematically analysed and hence are not fully understood.
The key to understanding the emergence and dominance of reference cultures is to chart the public discourses in which these collective frames of  Cultural and science historians, information scientists and text mining experts are currently addressing these questions in Translantis. 31 The programme will implement the text mining tools that have emerged from the wahsp and biland projects to study long-term developments and transformations in national discourse in a systematic, longitudinal, and quantifiable way. It is expected that the implementation of text mining tools will provide historians with a sophisticated heuristic model outlining the emergence, role and decline of reference cultures such as the United States, and possibly rising economic powers such as China. The outcomes of this project -insight into reference cultures and the experience with digital technology  big data for global history: the transformative promise of digital humanities van eijnatten, pieters and verheul to mine public debate -will serve as a springboard for comparative studies on European, trans-Atlantic and global levels to determine the patterns of transnational discourse and global cultural exchange.

The digital promise
Based on the examples discussed, we argue that the application of new digital techniques offers a number of methodological advantages over more conventional approaches in humanities research, particularly the history of mentalities. These advantages manifest themselves especially in, but are not necessarily limited to, research that involves textual big data repositories. The advantages arguably apply to at least four related areas -heuristics, analytics, quantitative trans-disciplinarity and new forms of reproducibility.

Heuristics
Digital search tools allow searches into textual data of virtually unlimited size, meaning that they are constrained only by the availability of digitised data repositories and computational capacity. This crucial dimension of big data research has several important implications. Firstly, it means that both manual browsing, which is inherently limited in scope, and sampling in its various forms, which involves restrictions with regard to representativeness, are no longer necessary. Secondly, searches no longer depend on indexing or registers, allowing explorations in both structured and unstructured data sets.
Thirdly, digital search techniques allow unlimited combinations of searches, facilitating associative thought in a controllable and reproducible way. This fosters creativity and serendipity, bringing the domain specific expertise of the researcher into full play. Even more promising is the potential to discover and quantify 'hidden' debates that offer a new perspective on the history of mentalities. 32

Analytics
Computational techniques allow new ways to analyse research results. The computational methods which are currently being explored in wahsp, biland and Texcavator allow the recognition and display of patterns in textual data.
They recognise and identify entities in texts, such as proper names, events and geographical locations, and reveal historical arrays of sentiments and values.
Combination of these data and the use of metadata (data that describe the structure or content of data in a repository) allow the researcher to reconstruct the structure, intensity and emotions of historical debates in public media.
Statistical forms of cluster analysis can point to patterns in debates that possibly eluded the traditional researcher. This offers novel ways to trace the course of debates in newspapers and other public media, hence offering a new perspective on public debates.

Quantitative trans-disciplinarity
Big data analytics not only opens up new fields of research in cultural history, but also results in quantitative data sets that can be confronted with other historical data. Especially promising are combinations between quantitative textual data and historical data sets that are being produced by economic and  big data for global history: the transformative promise of digital humanities van eijnatten, pieters and verheul cons of conventional sampling: not only will more historical sources be made accessible retrospectively in digital form, but the computerised documentation of today will be the historian's material of tomorrow.

Concluding remarks
We expect that methods geared to text mining big data will set agendas for historical research, in the sense that they will determine what the significant themes in public debates within a specific time frame actually were. This kind of fingerprinting or fixation of historical mentalities on the basis of big data evidence is unprecedented. It might well -at long last -allow humanities scholars to validate their inferences in the manner of social scientists. However, it remains to be seen whether the conclusions obtained through digital methodologies will differ substantially from those acquired by traditional means. Ascertaining this is one of the aims of the Translantis project outlined above and also of the European follow-up hera project 'Asymmetrical Encounters: E-Humanity Approaches to Reference Cultures in Europe, 1815-1992 (asymenc)' that started in the autumn of 2013.
Convinced of their transformative power as we may be, we also readily recognise there are downsides to big data research as proposed here. Given the scarcity of research money, financial investments in quantitative big data research will inevitably occur at the expense of 'traditional' humanities research. We believe that the kind of 'digital mentalities' research outlined in this article will make it possible to quantify the cultural baggage of groups of people -to identify changes in their sentiments, attitudes and values over decades, if not centuries. Yet to interpret such changes the hermeneutic skills traditionally associated with historical scholarship remain indispensable to assess in which cases mining big data for meaning is sound, productive and worthwhile from a humanities' point of view. Indeed, a word cloud is meaningless without the historian's ability to contextualise and understand the past 'from within'; and it takes a humanities scholar to understand whether a correlation that is statistically significant is also culturally relevant and historically meaningful. Moreover, no historian worth his salt will rely exclusively on text mining techniques, at least not anywhere in the near future. Text mining techniques will displace but they will not replace traditional hermeneutic methods. Indeed, Translantis explicitly includes conventional in-depth analyses as a way to explain the specific patterns, or parts of patterns, generated through computational methods: and of course, the source criticism that is part and parcel of the historical profession will remain a sine qua non, even if specific skills need to be honed to properly gauge the quality of a digital source. Historians do not have to worry that big data will ever replace intelligent inquiry, or that digital methodologies will serve as an alternative to historical theories. In fact, digital methods are worthless without meaningful research questions and conceptual frameworks. One of the most important realisations is that even 'big' digitised data collections represent only a tiny and biased part of the historical evidence that historians have at their disposal, and using them requires constant awareness of their inherent biases and distortions. Among other things this is exemplified by the rather frustrating experiences with poor ocr quality and by the numerous lexicological challenges that still have to be met. We agree therefore, that the proper use of 'big data' requires equivalent quantities of critical sense and that we should stay clear from the dangerous illusion, as Andrew Prescott recently warned, 'that data can somehow be cut free from its historical moorings to enjoy an autonomous existence'. 35 We might live in the age of the exabyte if not the zettabyte but it is important to acknowledge that only a sliver of our vast historical past is available in the form of bits and bytes. On a global scale, the current state of big data repositories hardly allows us to go beyond a still very circumscribed number of newspapers, a smattering of magazines and a painfully limited amount of digitised material that is as disparate as it is fragmentary.
Furthermore, it remains to be seen whether different repositories can be accessed in a way that is useful to impatient scholars with limited time on their hands. Obstacles are not just technological in nature and do not only concern standards and protocols; excruciatingly complicated copyright issues are involved, as well as urgent concerns about ocr quality, textual stability, discrepancies within and between public and commercial providers of source material and the real threat that profitability may triumph over free access. At this moment we are not anywhere near to 'fingerprinting' different cultures in terms of attitudes, opinions and values -an ambition that is likely to revolutionise global history once it is realised. On the other hand, if these difficulties are surmounted -and there is no reason to believe that they cannot be -the opportunities for and possibilities of innovative historical research will be manifold.
There are things digital humanities research cannot do. One of them is to produce a historical narrative authored by a craftsman whose evocation of the past depends on individual erudition, scholarship, insight, talent and the ability to tell a story. 36 However, it is clear to us that the new tools and methodologies discussed in this paper, and to which all historians will soon have access, will be an important contribution to future historical

77
big data for global history: the transformative promise of digital humanities van eijnatten, pieters and verheul Joris van Eijnatten (1964) is Professor of Cultural History at Utrecht University. He has worked on three overlapping and interrelated fields -the history of ideas, the history of religion, and the history of media and communication. He has written or edited more than ten books and authored around a hundred articles and contributions on subjects ranging from history of medicine and press freedom to the history of cultural values and public administration. Recent publications: co-authored with Fred van Lieburg, Nederlandse religiegeschiedens (second edition; Hilversum 2006) and Hogere sferen. De ideeënwereld van Willem Bilderdijk, 1756-1831(Hilversum 1998. He has a Dutch-language textbook From Village Square to Cyberspace: A History of Communication, Media and Information forthcoming (Prometheus). Email: j.vaneijnatten@uu.nl. (1960)