Perspective
Text as big data: Develop codes of practice for rigorous computational text analysis in energy social science

https://doi.org/10.1016/j.erss.2020.101691Get rights and content

Abstract

Augmenting traditional social science methods with computational analysis is crucial if we are to exploit the vast digital archives of text data that have become available over the past two decades. In this journal, Benites-Lazaro et al. [1] showcase this in an application of topic modeling and other computational methods to an actor-specific examination of changes in policy discourse on ethanol in Brazil and point out methodological promises and challenges. However, their contribution also highlights the need for establishing codes of practice for computational text analysis. In this perspective, we discuss five areas for improvement when treating text as big data in light of guiding principles from computational research – transparency, reproducibility and validation – to facilitate rigorous research practice: (1) full transparency over data collection and corpus construction, (2) comprehensive method descriptions that enable reproducibility by other researchers, (3) application of rigorous model validation procedures, (4) results interpretation based on primary text and clear research design and (5) critical discussion and contextualization of main findings. We conclude that the energy social science community needs to develop codes of practice to build on the promising research within the field of computational text analysis and suggest first steps into this direction.

Introduction

We live in an age of big data: digital information is increasingly characterized by challenging levels of volume, velocity, variety and veracity [2]. Part of this data explosion is the recent availability of digital text at a massive scale, which can be used for social science inquiry. Digital archives of scientific and newspaper articles, parliamentary records, party documents, patents, legislation, treaties or social media posts such as those from Twitter, Facebook or Reddit provide vast and rich sources for research. As shown in Fig. 1 these text archives are vast and growing fast.

Using text as big data provides new opportunities and requires the augmentation of traditional social science methods with computational text analysis applications (e.g. [3], [4], [5], [6], [7]). The analysis of language is central for a variety of methods in energy social science. Yet these methods are challenged by volume and velocity: today, vast amounts of text are produced so fast that manual analysis cannot keep up. This precludes timely and comprehensive inquiry. Computational text analysis tools promise to scale traditional social science methodologies to vast archives of text without massive funding support [8], [9], [10]. There is a large variety of supervised and unsupervised methods that allow the automation of tasks such as document classification, content analysis, sentiment analysis, part of speech tagging, machine translation and information retrieval [11], [12]. This provides entirely new opportunities for energy social science research (e.g. [13], [14], [15]).

For example, discourse analysis studies the use of spoken or written language in social contexts to understand how debates are shaped by actors and develop over time [16], [17]. Discourse analysis is often limited by the capacity to classify and code text manually: vast digital text archives can only be analyzed partially, exposing traditional approaches to the criticism of ‘cherry-picking’ or neglecting less frequent linguistic patterns when only small samples of text are analyzed [18]. Computational methods for text analysis can assist in or take over some of these coding tasks and scale discourse analysis to text archives of almost any size [19], [20], [21], [22].

Despite its promises and notwithstanding at least 20 years of research in computational social science [23], [4] and digital humanities [24], [25], [26], the adoption of automated text analysis methods by the social science community has been slow. Searching the 1318 contributions in this journal, we only find six articles1 that apply automated text analysis methods [1], [27], [28], [29], [30], [31].

It is, therefore, a major contribution of Benites-Lazaro et al. [1] to be among the first to discuss the methodological potential of computational text analysis in the field of energy social science, as part of a 2018 “Special Issue on the Problems of Methods in Climate and Energy Research”2 in this journal. They apply topic modeling – an unsupervised machine learning method for automated content analysis – and other computational methods in order to understand policy discourses in Brazil on ethanol production across different actors. Their analysis of thousands of documents from governments, businesses, NGOs and media outlets reveals the thematic structure of the discourse over the observed 11-year time-span and identifies, for example, distinct thematic discourses across actor groups.

Yet, the application of computational text analysis in energy social science comes with particular risks and challenges. First, it requires deep interdisciplinary knowledge in multiple fields. Analyses like the one by Benites-Lazaro et al. [1] require expertise in applied text-mining methods, social science methods for understanding discourses and subject matter knowledge on energy policy in the Brazilian context. To complicate things, different fields may be characterized by potentially contrasting ontologies and ideas about epistemic hierarchies [[39], [40], [41], [38]]. Second, limited or biased data availability precludes the application of these methods in certain research areas. The establishment of text mining methods will illuminate some research areas while obfuscating others [42]. It may thus lead to selective coverage or at least introduce new biases in the representation of different types of research questions and subjects. To mitigate or correct for these biases, researchers need to be aware of them. Third and related to this, text mining applications in social science need to take into account power differentials that are inscribed in the technologies that enable them and the effects of quantification practices on society and communication [42]. Therefore, although we argue here that the importance of computational text analysis approaches is growing, they do only represent one set of methods in the toolbox of energy social science.

Pioneering studies that introduce new methods into a field furthermore face the challenge that they operate in a comparatively open space, in which methodological standards, codes of practice and community expertise are not well established. We argue that despite prominently introducing and highlighting the importance of computer-assisted discourse methodology in this journal, the paper by Benites-Lazaro et al. [1] highlights the need for a critical discussion of how such an analysis should be conducted. To varying degrees, this is also reflected in other contributions to the field including our own ones (e.g. [13], [18], [27], [29], [43], [44], [45]). In the next section, we outline some key principles of good scientific practice for computational research, mainly from computer science. In light of critical limitations in Benites-Lazaro et al. [1], we use these principles in the subsequent section to propose recommendations for applications of computational text analysis in energy social science, which supplement guidance for good scientific practice as mapped out by Sovacool et al. [46]. As Benites-Lazaro et al. [1] apply mainly unsupervised machine learning and other data science applications such as topic modeling, the discussion will mainly focus on these approaches, but broader recommendations are equally valid for other computational text analysis methods. We close this perspective by highlighting the need for further developing and implementing codes of practice for computational research that promote transparency, reproducibility and validation in this emerging research area in energy social science.

Section snippets

Principles for computational research

In empirical research, scientific inquiry is characterized by claims that are based on empirical data and can be defended in a rational debate. Defensible claims have to be based on rigorous scientific practice practice (see Table 1 for a definition of rigor in the scientific context). Following Sovacool et al. [46, p. 13], this implies that researchers apply “a mix of carefulness and thoroughness” in designing study objectives, applying methods and interpreting results. The advancement of

Applying computational text analysis in energy social science

Using the principles from computer science and the recommendations by Sovacool et al. [46], we discuss in the following specific recommendations for computational text analysis in energy social science. The study by Benites-Lazaro et al. [1] serves as an example to develop and illustrate our points. We choose this particular example because the authors explicitly focus on methodological innovation and the methods’ potential for future research. However, we find similar limitations to various

Outlook: towards codes of practice for computational text analysis in energy social science

With this contribution we aim to start a discussion on codes of practice for computational text analysis in energy social science that complement the general good practice guidance provided by Sovacool et al. [46]. Such codes would establish minimum standards that help authors and reviewers to provide and guarantee good quality research. Based on the principles from computer science – transparency, reproducibility and validation – we lay out ideas how such guidelines could look for this

Code and data availability

Code and data for figures are available at https://github.com/mcc-apsis/text-mining-commentary.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was developed within the project ‘Strategic Scenario Analysis’ (START) funded by the German Ministry of Research and Education (grant reference: 03EK3046B). Max Callaghan is supported by a PhD scholarship from the Heinrich Böll Foundation. We thank William Lamb and three anonymous reviewers for helpful comments.

References (78)

  • B.K. Sovacool et al.

    Promoting novelty, rigor, and style in energy social science: Towards codes of practice for appropriate methods and research design

    Energy Res. Soc. Sci.

    (2018)
  • B.K. Sovacool

    What are we doing here? Analyzing fifteen years of energy scholarship and proposing a social science research agenda

    Energy Res. Soc. Sci.

    (2014)
  • D. O’Callaghan et al.

    An analysis of the coherence of descriptors in topic modeling

    Expert Syst. Appl.

    (2015)
  • M. Beyer et al.

    The importance of big data: a definition

    Gartner

    (2015)
  • G. Miller

    Social scientists wade into the tweet stream

    Science

    (2011)
  • D. Lazer et al.

    Computational social science

    Science

    (2009)
  • S.A. Golder et al.

    Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures

    Science

    (2011)
  • T. Lansdall-Welfare et al.

    Content analysis of 150 years of british periodicals

    P. Natl. Acad. Sci. USA

    (2017)
  • P.S. Dodds et al.

    Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter

    PLoS ONE

    (2011)
  • J. Grimmer et al.

    Text as data: The promise and pitfalls of automatic content analysis methods for political texts

    Polit. Anal.

    (2013)
  • K. Isoaho et al.

    A big data view of the european energy union: shifting from ‘a floating signifier’ to an active driver of decarbonisation?

    Polit. Gov.

    (2019)
  • M. Gentzkow et al.

    Text as Data

    J. Econ. Lit.

    (2019)
  • G. Ignatow et al.

    Text mining. A guidebook for the social sciences

    (2016)
  • E.M. Cody et al.

    Climate change sentiment on twitter: An unsolicited public opinion poll

    PLoS ONE

    (2015)
  • F.C. Moore et al.

    Rapidly declining remarkability of temperature anomalies may obscure public perception of climate change

    P. Natl. Acad. Sci. USA

    (2019)
  • Y. Kryvasheyeu et al.

    Rapid assessment of disaster damage using social media activity

    Sci. Adv.

    (2016)
  • M. Hajer et al.

    A decade of discourse analysis of environmental politics: Achievements, challenges, perspectives

    J. Environ. Policy Plan.

    (2005)
  • H. Klüver

    Measuring interest group influence using quantitative text analysis

    Eur. Union Polit.

    (2009)
  • L. Collingwood et al.

    Tradeoffs in accuracy and efficiency in supervised learning methods

    J. Inf. Technol. Polit.

    (2012)
  • M.E. Roberts et al.

    A model of text for experimentation in the social sciences

    J. Am. Stat. Assoc.

    (2016)
  • J. Lawrence et al.

    Argument mining: a survey

    Comput. Linguist.

    (2019)
  • C. Cioffi-Revilla

    Computational social science

    WIREs Comput. Stat.

    (2010)
  • M.L. Jockers et al.

    Text-Mining the Humanities

  • S. Schreibman et al.

    A Companion to Digital Humanities

    (2008)
  • P. Svensson

    The landscape of digital humanities

    Digit. Humanit.

    (2010)
  • D. Reinsel et al.

    The Digitization of the World - From Edge to Core, Technical Report

    (2018)
  • Scopus factsheet, 2019....
  • R. Johnson, A. Watkinson, M. Mabe, The STM Report: An overview of scientific and scholarly publishing, Technical Report...
  • Cited by (0)

    View full text