PerspectiveText as big data: Develop codes of practice for rigorous computational text analysis in energy social science
Introduction
We live in an age of big data: digital information is increasingly characterized by challenging levels of volume, velocity, variety and veracity [2]. Part of this data explosion is the recent availability of digital text at a massive scale, which can be used for social science inquiry. Digital archives of scientific and newspaper articles, parliamentary records, party documents, patents, legislation, treaties or social media posts such as those from Twitter, Facebook or Reddit provide vast and rich sources for research. As shown in Fig. 1 these text archives are vast and growing fast.
Using text as big data provides new opportunities and requires the augmentation of traditional social science methods with computational text analysis applications (e.g. [3], [4], [5], [6], [7]). The analysis of language is central for a variety of methods in energy social science. Yet these methods are challenged by volume and velocity: today, vast amounts of text are produced so fast that manual analysis cannot keep up. This precludes timely and comprehensive inquiry. Computational text analysis tools promise to scale traditional social science methodologies to vast archives of text without massive funding support [8], [9], [10]. There is a large variety of supervised and unsupervised methods that allow the automation of tasks such as document classification, content analysis, sentiment analysis, part of speech tagging, machine translation and information retrieval [11], [12]. This provides entirely new opportunities for energy social science research (e.g. [13], [14], [15]).
For example, discourse analysis studies the use of spoken or written language in social contexts to understand how debates are shaped by actors and develop over time [16], [17]. Discourse analysis is often limited by the capacity to classify and code text manually: vast digital text archives can only be analyzed partially, exposing traditional approaches to the criticism of ‘cherry-picking’ or neglecting less frequent linguistic patterns when only small samples of text are analyzed [18]. Computational methods for text analysis can assist in or take over some of these coding tasks and scale discourse analysis to text archives of almost any size [19], [20], [21], [22].
Despite its promises and notwithstanding at least 20 years of research in computational social science [23], [4] and digital humanities [24], [25], [26], the adoption of automated text analysis methods by the social science community has been slow. Searching the 1318 contributions in this journal, we only find six articles1 that apply automated text analysis methods [1], [27], [28], [29], [30], [31].
It is, therefore, a major contribution of Benites-Lazaro et al. [1] to be among the first to discuss the methodological potential of computational text analysis in the field of energy social science, as part of a 2018 “Special Issue on the Problems of Methods in Climate and Energy Research”2 in this journal. They apply topic modeling – an unsupervised machine learning method for automated content analysis – and other computational methods in order to understand policy discourses in Brazil on ethanol production across different actors. Their analysis of thousands of documents from governments, businesses, NGOs and media outlets reveals the thematic structure of the discourse over the observed 11-year time-span and identifies, for example, distinct thematic discourses across actor groups.
Yet, the application of computational text analysis in energy social science comes with particular risks and challenges. First, it requires deep interdisciplinary knowledge in multiple fields. Analyses like the one by Benites-Lazaro et al. [1] require expertise in applied text-mining methods, social science methods for understanding discourses and subject matter knowledge on energy policy in the Brazilian context. To complicate things, different fields may be characterized by potentially contrasting ontologies and ideas about epistemic hierarchies [[39], [40], [41], [38]]. Second, limited or biased data availability precludes the application of these methods in certain research areas. The establishment of text mining methods will illuminate some research areas while obfuscating others [42]. It may thus lead to selective coverage or at least introduce new biases in the representation of different types of research questions and subjects. To mitigate or correct for these biases, researchers need to be aware of them. Third and related to this, text mining applications in social science need to take into account power differentials that are inscribed in the technologies that enable them and the effects of quantification practices on society and communication [42]. Therefore, although we argue here that the importance of computational text analysis approaches is growing, they do only represent one set of methods in the toolbox of energy social science.
Pioneering studies that introduce new methods into a field furthermore face the challenge that they operate in a comparatively open space, in which methodological standards, codes of practice and community expertise are not well established. We argue that despite prominently introducing and highlighting the importance of computer-assisted discourse methodology in this journal, the paper by Benites-Lazaro et al. [1] highlights the need for a critical discussion of how such an analysis should be conducted. To varying degrees, this is also reflected in other contributions to the field including our own ones (e.g. [13], [18], [27], [29], [43], [44], [45]). In the next section, we outline some key principles of good scientific practice for computational research, mainly from computer science. In light of critical limitations in Benites-Lazaro et al. [1], we use these principles in the subsequent section to propose recommendations for applications of computational text analysis in energy social science, which supplement guidance for good scientific practice as mapped out by Sovacool et al. [46]. As Benites-Lazaro et al. [1] apply mainly unsupervised machine learning and other data science applications such as topic modeling, the discussion will mainly focus on these approaches, but broader recommendations are equally valid for other computational text analysis methods. We close this perspective by highlighting the need for further developing and implementing codes of practice for computational research that promote transparency, reproducibility and validation in this emerging research area in energy social science.
Section snippets
Principles for computational research
In empirical research, scientific inquiry is characterized by claims that are based on empirical data and can be defended in a rational debate. Defensible claims have to be based on rigorous scientific practice practice (see Table 1 for a definition of rigor in the scientific context). Following Sovacool et al. [46, p. 13], this implies that researchers apply “a mix of carefulness and thoroughness” in designing study objectives, applying methods and interpreting results. The advancement of
Applying computational text analysis in energy social science
Using the principles from computer science and the recommendations by Sovacool et al. [46], we discuss in the following specific recommendations for computational text analysis in energy social science. The study by Benites-Lazaro et al. [1] serves as an example to develop and illustrate our points. We choose this particular example because the authors explicitly focus on methodological innovation and the methods’ potential for future research. However, we find similar limitations to various
Outlook: towards codes of practice for computational text analysis in energy social science
With this contribution we aim to start a discussion on codes of practice for computational text analysis in energy social science that complement the general good practice guidance provided by Sovacool et al. [46]. Such codes would establish minimum standards that help authors and reviewers to provide and guarantee good quality research. Based on the principles from computer science – transparency, reproducibility and validation – we lay out ideas how such guidelines could look for this
Code and data availability
Code and data for figures are available at https://github.com/mcc-apsis/text-mining-commentary.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was developed within the project ‘Strategic Scenario Analysis’ (START) funded by the German Ministry of Research and Education (grant reference: 03EK3046B). Max Callaghan is supported by a PhD scholarship from the Heinrich Böll Foundation. We thank William Lamb and three anonymous reviewers for helpful comments.
References (78)
- et al.
Topic modeling method for analyzing social actor discourses on climate change, energy and food security
Energy Res. Soc. Sci.
(2018) - et al.
A critical review of discursive approaches in energy transitions
Energy Policy
(2019) - et al.
Muslims in social media discourse: Combining topic modeling and critical discourse analysis
Discourse, Context Media
(2016) Vodka on ice? Unveiling Russian media perceptions of the Arctic
Energy Res. Soc. Sci.
(2016)- et al.
Villainous or valiant? Depictions of oil and coal in American fiction and non fiction narratives
Energy Res. Soc. Sci.
(2017) - et al.
Business storytelling about energy and climate change: The case of Brazil’s ethanol industry
Energy Res. Soc. Sci.
(2017) - et al.
Energy ideals, visions, narratives, and rhetoric: Examining sociotechnical imaginaries theory and methodology in energy research
Energy Res. Soc. Sci.
(2018) Shattered frames in global energy governance: Exploring fragmented interpretations among renewable energy institutions
Energy Res. Soc. Sci.
(2020)- et al.
CSR as a legitimatizing tool in carbon market: Evidence from Latin America’s Clean Development Mechanism
J. Clean. Prod.
(2017) - et al.
, Modeling landscape sustainability in the oil producing Niger delta area of Nigeria
Energy Pol.
(2019)