Image as Data: Automated Content Analysis for Visual Presentations of Political Actors and Events

Images matter because they help individuals evaluate policies, primarily through emotional resonance, and can help researchers from a variety of ﬁelds measure otherwise diﬃcult to estimate quantities. The lack of scalable analytic methods, however, has prevented researchers from incorporating large scale image data in studies. This article oﬀers an in-depth overview of automated methods for image analysis and explains their usage and implementation. It elaborates on how these methods and results can be validated and interpreted and discusses ethical concerns. Two examples then highlight approaches to systematically understanding visual presentations of political actors and events from large scale image datasets collected from social media. The ﬁrst study examines gender and party diﬀerences in the self-presentation of the U.S. politicians through their Facebook photographs, using an oﬀ-the-shelf computer vision model, Google’s Label Detection API. The second study develops image classiﬁers based on convolutional neural networks to detect custom labels from images of protesters shared on Twitter to understand how protests are framed on social media. These analyses demonstrate advantages of computer vision and deep learning as a novel analytic tool that can expand the scope and size of traditional visual analysis to thousands of features and millions of images. The paper also provides comprehensive technical details and practices to help guide political communication scholars and practitioners.

IMAGE AS DATA 6

Inputs Into Decision Making
Humans are more likely to notice and learn from visual information than textual.
Images provide information about a situation, such as a politician's patriotism or the beneficiaries of a new healthcare policy, more accessibly and quickly than text (Barry, 1997). This faculty is probably because writing is a technology that must be learned, while visual processing is evolutionarily antecedent (Gazzaniga, 1998). Compared to text, images provide "a more comprehensive and error-free grasp of information, better recall, and greater emotional involvement" (Graber, 1996).
These emotions affect decisions ranging from vote choice (Joo, Steen, & Zhu, 2015) to mobilization (Casas & Webb Williams, 2019). Understanding how images matter for politics is therefore central to understanding how politics works.
Images are a powerful means of persuasion and a critical device in media framing, agenda-setting, and propaganda (Geise & Baden, 2015). They are carefully selected, edited and presented to audience, conveying various intentions encoded in subtle or sometimes very obvious ways. Scholars have demonstrated the effect of visuals on issue perceptions (Soroka, Loewen, Fournier, & Rubenson, 2016) and candidate evaluations (Barrett & Barrington, 2005). Given a multimodal message, the audience construct a blended representation of issues and events from verbal and visual cues, and when they are not congruent, the visual one may dominate (Gibson & Zillmann, 2000).
Images encapsulate underlying, complex issues, providing an information shortcut for individuals to evaluate multi-faceted political issues (Popkin, 1994).

Advancing Communication Research
Framing. Facial expressions of politicians are an indicator of overall favorability. For instance, a smiling face is more likely to convey a positive sentiment about the main person being depicted. Based on this assumption, Groeling, Joo, Li, and Steen (2016) have examined the degree of media bias present in TV news programs in the U.S. by automatically analyzing facial expressions of presidential candidates across news networks. Going beyond traditional professional sources, attempts have been also made to analyze political images in social media. For instance, You, Cao, Cong, Zhang, and Luo (2015) have analyzed multimodal cues of Flickr posts related to presidential candidates in the U.S. to predict election outcomes based on facial expressions and hashtags.
Candidate Evaluation. Computer vision methods have also shown the potential effects of politicians' facial appearance on voters' trait judgment and election outcomes. Personality inference from facial appearance is a well studied topic in psychology (Zebrowitz & Montepare, 2008), and political scientists have attempted to explain public responses to politicians, including election outcomes, based on physical appearance of political leaders such as their visually-inferred competence (Todorov, Mandisodza, Goren, & Hall, 2005). Automated models have been used to extract visual features from facial images to predict subjective trait judgments on dimensions such as intelligence or trustworthiness (Rojas, Masip, Todorov, & Vitria, 2011;Vernon, Sutherland, Young, & Hartley, 2014). Automatically inferred facial traits may also predict election outcomes (Joo et al., 2015). Section 6.1's analysis of politicians' images shared on Facebook shows how deep IMAGE AS DATA 8 learning informs the study of elected officials' self-portrayal (Fenno, 1978). Most people access news through multimodal (a combination of print, audio, or visual) media; even newspapers devote significant space to photographs, and saying that the visual dimension of politics matters is not new (Barrett & Barrington, 2005;Gilliam Jr & Iyengar, 2000;Grabe & Bucy, 2009;Hansen, 2015;Schill, 2012). Presidential debates, for instance, are both verbal exchanges of policy positions and, because they are televised, conveyors of emotions and tensions between the candidates (Joo et al., 2019;Shah et al., 2016). Indeed, the nonverbal cues and visual exposures of politicians may encode their emotions and invoke voter reactions (Grabe & Bucy, 2009;Sullivan & Masters, 1988)

. Prominent recent examples from the United States include Donald
Trump's stalking of Hillary Clinton during their debates as well as Speaker Pelosi's sarcastic clapping during President Trump's 2019 State of the Union address. Visuals are an especially important information shortcut for low-information voters (Lenz & Lawsom, 2011), which may explain why out-parties tend to prefer more attractive candidates (Atkinson, Enos, & Hill, 2009).

Media Bias.
Computer vision techniques also enable measurement of media bias and framing, which Section 6.2 demonstrates. Large literatures analyze media bias of political news coverage (D'Alessio & Allen, 2000;Gentzkow & Shapiro, 2010), its public perception (Watts, Domke, Shah, & Fan, 1999), and effects (Baum & Groeling, 2008;Druckman & Parkin, 2005). Measuring media bias objectively is a challenging task because the ground truth is unknown. For systematic analysis, studies have relied not only on verbal content analysis (Baum & Groeling, 2008) but also on visual analysis ranging from counting the number of photographs of a candidate in newspapers (Stovall, 1988) to manually coding how favorable or unfavorable their portrayals are (Grabe & Bucy, 2009). Computer vision based techniques can significantly reduce coding costs by automatically recognizing people in photographs, their expressions and favorability and comparing the results across outlets or candidates (Peng, 2018). As traditional media faces increasing competition from online, decentralized content producers (Blumler & Kavanagh, 1999), the ability to analyze IMAGE AS DATA 9 image framing at scale will only increase in importance (Schmuck & Matthes, 2017).

Opinion Formation. Issue behavior responds to visual communication.
Negative opinions towards immigration, for example, may be due to media conflation of immigrants with crime and disease (Tukachinsky et al., 2011). Attitudes about immigration are more positive, however, when the imagery accompanying an article evokes European, instead of Latin American, immigration, and this effect is caused by intervening emotional variables, especially anxiety (Brader, Valentino, & Suhay, 2008).
The power of images explains why anti-immigrant rhetoric focuses on symbolic (visual) appeals over economic ones (Schmuck & Matthes, 2017). Deep learning techniques can also offer insight into what features of images provoke behavior. For example, people are more likely to pay attention to negative or shocking events (Baumeister, Bratslavsky, & Vohs, 2001), so newspapers and television report those type of events.
But how those events are portrayed should also Polarization. Computer vision techniques can also shed light on changes in political polarization. Dietrich (2018), for example, uses video data of members of the House of Representatives to show that frequency of physically crossing the aisle to talk to members of the other party predicts how polarized an upcoming vote will be. Which images politicians share on their Facebook, Twitter, and Instagram profiles may reveal their ideological position (Xi et al., 2020). Measuring ideology via images would prove especially useful for evaluating incumbent challengers since their ideology cannot be determined from voting history and campaign donation data may not provide this information early enough in an election cycle (Bonica, 2018).
Appendix Section B details additional applications in the study of development, natural disasters, civil war, state capacity, and protests.

Computer Vision and Deep Learning
Computer vision tries to solve visual problems with any kind of methods, and deep learning refers to efficient methods applicable to any kind of data, not just images.
Computer vision is an interdisciplinary branch of study crossing computer IMAGE AS DATA 10 science, statistics, cognitive science, and psychology. Its primary goal is automatic understanding of visual content, i.e., to replicate human visual abilities with computational models. Human vision is versatile, complicated, and not fully understood, and computer vision systems cannot simply reconstruct the mechanisms of human vision. Therefore, research has mostly focused on using statistical inference and machine learning approaches to deal with noisy inputs and discover meaningful patterns.
In practice, this pipeline usually consists of collecting a large amount of visual data, manually labeling them, and training a model that best explains the observed data.
Prior to the start of the deep learning era, the insufficient reliability and accuracy of computer vision based methods was the primary factor limiting practical applications, including political analysis of visual content. The field made a dramatic leap forward with the advances in deep learning based approaches (Krizhevsky, Sutskever, & Hinton, 2012). The next section introduces those advances.

Deep Learning and Hierarchical Representations
Deep learning refers to a class of machine learning methods which utilizes hierarchical, multi-layered models. 1 In contrast to single-layered models, such as linear regression, in which output variables can be directly computed from input variables, "deep" models employ repetitive structures with multiple layers such that the final outputs of the model are obtained through a sequence of operations applied to the input data and intermediate results.
In machine learning, hierarchical model structures are commonly used, as in some topic models (Griffiths, Jordan, Tenenbaum, & Blei, 2004). These models incorporate different levels of representations which capture structured and global information (e.g., topic), as well as local information (e.g., words) from input data. In political science, hierarchical text models have been used to study Congressional press releases (Grimmer, 2010) and open-ended survey responses (Roberts et al., 2014).
Deep learning based methods profit from the same hierarchical structure, but they employ a larger number of consecutive layers. These extra layers add the "deep" to IMAGE AS DATA 11 the learning. Indeed, the success of deep learning is related to the depth of the models, as additional layers can encode abstract visual attributes and capture more complex data distributions than what shallower models can (Delalleau & Bengio, 2011;Eldan & Shamir, 2016).
Furthermore, these complex internal structures are directly learned from the images rather than manually defined by the researcher. Direct learning contrasts with other approaches, explained in the next sub-section, that require the researcher to specify the visual features of an image that correspond to the desired image label ("car", "torch", "rally", &c). That approach is similar to using a dictionary in text analysis to identify texts as being about a topic if it contains some combination of keywords in that dictionary. Dictionary approaches to text are more productive than manual feature specification in images because text can be represented more simply. Deep learning, by contrast, does not use a pre-defined feature set, an advantageous approach when applied to complex data such as images.

Advances Over Previous Computer Vision Methodology
Artificial neural networks have a long history in machine learning and computer vision and regained popularity after Krizhevsky et al. (2012) demonstrated a 21.9%-33.8% improvement in image classification performance using a convolutional neural network on a benchmark dataset, ImageNet (Russakovsky et al., 2015). Two major requirements for deep learning, very large-scale datasets and high-performance computation using Graphical Processing Units (GPU), contemporaneously became available.
Traditional computer vision methods heavily rely on manual feature engineering.
These methods typically utilize a two-step process, as shown in Figure 1. Given raw input image data, the methods first extract features using a hand-crafted feature extractor. Hand-crafting means that a researcher has to manually design and define the feature extraction function based on instinct and experience.  Comparing Deep Learning to Traditional Computer Vision Methods capture the most important cues in the raw data, and a separate classifier, such as logistic regression, exploits them in the second step.
In contrast, deep learning methods learn their representations directly from data without hand-crafted feature extraction. These methods employ a data-driven approach in feature learning and train an integrated model that will automatically learn and capture low-and high-level representations of data. This approach is advantageous because the learning algorithm can discover many subtle features which are specific to the given task. In other words, the features in deep learning are optimized for the task during training, as opposed to traditional methods that require the researcher to specify features before training.
The Appendix provides technical details about how convolutional neural networks work. We leave the technical discussion for the appendix because it is challenging for practitioners to design and construct their own CNN from scratch.
Rather, it is much more efficient to acquire a training set of images that can be used to customize an existing pre-trained model. Appendix Section D elaborates details for transfer learning, training, and validating models for advanced readers.

4 Tasks in Computer Vision
This section discusses three common tasks in computer vision: image classification, object detection, and face and person analysis.

Image Classification
Image classification is a popular topic in computer vision. Given an input image, I, the goal of image classification is to assign a label, y, from a predefined label set, Y , based on the image content: (1) For binary classification, Y = {positive (belongs to category), negative (does not belong to category)}. In general, Y may contain any number of possible labels. The posterior probability for each label is computed for a given input image, and the classifier chooses the category with the highest output score, similar to how a topic is assigned by some text classifiers.

Figure 2
Example results of image classification with the confidence scores computed from a CNN.

Red color indicates the correct category and blue color indicates the incorrect categories.
In multiclass (or multinomial) classification, Y contains more than two, mutually exclusive categories. The softmax function is commonly used in multiclass classification to normalize output scores over multiple categories such that the final scores sum to 1; the class with the highest normalized output is assigned to that image. Suppose that the last fully connected layer outputs a vector x = (x 1 , x 2 , ..., x n ), where x k is the raw IMAGE AS DATA 14 output score before normalization for the k-th class out of n classes. The final score will be obtained as follows. . (2) An image can contain more than one label. In this situation, called multilabel classification, an image is allowed to be assigned more than one label. For instance, Section 6.1 uses multilabel classification to understand politicians' imagery, and Section 6.2 uses multiclass classification to identify images of protest.

Object Detection
The goal of object detection is to localize (find) objects in images and assign a category (gun, flag, or cup, for example) to each object. The output of object detection is a set of detected objects, their locations, and categories. Figure 3 shows example results of object detection with detection scores from Google's Cloud Vision API. 2 Object detection is a more complex problem than image classification because the model should classify the types of objects and their locations in the image. In practice, many object detection systems utilize a two-stage procedure. First, the system generates a number of generic object "proposals" from an input image (Uijlings, Van De Sande, Gevers, & Smeulders, 2013). These proposals are image subregions which the system believes are likely to contain an object instance, regardless of its category. An object location is represented by a rectangular bounding box, (x, y, w, h), indicating the coordinates and the size of the bounding box. This bounding box is the rectangular area of the minimum size that can cover all the pixels that the object occupies in the image. Second, the image classification step is then applied to each object proposal to determine whether it belongs to a category or is background.

Face and Person
The human face has received enormous research attention as a special domain in computer vision since the 1970s, for two main reasons. First, facial recognition has many useful applications, such as for personal identification or security. Second, it is

Figure 3
Example results of object detection by Google Cloud Vision API.
relatively easy to handle face images compared to other objects because the appearance of a human face is consistent across individuals but distinct from other objects. These properties motivated early approaches such as automated feature extraction (Kanade, 1977), feature learning with neural networks (Fleming & Cottrell, 1990), and classification based on statistical analysis of data (Belhumeur, Hespanha, & Kriegman, 1997 a similar structure to an image classification model. Figure 4 shows two examples of face recognition and gender and race classification from facial appearance. In this case, the system will first detect every face in an image and each facial region will then be classified separately by a model trained for face attribute classification.

Figure 4
Example results of face detection, recognition, and attribute classification. The labels were computed by a model from Kärkkäinen and Joo (2019).

Ethics
The explosion of data and computational power that has enabled academic and commercial advances in the study of human behavior stimulates a growing awareness of their ethical implications. Since deep learning is a result of these advances, it is also implicated in resulting ethical debates. This section focuses on five areas of concern: training data bias, privacy, informed consent, model opacity, and access to resources.
Bias. Perhaps the biggest ethical challenge facing those employing computer vision techniques is that a model will reproduce any biases in the input data, and input data often already contain racist and gendered stereotypes. 3 For a similar reason, IMAGE AS DATA 17 commercial gender classification APIs offered by Microsoft, IBM, and Face++ have been criticized due to the inferior classification accuracy on darker-skin females (65%; 99% on lighter-skin males) (Buolamwini & Gebru, 2018). In image search results, women are, on average, underrepresented relative to their participation rate in a given occupation (Lam, Wojcik, Broderick, & Hughes, 2018). A researcher relying on pre-trained models or commercial APIs should make sure he or she is aware of any biases that model imbues. When building one's own model, labels applied to a validation dataset should be examined for any biases before subsequent analysis uses the model output.
Privacy. If a model involves face detection, one may be able to identify individuals, violating their privacy. This concern is especially relevant in the study of contentious politics, as this capability means governments could engage in targeted repression by finding protesters in photographs and matching those faces to identifying information. Governments like Russia and China already deploy this technology to identify anyone in a crowd (Purdy, 2018), and some law enforcement agencies in the United States have adopted similar technology (Shaban, 2018). To protect individuals, researchers should not release photographs that could be used to identify them.
Researchers should also consider whether or not their research requires identifying particular individuals at all.

User Consent.
The concern about identifying individuals based on their faces segues into a third concern, informed consent. When a user makes his or her social media posts public, a researcher can reasonably assume that the user has provided consent to be studied, much in the same way driving on a public street provides data to traffic engineers. This assumption is more questionable for individuals who appear in Understanding the internal logic of deep learning models is an active area of research.
6 Case Studies

Self-presentation of Politicians in Social Media
Social media have been widely used by politicians for political communication (Stier, Bleier, Lietz, & Strohmaier, 2018 Like ideology, gender represents another axis along which politicians may vary their self-presentation. Regardless of gender, voters prefer attractive candidates (Ahler, Citrin, Dougal, & Lenz, 2017;Mattes & Milazzo, 2014). This evaluation then maps onto gender, with female candidates stereotyped as warm and men strong (Johns & Shephard, 2007). Voters in the United States reward female candidates who appear more feminine (Carpinella et al., 2016). Regardless of ideology, we therefore expect that female candidates will emphasize physical features more than male candidates. The detected labels serve as concise semantics describing image content, allowing researchers to perform standard statistical analysis. In this example, we compare images of Democrats and Republicans by conducting a chi-squared test of the labels. To characterize such visual priorities, we measure cross-party difference for each label by comparing the number of images with and without the label. Since the two parties have different gender ratios, we perform chi-squared test on male and female images separately.
The results are shown in Table 1 (male) and Table 2 (female). The labels associated with each party are sorted in decreasing order of the chi-square statistic. A

Figure 5
Example results (input images and automatically detected labels) from Google's Label long hair, pink, beauty). These two sets of results suggest that politicians optimize their self-presentation by combining partisan values and gender stereotypes (Bauer & Carpinella, 2018).
For an example of unsupervised learning using the politicians' images, sed Appendix F. That example runs k-means clustering (k=200) on the penultimate layer of a pre-trained CNN. The resulting clusters contain very similar, often identical, images, revealing common themes within an image corpus.
Using a pre-trained classifier or API is a simple yet effective way for a visual comparative analysis on unknown domain. Researchers do not need to prepare any training data or annotations or train their own models. The key disadvantage of using an existing classifier is inferior customizability in case a researcher wants to classify concepts not defined in the classifier (or API). One solution to this situation is to train a custom classifier using annotations, which we show in the next example.

Frame Alignment During Protest
Protests are a key tactic of social movements, recruitment to protest affects the probability of success (Snow, Rochford Jr., Worden, & Benford, 1986), and how they are portrayed to bystanders ("framed") is a key input into recruitment success (Benford & Snow, 2000). This example demonstrates that Twitter users frame protests in ways likely to encourage bystanders to join.
Protesters seek to frame events to appeal to the most number of people. For example, labor organizers and the family of Mohammed Bouazizi, the Tunisian fruit vendor whose self-immolation sparked the Arab Spring, transformed his death into a parable about corruption and gender politics in a way that bridged class and geographic divides (Lim, 2013). From the other side, states portray protesters as radical, foreign, violent, or some combination thereof (Hamdy & Gomaa, 2012). This framing delegitimizes a protest, decreasing the cost a state pays if it engages in repression (Stephan & Chenoweth, 2008).
The rise of the internet and social media has empowered individuals to construct IMAGE AS DATA 23 frames, weakening media and activist gatekeepers (Livingston & Bennett, 2003). A new logic of connective action now means that personal action frames are commonly invoked during social movements, as they allow individuals to connect their issues with a larger collective (Bennett & Segerberg, 2013). This ability is especially important because the primary source for framing movements, newspapers, prefers to report on violent events (Hellmeier, Weidmann, & Geelmuyden Rød, 2018) and often have a status quo bias, causing them to frame protests differently than protesters would frame themselves (Hamdy & Gomaa, 2012).
The ability of individuals to construct and disseminate their own frames is especially important because newspaper and television emphasize protester violence (Myers & Caniglia, 2004). Media are especially likely to negatively frame events when they are seen as threatening status quo institutional interests, whether in democracies (Gitlin, 1980;Wittebols, 1996)  In addition to emphasizing state violence, individuals should prefer to frame a protest as a collective endeavor. Because the risk of protesting decreases as the size of the protest increases, bystanders are more likely to join a protest they believe is already attended by large crowds. This large crowd decreases the probability that an individual will suffer reputational cost or be the victim of state repression (Moore, 1995). Since crowds create a positive feedback loop of mobilization (Biggs, 2016), we expect that protest images shared on Twitter will frame the event as containing crowds, not individuals.
This subsection investigates these two expectations about framing by analyzing protests in five countries. 5 To explore which types of frames protesters choose, we first We operationalize frames according to six labels. Many types of frames are chosen to normalize a protest. Protests that are peaceful or mobilize multiple types of participants often include pictures of youth or faces of the participants, the first two labels. Participants will often share images of large groups to convey that the issue being protested is not fringe, while small groups tend to convey personal action frames (Bennett & Segerberg, 2013); these two types of groups are labels three and four.
Because previous literature has identified violence as a key frame (Myers & Caniglia, 2004), we also generate protester and state violence labels, the final two. Figure 6 shows IMAGE AS DATA 25 sample images and their ratings for protester and state violence.
To measure which frames protesters choose, we then detect duplicate images and identify the rate of duplicate images within each label. To identify duplicate images, we take each image's last fully connected layer, a 1,000 feature vector, and measure the pairwise distance between that vector and every other image's vector. If that normalized distance is below .2, a threshold chosen from inspecting the distance histogram, two images are considered duplicates.

Figure 6
Images  Table 3 provides initial support for the claims made about framing, violence, and crowds. In three of the five protests, images containing state violence are shared more.
Images of groups are also shared at higher rates in three of the events, though not the same three that frame state violence. The protests framed more strongly as containing state violence (Catalonia and Venezuela) also emphasize the group nature of protests.
For an example of the images driving frame alignment, see Figure 7. It shows the four most duplicated images in our sample; two use the small group frame, one uses a sign frame, a pleasant surprise because it is not a frame we expected to be prominent, and a state violence frame.

Figure 7
The four most common images causing frame alignment in our sample. The top row, and the two most shared, are from Venezuela and use a small group frame. The bottom left image is from South Korea and uses a frame, sign usage, not shown in Table 3. The bottom right is from Catalonia and is an example of the state violence frame.
Because policy makers are more likely to respond to protests the more that protesters put forth a consistent frame (Wouters & Walgrave, 2017), higher rates of duplication may indicate episodes of greater frame alignment, both across events and within event labels (Ketelaars, Walgrave, & Wouters, 2017). For example, the divisiveness of protests in Russia may be reflected in the lower rates of duplication of state and protester violent frames in comparison to Catalonia, Spain and Venezuela.
While protest success is the result of multiple factors, the ability to measure framing across countries may contribute to understanding when they succeed or fail.
These results are provisional: this example demonstrates additional understanding about protest framing that computer vision techniques can generate, but IMAGE AS DATA 27 it should not be considered a definitive answer. We have suggested one way of measuring framing, but future work should explore other operationalizations such as number of tweets containing a label (instead of percentage) and expand the frames considered. This analysis also discards temporal variation, which is almost certainly an important determinant of when certain frames receive emphasis. Which frames receive emphasis may also be affected by city and country correlates that we do not consider.
The results in Table 3 reveal interesting variation warranting further exploration.
Across events, the most obvious difference is that each event exhibits different baseline amounts of framing intensity. For example, Venezuela contains the highest framing intensity (duplication rate) across all labels, and the rank correlation of events across labels is quite high. The relative rates of duplication, moreover, vary significantly: the most duplicated event-label, Venezuela state-violence, resonates almost 103 times as much as the least, photos from Hong Kong with children. Two possibilities are that frame alignment increases the more violent state repression is or as social media penetration increases, increasing the rewards to frame alignment. Within each event, violent images are duplicated the most, with images of state violence shared more in 3 events, protester violence in 2, and a tie in Ukraine. Individuals also prefer to frame protests in terms of groups as opposed to individuals, as evidenced by the higher duplication rates in the group labels versus the child and faces label. That the rank ordering of frames within events appears to correlate across events suggests a hierarchy of protest frames, suggesting that forces beyond just the presence of professional organizations also affect frame alignment (Ketelaars et al., 2017).

Conclusion
If a picture is worth 1,000 words, then it would require approximately two kilobytes of storage (Jagenstedt, 2008). Images from consumer cell phones and digital cameras, however, require at least three megabytes of storage, usually more. Even images shared on social media platforms, which are compressed from their original size, require hundreds of kilobytes of space. A picture, in other words, is worth anywhere IMAGE AS DATA 28 from 50,000 (100 kilobytes) to 1,500,000 words (3 megabytes). A picture is actually worth a book. 6 This paper has argued that recent advances in computer vision, deep convolutional neural networks, hold much promise for the study of politics. Analyzing them in large quantities can inform research in behavior, communication, development, and conflict. The paper then introduced deep learning methods and how to validate model output. These techniques are especially promising for the study of protest, and an example analyzes six protests. The use of large, passively collected datasets raises new ethical issues of which researchers should be aware, especially when the data are images.
The increasing prevalence of digital technology has led to a greater appreciation of the importance of images in political life. Images make arguments, set agendas, document and dramatize events, activate emotions, shape perceptions, build identity, generate social cohesion, build empathy, and strategically create ambiguity (Schill, 2012). Whereas pedagogy, communication, and academic analysis have traditionally focused on acquiring textual information, cheap computing means that individuals consume and produce increasing amounts of visual information (Kraidy, 2002). Images are key drivers of political phenomena, and we would do well to take advantage of new techniques to analyze them in large quantities in research.
Notes 1 A layer is a separate operation or a collection of internal nodes placed at the same stage in a network.
It will be further elaborated shortly. The Supplementary Materials discuss different types of layers in neural networks.
2 Google has not published details of the Vision API's architecture, though it is safe to assume that it is based on a CNN. It is concerning that users are not informed about these details. We discuss these issues, for example, model biases and interpretability, in the Ethics section. We recommend this API provided that researchers are aware of potential issues and validate these APIs for their purposes (Section Appendix D), e.g. by measuring the accuracy of the API with manual annotations.

Table 1
Top 20 labels associated with Democrats (left) and Republicans (right). p D and p R indicate the ratio of images containing the label for each party. Male candidates only.  ( * * * p < 0.001, * * p < 0.01, * p < 0.1)

Appendix A Comparing Text to Images
Visual data differ from text data in ways summarized in Table A1. The most critical distinction between them is that, since words are the units of meaning in texts and are easier to define than objects in images, it is easier to process text than images.
An image's constituent elements, pixels, carry no meaning, as opposed to text data whose atomic elements are words. In other words, texts contain less uncertainty about meaning than images, and the five differences in Table A1 flow from that distinction. computer a text building block: it is any sequence of characters bounded by a punctuation or space character. Word detection is therefore equivalent to object detection in images. A single word can provide a great deal of semantic information (e.g., "Trump" or "election") and a simple string comparison operation allows one to access the information. In contrast, one pixel, and even small groups of pixels, are meaningless. In visual analysis, one has to process a huge number of meaningless pixels to detect and identify people, objects and events. Recognizing elementary content, visual "words," from an image is, however, extremely difficult. This technical difficulty has been the main obstacle to research involving quantitative analysis of visual data on a large scale.
It is also easier to build meaning from a collection of words than from pixels because words are arranged in one dimension, whereas pixels spread across two. The simplest text models take a bag-of-words approach, where the order of words does not matter; while more complex models perform better, bag-of-words models are nonetheless useful. A bag-of-pixels model would fail, however, since each pixel is meaningless.
Visual models therefore need to identify groups of pixels. Groups are identified using sliding windows, and these windows vary in two dimensions. by other white pixels could be a dress shirt, or it could be a part of a flag. Brown pixels separated from other brown pixels by 100 other pixels could be two eyes, but they could also be two shoes or two coffee cups. Because there is no easy definition of objects in images, it is harder to infer meaning from images than text.
Because words have clearer meaning than pixels, text files require less space than images. For example, images in tweets require, on average, 100 kilobytes of storage space. A tweet cannot contain more than 240 characters, which requires .24 kilobytes of space. A tweet of 100 kilobytes could contain 100,000 characters. The smaller size of texts means they are easier than images to store, share for replication, and, most importantly, analyze.
Because there is not a universal verbal language, object detection in images is more universal than meaning detection in texts. For example, the vast majority of faces contain two eyes, two ears, a nose, mouth, and forehead. The words for these facial features, however, vary across languages. An image classifier to detect faces therefore is more likely to detect all faces than a text classifier trained on one language, such as English, will be to detect facial words in another language's text. The lack of structure to images at the pixel level is therefore a blessing and a curse: it is a curse because building and training image classifiers is harder than for text, but it is a blessing because an image classifier is more broadly applicable than a text one.  (Jensen and Cowen, 1999). Imagery with a resolution of one meter or smaller can provide data on socioeconomic characteristics as they vary by neighborhood, allowing for frequent census-like data creation, an ability especially useful in countries with no, or irregular, censuses (Tapiador, Avelar, Tavares-Correa and Zah, 2011). For agricultural areas, it can measure changes in rainfall and crop growth, proximate measures of income for many countries (Toté, Patricio, Boogaard, van der Wijngaart, Tarnavsky and Funk, 2015). Since income shocks are a precursor to civil conflict, data that accurately measure subnational changes in income could act as an early warning system (Hsiang et al., 2013).
It is possible to measure socioeconomic variables using photographs of places taken by people. Manual analysis of Google Street View (GSV) imagery shows that photographs of streets correlates strongly with survey based measures of neighborhood attributes (Odgers, Caspi, Bates, Sampson and Moffitt, 2012;Wilson, Kelly, Schootman, Baker, Banerjee, Clennin and Miller, 2012). A model trained on GSV images recovers income by block in New York City (Glaeser, Kominers, Luca and Naik, 2018), and a deep learning model of cars in GSV images can measure income, race, and education at the precinct level (Gebru, Krause, Wang, Chen, Deng, Aiden and Fei-Fei, 2017). Another promising approach is to pay people to take photographs of specific phenomena, such as the price of goods at a supermarket or the prevalence of anti-incumbent signs at a protest (Premise Data, 2017). Paying people to capture images is especially useful in areas with otherwise insufficient publicly available data.
IMAGE AS DATA 6 Natural Disasters. Image data also provide access to temporal changes in local regions. For example, a model that accurately recovers built features of towns and cities could provide insight into how institutions affect recovery from natural disasters.
If images exist of the same area immediately before and after a natural disaster, the physical and geographic extent of damage as well as the speed and amount of recovery may be measurable. These dependent variables may then be related to various institutional ones. Recovery may occur more quickly in democracies than non-democracies or in countries with free media, for example. In democracies, subnational variation could depend on whether a disaster strikes a powerful politician's district or if there is an impending election.

Contentious Politics
Civil War. Using computer vision, greed and grievance can be measured with more geographic and temporal precision (Collier and Hoeffler, 2004;Kern, 2011). Those two concepts are notoriously difficult to operationalize, and researchers rely on imperfect measures such as the availability of natural resources (greed) or aggregate economic statistics such as gross domestic product (economic grievance). For example, greed is measurable using the precise outline of diamond mines, virgin forests, or oil deposits, and their depletion can be observed from satellite data or resource maps (Hunziker and Cederman, 2017). Grievance is reflected in city-level variation in economic activity measurable using light emissions (Weidmann and Schutte, 2017).
Whether these measures are better than existing datasets will depend on the dataset and country on which the researcher is focused.
State Capacity. Images can also be used to measure state capacity.
Humans-as-sensors can take photographs of specific objects, such as prices in markets (to measure inflation), road conditions, or school conditions, using smart phones (Premise Data, 2017). These images can give disaggregated information about a state's ability to repress intranational conflict, as well as the ability of rebels to attack the state. Maps are also images, and digitizing them can provide historical data on state IMAGE AS DATA 7 capacity, especially power projection, that current measures, such as GDP, may not capture (Hunziker, Müller-Crepon and Cederman, 2018).

Figure C1
An example computation in a node and its connected nodes. Figure C1 shows an example configuration of a node and its connected nodes.
Each node takes input values from nodes in the proceeding layer and evaluates a IMAGE AS DATA 9 weighted sum using weights associated with edges (in this example, 1 · 0.7 + 0.5 · −0.3 + 0.3 · 1.0 = 0.85). Typically this value is transformed by a non-linear activation function, e.g., sigmoid or rectified linear unit (ReLU), and then passed to output node. For example, these input values might be an individual's values for gender (x 1 ), race (x 2 ), or income (x 3 ) and the output variable might be political ideology.
Input layer (e.g., pixels) Hidden layer 1 Hidden layer 2 Output layer Forward computation

Figure C2
An example architecture of a neural network with an input layer, an output layer, and two hidden layers. Figure C2 shows an example architecture of a neural network with several layers.
Neural networks with multiple hidden layers are considered "deep." A layer in a neural network is a set of nodes which takes inputs from the nodes in the previous layer and deliver outputs to the nodes in the next layer. When a network is visualized as Figure C2, a column of nodes is a layer, and the number of columns is the number of layers. Inputs to the whole network therefore undergo several steps of transformations through layers until they reach the output layer of the network. The output layer is the network's final layer, and it contains one node per desired label in case of classification.
Hidden layers are intermediate layers between the input and output layers in a IMAGE AS DATA 10 network whose true values are not observed during training. They play a critical role in modeling complex concepts by giving an expressive power to deeper networks. Studies have shown, both experimentally and theoretically, that the more layers a neural network has, the better performance it can achieve (Eldan & Shamir, 2016;Poggio, Mhaskar, Rosasco, Miranda and Liao, 2017). A drawback of having too many layers is that it is more difficult to train such a model, i.e., vanishing gradients (Bengio, Simard and Frasconi, 1994). 7

Figure C3
An example of a convolutional neural network architecture.

Convolutional layer
A convolutional layer in CNNs performs a smoothing operation ("convolution") to the input to the layer, which is either raw image data or an output from the previous layer. Convolution is widely used in signal processing for transforming or comparing time series data. For example, one can reduce noise in an audio signal by convolving it with a Gaussian filter, which will smooth out the original signal by blending the original value at time t with other values at adjacent time points around t.
Formally, the convolution of two functions, f and g, is another function defined by The second function, g(t), is called a kernel. Note that the kernel is flipped (g(t − x)) by the definition of convolution. In a discrete case, convolution computes the sum of element-wise multiplication between two functions, with one function being shifted over time, such that: Convolution (sum of elementwise multiplications) *

Figure C4
Illustration of computations in a convolutional layer.

Each convolutional layer in a CNN uses a convolution operation in order to
compare the input data with the kernels (also called filters in the deep learning literature) in the model. In practice, the kernel is not flipped in computation in most implementations as it is unnecessary for the purpose of CNN. 9 Not flipping the kernel creates a slightly modified definition of convolution of a two-dimensional input I and a two-dimensional kernel K in CNN: I(x, y) and K(x, y) denote the element in xth row and yth column in the matrices I and K. h and w denote the height and width of the kernel K, and, typically, CNNs use square kernels (h = w). The result of the convolution is another 2D array, F , which is called a feature map. The feature map is the output of the convolutional layer, and it is the same dimensions as the input data. This computation is performed on every location in an input map and the result is stored in the same location in the output feature map (See Figure C4).
Most images are three-dimensional data with two spatial dimensions and an additional dimension of color (e.g., RGB). Feature maps in each layer are therefore also three-dimensional as each individual feature map (also called a channel) corresponds to the response from a specific kernel (filter). Each filter describes a specific pattern to be detected from an input from the previous layer. The entire weight parameters of each convolutional layer (K) are therefore represented by a four-dimensional array of size (w, h, m, n), where m is the number of channels of the input (the number of channels in the previous convolutional layer) and n is the number of channels in the current layer.
The number of channels (feature maps) in each layer is arbitrary and typically ranges from 32 to 1024, except the color channel (3). The feature map for each channel will therefore be obtained as follows: Convolutional layers enable the following two key properties of convolutional neural networks.

Weight sharing. In Equation C4
, the kernel is invariant to the location of each input node (x, y). Therefore, the same kernel will apply to every location of the input map, and the connections between two layers (input and output nodes of each convolutional layer) will share the same weights. Weight sharing is effective because an object may appear in any location of an image and its appearance is invariant to its placement. Weight sharing reduces the number of free parameters in the network and makes it easier to train.
Local and sparse connectivity. Convolutional layers in CNN achieve sparse IMAGE AS DATA 13 connectivity by using a kernel much smaller than the size of input map (h, w < 10, usually). Each node in a convolutional layer is only connected to a small number of nodes in the previous layer, i.e., a local region. This kernel is small because adjacent pixels and subregions of an image are more highly correlated than distant regions.

Nonlinear Layer
Each convolutional layer is typically followed by a nonlinear activation function that applies to each element in the feature map. One of the most common activation functions is the rectified linear unit (ReLU): This function will simply replace negative feature map values with 0 and keep positive values. Other functions such as sigmoid or hyperbolic tangent function can be also used. The main advantage of the ReLU is that it runs much faster than those functions.
Nonlinearity of visual models is important as it allows to capture a complex data distribution. Visual data, projections to 2D space, are highly nonlinear due to many factors such as occlusion, object deformation, and camera exposure saturation. Human visual systems are capable of processing this nonlinearity. Especially, nonlinear layers are essential in deep networks because consecutive layers of linear operations collapse into one linear layer. Thus, there will be no benefit of adding more layers to the network without nonlinear functions.

Pooling layer
Pooling is another important operation in convolutional neural networks since it reduces computational complexity. A pooling layer takes an input feature map from the previous layer and generates a transformed map whose size differs from its input size.
Most images and feature maps in a CNN are spatially correlated: values in closer pixels or nodes 10 tend to be more similar than those far away. Instead of keeping similar values redundantly from adjacent locations, one can simply choose the maximum response (or the average value) in each spatial neighborhood (pooling window) to represent the area.

Figure C5
Illustration of a max-pooling operation of the window size 2 × 2. For each window, only the maximum value will be retained.
Specifically, a max pooling layer compares values in each sub window (e.g., a 2 × 2 window of pixels) of the input feature map and chooses the maximum value (see Figure C5). Only these maximum values will be stored in the output map; the other values are disregarded. Removing non-maximum values also means that the resulting feature map will be of a smaller size than the input map. For example, an input image of size 256 × 256 will be downsampled to 16 × 16 after applying 4 max-pooling layers of size 2 × 2. During the process, the information originally encoded in the spatial dimension in images will be translated into the non-spatial dimension in the feature map, e.g., 16 × 16 × 1,024.
One main difficulty in visual learning is high geometric variations of objects and parts arising from part movements and viewpoint changes. Pooling not only reduces the number of free trainable parameters but also helps the network achieve translation invariance, which is an important property for computer vision systems. Robust computer vision system needs to handle such geometric variations, and pooling operations help by disregarding small spatial perturbations within the pooling window. In the case of classification, the fully connected layer(s) in a CNN are usually followed by a softmax function, which normalizes the final classification scores over categories. This procedure is the same as multinomial logistic regression.

Appendix D Training and Validation
This section discusses practical issues in training a model and introduces tools to diagnose the model performance. For technical details of training and validation, see Section C in the appendix. The appendix also provides precise definitions of technical concepts, such as weights, kernels, or loss functions, and their computations in greater detail.

Training
New Models. As in other machine learning methods, training a new model means using training data (labeled images) to estimate optimal values for model parameters.
Training a neural network means finding optimal values for weights in the model (see Figure C1). In most cases, objective functions of neural networks are non-convex and cannot be directly optimized, and training is conducted by a gradient descent method with the backpropagation algorithm (LeCun et al., 1989), alternating between forward and backward passes. 11 In the "forward" pass, given an input value, the network evaluates the output and computes the loss function based on the ground truth output value, i.e.the image's labels or class. In the "backward" pass, the gradient of the There exist many types of loss functions. One can use a specific loss function or a combination of multiple loss functions depending on the task (classification, detection, or face recognition) and the output dimension (number of variables). In image classification, for example, the most popular loss function is cross-entropy loss, also called log-loss. In a binary classification task, the binary cross-entropy loss is: where y ∈ {0, 1} is the true label for the example andŷ ∈ (0, 1) is the output value computed from the model. In training, all the model parameters are optimized to minimize this loss function across the entire training set. Other loss functions can be also used in other tasks. For example, mean square error loss can be used to estimate continuous outputs such as age.   Note that, after the 20th epoch, 12 the model performance is saturated and the validation loss starts increasing although the training loss continues to decrease. 13 This degradation arises because the model is fitted too much to the training set. One can stop training at that point and take the final model. Using more training data can help avoid overfitting and train a better model (See Figure D1(b)).
Pre-Trained Models. Deep learning usually requires a large amount of training data (1.28 million images in ImageNet (Russakovsky et al., 2015)) to be successful. It is usually not feasible for an individual researcher to collect such a large IMAGE AS DATA 18 training set or training a model to exploit those images' complexity. One method of overcoming the requirement is to use models trained for another task with a larger dataset and apply to the current task for which only a small amount of data is available.
This is known as transfer learning and is the process we recommend others follow.
(a) Classification accuracy trend (b) Validation loss trend

Figure D2
The effect of fine-tuning (using a pre-trained model)   . Figure D2 illustrates  There exist many pre-trained models which are widely adopted as baselines for fine-tuning, such as AlexNet (Krizhevsky et al., 2012), Places365 (Zhou, Lapedriza, Khosla, Oliva, & Torralba, 2017), and VGG-Face (Cao, Shen, Xie, Parkhi, & Zisserman, 2018 Using one of these pre-trained models facilitates topic discovery. By taking the last fully connected layer or the softmax layer of images run through a classifier, one can find similar images using any preferred clustering algorithm. The images in the clusters will contain similar features (pictures of John McCain, for example), suggesting they are about the same topic. Appendix Section F shows this approach to topic discovery using politicians' images shared on Facebook and k-means clustering.
Whether using transfer learning or making a new model, it is critical to ensure that the training data represent a diverse and balanced set of images before they are annotated so that recall is high for each desired label. For example, if one wants to collect images to be used for training a protest event classifier, the set should contain enough protest images and non-protest images. This task may not be trivial if the target event infrequently occurs. If the task is well defined and clearly explainable by simple statements, one can crowdsource the annotation task using online services, such as Amazon Mechanical Turk. If an annotation task requires more expertise, one should hire and directly supervise annotators. It is beyond the scope of this paper to discuss at length how to optimize the architecture of a CNN to be used (number of layers in a model or types of regularization to be used), preprocessing, best optimization methods, and other hyperparameters. In general, these are empirical questions and the optimal solution varies by task.

Validation and Interpretation
Deep neural networks often receive criticism due to the lack of interpretability of their results and internal mechanisms compared to simple models with a handful of explanatory variables. A deep model typically comprises millions of parameters (see Table D1), and it is impossible to identify their meanings or roles from the classifier output.
One method of validation is to use a validation dataset which does not overlap with the training set. As in other classification problems, the accuracy of a CNN-based classifier can be measured by several metrics, including raw accuracy, precision and recall, or average precision, among others. These measures, however, do not explain how the model achieves its results.

Language-based Interpretation
Just as humans use language to explain a concept, one can develop a joint model that incorporates visual and textual data such that the text part explains its visual counterpart. For example, image captioning generates a sentence describing visual content in an input image (Kiros, Salakhutdinov, & Zemel, 2014) or text-based justifications to explain why the model produces particular outputs (Hendricks et al., 2016).
Another line of research on text-based interpretation of visual learning utilizes questioning and answering (Antol et al., 2015). Such methods take both an image and a text question as input and output a text-based answer to the input question. This allows a more flexible interface between a user and a model than a traditional classification task, which essentially asks a fixed question to the model.
The key limitation of these methods is that they do not generalize: they are unable to deal with novel content or questions. The models are trained on image-text pairs and simply reproduce the mapping learned from the training data. When the model is given a novel question which was not given during training, it will not understand the meaning of the question.

Visual Validation
Another method of understanding how a deep network produces its output is through visualization. Since convolutional neural networks are largely used for visual learning from images, visual validation is especially effective. We introduce the two most popular approaches: feature-based and region-based. Figure D3 provides examples of the feature-based approach, using a random sample of images from ImageNet. This approach uses a "deconvolutional" network (Zeiler & Fergus, 2014), which is akin to a reverse CNN. Figure D3 shows that visually similar image patches that contain the same image feature (left sub-panel) will trigger high activation scores in the same node in the network that captures the image feature.
The image feature can be visually identified from the feature activation maps (right IMAGE AS DATA 22 sub-panel). Moreover, this visualization also confirms that the lower layers in a network respond to the low level visual features such as color or texture, and the higher layers capture more structured and semantically meaningful shapes ("face", "web").
The region-based approach is exemplified by Gradient-weighted Class Activation Mapping (Grad-CAM) . Grad-CAM highlights pixels in images based on how much they contribute to the final output of the model. See Figure D4 for an example visualization using this paper's protester framing example. Grad-CAM can confirm that the model was able to learn meaningful features such as "smoke" to model the concept of "violence".

Figure D3
Visualization of feature activations at different layers in a CNN by a deconvolutional network (Zeiler & Fergus, 2014

Software Libraries
There exist many open-source or commercial libraries and tools that researchers can use for visual content analysis in their projects. Compared to software for text analysis, these libraries are in general larger and have more complex internal structures, which are required to provide various image processing functionalities. Fortunately, there are a small number of standardized, popular libraries that can be adopted for computer vision and deep learning projects, which will be briefly reviewed in this section.
• OpenCV and dlib are currently the most popular computer vision libraries. They offer a wide range of basic image processing, computer vision, and machine learning functionalities. Python is best for OpenCV, though there is a light wrapper in R for it. dlib is accessible via R and Python libraries. • In case researchers simply want to use existing classifiers which are already trained without developing a model themselves, they can also use commercial services through APIs. These options include Google Vision API, Microsoft Vision API, Face++, and Amazon Rekognition. These services return submitted images with labels.
IMAGE AS DATA 26 Appendix F

Self-presentation of Politicians in Social Media
One can also use an inductive approach by clustering a given set of images without any annotations or labels. Figure F1 shows example clusters obtained from images in the same dataset, not using the Google Vision labels. Specifically, we first computed generic image features using an image embedding from a CNN pre-trained on ImageNet. We ran the model on each image and obtained a numeric vector of length 2,048 from the activation values of the second-to-last layer of the CNN. Then we ran K-means clustering (K = 200) on these features.
By grouping similar images, one can identify clusters showing various activities and events which politicians attend to. A cluster of John McCain (the last cluster in Figure F1) arises as many politicians posted his photographs after his death on August 25, 2018. Clustering analysis is an effective way of discovering issues or topics which may be unknown to researchers prior to analysis. This example of unsupervised learning is very similar to unsupervised topic modeling in text analysis.  . This process is similar to the regular model training procedure. The closer to red the area of an image, the more it contributes to the classifier output. Figure D4 shows that the classifier is driven by parts of an image that a human would recognize as important for each category. For example, the protest label primarily activates on signs. Tear gas and police helmets drive the violence classifier, while a child's face, but not the nearby adults', drive the children classifier.