Gender, power and emotions in the collaborative production of knowledge: A large-scale analysis of Wikipedia editor conversations

This paper studies the conversations behind the operations of a large-scale, online knowledge production community: Wikipedia. We investigate gender differences in the conversational styles (emotionality) and conversational domain choices (controversiality and gender stereotypicality of content) among contributors, and how these differences change as we look up the organizational hierarchy. In the general population of contributors, we expect and find significant gender differences, whereby comments and statements from women are higher-valenced, have more affective content, and are in domains that are less controversial and more female-typed. Importantly, these differences diminish or disappear among people in positions of power: female authorities converge to the behavior of their male counterparts, such that the gender gaps in valence and will-ingness to converse on controversial content disappear. We find greater sorting into topics according to their gender stereotypicality. We discuss mechanisms and implications for research on gender differences, leadership behavior, and conversational phenomena arising from such large-scale forms of knowledge production.


Introduction
Collaborative work would be unthinkable absent people's ability to converse in order to share information and to coordinate and motivate efforts. Conversations influence work, for instance through their effects on productivity and creativity (Huang, Gino, & Galinsky, 2015;Wu, Waber, Aral, Brynjolfsson, & Pentland, 2008). At the same time, conversations are also shaped by work processes. Expressions of emotions in natural collaborative production processes offer an important window into the psychology of work. They can inform our understanding of the differential motivations and experiences of various subgroups of workers, and how their presence might influence the broader organizational climate and culture (Cross & Madson, 1997;Schein, 2004). Women in positions of power are one important subgroup of workers on which our knowledge is still limited, largely due to the unavailability of data.
Research has shown that men and women in the general population differ in their choices (Kugler, Reif, Kaschner, & Brodbeck, 2018), preferences (see Croson & Gneezy, 2009), and personality traits (Costa Jr., Terracciano, & McCrae, 2001;Feingold, 1994), but little is known about gender differences higher up in organizational hierarchies (Adams & Funk, 2012) and how they compare to gender differences at lower levels of the same organization. Our paper addresses this gap by observing conversations between individuals who jointly and voluntarily work on one of the largest knowledge production platforms, Wikipedia. Specifically, we address the following questions: Are there systematic differences in the expression of emotions by women and men, and in their choice of conversational topics in terms of domain gender stereotype and topic controversiality? Do possible gender differences persist as we shift our perspective to people in positions of authority? Are they amplified, or are they attenuated?
The responses to these questions are important from a gender perspective because they speak to the more fundamental question of whether some of the main effects of gender that have been found in previous research might result from confounding gender with underlying power differentials (see, e.g., Johnson & Helgeson, 2002). The answers matter from an organizational perspective because they allow us to shed further light on how conversations and verbal interactions are related to emotional experience and motivation in organizations (Herring, 2000;Pennebaker, Mehl, & Niederhoffer, 2003). We put a special emphasis on the role of an important and growing subgroup of workers: women who advance to the core of collaborative knowledge study a rich trove of data, which covers a period of more than 15 years and contains 166,322 different discussion threads across 1,236 articles/ topics on Wikipedia Talk pages (Prabhakaran & Rambow, 2016). Importantly, we have information on contributors' gender as well as their roles: general editors versus so-called "administrators" with greater decision-making power (e.g., the right to block and unblock other editors' accounts, to restrict or allow editing of certain Wikipedia articles, or to judge the outcome of certain discussions). 3 Large-scale natural language datasets obtained from the Internet have proven extremely useful for understanding human behavior, with important applications in many fields, such as management (George, Osinga, Lavie, & Scott, 2016), public health (Hawn, 2009), cognitive science (Griffiths, 2015), marketing (Humphreys & Wang, 2017), and psychology (Harlow & Oswald, 2016;Kosinski & Behrend, 2017). Our use of the Wikipedia conversations dataset, along with novel techniques from natural language processing and computational linguistics, allow us to analyze differences in the expression of emotions (valence, arousal, as well as overall degree of emotionality) and how they unfold across different levels of the organizational hierarchy (normal editors versus administrators). Since Wikipedia aims to cover the sum of all human knowledge (as opposed to technical and focused communities such as StackOverflow), and since people self-select into topics of their choosing (rather than being told what to work on by managers), we can, moreover, study differences in the gender stereotype of the domain and in the controversiality of articles that different editors choose to converse about. This allows us not only to analyze gender differences in conversational styles (the expression of emotions) among general editors and those in positions of power, but also in their conversational domain choice with respect to the topic's gender stereotype and controversiality.

Gender differences in emotionality
Previous research shows that men and women in the general population differ systematically in terms of their preferences (Croson & Gneezy, 2009) and negotiation (Kugler et al., 2018) and linguistic behaviors (Carli, 1990;Mulac, 1998). With regards to emotionality, women have been found to use references to emotion (e.g., "I am happy") more frequently than men (Palomares, 2004), and to make more emotional and positive contributions in asynchronous computermediated communications (Guiller & Durndell, 2007). Although a large number of studies have drawn their observations from university students, it is possible to predict a communicator's gender with high accuracy from observing their language use (see, e.g., Mulac (1998) and more recent advances such as Schwartz et al. (2013)). Popular accounts such as Tannen's (1990) You Just Don't Understand: Women and Men in Conversation (a New York Times bestseller) even argue that men and women belong to different linguistic communities with stark differences in their conversational styles. But again, most observations stem from observing women from the general population, where power may be a confounding factor.
Such differences in emotionality may at least in part be explained by society's gender role beliefs (Eagly & Wood, 2012), or gender stereotypes, which lead to expectations for women to be communal (i.e., warm, emotional, supportive and caring) as opposed to agentic and dominant (e.g., Amanatullah & Tinsley, 2013;Eagly, 1987;Eagly & Carli, 2003;Williams & Tiedens, 2016). Gender role beliefs impact individuals' behavior through various mechanisms (Wood & Eagly, 2010). One important mechanism is social sanctions for counter-stereotypical behavior, also termed the backlash effect (Rudman, 1998;Rudman & Fairchild, 2004). Thus, women may refrain from displaying leadership behaviors and using the concomitant language in order to avoid negative evaluations due to the perceived gender-leadership role incongruity (Eagly & Karau, 2002). In many cultural contexts, gender-specific norms make it appropriate for women but not for men to express positive emotions (Brody, 2000).
Even absent others' knowledge of an individual's gender, such as in many online contexts, gender role beliefs can produce gender differences in behavior through internalization of a given gender identity (Wood & Eagly, 2015). It is therefore an interesting question what happens when we consider modern knowledge production contexts, where gender cues are much less salient because individuals work in large-scale online communities and are not co-located. Despite the reduced prominence of gender cues in these contexts, past empirical research (Kucuktunc, Cambazoglu, Weber, & Ferhatosmanoglu, 2012;Laniado, Kaltenbrunner, Castillo, & Morell, 2012) as well as the gender identity mechanism (Wood & Eagly, 2015) suggest that we can expect to find among the general population of editors gender differences in emotionality that are similar to those in offline contexts.

Gender differences in domain choice
A well-established research stream following Gneezy, Niederle, and Rustichini (2003) and Niederle and Vesterlund (2007) in economics shows that women shy away from competition and conflict (see, e.g., Bear, Weingart, & Todorova, 2014;Schneider, Holman, Diekman, & McAndrew, 2016 as well as earlier work by Stuhlmacher & Walters, 1999). Following this research, we would expect female editors to be less likely to engage in conversations about controversial topics. Another explanation for observing such a relationship between gender and controversiality would be that particular topics come to be regarded as controversial specifically because men gravitate towards discussing them. Research in personality psychology suggests that men on average are somewhat less 'agreeable' than women (e.g., McCrae, Terracciano, & 78 Members of the Personality Profiles of Cultures Project, 2005). Therefore, a high concentration of men might lead to a different tone in the discussion, such that the article becomes labeled as controversial. We will not be able to rule out such an interpretation, but consider the alternative more plausible: that women in the general population of editors are more reluctant to engage in controversial content discussions. Recent survey evidence on Wikipedia editors indeed suggests that female editors display greater avoidance of conflict than men (Bear & Collier, 2016).
Similarly, albeit focused on non-work contexts, it has been suggested that men and women in the general population differ in their choice of conversation topics (Bischoping, 1993). Moreover, genderincongruent situations may lead to increased anxiety, role conflict, backlash, and avoidance (Bem & Lenney, 1976;Luhaorg & Zivian, 1995;Rudman, 1998). We therefore expect to find a gender specific separation of labor, whereby female editors from the general population are more likely to converse on female-typed content, while male editors are more likely to converse on male-typed content. Such domain-specific sorting by gender should be reinforced by differences in previously accumulated expertise (e.g., somebody with expertise in arts will be more likely to contribute to articles related to the arts). If this is the case, we may expect the same domain-specific gender difference to persist as we consider editors in positions of power. This would contrast with the previously discussed gender differences in emotionality and article controversiality, as further discussed below.

The gender gap across the organizational hierarchy
Understanding whether systematic differences between men and women persist as we look up the organizational hierarchy is important because it speaks to whether gender differences found in the general population are absolute, or whether they may have been partly confounded with related differences in status and power (Johnson & Helgeson, 2002;Watson, 1994). Moreover, from a practical perspective, analyzing gender differences at the top of organizational hierarchies advances our understanding of the implications of increased female participation in organizational leadership (Adams & Funk, 2012). Differences in the expression of emotions and in the domain choices made by men and women in power have implications for the broader organizational culture (e.g., through the expression of emotions) and functioning (e.g., if female leaders were to avoid controversy).

Emotionality
Prior research suggests that the differences in male and female leaders' styles are merely "mild shading" and that general similarities in style prevail (see Gipson, Pfaff, Mendelsohn, Catenacci, & Burke, 2017 for a recent survey of the literature). Moreover, there appear to be no significant differences between female and male leaders' demonstrations of emotional intelligence competencies (Hopkins & Bilimoria, 2008). Elevated power has been found to be associated with increased freedom and more socially disinhibited behavior (Keltner, Gruenfeld, & Anderson, 2003). Thus, women in positions of authority may be less bound by the female gender role. We therefore expect to find smaller differences in the expression of emotions (valence, arousal) by women and men in positions of power, compared to the differences in the general population of editors.
Extending the analysis to the expression of mental states beyond emotions and, specifically, considering the extent to which reference is made to cognitive as opposed to affective processes, we do not have a prediction about whether female leaders would behave differently from women in the general population of editors. But in a context of online knowledge work, where at baseline comments can be expected to be significantly more cognitively-loaded and hence less affect-based, we do expect a pattern whereby comments from women are generally more affective than comments from men. Whether women's reference to cognitive processes changes as they come into positions of power is an empirical question that we will analyze.

Domain choice
Using a survey of directors, Adams and Funk (2012) show that several of the well-established gender differences found in the general population no longer hold or are even reversed when looking at female and male directors. Notably, female directors in their sample are more risk tolerant and less security-and tradition-oriented than their male counterparts. Translated to our context, this suggests that women in positions of power may be more likely than women in the general population to engage in conversations about controversial content, such that the gender gap may disappear. However, to the extent that women have greater knowledge of stereotypically female content, the gender gap in topic choice (male-vs. female-typed) may remain.
Hence, overall, we expect to find smaller or no gender differences in the expression of emotions (valence, arousal) and the choice of engaging in controversial content discussions. We conjecture that this will be driven by women converging to the behavior of their male counterparts as they come to occupy positions of authority. An intriguing question also for future research is what accounts for any potential closing of the gender gaps.

Mechanisms
There are three non-exclusive mechanisms why gender differences may disappear when considering men and women in power: first, a treatment effect of the position of authority on behavior and possibly preferences (see Magee and Galinsky (2008) for a review of the effects of power on individuals' psychological states and behavior). This would suggest that the position of authority mutes gender differences. Putting women in positions of authority allows or compels them to express less positive emotions and to engage more in controversial content discussions. As the experience of power makes individuals more goal-directed and more likely to take action (Galinsky, Gruenfeld, & Magee, 2003), they may devote less attention to other dimensions, such as conforming with their gender role. Power has been found to make individuals less likely to consider others' perspectives (Galinsky, Magee, Inesi, & Gruenfeld, 2006), which may also reduce women's awareness of (or concern about) social expectancies related to their gender role.
Besides this explanation of a treatment effect of the position of authority, we consider two forms of sorting, whereby female editors who display a stereotypically male emotional tone or choice of domain are more likely to seek or find themselves in positions of power. Hence, the second mechanism is self-selection (in line with occupational sorting à la Polachek (1981)). This is a supply-side factor or, as referred to by psychologists, an intrapersonal effect (Gino, Wilmuth, & Brooks, 2015) in that it takes place within ("intra") the individual and reflects the person's own decisions. For instance, recent research shows that women see professional advancement as less desirable , and that they seem to be less status-seeking than men (Huberman, Loch, and Önçüler (2004)). 4 In short, it is possible that women who seek to advance to positions of authority systematically differ from women in the general population.
The third mechanism is social selection by the majority-male population of editors. 5 This would correspond to demand-side factors, or interpersonal effects, which may or may not be conscious. In contrast to intrapersonal effects, interpersonal effects take place at the intersection between ("inter") individuals and are taken to refer to others' decisions. This explanation is in line with the argument that women must act like men to climb organizational hierarchies and be successful (Branson, 2006). In virtual collaboration contexts such as Wikipedia, gender cues are less salient than in processes where physical characteristics are apparent (e.g., Brooks, Huang, Kearney, & Murray, 2014;Goldin & Rouse, 2000). Nevertheless, research has shown that gender differences often persist in computer-mediated contexts with gender anonymity (Guiller & Durndell, 2007;Herring, 2000;Lee, 2007), and that there continues to be greater conformity to ostensible male interaction partners even where linguistic features are used as bases for gender inferences (Lee, 2007). Thus, social selection may also occur based on behavioral differences, such that women who act more like men are more likely to be accorded higher-status positions.
A recent analysis by Fernandez-Mateo and Fernandez (2016) discusses the intricacies of distinguishing demand-and supply-side factors (social-and self-selection, respectively), including how anticipatory effects can make demand-side factors (e.g., a discriminatory environment) look like supply-side preferences on the part of women who select out to preempt being discriminated against. The authors propose an original approach for untangling the mechanisms in the context of executive search. Distinguishing these two sorting mechanisms from a treatment effect adds an additional layer of complication and is beyond the scope of the present paper. Yet investigating the past behavior of female editors who eventually rise to positions of authority will yield some insight as to the relevance of the different mechanisms. If the two forms of sorting are sufficient to explain a possible closing of the gender gap among editors in power, we would expect to see that women who rise to the top already differed from the general population before their ascent. We will present analyses in the Results section.

Dataset
Our dataset involves Wikipedia Talk page discussions collected by Prabhakaran and Rambow (2016) and made available at https://www. cs.stanford.edu/~vinod/publication.html. These discussions contain 906,671 comments made by 104,982 unique Wikipedia editors in 166,322 threads spanning 1,236 articles from 2001 to 2015. There are an average (mean) of 5.45 comments per thread and an average of 2.84 editors per thread, and comments have an average of 85.25 words. As shown in Prabhakaran and Rambow (2016), these comments are not distributed uniformly over time. The number of comments in a given year rises from 2001 to 2006 and drops from 2007 to 2015, with 2005 to 2008 being the peak of editor interactions in the corpus. Note that about 5% of the comments do not have an assigned date (as these are from a period when Wikipedia did not enforce formats for editor signatures in comments).
Crucially, this dataset contains editor metadata obtained through the MediaWiki API, which includes whether the editor is registered, whether the editor is an administrator at the time of the post, as well as editor gender and number of edits made. 57% of editors in this dataset are registered, out of which 12% reveal their gender in their user accounts. 92% and 8% of the gender-identifiable editors are male and female, respectively, and the bulk of our analysis pertains to the comments made by these editors (in the discussion section we consider the limitations of this restriction). The participation of male and female editors is relatively stable over time (i.e., there is no statistical relationship between the date at which a comment is posted and the gender of the poster). Additionally, around 1% of the editors are administrators, and the average number of prior edits at the time of the comment (a measure of experience) of all editors for whom we have edit data is 4,428. We summarize these and other variables relevant to our analysis in the Variables section below. Additional details regarding the dataset are presented in Prabhakaran and Rambow (2016).

Emotion ratings
We examine the emotionality of editors' comments by using automated text analysis. For this, we rely on valence and arousal norms collected by Warriner, Kuperman, and Brysbaert (2013), in which valence corresponds to the overall positive or negative qualities of the word, and arousal corresponds to the degree to which the word connotes excitement, intensity, and activation. Warriner et al. collected these norms using surveys among individuals who self-identified as being current residents of the US, aged between 16 and 87 years, about 60% of whom were female. Participants were asked to rate the valence and arousal of words on a scale from 1 to 9 (with higher ratings for higher valence or higher arousal). The highest valence words in this dataset are vacation and happiness (average ratings of 8.53 and 8.48, respectively), the lowest valence words are pedophile and rapist (average ratings of 1.26 and 1.30, respectively). The highest arousal words are insanity and gun (average ratings of 7.79 and 7.74, respectively), and the lowest arousal words are grain and dull (average ratings of 1.60 and 1.67, respectively). The split-half reliabilities for valence and arousal ratings are 0.91 and 0.69, respectively. This lexicon is balanced for the valence ratings: 55% of words are rated at or above 5 (positive) and 45% of words are rated below 5 (negative). It is not balanced for the arousal ratings: only 18% of words are rated at or above 5 (high arousal), whereas 82% are rated below 5 (low arousal). This likely reflects psycholinguistic features of the English language (most words are non-arousing) rather than a bias in Warriner et al.'s lexicon. There are many measures of emotion. We limit our analyses to valence and arousal as these two dimensions capture the majority of the variance in the structure of emotional experience (Russell, 1980). We 4 Although there are also studies suggesting that the desire for status is universal (Anderson, Hildreth, and Howland (2015)). 5 The most recent survey conducted by the Wikimedia Foundation puts the fraction of female editors on Wikimedia projects at 9% (Wikimedia, 2018), which corresponds closely to earlier surveys (Glott, Schmidt, & Ghosh, 2010). Readership rates, however, seem to be equal across genders (Zickuhr & Rainie, 2011). complement this with an examination of the degree to which comments by men and women are affective vs. cognitive, which extends our analysis to other mental states beyond emotions (see the Affective vs. cognitive content subsection below). Although there are other datasets that could be used to obtain valence and arousal ratings for words (Bradley & Lang, 1999), the Warriner et al. dataset is the largest lexicon currently in existence, and it contains participant-generated valence and arousal ratings for 13,915 different words. Importantly, this lexicon has been compiled by psychologists and is widely used in psychological research on emotion, language, memory, and decision making.

Word-frequency averaging method
We use two different methods for analyzing emotions in the Wikipedia comments. Our first method, the word-frequency averaging (WFA) method, measures the valence or arousal of a comment based simply on the aggregate valence or arousal of its component words. Specifically, this method first tokenizes the comment by lower-casing it, removing all punctuation, and splitting up the comment by white space. This step transforms the natural language sentence or paragraph that makes up the comment into a "bag-of-words" representation, i.e., a set of component words and their corresponding frequencies in the comment. After this step, the WFA method queries the Warriner et al. lexicon for the valence and arousal ratings of each word. Finally, it averages the valence or arousal ratings for all words in the comment that are also contained in the Warriner et al. lexicon, to obtain an aggregate measure of the valence or arousal of the comment (see, e.g., Humphreys and Wang (2017) for an overview of this approach).
More formally, the average valence or arousal rating of comment i using this method is: 1 . Here j = 1, 2, … N indexes the words in the Warriner et al. lexicon, f ij is the frequency of occurrence of word j in comment i, r j is the valence or arousal rating of word j in the Warriner et al. lexicon,and N = 13,915 is the total number of words in the lexicon. We have f ij = 0 if word j is not present in comment i.

Embeddings method
One limitation of the WFA method is that not every comment contains words in the Warriner et al. lexicon, and there are many words not in the Warriner et al. lexicon that are commonly mentioned in the comments. To avoid these data sparsity issues and to ensure the robustness of our results, we also analyze the valence and arousal of comments using a second approach: the word embeddings method.
Word embeddings are popular tools in computational linguistics that quantify the meanings of words by describing them as high-dimensional vectors. Word vectors are derived from the structure of word co-occurrence in natural language, and are useful for a variety of text analysis applications (see Bhatia, Richie, and Zou (2019) or Lenci (2018) for overviews of word embeddings and a discussion of their relevance for psychological research). Here, we use word embeddings to extrapolate the valance and arousal ratings collected by Warriner et al. to other words mentioned in the Wikipedia comments that are not in the lexicon. Our analysis is based on the Google News word2vec embeddings model, a powerful pretrained model that possesses 300 dimensional representations for over 3 million words and phrases (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). We use the word2vec model to map each word in the Warriner et al. lexicon to a 300-dimensional embedding space, and then use the valence and arousal ratings for the words in the Warriner et al. lexicon to train an algorithm capable of predicting the valence or arousal of the 3 million words and phrases that exist in the word2vec model's space. Through this approach, we can measure the valence and arousal of a large number of words in the Wikipedia comments, including words used in the comments not present in the Warriner et al. lexicon (see Hollis, Westbury, and Lefsrud (2017) and Sedoc, Preoţiuc-Pietro, and Ungar (2017) for overviews of this method).
More specifically, each word j = 1, 2 … N in the Warriner et al. lexicon can be described as a vector w j in the word2vec embedding space. Based on the valence and arousal ratings r j associated with each word, we can learn a function g that best maps the vector w j onto r j . In our analysis we assume this function is linear, and we train the weights of this function using a ridge regression implemented in the sci-kit learn Python machine learning library. We optimize the regularization hyperparameter through cross-validation and find that the best performing ridge regression model achieves an out-of-sample correlation rate of 0.79 for predicting word valence and 0.61 for predicting word arousal in a ten-fold cross-validation exercise on the Warriner et al. data.
With the best-fit function g (now trained on the entire data), we can take an arbitrary vector w to make a valence or arousal rating prediction R = g(w) for the word or comment corresponding to the vector. Thus, to obtain valence and arousal ratings for a given comment i in the Wikipedia corpus, we first tokenize the comment (by lower-casing and splitting by white space, as above), then vectorize the comment (by averaging the word vectors for each word in the comment), and then pass the comment vector through our trained function g. The predicted rating R i is subsequently given by R i = g(w i ). Here, , and k = 1, 2 … M indexes the words in the word2vec vocabulary, f ik is the frequency of occurrence of word k in comment i, w k is the vector representation of word k in the word2vec model, and M = 3,000,000 is the total number of words in the word2vec model. We have f ik = 0 if word k is not present in comment i. Note that we always normalize our vectors before passing them through the function g. Thus, we have ||w|| = 1 for all vectors used in the training and extrapolation parts of our analysis. Also note that this approach first aggregates the vectors of the words in a comment into an overall comment vector, before mapping the vector onto the valence or arousal rating scale. This is in contrast to using the vector for each word in the comment to obtain a predicted valence or arousal rating for the word, and then averaging the predicted ratings for all words in the comment as in the WFA method. The two approaches are nearly identical as our function g is linear (g applied to an average of a set of vectors is the same as the average of g applied to the individual vectors).

Affective vs. Cognitive content
In addition to the emotion ratings obtained using the WFA and embeddings methods described above, we also analyze the degree to which each comment expresses affective vs. cognitive content. This is done to account for the fact that not all mental states are emotions, and that they also comprise expressions of thinking, planning, and decisionmaking. Affective content corresponds to the use of emotion-related words and concepts (e.g., "I feel"), whereas cognitive content involves the use of belief-related words and concepts (e.g., "I think"). Measuring the relative amount of affective vs. cognitive content allows us to assess the overall degree of emotionality -or instead, emphasis on cognitive processes -of a comment.
Our analysis of affective vs. cognitive content relies on the distributed dictionary method (Garten et al., 2018) applied to the set of cognitive and affective process words in the linguistic inquiry and word count (LIWC) lexicon (Pennebaker, Boyd, Jordan, & Blackburn, 2015). As with the embeddings method for measuring valence and arousal, the distributed dictionary method uses a word embeddings representation of a comment to measure the semantic distance (i.e., dissimilarity in meaning) between the comment and a given construct characterized by a set of words. Since word embeddings quantify word meaning, comments that are closely related to the construct being analyzed will have vectors that are closer to the vectors of the words describing the construct, whereas comments that are unrelated to the construct being analyzed will have vectors that are further away from the vectors of the words describing the construct. In our case, the constructs being analyzed are cognition and affect, and the words describing these constructs are words typically used to express the outcomes of thought and emotion processes. There are a total of 791 words in the LIWC cognitive processes word set and 1,391 words in the LIWC affective processes word set. The LIWC lexicon is a well-validated dataset of words reflecting a variety of psychological constructs, and it has been used in numerous text analysis applications in psychology (Pennebaker et al., 2015 for a summary of reliability statistics and an overview of applications).
Formally, we calculate the degree of affective vs. cognitive content in each comment by first tokenizing and vectorizing the comment with the word2vec embeddings model (as in the embeddings method described above). This gives us, for each comment i, a 300-dimensional vector w i . We also obtain vector representations for affective and cognitive processes by vectorizing and averaging each of the words in the LIWC lexicon. Specifically, we use the word2vec model to obtain word vectors w k for each of the LIWC affective and cognitive processes words, and then average the vectors for the affective processes word set and the vectors for the cognitive processes word set to obtain a single vector representation a for affective processes and a single vector representation c for cognitive processes. Finally, for each comment i we calculate the relative semantic distance between its vector w i and the affective and cognitive vectors a and c. This is done using cosine distance, so that our measurement of the affective vs. cognitive content of vector i is given by As the measure of cosine distance used to compute the affective and cognitive content of the comment lies in the range [−1,1], the difference in affective vs. cognitive content lies in the range [−2,2]. Comments with higher values of s i have words that are more semantically related to cognitive process words than affective process words. The opposite is true for comments with lower values of s i .

Variables
The richness or the Wikipedia dataset and the range of text analysis methods introduced above allow us to analyze the relationships between a large number of variables. Most of our analyses study the emotional characteristic of editor comments, and thus this section will summarize the statistics of our variables on the comment level. Detailed statistics for the variables discussed here are provided in Table 1.

Comment-level variables
As discussed in the Dataset section above, the Wikipedia discussions corpus contains a total of 906,671 comments. Out of these, 98.05% contain words present in the Warriner et al. lexicon. For these comments, we use the word-frequency averaging (WFA) method to calculate valence and arousal. The average WFA valence in this set of comments is 5.65 and the average WFA arousal is 3.91. Now, the main benefit of our embeddings method is that it does not suffer from WFA's data sparsity issues. Specifically, we can use the word2vec embeddings model to obtain representations for 99.41% of comments, and subsequently use our embeddings method to calculate the embeddings-based valence and arousal of these comments. The average embeddings valence in this set of comments is 5.39 and the average embeddings arousal is 4.15. The WFA and embeddings measures are positively correlated, with a correlation of 0.61 for valence and 0.44 for arousal across the comments in our dataset for which we have both measures (p < 0.001). Finally, we analyze the comments with embeddings representations for affective vs. cognitive content. Using the approach outlined above, we find that the average distance to the affective vs. cognitive processes words is 0.12. Thus, comments are on average more semantically related to cognitive process words than to affective process words. This is in line with Wikipedia's focus on being a knowledge creation platform, where claiming reason and objectivity would be expected to be a prevalent discussion tactic.
Other comment-level variables include the length of the comment, the order in which the comment is placed in the thread, and the date of the comment's posting. Comment length is measured as the number of words in the comment. This is not normally distributed (with most comments having few words and some comments having many words). For our analyses we will therefore use the log of the comment length, whose mean is 3.90. Comment order is simply the rank of the comment in the thread (with rank 1 for the first comment): The average comment order in our data is 9. Finally, comment date is measured as the number of days between the posting of the comment and 01/01/2000. We transform the date into a "days" variable for ease of analyzing time trends. The average time in days since 01/01/2000 for our comments is 3,269 (indicating a date in December 2008).

User-level variables
The dataset we use is unique in that it contains information about a subset of the editors' genders, as revealed by the editors on their user accounts. Out of the comments in the dataset, 17.97% are written by male editors, whereas 1.44% are written by female editors. The gender for 80.59% of comments cannot be determined as they are written either by non-registered editors (editors without user accounts) or registered editors who have decided not to reveal their gender. We analyze a number of other editor-level variables, such as whether or not the editor is an administrator at the time of the post and the editor's number of prior edits, which is a measure of editor experience. In our dataset, 9.09% of comments are made by administrators, and 90.91% are made by non-administrators. As mentioned in the dataset section, the average number of prior edits of all editors for whom we have edit data is 4,428. We use a log-transformation of this variable in all subsequent analyses as the number of edits is highly skewed (with most users making very few edits, and some users making a lot of edits). Over all the comments in our dataset the mean of the log number of prior edits of the commenter at the time of the comment is 8.32.

Article-level variables
We also consider various variables pertaining to the article being discussed. Two important variables here are whether or not the article is tagged as "controversial" on Wikipedia (31% of all articles in the dataset are controversial, though 61% of all comments are made on threads pertaining to controversial articles), as well as its gendertypedness. We compute the latter using both a WFA method and an embeddings method. The WFA gender-typedness measure of an article is obtained by calculating the ratio of the sum of male pronouns ("him", "he", "himself", "his") to the sum of male and female pronouns ("her", "she", "herself", "hers") in the article. There are a total of 1,144 unique articles that we could access, which mention at least one male or female pronoun, and these articles have an average proportion of male pronouns of 80.20% (the remaining articles either did not mention any English pronouns, or were not available on Wikipedia at the time of our analysis). There are 305 articles with exclusively male pronouns (including "God", "Walmart", "Communism", "BBC", and "American Civil War") and 12 articles with exclusively female pronouns (mostly pertaining to women's health, childbirth, and sexuality). The average gender-typedness (ratio of male pronouns to all pronouns) of articles associated with each comment is 0.84, indicating that most comments pertain to articles that are primarily male-typed.
The embeddings-based method for calculating article gendertypedness involves the type of distributed dictionary analysis (Garten et al., 2018) discussed in the Affective vs. cognitive content section above. Specifically, we first tokenize and vectorize the article using the word2vec embedding space, and then calculate the relative semantic distance to a set of 20 male words relative to a set of 20 female words. The male and female words are gender pronouns, as well as gendered relationship words (e.g. father, mother, nephew, niece), and other words describing men and women (e.g. male, female, man, woman, boy, girl). As this approach measures semantic distance to male vs. female words, higher values of the gender-typedness variable correspond to a greater female-typedness of the article.
The embeddings-based distributed dictionary approach can be applied to 1,154 articles discussed in the Wikipedia corpus (the remaining articles either did not mention any English words, or were not available on Wikipedia at the time of our analysis). The average gendertypedness score for these articles using the embeddings method is −0.04, indicating that the average article is more semantically distant to female words than male words. The average embeddings-based gender-typedness of articles associated with each comment is −0.05. The embeddings and WFA methods for calculating gender-typedness are quite similar, with a correlation of −0.62 on the comment level (p < 0.001) (this correlation is negative as articles with more male gender-typedness have positive values on the WFA measure and negative values on the embeddings measure). Similar to the WFA measure, articles with high levels of male content pertain to war, religion, and business, and articles with high levels of female content pertain to women's health, childbirth, and sexuality. Importantly, we control for the valence and arousal of the article being discussed, as high or low valence and arousal articles are likely to have comments that are high or low in valence and arousal. We again do so by using both the WFA and the embeddings methods. There are a total of 1,154 unique articles for which we are able to compute WFA valence and arousal measures, with an average valence of 5.50 and an average arousal of 4.11. To illustrate, the articles with the highest and lowest valence scores in our dataset are "Ruth Westheimer" (an American sex therapist, media personality, and author) and "Crime in the United States", respectively. Other high valence articles include articles for popular celebrities, e.g. "Whoopi Goldberg", and articles for cultural products and phenomena such as "Smooth Jazz" and "Buddhism". Other low valence articles include ones for diseases, e.g. "Hodgkin lymphoma", social phenomena, e.g. "Hate group", and wars, e.g. "Korean War". In contrast, the articles with the highest and lowest arousal scores in our dataset are "Sexual abuse" and "Mesoamerican Long Count Calendar", respectively. Other high arousal articles include political movements and topics such as "Fascism" and "Nuclear war". Other low arousal articles include various uncontroversial topics, such as "Scientific method". The average WFA valence and arousal values of articles associated with each comment are 5.50 and 4.11, respectively.
The embeddings method also allows us to analyze the valence and arousal of 1,154 articles in our data. We find that the mean embeddings-based valence of these articles is 5.16 whereas the mean embeddings-based arousal of these articles is 4.22. The mean embeddingsbased valence and arousal values of articles associated with each comment are 5.12 and 4.23, respectively. The embeddings and WFA methods for calculating valence and arousal are highly correlated, with a correlation of 0.90 and 0.81 on the comment level (p < 0.001). As is reflected in these strong correlations, the articles considered high or low in valence and high or low in arousal by the two methods are nearly identical.

Thread-level variables
A final set of controls involves thread-level variables. These are the number of unique editors commenting on the thread, the total number of comments on the thread, and the time difference (in number of days) between the first and the last comment on the thread. These have mean values of 5.31 unique editors, 16.99 comments, and 59.13 days on the comment level, respectively. Since the number of days between the first and last comment on a thread is highly skewed, with 36.76% of threads resolved within the same day, and 94.69% of threads resolved within two weeks, we log-transform this variable in all subsequent analyses. The mean log-transformed value of thread time difference in days is 1.86.

Table 1
Descriptive statistics for key variables. WFA refers to variables measured using the word-frequency analysis method, whereas EMB refers to variables measured using the embeddings method. COV indicates the coverage of the variable on the data, that is the percentage of the comments over which the variable could be calculated. Variables labeled with * are typically used as dependent variables in our analysis, whereas the remainder are typically independent variables and controls. Note: The statistics are aggregated on the comment-level. Thus, for example, the table presents the average gender-typedness of the articles associated with each of the comments rather than the average gender-typedness of the unique articles.

Overview
The code and analysis for this paper are available at https://osf.io/ s8hef/. Below we analyze the relationship between gender, power, and a number of variables that, most notably, capture the emotional content of the comments in the Wikipedia discussions. As our goal is to understand these variables in the context of conversations within organizations, we exclude comments on threads in which only one editor makes a comment (i.e., there is no conversation). We also run regressions with numerous control variables, and thus exclude data for which these variables are not defined. Finally, we typically run two sets of regressions: one with variables obtained using our word-frequency averaging (WFA) method and one with variables obtained using our embeddings method. This means, for example, that our analysis of a comment's valance, as measured by our WFA method, will involve controlling for the valence, arousal, and gender-typedness of the article that is the topic of the thread using the WFA method. Conversely, our analysis of a comment's valence, as measured by our embeddings method, will involve controlling for the valence, arousal, and gendertypedness of the article that is the topic of the thread using the embeddings method. For expositional simplicity we do not include embeddings controls for WFA dependent variables, or vice versa.

Gender differences in domain choice
Before analyzing the emotionality of the conversations in our dataset, we examine whether there are systematic differences in the topics that women and men choose to converse on. We use a multiple logistic regression in which each observation corresponds to a comment, the dependent variable is whether or not the comment is written by a female editor, and the independent variables are various article-level characteristics, such as the article's valence, arousal and gendertypedness. We run two of these regressions, one with the article valence, arousal and gender-typedness variables obtained using our wordfrequency averaging (WFA) method, and one with article valence, arousal and gender-typedness variables obtained using our embeddings method. For both these regressions we permit random effects in intercepts on the thread level. These random effects group (or nest) comments based on the thread they are in, in order to accommodate variability in gender across threads. In this sense our regression involves a hierarchical analysis.
As can be seen in Table 2, a comment is significantly more likely to be written by a woman if the article it pertains to is more female-typed (has fewer male pronouns than female pronouns, as with the WFA method, or is more semantically distant to male words relative to female words, as with the embeddings method). Using the WFA method we also find that women are less likely to comment on controversial articles when controlling for article characteristics like valence, arousal and gender-typedness, though this pattern is weaker and becomes nonsignificant when these characteristics are measured using the embeddings method. Finally, we find a positive relationship with article arousal and a non-systematic relationship with article valence. In the subsequent analyses, we control for these article-level variables when analyzing the relationship between the gender of the communicator and the emotionality of the comment.

Gender differences in emotionality
We now examine whether there is a systematic gender difference in the expression of emotions among the general population of editors, where we first focus on valence and subsequently on arousal. We therefore regress comment valence and arousal on gender and also include the other editor-level, article-level, and thread-level variables discussed above. As before, we consider editor gender (=1 if female) as the main coefficient of interest, and we control for admin-status (=1 if the editor is an administrator) and experience (log number of prior edits). At the article-level, we control for valence, arousal, controversiality, and gender-typedness of the content. Thread-level controls are the number of comments, the number of unique editors, and the length of time between the first and last comments in the thread (in log days). To gain further insight about the structure of conversations, we also explore the role of comment order for emotionality by including a discrete variable indexing the comment's position in the thread. This variable takes on a value of 1 if the comment is the first in the thread, 2 if it is the second, and so on. We control for the date of the comment's posting (measured in days since 01/01/2000).
We use intercept random effects in our regressions to control for thread-and user-level heterogeneity not captured by our control variables. These nest comments made by each user in a thread in a single group. Thus, for example, we allow the valence of a comment to depend not only on the article-, thread-, and user-variables that are of central concern to our analysis, but also on an additive effect of the specific user in the thread. In this way comments by a given user in a given thread are grouped together, capturing user-and thread-level heterogeneity.
Lastly, as there are multiple variables being tested in each regression, we apply a Bonferroni correction for multiple comparisons. This yields a significance cutoff of p = 0.0042. We apply this regression in two ways: One using our WFA variables and one with our embeddings variables.
The results are shown in Table 3 for valence and Table 4 for arousal. As can be seen in Table 3, gender is a strong and significant predictor for comment valence for both the WFA and embeddings methods. The sign is positive, meaning that comments made by female editors are significantly higher in valence than comments from male editors. There are no other significant editor-level determinants of comment valence. There are, however, other article-and thread-level determinants (which we control for in the main analyses on gender and power dynamics). While these are not the focus of this study, we briefly present the patterns that emerge. Table 3 shows that comments have a significantly more positive valence (p < 0.001) if they are in threads about positively valenced articles, with fewer comments, fewer unique editors, and a shorter time between the first and last comments. Additionally, comments occurring later on in a conversation have a significantly higher valence than comments occurring towards the beginning. Comments occurring more recently in time also have higher valence. These patterns emerge with both the WFA and embeddings method. In addition to these, we also find that less arousing articles and articles that are more female-typed have higher valence, as measured by the embeddings method. Table 4 shows that comment arousal (which is a measure of the excitement or intensity of the comment) is not significantly predicted Table 2 Word-frequency averaging and embeddings-based logistic regressions predicting whether the originator of the comment is female, as a function of various article-level variables.
by editor-gender. It does, however, depend on the editor's prior experience, with editors with more prior edits writing relatively lowerarousal comments (in line with, e.g., Kucuktunc et al., 2012). Comments also have significantly higher arousal if they belong to conversations about lower-valence and higher-arousal articles, and if they involve a larger number of editors and unfold over a longer time span. More recently made comments have lower arousal. These differences appear with both the WFA and embeddings methods. Our regression with the embeddings method also suggests that threads with more comments have higher arousal. We also examine the determinants of a comment's affective vs. cognitive content. Recall that this variable encodes the semantic distance between the comment and affective-process words relative to cognitive-process words (with positive values indicating a stronger cognitive component). As measuring this variable involves using comment embeddings, we use only one set of regressions (with controls given by our embeddings approach for measuring article valence, arousal, and gender-typedness). The results of this regression are shown Table 3 Word-frequency averaging and embeddings-based regressions predicting comment valence from various user-, article-, thread-, and comment-level variables. Note: Random effects on thread-and user-level. Bonferroni correction for multiple comparisons yields a significance cutoff of p = 0.0042 for each regression.

Table 4
Word-frequency averaging and embeddings-based regressions predicting comment arousal from various user-, article-, thread-, and comment-level variables. Note: Random effects on thread-and user-level. Bonferroni correction for multiple comparisons yields a significance cutoff of p = 0.0042 for each regression.
in Table 5, which indicates that comments made by female editors have more affective content than comments made by male editors, as expected. This table also shows that editors with fewer prior edits write comments with more affective rather than cognitive content. Additionally, there is more affective content in comments written about uncontroversial and male-typed articles as well as articles with higher valence and arousal. Table 5 also shows further significant thread-and comment-level predictors, which we do not discuss here. While this regression focused on comments for which we could identify the editor's gender, the article-and thread-level patterns persist even when we examine all comments, including comments with nonidentifiable editor-gender and prior edit count (see Tables A1-A3 in the Online Appendix). In the Online Appendix we also consider various user-, article-, and thread-level predictors for comment length. Table A4 shows that female editors write significantly longer comments than male editors, suggesting gender differences in commenting style that extend beyond emotionality.

Moderators of the gender-valence relationship
In the previous section we observed a strong main effect of editor gender on comment emotionality, with comments from female editors displaying a significantly more positive valence and more affective content than comments from male editors. In this section our goal is to understand the moderators of this tendency. While our main interest is to analyze the interaction between gender and power (admin-status, captured by the variable user administrator), we have also explored the interactions between gender and the ten other variables used in our analyses (user prior edits, article valence, article arousal, article controversiality, article gender-typedness, number of comments in thread, number of unique editors in thread, length of time of thread, position in thread, and date of comment). We report the results for completeness. We separate regressions with the emotionality of the comment (valence, arousal, or affective vs. cognitive content) as the dependent variable, the variables examined in the prior section as independent variables, and an interaction term between gender and one of these eleven variables. As above, our regressions include random effects on the user-and thread-level, and are performed with both the WFA and embeddings variables.
The outputs of the interaction effects for the regressions for comment valence are shown in Table 6. As can be seen, the only significant interaction with editor gender, for both the WFA and embeddings regressions, is admin-status (i.e., the position of authority). The negative value of this interaction shows that there is a drop in comment valence for female administrators relative to female non-administrators. Thus, it seems that the only variable that reduces the difference in comment valence across men and women is admin-status -i.e., the position of authority. Table 7 shows a similar set of interactions for comment arousal.
Here we see that there is no variable that crosses the threshold for significance when using a Bonferroni correction for multiple comparisons for both the WFA and embeddings methods. Thus, not only are there no gender differences in comment arousal, but gender also does not systematically interact with other variables to influence comment arousal. Table 8 shows the results of these regressions for affective vs. cognitive content of comments. As we measure affective vs. cognitive content using embeddings, the interacting variables here include only embeddings variables. Interestingly, the previously observed tendency that female editors' comments have a higher affective vs. cognitive load is not mitigated by power, unlike our valence results. That is, female administrators also use more affective and less cognitive language than male administrators. We discuss possible interpretations of this finding in the last section of the paper. Finally, Table 8 also shows a significant interaction for article valence, suggesting that comments made by female editors have more affective content in threads involving highervalenced articles.

The gender gap across the organizational hierarchy
In this final section, our goal is to examine the interaction between gender and power (as proxied by admin-status) in more detail.

Domain choice
Our analysis in Table 2 has shown that there are differences between men and women in terms of the articles they choose to converse on, with women more frequently commenting on female-typed articles, which are higher in arousal. In this analysis we did not find systematic effects of article controversiality and valence that persisted with both the WFA and embeddings methods. However, this analysis pooled administrators and non-administrators, and thus examined aggregate effects for gender, irrespective of power. Here we attempt similar tests separately for individuals at varying levels of the organizational hierarchy. We use a random-effects logistic regression on the comment level to predict whether a given comment is written by a man or a woman, using various article-level characteristics. Table 9 shows that there are important reversals in gender differences for administrators vs. non-administrators in terms of their domain choice. Female non-administrators are significantly more likely than male non-administrators to comment on female-typed content and articles that are uncontroversial. In contrast, although female administrators still disproportionately comment on female-typed articles, gender differences in article controversiality reverse, with female administrators being slightly more likely than male administrators to comment on controversial articles. These results emerge for both our WFA and embeddings methods, and suggest that the non-systematic effects of article controversiality documented in our prior analysis (Table 2) were a product of gender differences across levels of power. In Table 5 Embeddings-based regressions predicting cognitive vs. affective content of comment from various user-, article-, thread-, and comment-level variables. Note: Random effects on thread-and user-level. Bonferroni correction for multiple comparisons yields a significance cutoff of p = 0.0042 for each regression.
contrast, we do not find systematic and consistent differences across the two methods in terms of the valence and arousal of articles that women and men at different levels of the hierarchy are commenting on. (We do, however, consistently find that female non-administrators comment more on arousing articles than male non-administrators).

Emotionality
The above analysis finds that gender interacts with power to influence comment valence. In contrast, there are no systematic interactions between gender and power for comment arousal or a comment's affective vs. cognitive content. To develop an intuition of how the gender difference in valence changes as we consider individuals in positions of authority, we perform a simple aggregate analysis of comment valence across the four groups of male administrators, female administrators, male non-administrators, and female non-administrators. The basic analysis regresses comment valence on gender (1 if female, 0 otherwise), admin-status (1 if administrator, 0 otherwise), and their interaction, and does not control for the other editor-, article-, or thread- Table 6 Interaction effects between user gender and other possible predictors of comment valence, from eleven separate regressions for the word-frequency averaging and embeddings methods. Note: Each of the eleven regressions includes our standard set of controls as well as random effects on the user-and thread-level. Bonferroni correction for multiple comparisons yields a significance cutoff of p = 0.0045 for each set of regressions.

Table 7
Interaction effects between user-gender and other possible predictors of comment arousal, from eleven separate regressions for the word-frequency averaging and embeddings methods. Note: Each of the eleven regressions includes our standard set of controls as well as random effects on the user-and thread-level. Bonferroni correction for multiple comparisons yields a significance cutoff of p = 0.0045 for each set of regressions. To gain a deeper understanding of the interaction effects, we also perform an analysis of the valence of the words with the highest relative probabilities of being used by either of the four groups. The analysis only considers Warriner et al. words that occur more than 1,000 times in the dataset. This is done to ensure that the results are not driven by rare words, which have low probabilities of occurrence and are subsequently very hard to predict (Taleb, 2007). Including such rare words would yield spurious, highly-skewed probabilities that would bias our results.
There are 12,338 words that occur more than 1,000 times in the dataset. To measure the relative probabilities of these words being used Note: Each of the coefficient-, standard error-, and confidence interval values have been multiplied by 10 3 to aid exposition. To obtain the actual values multiply each number by 10 −3 . Each of the eleven regressions includes our standard set of controls as well as random effects on the user-and thread-level. Bonferroni correction for multiple comparisons yields a significance cutoff of p = 0.0045 for each set of regressions.

Table 9
Word-frequency averaging and embeddings-based logistic regressions predicting whether the originator of the comment is female, using various articlelevel variables, for non-administrators and administrators, respectively. Note: Random effects on article-level. Bonferroni correction for multiple comparisons yields a significance cutoff of p = 0.0125 for each regression. by the four groups, we first calculate how many times each of the words occurs in comments made by male administrators, female administrators, male non-administrators, and female non-administrators. We then divide each word's frequency by the total number of words written by the four groups of editors, to get each word's probability of occurrence in comments made by each of the four groups. We write these probabilities for word i as  Fig. 2 shows the relative probabilities of occurrence for the ten highest-valence words that occur at least 1,000 times in our dataset. Here, we see that female non-administrators have the highest relative probabilities for four out of these ten words ("happy", "free", "love", and "good"), and the second-highest relative probabilities for another three of these words ("live", "relationship", and "thank").
We use a logistic regression to test for this relationship between the valence of each of the 12,338 words that occur more than 1,000 times in the dataset (our independent variable) and whether or not the word has the highest relative probability of occurrence in the comments made by female non-administrators (our dependent variable, which assumes the value zero for the three remaining user groups). This analysis reveals a significant positive relationship (β = 0.20, z = 2.43, p = 0.015, 95%CI = [0.04, 0.36]), showing that higher-valenced words are indeed statistically significantly more likely to be coming from female non-administrators (compared to male administrators, male non-administrators, and female administrators).
In Fig. 3 we divide these 12,338 words into four quartiles based on their valence (1st and 4th quartiles corresponding to the lowest-and highest-valence words, respectively), and show the proportion of words in each of the four quartile groups with the highest relative probability of occurrence in the comments made by female non-administrators.
Here we can see that lower-valenced words (1st and 2nd quartiles) typically do not have the highest relative probability of occurrence in the comments made by female non-administrators, whereas higher-valenced words (3rd and 4th quartiles) do. This again shows that female non-administrators are relatively more likely to use higher-valenced words, relative to the other three groups.

Exploratory analysis of mechanisms
As discussed in the theory section, the main mechanisms behind the convergence we observe may be a treatment effect, or sorting in the form of social-and self-selection. A comprehensive comparison of these mechanisms would require novel, ideally experimental data involving the random assignment of users to positions of power, which is beyond the scope of the current paper. A weaker analysis involves comparing the emotional styles of users who eventually become administrators with those of users who do not come to occupy administrator positions, or alternatively comparing the emotional styles of users before and after they become administrators. Although our dataset is extensive, the gender imbalances in Wikipedia editor and administrator roles mean that there are only twenty-six women for whom we observe comments made in both non-administrator and administrator positions. Thus, our Proportion Valence Quartile Fig. 3. Proportion of words with highest relative probability of usage by female non-admins. Error bars indicate +/-1 SE ability to test for underlying mechanisms is restricted. Nonetheless, we include some exploratory tests, which indicate that a treatment effect of the position of authority may be involved (again, the three mechanisms are not mutually exclusive). First, we analyze whether comments made by female editors who later rise to a position of authority differ from those of their female peers from the general population who do not become administrators later on. We do this by regressing comment valence on a binary variable indicating whether or not the user would eventually become an administrator. We run this regression only for comments made by female non-administrators, and include our standard set of controls (user logedit count, article valence, arousal, controversiality, and gendertypedness, comment order, comment date, thread number of comments, users, and time difference) as well as random effects on the user-and thread-level. We do not find a significant difference between the comment valence of female non-administrators who eventually become administrators and the comment valence of female non-administrators who do not become administrators when running this regression with the WFA method (β = 0.03, SE = 0.04, z = 0.76, p = 0.49, 95% CI = [−0.05,0.10]) or with the embeddings method (β = 0.04, SE = 0.03, z = 1.45, p = 0.15, 95% CI = [−0.01,0.10]). This suggests that female administrators do not differ in their emotionality ( valence) from other women before they come to occupy the position of authority.
Second, we tentatively explore whether there may be a treatment effect of the position of authority on women's subsequent behavior by analyzing the data on women for whom we have observations on both the time before and during their adminship. We test whether there is a change in comment valence as they become administrators. Again, this involves a regression of comment valence on a binary variable indicating whether or not the user is an administrator at the time of posting (using only the comments generated by women for whom we have data from before and after they become administrators). We run this regression with the controls discussed above and include random effects on the user-and thread-level. We observe a directional drop in comment valence as women come to occupy the position of authority using the WFA method, though this drop is not significant (β = −0.04, SE = 0.04, z = −1.06, p = 0.29, 95% CI = [−0.12,0.04]). However, we do find a significant drop using the embeddings method (β = −0.07, SE = 0.03, z = −2.39, p = 0.02, 95% CI = [−0.13, −0.01]). While this analysis is limited as there are only twenty-six editors for whom we have the requisite data, meaning that we remain cautious about robustness, these results would be consistent with an interpretation that holding powerful office may have an influence on behavior -possibly legitimizing or compelling women to reduce the valence in their communications. Replicating these results and analyzing these mechanisms in more detail is an important and promising avenue for future research.

Discussion and conclusion
Our analysis yields several implications for research on gender differences, leadership behavior, and conversational phenomena within modern forms of knowledge production. In many of these novel organizational forms (e.g., Wikipedia, open source software production), selection into different work domains is voluntary and neither mandated nor predominantly motivated by pecuniary incentives. Millions of people around the world coordinate their efforts in virtual space, often without much personal interaction (e.g., without private communication in small or two-person teams, discussion of matters not related to work, or face-to-face meetings). Analysis of communication in such forms of production provides interesting insights for the future of work given the increasing predominance of large teams and the rise in alternative, often platform-based work arrangements. This makes understanding the linguistic coordination of people working on Wikipedia important for organizational scholars. Wikipedia has attracted much interest based on what is considered a relatively anti-authoritarian and decentralized structure. It is therefore surprising to see the role played by authoritative positions even in such an environment where workers are possibly less influenced by authority.
With regards to research on gender in organizations, we show that there are significant gender differences in people's conversational styles (specifically, in their emotionality and emphasis on affect vs. cognition) and domain choices (controversiality and gender-typedness). Importantly, once we look up the organizational hierarchy to individuals in positions of power, these differences diminish or even disappear: female and male authorities are just as (un)emotional in terms of valence in their language use, and they are just as likely to engage in conversations about controversial content. As our analyses also show, this change is driven by women who converge to the behavior of their male counterparts as they assume positions of power. The two notable exceptions are that the gender-specific separation of labor -sorting into conversational topics based on their gender stereotype -seems to increase. This may be explained by differences in accumulated knowledge and expertise that editors can leverage once they become administrators. Moreover, female administrators continue to use fewer cognitive process words and more affective process words than male administrators. This is an interesting result that future research should explore further. It might be an indication of female leaders' intent to navigate a competence-warmth trade-off (Fiske, Cuddy, Glick, & Xu, 2002), whereby they counterbalance their position of power by renouncing the use of overly cognitive words. (The average comment in our dataset is more semantically related to cognitive rather than affective words, which is expected given Wikipedia's focus on being a knowledge creation and not a social media platform.) Our finding of the disappearance of important gender gaps in emotionality and domain choice among people in positions of power is in line with other work in the gender literature (Croson & Gneezy, 2009). Previous work shows, for instance, that the well-established gender difference in risk preferences does not extend from the general population to managers. Croson and Gneezy (2009) conclude, that "the evidence suggests that managers and professional business persons present an important exception to the rule that women are more risk averse than men" (p. 454). These findings were obtained for trained managers, which opens the possibility (also discussed by Croson and Gneezy) that the training may have affected women's behavior (see, e.g., Johnson and Powell (1994), who compare trained and untrained subpopulations, as well as Masters and Meier (1988) and Birley (1987) who focus on entrepreneurs). We find such convergence even in a population of untrained individuals, as Wikipedia administrators presumably did not undergo formal management education.
We find suggestive evidence that the position of authority may have an effect on the disappearance of the gender gaps. In line with our findings, a recent study that looks at laughter occurrences documents a similar pattern where women in positions of power converge to the behavior of men and exhibit less inauthentic laughter -even when power is exogenously assigned (Bitterly, Brooks, Aaker, & Schweitzer, 2020).
Other possible mechanisms behind the smaller and even disappearing gender gaps in our data are self-and social selection (i.e., supply-and demand-side factors). Analyzing these mechanisms, including how they interact, is an important avenue for research (Fernandez-Mateo & Kaplan, 2018). Such future work could also consider further measures of power, for instance by using social network analyses to build centrality measures. Replicating our results with such measures would be useful, and it would also open other intriguing questions, such as about the extent to which formal (adminship) and informal (social network-based) measures of power overlap in the context of Wikipedia and beyond.
To draw implications for interventions, it will be important to replicate our findings, ideally by experimentally assigning power in order to identify its causal effects. To illustrate, if lower valence is a cause of power, then organizations may want to actively counterbalance its weight in promotion processes (assuming it is not related to ability and leader effectiveness). Conversely, however, if power causes lower-valenced communication, the implications would be different. For instance, to the extent that low valence creates a less enjoyable organizational culture, policies that foster more positive interactions would be conducive.
More generally, understanding differences in the behaviors of men and women across different hierarchy levels of organizations is a necessary first step to understanding how to remedy gender differences in organizational outcomes. Our analysis takes this step, and it sets the basis for a more rigorous and naturalistic examination of power-gender dynamics. Another practical benefit is that our methods can be deployed at scale and in real time. Using automated text analysis, organizations can thus monitor possible gender differences. This will help them better understand organizational dynamics and, if need be, control for these when making promotion decisions.
Follow-up work should further investigate the role played by the mode of collaboration, comparing conversations in more traditional, small-scale, and co-located team production settings (Leavitt, 1989) to novel conversational phenomena that arise from large-scale collaborations among self-governing "peers". Our study focuses on the latter. It is interesting that even in such a context, where gender cues are reduced and work takes place in virtual space, we observe notable gender differences among the general population of editors. This provides further evidence, from a non-student sample, that gendered power differentials may persist in online contexts (Guiller & Durndell, 2007). It is likely that different conversational dynamics unfold where gender cues are more salient and where voice and nonverbal behaviors may be used by women in an attempt to mitigate adverse, possibly gender-specific, consequences from leader-like behaviors (Carli, LaFleur, & Loeber, 1995;Eagly & Karau, 2002;Hall, Coats, & LeBeau, 2005;Schroeder, Kardas, & Epley, 2017). Such analyses in offline contexts would have the benefit that they do not rely on people's decision to reveal their gender and/or other characteristics (e.g., with our dataset we cannot study or control for ethnicity). It remains an empirical question whether people who share their gender differ systematically from those who do not, and on what dimensions. Our results pertain to the population of editors whose gender is revealed, which is only a subgroup of Wikipedia editors.
By considering emotions as a window into the psychology of knowledge production, we hope to provide a basis for further research into the motivations driving the production of global public goods such as Wikipedia. It would be interesting to study the extent to which expressions of emotions in this virtual knowledge production context reflect actually experienced feelings as opposed to possible attempts to conform to gender-and leadership-role specific display rules (Brody, 2000;Simpson & Stroh, 2004). More generally, future work could use automated text analysis to examine a variety of psychological variables and constructs in naturally occurring conversations (see Humphreys & Wang, 2017), with important implications for our understanding of gender, power, and other key social variables in organizations and in everyday life. By using automated text analysis applied to a large dataset of Wikipedia editor conversations, our paper is intended to help lay the groundwork for such an analysis.