Concept Class Analysis: A Method for Identifying Cultural Schemas in Texts

: Recent methodological work at the intersection of culture, cognition, and computational methods has drawn attention to how cultural schemas can be “recovered” from social survey data. Deﬁning cultural schemas as slowly learned, implicit, and unevenly distributed relational memory structures, researchers show how schemas—or rather, the downstream consequences of people drawing upon them—can be operationalized and measured from domain-speciﬁc survey modules. Respondents can then be sorted into “classes” on the basis of the schema to which their survey response patterns best align. In this article, we extend this “schematic class analysis” method to text data. We introduce concept class analysis (CoCA): a hybrid model that combines word embeddings and correlational class analysis to group documents across a corpus by the similarity of schemas recovered from them. We introduce the CoCA model, illustrate its validity and utility using simulations, and conclude with considerations for future research and applications.

S OCIOLOGISTS often use the concept of "schema" to articulate the role of personal culture in social action (DiMaggio 1997;Blair-Loy 2001;Swidler 2001;Vaisey 2009;Cerulo 2010;Spillman 2016). Measuring cultural schemas-which are defined as implicit and unevenly distributed relational memory structures that are slowly learned through prior life experiences-has nonetheless proved tricky for traditional methods, like surveys, not unlike many other kinds of culture (Mohr et al. 2020:3). For this reason, Goldberg's (2011) article was particularly influential: it presented a formal method for not only operationalizing and measuring schemas but doing so in an unlikely place-social surveys. Goldberg's method (2011), relational class analysis (RCA), takes respondents' response vectors across an array of domain-specific ordinal survey items as input and-using a schematic similarity metric and graph-partitioning technique that we briefly summarize later-sorts respondents into classes on the basis of the schema to which their survey response patterns best align. Boutyline's follow-up approach-correlational class analysis (CCA)-further exemplifies the utility of these "schematic class analysis" methods for quantitative cultural analysis (Boutyline 2017:385). Given the preponderance of interest in measuring cultural schemas (and finding what they do and what predicts them) and the plethora of types of data for cultural analysis, a logical next step is to develop methods for extracting schemas from sources other than surveys.
We propose such a method-namely, a method for identifying schemas in texts. We introduce concept class analysis (CoCA): a hybrid model that combines word embeddings-specifically, concept mover's distance (Stoltz and Taylor 2019) and semantic directions (Bolukbasi et al. 2016;Kozlowski et al. 2019;Taylor and Stoltz 2020)-and CCA to find groups of documents across a corpus that share similar schematic traces.
In contrast to inductive methods for clustering documents, such as topic modeling, our proposed workflow incorporates a deductive approach in which the analyst can define the dimensions of interest; and, unlike traditional supervised classifiers, our workflow requires no hand-coding of documents. More importantly, unlike surveys, the set of meaningful dimensions that can be estimated from a corpus are not limited by the questions of a survey instrument. If theory and empirical research suggest other semantic directions might be salient for a particular question or corpus, all that is needed is a list of juxtaposed terms that approximate this dimension. This opens up a vast landscape of applications for CoCA, given the ever-growing storehouse of relatively inexpensive and easily accessed natural language corpora. Any corpus can be used with our proposed workflow.
In what follows, we first overview the theoretical foundations of the schematic class analysis framework and further extend these foundations to show how schemas can also be approximated using their "traces" in natural language. We then detail the CoCA model and report the results from a series of simulations that illustrate CoCA's validity and utility as a schema-identifying tool for document corpora. We then conclude with a discussion for future research and applications using the method.

Defining Cultural Schemas
Simply put, a schema is a "flexible memory structure, automatically acquired and updated from patterned activity, [and] composed of multimodal neural associations" (Wood et al. 2018:246). In other words, a schema is a nondeclarative cognitive network that people acquire slowly over the life course to mentally represent situations and their prototypical features as well as typical sensorimotor procedures for interacting with objects and situations (Squire 1992;Ghosh and Gilboa 2014;Lizardo 2017). 1 Schemas are central to human learning, thinking, and action, as they are activated (literally, as a weighted network of co-activating neurons) to recognize, fill, and update patterns in one's everyday observations (Rumelhart 1980:39)-a sort of mental shortcut that is especially crucial when faced with "conditions of incomplete information" (DiMaggio 1997:269). Schemas, then, are central neurocognitive structures behind the "cognitive-miser" model of human thinking (Fiske and Taylor 2013). 2 For instance, one may see a 23-wheeled, translucent, octagon-shaped object with a person inside of it moving down a freeway-a previously unobserved object for this person, to be sure. But the fact the object is on a road, has wheels, is moving, and has a person inside of it will likely satisfy enough of the prototypical features of their internalized "automobile" schema for it to be activated as the relevant sense-making mechanism; the object will almost instantaneously be understood as a road vehicle of some kind. The schema comprises the co-activation of the neural representations of these prototypical features. Furthermore, the internalization of this "automobile" schema is due to the person's repeated and multisensory exposure to cars, trucks, roads, and the like-seeing cars, driving cars, hearing wheels squeal, smelling gasoline, etc., all correlationally experienced.
A schema is then said to be cultural to the extent that it is a learned cognitive structure (Wood et al. 2018:246). Furthermore, for sociologists, the primary focus of investigation has been the extent to which people share similar schemas (Boutyline and Soter 2020:6). 3 By exploring the distributions of patterns of schematic traces in social life, or what Sperber (1996:1-2) refers to as an "epidemiology of representation," the analyst "'dissolv[es]' the individual, unlock[ing] the potential to theorize the 'collective-social' at a level below the person" (Lizardo et al. 2020:9).
The automobile schema in the previous illustration is an example of a cultural schema, as it is certainly learned through one's experience of the social world. Importantly, though, although it may be widely shared (in terms of relative similarity between people), it is not universal, and it is not necessarily the most salient sense-making mechanism. If a person grew up and lived in a militarized zone, for instance, then the features of this object may instead prime a "tank" schema (or perhaps a more general "military" schema).
For our purposes, there are two important points to be gleaned from the discussion thus far. First, schemas are inherently relational in the sense that they are "implicit recognition procedures that emerge from intricate associational links" between elements represented in memory (Goldberg 2011(Goldberg :1401Boutyline and Vaisey 2017;Hunzaker and Valentino 2019). Second, cultural schemas will be shared, to the extent that people have similar life histories and therefore occupy similar social positions and life-course trajectories. Cultural schemas-by their nature of being learned and unevenly shared across populations-can be conceptualized as a means of distinguishing between "thought communities," or social groups with more or less distinct meaning-making and meaning-maintenance practices (Fleck 1979;Zerubavel 2009;Lee and Martin 2018). Methodologically, then, we can "reverse engineer" and identify cultural schemas by identifying the social groups that demonstrate similar patterns of activities, attitudes, or products.

Schematic Class Analysis with Survey Data
Cultural schemas have been measured in a number of ways-e.g., through experiments in cultural and social psychology and semistructured interviews and participant observation in cognitive anthropology (Strauss and Quinn 1997;Shore 1998;Morris and Mok 2011;Hunzaker 2014;Homan et al. 2017;Hunzaker and Valentino 2019;Miles 2019). In a now classic 2011 article, Goldberg put forth RCA as an innovative method for measuring cultural schemas in social survey data. Specifically, RCA quantifies the extent to which participants can be grouped by formal similarities in their responses and then uses group-specific inter-item correlations to interpret the schemas that define each group. Goldberg (2011Goldberg ( :1402Goldberg ( -1405 gives the following example to show the intuition behind RCA. Consider Figure 1, which reports musical tastes for four (A, B, C, and  Goldberg (2011Goldberg ( :1405 and Boutyline (2017:357, 361). The figure shows survey responses are musical tastes from four hypothetical respondents, ranging from "strongly dislike" (1) to "strongly like" (5). D) hypothetical survey respondents. Each respondent rates the extent to which they "strongly dislike" (1) or "strongly like" (5) seven different music genres, listed along the x axis of panel A. Respondent A likes pop, blues, and rock, strongly likes classical and opera, and is fairly "in the middle" when it comes to bluegrass and country. Respondent B does not respond with any of the same ordinal categories across the genres: they dislike pop, blues, and rock, are neutral about classical and opera, and strongly dislike bluegrass and country.
We may default to using Euclidean distances as a measure of the extent to which these two respondents are similar in their musical tastes, but, as Goldberg notes (2011Goldberg notes ( :1406, Euclidean distances only account for respondent differences specific to each variable (e.g., A Pop − B Pop ) and rely on summed squares (i.e., where g is a genre in the total set of genres G). This means that Euclidean distances account for neither (a) interrespondent similarities in withinrespondent response patterns across the set of survey items nor (b) the directionality of the similarities. Consequently, respondents A and B exhibit a (standardized) Euclidean distance of about 0.5. If, however, we are interested in the extent to which respondents A and B are schematically similar, then we should instead identify similarities in patterns of relative adjudication )-that is, the extent to which respondents show similarity in the oppositions they implicitly assign between survey items. Though the respondents do not give the same answers for any of the taste items, they nonetheless "weight" their relative like or dislike for each genre in the same direction: respondents A and B both equally (dis)like pop, blues, and rock, like classical and opera equal to one another (and more so than they like pop, blues, and rock), and they both dislike bluegrass and country more than they dislike any of the other genres. These respondents, then, are schematically identical: they use the same musical taste schema to valuate musical genres.
Goldberg introduces a metric called "relationality" to quantify the extent to which two respondents are schematically similar. For respondents A and B in Figure 1, their relationality is equal to 1 (on a -1 to 1 scale)-indicating they are schematically identical (notice that A's and B's "trend line" across the genres is exactly the same; B is simply shifted down by two scale categories for each genre).
Importantly, two people can acquire similar schemas and yet express diametrically opposed valuations of individual objects within that same schematic understanding. For example, respondent C does not not share any of the same answers across the genres as respondents A and B. Furthermore, their valuations go in opposite directions: they are neutral about pop, blues, and rock, strongly dislike classical and opera, and strongly like bluegrass and country. Although C's valuations go in opposite directions as A and B, they nonetheless exhibit the same pattern of evaluation: e.g., classical, opera, bluegrass, and country are highly distinguishing genres in their meaning-making. Appropriately, C's relationality with A and B is negative. Finally, respondent D shows no discernible similarities in how they organize their tastes; as such, they have a relationality close to zero for all three pairwise comparisons.
RCA then applies a graph-partitioning algorithm to the symmetric matrix of absolute pairwise interrespondent relationalities to identify the schematic classesi.e., those groups of respondents who approximate the same cultural schema for organizing their beliefs, attitudes, or tastes within the domain of interest.
An important variant to RCA is CCA (Boutyline 2017). In short, CCA differs from RCA only in the measure of schematic similarity; the theoretical foundations, default graph-partitioning algorithm, and interpretation techniques are otherwise the same. Boutyline observed that two respondents-say, i and j-draw upon exactly the same schema for finding meaning in some domain if their associated response vectors across the set of K domain-specific survey items are perfect linear transformations of one another (Boutyline 2017:356). The absolute Pearson correlation is-like absolute relationality-a continuous measure in the [0,1] interval of the extent to which i and j follow the same schema. Boutyline reported a series of simulations and found that the interrespondent absolute Pearson correlation was consistently more accurate than the absolute relationality as a schematic similarity metric and also more computationally efficient. For these reasons carefully laid out by Boutyline, we rely on CCA as our model for schematic class analysis.
Together, RCA and CCA have enjoyed wide empirical application. They have been used to study taste ( (Rossoni et al. 2020), and literary schemas (Rawlings and Childress 2019). What all of these applications have in common, of course, is the reliance on attitudinal measures for identifying schemas.
How might RCA/CCA be applicable for researchers with nonsurvey data-say, texts? 4

Can Cultural Schemas Be Identified in Natural Language?
Before outlining our method, we must first (briefly) lay out some theoretical and empirical support for the notion that schemas can be recovered from written natural language. Although the production of any piece of text is shaped by a collective process that is always embedded in social organization, and thus shaped by the exigencies of specific fields (e.g., Childress 2017), writing is also a fundamentally cognitive process. Furthermore, although the specific content is often the subject of explicit consideration and deliberate "framing" (DiMaggio et al. 2013;Wood et al. 2018), there are also intuitive aspects of writing that are often never more explicit than quickly judging whether a string of words "sounds right." As such, along with practice-theoretic traditions in the study of the production of culture (Bourdieu 1993;Breiger 2000), we argue that there is a relative homology between the schemas authors draw upon to produce texts and the latent patterns learned from the texts themselves (Ignatow 2016). This argument is supported by a growing body of cognitive, psychological, and linguistic research using distributed representations of words (specifically, word embeddings) to recover schematic conceptual information from the statistical properties of natural language corpora. In a review and comprehensive expansion of this line of research, Utsumi (2020) found that, with some variation, many distinct kinds of conceptual knowledge (concrete, abstract, spatial, temporal, perceptual, emotional, etc.) are accurately encoded in word embeddings (hereafter just embeddings). For example, one team (Fulda et al. 2017) found that embeddings trained on Wikipedia articles encoded prototypical object affordances, whereas another team found that embeddings encoded visual and evaluative features of objects, such as size and dangerousness (Grand et al. 2018) and can thus be used to accurately model human judgement about a variety of domains (Richie et al. 2019). Also working with word embeddings, Joseph and Morgan (2020) found that these distributional models reveal stereotypical biases (e.g., engineers are men) that largely mirror those found in survey data (see also Caliskan et al. 2017;Kozlowski et al. 2019).
But why do embeddings have this range of ecological validity? At core, this research is based on the "distributional hypothesis" in linguistics, which Harris (1954:156) defined as "difference of meaning correlates with difference of distribution." Specifically (Arseniev-Koehler and Foster 2020; Caliskan and Lewis 2020), this is because distributed representations are a neurally plausible model of how human brains actually represent meaning-following decades of research on connectionism and artificial neural networks (Günther et al. 2019). Put another way, the key point of similarity between human minds and these computational models of language is that both store information as distributions: In contrast to maintaining slots to represent all possible concepts in the world, storing a concept as a distributed representation means that many concepts can be represented by a limited number of neurons because even a limited number of neurons may have many, many possible patterns of activation (Arseniev-Koehler and Foster 2020:7) As a striking demonstration of this, a team was able to predict stories people were reading from functional magnetic resonance imaging data by using semantic representations of the stories obtained with embedding methods. Even more compelling, "this was possible even when the distributed representations were calculated using stories in a different language than the participant was reading" (Dehghani et al. 2017:6096).
Arseniev-Koehler and Foster (2020) further argue that one of the most common word embedding models, word2vec, is also an accurate model for the process by which human minds acquire or build these distributed representations (see also Caliskan and Lewis 2020). There are alternative procedures to produce word embeddings that have outputs comparable to word2vec, namely GloVe (Pennington et al. 2014). As our proposed method begins with embeddings, it is more or less agnostic to the underlying embedding model that produced them. 5

Concept Class Analysis: The Building Blocks
Now that we have outlined the mechanics of RCA/CCA and shown how cognitiveschematic information can be recovered from written natural language, we now outline how CCA can be combined with embedding methods to group texts into schematic classes.

Concept Mover's Distance
Concept mover's distance (CMD) is a method for measuring the extent to which a document is similar to a location in a word embedding vector space (Stoltz and Taylor 2019). It begins by conceiving of documents as clouds of locations, drawing on prior work showing that similarity between two documents can be quantified by the "cost" of moving the clouds constituting one document to the locations of the clouds constituting another (Kusner et al. 2015;Atasu et al. 2017). CMD measures the cost of moving all words in a document to one or very few points in the embedding space. In it's simplest form, this point is defined by a single word vector. Treating this as a transportation problem, in which term counts constitute amounts to be moved and word vectors define distances to be moved, this becomes a special case of earth mover's distance (EMD; i.e., the Wasserstein metric for comparing discrete probability distributions).
Imagine two documents represented as bags of words (i.e., a vector of unique term counts), doc 1 and doc 2 . These documents are normalized by dividing each by their respective total term count such that the sum of their vectors will equal 1. EMD finds how much of word p in doc 1 has to flow to word q in doc 2 , and the cost of moving word p in doc 1 to word q in doc 2 . Formally, this is defined as follows (Atasu et al. 2017:890; see also Rubner et al. 1998): where y and c are n × n matrices (where n is the total vocabulary in the corpus), y[p, q] is how much of word p in doc 1 flows to q, and c[p, q] is the cosine similarity between words p and q in an embeddings space (i.e., the cost of movement). The amount a word can flow from doc 1 is ultimately constrained by the relative frequency of its nearest neighbors in doc 2 , such that (Atasu et al. 2017:890): Let's say that the nearest neighbor of the word "dog" in doc 1 is "cat" in doc 2 , but 20 percent of the words in doc 1 are "dog," whereas only 10 percent of the words in doc 2 are "cat." The remaining 10 percent of doc 1 must be distributed to the next nearest neighbor in doc 2 , and on and on. The row vectors of the resulting flow matrix, y, between doc 1 and doc 2 must therefore sum to the relative frequency of word p in doc 1 (i.e., ∑ q y pq = doc 1(p) ) (see Kusner et al. 2015:3). Similarly, the columns vectors of y must also sum to the relative frequency of word q in doc 2 -∑ p y pq = doc 2(q) . The strength of this, however, is that the documents can be very different lengths, up to the limiting case where one document is a single word. CMD exploits this strength.
In the simplest case, CMD calculates the cost of moving to a single word denoting a concept of interest-for example, "love." The concept can be further specified either by including additional terms-e.g., "romantic love"-or by finding the average vector of various words denoting the concept of interest-e.g. "love," "romance," "devotion," "affection," and so on. CMD accomplishes this task by treating the word (or words) denoting a concept as a "pseudo-document," appending that pseudo-document to the document-term matrix (consisting of all 0s except for the columns that are for concept-denoting terms, which get 1s), then using a computationally efficient solver for the EMD problem 6 to arrive at a vector with scores indicating how similar each real corpus document is to the pseudo-document-that is, to the concept of interest.
Finally, and central to our proposed method, we can also measure the cost of moving a document in a particular direction defined by juxtaposed terms-e.g., "love" as opposed to "hate." We now move to a discussion of this particular procedure.

Semantic Directions
To understand semantic directions, consider gender bias in texts (Bolukbasi et al. 2016). This is a major area of study using word embeddings because an analyst can easily define a direction toward women and away from men, for example. This is implicit in the king:man as queen:woman analogy task introduced by Mikolov et al. (2013). If we subtracted "man" from "woman," the result would be a vector that defined a direction pointing toward "woman" and away from "man"-it is not a word vector per se but a kind of relation vector. If we measured the cosine similarity of "queen" and "king" with this new relation vector, we would find that queen is close (defined by the cosine of the angle of the vectors) and king is far. But queen and king are obviously gendered labels; what about monarch or castle? We might find they are somewhere between queen and king. If the cosine similarity is zero-i.e., the angle is perpendicular-this means they are equidistant to both man and woman. However, if they lean to one side or the other, we would say these terms have a bias toward man or woman (Bolukbasi et al. 2016;Caliskan et al. 2017).
To increase the accuracy of this direction, we can find several gendered word vectors corresponding to man on the one hand (e.g., men, gentlemen, boys) and woman on the other (e.g., women, ladies, girls), subtract the vectors for each pair, and then average the result (Kozlowski et al. 2019;Arseniev-Koehler and Foster 2020;Taylor and Stoltz 2020). Therefore, we define a semantic direction, d, as the mean of a set of vector differences between a collection of juxtaposed word pairs: where p is a word pair in the total set of P relevant juxtaposed word pairs, p 1 and p 2 are the vector representations of the two words in juxtaposed pair p, and d points toward p 1 and away from p 2 . 7 We can then take a sample of terms and measure whether they are closer to one side or the other of this semantic direction to determine a range of gender bias. For example (Jones et al. 2020:14-15), we could determine the gender bias of "Science" by measuring the cosine similarity between our gender relation vector and the following terms: science, technology, physics, chemistry, Einstein, NASA, experiment, and astronomy. We can then average each similarity to arrive at a summarized measure of gender bias in "Science." If the average cosine is zero, it neither leans toward women (a cosine of 1) nor toward men (a cosine of −1).
We can easily generalize this procedure to domains beyond gender. For example, Kozlowski et al. (2019) estimate directions for affluence and education, whereas Arseniev-Koehler and Foster (2020) estimate directions for health and morality, and Taylor and Stoltz (2020) get directions for life/death and political ideology (see Table 1 for examples of juxtaposing terms for a few semantic directions). To estimate a semantic direction, we simply need a list of juxtaposing terms-or what Arseniev-Koehler and Foster (2020) refer to as "anchor" words-for each dimension of interest within a given domain. This can be accomplished using precompiled dictionaries, prior theory and research, synonym/antonym tools like WordNet (Fellbaum 1998), and seeing which other terms in the word embeddings have high cosine similarity to our anchor terms.
Finally, once we derive a vector for our semantic direction using the above procedure, we can use CMD to measure engagement with a particular pole of a semantic direction. For example, in a corpus of U.S. political texts, one document may engage more with the concept of "conservative" relative to "liberal"-meaning they would have a larger and positive "political ideology" CMD score, assuming that the "liberal" word was subtracted from the "conservative" word across the political ideology word pairs . These semantic direction CMDs are the foundational input to CoCA.

Robustness Checks for Semantic Directions
As robustness checks, we should consider the following: sensitivity to juxtaposed terms, sensitivity to the method of averaging the vector offsets, and overall face validity. Regarding the first, we could compare each document's engagement with the averaged semantic direction to the component directions defined by the individual juxtaposed term pairs (see Taylor and Stoltz 2020:10) to ensure that a single set of term pairs is not driving the majority of the engagement. Put another way, are all term pairs approximating a similar semantic direction? Second, we could compare our proposed method of averaging (as defined in Equation 3) with potential alternatives (see, for examples, Arseniev-Koehler and Foster 2020:19; Taylor and Stoltz 2020, appendix). Finally, we could check the face validity of the direction by using it as a term-classifier (e.g., Arseniev-Koehler and Foster 2020:19). Using CMD, we could also do this at the level of document classification by checking whether key document-level covariates roughly track engagement as expected.

CCA and Modularity Maximization
Once the semantic direction CMDs have been computed-where each document has a standardized score indicating the extent to which they engage with a particular pole of a binary concept-the last step is to simply pass this document-by-CMD matrix to the CCA algorithm. Consider the hypothetical example in Figure 2, a CoCA variant of the RCA/CCA example presented in Figure 1. Instead of four respondents with answers across seven ordinal and equidistant survey items, we now have four documents explicitly discussing immigration (panel A). These documents have seven scores, each indicating the extent to which they engage with one pole of a direction pertaining to the topic of immigration. These scores are continuous.
Documents A and B follow the same cultural schema when they write on immigration issues: they both engage very little with the racial, skills, legality, employment, and education dimensions of immigration discourse (or otherwise engage with the binary oppositions that define those dimensions relatively equally), but display pronounced biases toward discussions of open borders and low economic capital when it comes to the dimensions of borders and socioeconomic status. Just like in the Figure 1 example, document C does not have any of the same CMD scores as A and B, but instead displays an inversion of this pattern. As such, we might say that documents A, B, and C carry "traces" of the same schema: a "closed/open borders and high/low SES" schema, perhaps, where the concepts of borders and socioeconomic capital are highly distinguishing meaning-making features. Document D, however, displays a substantively different relational pattern across the CMD scores, and therefore carries traces of a different schema. Figure 2  Similarly, the Pearson correlation between A and C and B and C are appropriately −1-because, as before, A and C and B and C are perfect linear transformations of one another (C = 0.1 − A; C = −0.3 − B). The variance of document D, however, is not fully accounted for by its bivariate relationship with any of the other documents (D = 0.4548 + 0.2903A + 1 , D = 0.571 + 0.2903B + 2 , and D = 0.4839 − 0.2903C + 3 , where 1,2,3 are residual variance terms and where each i = 0). As expected, the Pearson correlations between D and the other three documents are close to zero.
The absolute values of the interdocument correlations are taken prior to subsetting the documents into schematic classes so that documents with inverted patterns are grouped together. Specifically, the symmetric document-by-document matrix, A, of absolute Pearson correlations is treated as a fully connected and weighted graph and partitioned using modularity maximization (Newman 2004;Goldberg 2011Goldberg :1409Goldberg -1410Boutyline 2017:386). CoCA returns the set of schematic classes that maximizes graph A's modularity score (after optionally decreasing the density of the graph by removing statistically nonsignificant correlations at some researcherspecified α-level). 8 A higher graph modularity score, Q, indicates stronger absolute correlations between documents assigned to the same class than one would expect if the correlations were placed between documents at random.
As a class detection method, this modularity-based technique has the benefit of reporting the class solution that best maximizes Q-as opposed to, for example, agglomerative clustering, which requires researcher decisions on where to "cut" and define the classes. Lastly, modularity maximization also aligns with cognitive "theories of schematic transmission," as Goldberg notes (2011:1410). 9

Simulation Analysis
As a document's engagement with semantic directions is both continuous and less precise than are the ordinal scales found in survey responses, it is important to determine whether CCA is robust to variation on continuous measures. Therefore, we present a simulation analysis to illustrate how CoCA accurately identifies "ground truth" schemas from a range of artificial corpora with varying amounts of random noise injected into the schematic patterns. 10 Next, we use one of the simulated class solutions to present some schema-specific interdirection correlation networks-common visualization strategies for interpreting the derived cultural schemas.

Partition Validity
Boutyline illustrated how CCA is robust to varying levels of noise (2017:365-366). We ran a series of simulations to test how accurate CCA continued to be as a schemapartitioning method when the input data are continuous (semantic direction CMDs) rather than ordinal (survey items with Likert and Likert-type response scales).
We ran a series of simulations to shed light on this question. First, we created three separate within-document, cross-direction patterns that varied within a [-3, 3] interval: one that follows document A's pattern in Figure 2, one that follows document D's pattern in Figure 2, and another third pattern. The three patterns are shown in Figure 3, in blue. Then, noting that two documents i and j (with x and y being i's and j's score vectors across a set of semantic direction CMDs, respectively) have traces of the exact same schema if y = α + βx and where β = 0, we created Note: Only within-set correlations shown. three separate linear transformations of each of these patterns. One pattern shifted by α = −0.4, one scaled by β = 0.6, and we then had an inversion of the original pattern (y = −x). Those transformed patterns are visualized with each original pattern in Figure 3. We then repeated each of these 12 patterns 10 times to create a simulated data set of 120 documents across seven semantic direction CMDs. Each pattern correlates perfectly with every other pattern within the same set (see Table 2). This means that each set of documents-{A, B, C, D}, {E, F, G, H}, and {I, J, K, L}, respectively-follow their own schematic pattern. Indeed, a CCA of this N = 120 corpus accurately partitions the documents into three schematic classes, with 40 documents in each.
We then created 981 additional simulated data sets, each with random noise injected into the within-document, cross-direction patterns (i.e., the document row vectors). Each document across these data sets was perturbed with a random deviation drawn from a uniform distribution (see the "jitter" function, a base feature in R Core Team 2020). Let the "noise factor" be f . For the first of the 981 data sets, f = 1, corresponding to a relatively small amount of noise. Each subsequent data set was perturbed with an additional 0.05 to f , leading to the final data set (number 981) that had a noise factor of 50. CCA was run on each of the simulations, leading to 982 total schematic class solutions (the original, representing the true class partition, and 981 class assignments from the noise-added simulations).
How accurate was CCA at finding the correct class solution with varying levels of random noise? To assess this, we calculated Cramér's V between the true class solution and each of the other 981 simulated solutions. A Cramér's V of 1 indicates a perfect association: each category of the row variable (the true three-class solution) corresponds perfectly to one (and only one) category of the column variable (one of the simulated solutions). A Cramér's V of 0 indicates a purely random association, and values between 0 and 1 indicate approximations of these extremes. The 981 V statistics are summarized in Table 3. The mean V was 0.81 with a standard These results suggest that CCA does an adequate job of finding the correct class solutions when the input data are continuous (i.e., semantic direction CMDs), even at high levels of noise. Indeed, as Figure 4 shows, most of the lower V statistics were from simulations with relatively high levels of noise. Altogether, these simulations suggest the CCA is a robust method for detecting schematic patterns across semantic direction CMDs.

Correlation Networks
In a simulation such as the one just described, the schematic patterns are known by design. Of course, in the vast majority of empirical use cases, this will not be the case. How, then, can researchers interpret the schema associated with each class after deriving them?
In RCA/CCA, the norm is to present class-specific inter-item correlation networks. The same visualization strategy can be used with CoCA, but with semantic direction CMDs rather than survey variables. The result, then, is interdirection correlation networks in which the edges indicate the extent that engagement with a binary concept is predictive of engagement with another binary concept.
Consider the initial simulation data set, consisting of 120 documents and with each document a perfect linear transformation of one of three schematic patterns (see Figure 3). As stated above, CoCA accurately partitioned this set of documents into three schematic classes, each consisting of 40 documents. Correlation networks for each class are presented in Figure 5. The left panel corresponds to the top schematic class in Figure 3, the middle to the middle schema, and the right panel corresponds to the bottom schema.
Correlation networks such as these provide the information necessary to "reconstruct" the schemas that constitute each class. For example, looking at the Class #1 network in Figure 5, we see that semantic directions #4 and #5 are strongly and positively correlated. This is in line with the schema as shown in the top panel of Figure 3, which shows that each document engages the same poles of each direction. Let's say, for instance, direction #4 was an "Employment" direction opposing "employed" with "unemployed" and direction #5 was an "Education" direction opposing "educated" with "uneducated," then, if a document in Class #1 engages with the concept of "employed" (in opposition to "unemployed"), that document also engages the concept of "educated" (as opposed to "uneducated")-and vice versa. We also see negative correlations between direction #7 and direction #5 on the Class #1 network. This again is in line with the top panel of Figure 3, where engagement with direction #5 is inversely related to #7 for each document. If, for example, #5 is "Education" and #7 is "Class" (opposing "rich" with "poor"), then documents that engage highly with the "rich" pole of #7 will also engage highly with the "uneducated" pole of #5 (and vice versa).

Summary
In this article, we proposed a workflow for recovering schemas in naturally occurring language. Defining cultural schemas as slowly learned, implicit, and unevenly distributed relational memory structures that allow people to make sense of the world, recent work has demonstrated ways of recovering schemas-or, rather, the downstream consequences of people drawing upon them-from domain-specific survey modules. We extended these methods to text data. Specifically, we draw on "schematic class analysis," which sorts respondents into "classes" on the basis of the schema to which their survey response patterns best align, and introduce concept class analysis (CoCA): a hybrid model that combines word embeddings and correlational class analysis to group documents across a corpus by the similarity of schemas recovered from them.
Word embeddings, which use term co-occurrence statistics to determine semantic similarity between words, form the backbone of this workflow. We can then aggregate up from the meaning of words to the meaning of documents by conceiving of documents as clouds of locations defined by the words they contain.
Next, we proposed a deductive method for measuring a document's engagement with specific concepts by measuring the cost of moving all the words contained in a document to a specific vector in the N-dimensional semantic space (Stoltz and Taylor 2019). Although vectors may be specific terms denoting a concept, we can extract "semantic directions," which point toward one pole of a binary concept and away from the other pole . For example, a document may "point" toward "feminine" and away from "masculine." Finally, by measuring a document's engagement with several key directions as continuous variables, we can then apply correlational class analysis to extract classes of similar engagement patterns (Goldberg 2011;Boutyline 2017). Using this workflow allows the text analyst to partition documents by recovering the "traces" of schematic information in the texts.

Further Research and Applications
As stated at the onset, one of the key strengths of our proposed workflow is that, unlike surveys, the set of meaningful dimensions that can be estimated are not limited by the questions of a survey instrument. Therefore, if theory and empirical research suggest other semantic directions might be salient for a particular question or corpus, the analyst only needs a list of juxtaposed terms that approximate this dimension.
It may be the case, however, that certain directions are more tightly organized as continuous bipolarities-such as gender in many contexts-whereas other semantic directions that could be estimated (such as movies juxtaposed to books or college versus professional sports) may not be so cleanly organized. Furthermore, although prior research, theory, and semantic dictionaries (e.g., WordNet) can aid in selecting juxtaposed term pairs, there is no agreed upon number of pairs required. Thus this greater flexibility for estimating variables along which documents may be partitioned comes with the additional burden of considering construct validity for each distinct dimension.
Related, the fit between context and dimensions of interest directly bears on selecting an appropriate corpus. That is, like network analysis, text analysis runs into a boundary specification problem. Laumann, Marsden, and Prensky (1989:18-9) note that, for example, "it is obviously of great consequence if a key intervening actor or 'bridging' tie is omitted due to oversight or use of data that are merely convenient," and that "misrepresentation of the process under study" is "precisely the outcome of errors in the definition of system boundaries in a network analysis." Relational analyses using texts must consider this same issue. CoCA falls squarely within this discussion because the recovered schemas are interpreted from document classes partitioned on the basis of similarities in edge placement between documents in the corpus. In other words, like most dimension reduction techniques, the schemas are derived from patterns internal to the data themselves.
RCA and CCA analyses of surveys can often bypass this concern because they can be estimated using probability samples derived from representative sampling frames where each population member has a nonzero probability of selection. Most text corpora are not compiled with probability theory and representativeness in mind. Future methodological research should therefore think more critically about some of the statistical steps in CoCA-namely, the use of asymptotically derived p values and "statistical significance" to filter out correlations in both the interdocument graph and the interdimension correlation networks where the edges indicate whether engagement with a binary concept is predictive of engagement with another binary concept.
Finally, any corpus can be used with our proposed workflow (in fact, one does not even need the raw texts-the texts represented as a "bag-of-words" documentterm matrix is sufficient). Furthermore, CoCA may even be particularly well suited for analyzing open-ended questions from surveys and interviews, especially those with well-defined domains and individual "authors" (i.e., respondents and interviewees, respectively). There has been text-analytic work on these types of data-e.g., analysts have applied topic modeling to open-ended responses from experiments and surveys (Roberts et al. 2014;Finch et al. 2018;Pietsch and Lessmann 2018) and unsupervised document clustering to interview transcripts (Janasik et al. 2009;Sherin 2013). We believe CoCA has much potential with these types of data: the single-author nature of each "document" and, importantly, their domainspecific content (where the participants are answering or ruminating on a specific, researcher-delivered question or prompt) means that CoCA is better able to isolate individual-level schematic traces. Additionally, open-ended responses from representative surveys may alleviate some of the statistical concerns raised in the preceding paragraph (assuming that there is also no nonresponse bias in the openended questions). We hope to see these types of CoCA applications in the near future.
Notes 1 See Wood et al. (2018) and Arseniev-Koehler and Foster (2020) for a discussion of some of the various types of sociologically relevant schemas and how they are different from the related concept of frames.
2 As Fiske and Taylor note (2013:15), "[t]he idea behind [the cognitive miser model] is that people are limited in their capacity to process information, so they take shortcuts whenever they can.... People adopt strategies that simplify complex problems; the strategies may not be correct processes or produce correct answers, but they are efficient. The capacity-limited thinker searches for rapid, adequate solutions rather than for slow, accurate solutions" (see also Taylor 1981;Spears et al. 1999).
3 It is an important, although minor point, to note that we consider schemas cultural so long as they are learned-the extent schemas may be shared is an important empirical question but makes them no more or less "cultural" per se (see White 1959;Weiss 1973).
4 After drafting the current article, we discovered one earlier article that applies RCA to texts, Miranda et al. (2015). The authors, along with three research assistants, hand-coded firms' documents discussing their "initiatives" regarding social media. The documents were coded for the presence of six "principles." Each unique initiative then received a score for each principle: the number of times a principle was present in a document divided by the total number of documents relating to the initiative. The authors then apply RCA to the pattern of scores to extract schematic classes.

5
The key difference is that GloVe begins with a matrix that represents the global term co-occurrence statistics for a given corpus; by contrast, word2vec iterates through the text, slowly extracting co-occurrence statistics (Levy and Goldberg 2014;Levy et al. 2015). The former is not neurally plausible because humans do not begin with schemas weighted by "global" probabilities of features in some domain. Rather, like word2vec, humans slowly accrue these probabilities from their ongoing lived experience.
6 EMD is computationally demanding. There are several efficient solvers that provide good enough approximations. CMD relies on the "relaxed word mover's distance" (RWMD) algorithm, originally detailed by Kusner and colleagues (2015). Readers interested in the technical details relating RWMD to CMD are encouraged to read the original CMD article by Stoltz and Taylor (2019). Lastly, because the CMD algorithm relies heavily on the text2vec R package (Selivanov and Wang 2016), CMD uses a considerably faster approximation of RWMD written into the package known as linear complexity relaxed word mover's distance (Atasu et al. 2017).