Integrating topic modeling and word embedding to characterize violent deaths

Significance We introduce an approach to identify latent topics in large-scale text data. Our approach integrates two prominent methods of computational text analysis: topic modeling and word embedding. We apply our approach to written narratives of violent death (e.g., suicides and homicides) in the National Violent Death Reporting System (NVDRS). Many of our topics reveal aspects of violent death not captured in existing classification schemes. We also extract gender bias in the topics themselves (e.g., a topic about long guns is particularly masculine). Our findings suggest new lines of research that could contribute to reducing suicides or homicides. Our methods are broadly applicable to text data and can unlock similar information in other administrative databases.

of topics. Formally, we want to keep the 0 "norm" of each column in X (i.e., the number of non-zero elements) small, so that 123 it is less than or equal to the sparsity constraint hyperparameter T0. Thus, the objective function of the K-SVD constrained 124 optimization problem is: 126 with the constraint xi 0 ≤ T0 ∀i. Recall that xi 0 is the 0 norm of the i-th column of X and ... norm, i.e., the sum of squared entries of the matrix. Hence we want to choose D and X such that the total squared difference 128 between the original embedding Y and the reconstruction DX is minimized, while constraining each column of X to T0 129 non-zero entries; in other words, a sparse representation of each word vector in terms of the atom vectors.

130
A. Solving the objective function of K-SVD to arrive at topics. In general, this constrained optimization problem cannot be 131 "solved" (i.e., truly optimized); therefore approximate methods must be used. The overall strategy to minimize the objective 132 function of K-SVD (and thus identify our topics) involves alternating updates to the coefficient matrix X and the dictionary D. 133 We begin with a randomly initialized dictionary D. constraint T0 on the number of non-zero coefficients). These separate minimization problems can be addressed using OMP to 142 give an approximate solution (11). ‡ 143 A.2. Updating the dictionary. Once the coefficients are updated for all columns of X , we freeze the coefficients. We then update 144 the dictionary of atoms; here we follow (11) closely. We update one atom vector (i.e., column of the dictionary) at a time. To 145 update the kth atom vector, we identify the word vectors whose reconstructions use that atom (i.e., the corresponding coefficient 146 in the sparse representation vector xi is nonzero). Now define a representation error matrix E k = Y − j =k djx j T , where dj 147 is the jth column of the dictionary matrix D (i.e., the atom vector for topic j) and x j T is the jth row of the representation 148 matrix X, i.e., all of the coefficients for the jth atom vector. E k essentially corresponds to all of the reconstruction error that 149 remains after we have reconstructed Y with the other K − 1 topics. 150 We want to reduce the reconstruction error further by updating the vector for the kth atom d k and the corresponding row 151 of the coefficient matrix x k T , but we must do so in a way that preserves sparsity. We do so by considering only the columns 152 of the error matrix that correspond to word vectors whose reconstruction currently uses the kth atom, yielding a restricted 153 matrix E R k . We likewise restrict x k T to only those elements of the row with non-zero entries (i.e., those coefficients where atom 154 vector k is currently used); call this x k R . We now update d k and x k R to minimize E R k − d k x k R 2 F ; this is, in essence, the "best 155 we can do" to further reduce error by only changing the atom vector d k and altering the way that reconstructions already 156 using that atom vector load onto it. By construction, this update cannot lead to violation of the sparsity constraint. This 157 sparsity-preserving minimization with respect to d k and x k R can be done via singular value decomposition (SVD) of the error 158 matrix E R k = U∆V T . In essence we want a rank one approximation of the error matrix E R k ; the optimal such approximation 159 is obtained by setting d k to be the first left singular vector (the first column of U) and the reduced coefficient vector x k R to be 160 the transpose of the first right singular vector (the first column of V) times the first singular value (i.e., ∆11). This updating 161 process must be carried out for every column of the dictionary matrix D. 162 The process iterates between updates to the dictionary and updates to the coefficients (which assign sparse combinations of 163 atoms to each word), until it reaches a predetermined stopping point. In our case, the process stops after either 10 iterations 164 or the total reconstruction error falls below 1 × 10 −6 , whichever happens first. The final result is a matrix of atom vectors 165 D and a matrix of coefficients X that allow us to reconstruct each vocabulary word as a sparse linear combination of atoms. 166 Conceptually, updating atoms in this way encourages distinct atoms; each time an atom is updated, the goal is to best account 167 for all the variation in words' meanings that the other atoms do not already explain. 168 We note that our overall approach is extremely modular. While we use K-SVD to discretize the semantic space and identify 169 topics, other dictionary learning algorithms (or even clustering algorithms, like k-means) can be used instead. § As long as 170 ‡ See https://en.wikipedia.org/wiki/Matching_pursuit for a simple exposition of the related Matching Pursuit algorithm. OMP works in our case as follows: For a given word vector, we find the closest possible atom vector using cosine similarity. The projection of the word vector onto that first atom vector represents our first attempt at reconstructing the word vector, and hence our first pass at the coefficients. We next compute the residual (the vector difference between the word vector and the reconstruction). We then find the atom vector closest to the residual (i.e., what is not explained by the atom(s) already assigned to this word-vector). This becomes the next atom vector with a non-zero coefficient. In OMP, we compute new coefficients for both atom vectors by projecting the full word-vector onto their span (in this case, a plane); this yields a new set of coefficients and a better reconstruction of the original word-vector. We iterate this process-compute the difference between the word-vector and its current reconstruction; find the atom vector closest to the residual; project the full word-vector onto the span of the iteratively selected atom vectors; repeat-until we have chosen T 0 atom vectors, corresponding to T 0 non-zero coefficients in x i for the sparse coefficient matrix X corresponding to the current dictionary D . § We conducted experiments using k-means. While performance on our corpus was comparable to K-SVD, we found that K-SVD produced interpretable topics more robustly across different corpora. We also note that the "theory of meaning" implicit in the K-SVD approach is more realistic: it views all words as a combination of basic semantic "building blocks" and finds those building blocks with that picture in mind. Using K-means implicitly assumes that the meaning of each word is best represented by the nearest cluster of word vectors, ignoring polysemy. This metric is well suited for topic modeling in embeddings, is efficient to compute, and correlates well with human judgement 193 (17). As illustrated in Figure S2A, we found that models with fewer topics tended to produce slightly more coherent topics, but 194 models were coherent across various numbers of topics. value if the same 25 words were the "top 25" in all topics. As illustrated in Figure S2B, we found that models with fewer topics 201 also tended to produce more distinct topics, and topic diversity dropped rapidly in models with more than approximately 225 202 topics.

203
While coherence and diversity favor a parsimonious topic model with few topics, it is nevertheless important that the model 204 "explains" the space of possible meanings in the corpus. To capture this important aspect, we turned to our third metric: 205 coverage. To measure how well the topics in a given model cover the semantic space, we computed the extent to which we could 206 "reconstruct" the original semantic space using just the set of topics. As in k-means-which is in fact a special case of K-SVD 207 (11, 12)-the objective function of K-SVD minimizes the sum of squared errors between the original data and reconstructed 208 data. Using the sum of squared errors and sum of squares total, we computed the proportion of the original variance explained 209 by the topics (i.e., R 2 ) to measure how well a candidate set of topics explains the semantic space (we refer to the value for R 2 210 here as coverage). In contrast with topic diversity and coherence, coverage continues to increase in models with more topics, 211 but the marginal gains from adding more topics reduce considerably around 225 topics in our data ( Figure S2C).

212
We selected our final model to balance all three of our metrics for a good quality topic model. Coherence steadily decreased 213 with more topics. Diversity dropped rapidly after around 225 topics. At first, coverage rapidly increased with more topics, but  Figure S2D), quite close to the value selected by the procedure above balancing coherence, diversity, and coverage.

221
In Table S2 we list all the topics identified in our data using the Discourse Atom Topic Model. For each topic, we include 222 our label (manually assigned) and the 10 most representative terms (from highest to lowest cosine similarity to the topic's 223 atom vector).
224 ¶ Hypothetically this coherence metric could range to -1, since cosine similarity between two word vectors in our word embedding may range from -1 to 1. In practice, word-vectors rarely have a negative cosine similarity. For clarity, we report this value as ranging from 0 to 1 in the main text (18).
‖ The final hyperparameter in the Discourse Atom Topic Model is the sparsity constraint T 0 , which is the number of topics that a word in the embedding matrix is allowed to "load" on to (i.e., have a non-zero coefficient). The sparsity constraint must be between 1 (in which case K-SVD is identical to k-means) and the number of topics in the model. We follow Arora et al. (10) in setting the sparsity constraint to 5. As they describe, if this sparsity constraint is not sufficiently low, then some of the coefficients must necessarily be small; this makes the corresponding components indistinguishable from noise (10). We empirically observed that models with more nonzeros have lower coherence and slightly less diversity, but higher coverage.

225
Topic modeling is a core method in text analysis. It is therefore unsurprising that a plethora of specific approaches exist. assumptions about the role of topics in the text-generation process.

251
DATM has several practical advantages compared to many prior topic models. It is robust to stopwords, domain specific 252 vocabulary, and can be used on documents of varying lengths. It can also yield highly interpretable and coherent topics, as we 253 illustrate in this paper. As we show next, the topics identified by DATM are qualitatively different from those picked up with 254 LDA topic models (the mainstream approach in computational social science). We emphasize, however, that the "ideal" topic 255 model for a particular use-case will depend on the researcher's data, theoretical assumptions, and research questions.

256
A. LDA Topic Modeling on NVDRS Narratives. Here we provide sample topics generated on our data using one of the most 257 popular topic models: LDA topic modeling (Table S3) To train our LDA topic models, we used a Python wrapper (2) for the MALLET implementation of LDA topic modeling 262 (24), after observing that this implementation offered substantially more interpretable topics than the default implementation in 263 Python using the Gensim package (2). For instance, in an LDA model trained with 225 topics using the default implementation, 264 the most five probable terms for one topic (topic 214, selected at random) include: "hispanic," "homeless," "wood," "decompos-265 ing," and "inflicted." The five most probable terms for another randomly chosen topic (topic 154) include: "seen_alive," "last," 266 "initiated," "doorway," and "letters." For reference, the overall model had a coherence of 0.11 and a topic diversity of 0.69. As a 267 second example, in an LDA model trained with 100 topics, the most five probable terms for one topic (topic 60, selected at 268 random) include: "garage," "nature," "dog," "fatal_injury," and "contents." This overall model scored similarly on coherence 269 and diversity (0.12 and 0.66, respectively). 270 We initially tried using the exact same vocabulary as we used for our Discourse Atom Topic Model. However, the resulting 271 topics were uninterpretable. They contained many stopwords and words that are very common in our data and thus lose 272 meaning (e.g., "the" and "victim"). LDA models require careful pre-processing that is specific to the corpus, and often are not 273 robust to stopwords (30-32), unlike the Discourse Atom Topic Model. Thus, for LDA topic modeling, we removed standard 274 stopwords using a list from the nltk package in Python (we retained gender pronouns, however, even though these are considered 275 stopwords in the nltk list). We also removed words that occurred in more than 75% of the documents (ubiquitous words), or 276 fewer than 15 times total in the corpus (very rare words).

277
To select the best LDA model, we trained 11 LDA models with varying values of K (i.e., topics): 15, 25, 50, 100, 150, 278 175, 200, 225, 250, 400, and 800. We selected our final LDA topic model among these using coherence and diversity metrics, 279 described in the main paper; coverage does not apply to LDA topic models, since it has to do with the ability of topic atoms 280 to reconstruct an embedding space. To compute coherence and diversity, we selected the "top 25" words for each topic by 281 considering their probability given the topic. In our LDA topic models, coherence has minimal gains after 100 topics (when 282 coherence is 0.18); it then drops with more than 250 topics. Topic diversity begins at 0.66 (at 15 topics, the smallest number 283 of topics we considered) but rapidly diminishes (e.g., by 250 topics the topic diversity is 0.41). Using the elbow method, we 284 selected a final LDA model with 100 topics to balance both coherence and diversity. which topics are associated with descriptions of indoors or outdoors in the narratives. Indeed, this core approach to measure 314 bias or cultural meaning in topics can be used for any strong, stable semantic contrast, including contrasts that may have 315 theoretical motivation but are not (yet) cleanly represented in structured variables.

316
As with the construct of gender, we extract a dimension for indoors versus outdoors in the corpus. Specifically, we average 317 the vectors for the words: indoors, inside, and indoor, and then subtract out the average of the vectors for the words: outdoors, 318 outside, and outdoor. We examine the topics that load most highly onto the resulting dimension (i.e., have the highest or 319 lowest cosine similarity). Topics with large negative cosine similarity are more distinct to language about outdoors (and not 320 indoors), while topics with large positive cosine similarities are more distinct to language about indoors (and not outdoors). In 321 our data, the most "outdoors" topic is one we had labeled "Specific outdoor locations" (Topic 120), followed by topics labeled 322 "Rural outdoor areas" (Topic 219), and "Canvassed" (Topic 52). The most "indoors" topic is one labeled "Fumes" (Topic 14), 323 followed by topics about "Tubes" (Topic 214), and "Gun Actions" (Topic 152). For context, fumes and tubes both reflect gas 324 poisonings, which occur in closed (i.e., indoor) spaces.   Family members mother, grandmother, father, sister, niece, aunt, stepfather, brother, fiance, best_friend Most representative terms are listed in order of highest to lowest cosine similarity to the topics vector. Misspellings which were not caught in preprocessing are retained here. Topics which are observed to be syntactic are denoted with (syntatic) in the topic label. Topics which may be graphic are denoted with (graphic) in the topic label. One term is modified in this table to "may_X_20XX" to retain anonymity.  hospital, transported, died, local, admitted, injuries, ambulance, expired, transferred, surgery medical, emergency, service, scene, pronounced, notified, considered, clear_whether, moments_before, pronounces Most representative terms are listed in order of highest to lowest probability of being generated by each LDA topic. Misspellings which were not caught in preprocessing are retained here.