Topic Specificity: a Descriptive Metric for Algorithm Selection and Finding the Right Number of Topics

Topic modeling is a prevalent task for discovering the latent structure of a corpus, identifying a set of topics that represent the underlying themes of the documents. Despite its popularity, issues with its evaluation metric, the coherence score, result in two common challenges: algorithm selection and determining the number of topics . To address these two issues, we propose the topic specificity metric, which captures the relative frequency of topic words in the corpus and is used as a proxy for the specificity of a word. In this work, we formulate the metric firstly. Secondly, we demonstrate that algorithms train topics at different specificity levels. This insight can be used to address algorithm selection as it allows users to distinguish and select algorithms with the desired specificity level. Lastly, we show a strictly positive monotonic correlation between the topic specificity and the number of topics for LDA, FLSA-W, NMF and LSI. This correlation can be used to address the selection of the number of topics, as it allows users to adjust the number of topics to their desired level. Moreover, our descriptive metric provides a new perspective to characterize topic models, allowing them to be understood better.


Introduction
Topic models are unsupervised, statistical algorithms that facilitate understanding large document collections by identifying and categorizing semantic patterns within the texts.The output is a set of 'topics', each represented by a cluster of words.These clusters are constructed based on the shared meaning or relevance among the words, and the expectation is that a well-functioning topic model presents thematically coherent word groups, mirroring the inherent thematic structure of the document collection.Ultimately, topic models aim to process and simplify vast text data into manageable and meaningful forms.The applications of topic models are vast and include document mining (Syed et al., 2018;Ahammad, 2024), emotion mining (Rao et al., 2014;Rao, 2015;Pang et al., 2019) text categorization (Haribhakta et al., 2012) and finding text similarity (Spina et al., 2014).Also, the topic model's output may be used as a text embedding to other downstream tasks, such as text classification (Mosteiro et al., 2021).Topic models are commonly evaluated with the coherence score (Newman et al., 2010;Aletras and Stevenson, 2013;Lau et al., 2014;Röder et al., 2015).Ever since this metric was shown to correlate with interpretability, the field abandoned human evaluations (Hoyle et al., 2021), and there is a growing trend towards evaluating topics using the coherence score (Doogan and Buntine, 2021).However, the coherence score's validity independent of human judgments is questionable as discrepancies have been shown between the score and human judgment (Hoyle et al., 2021), between the score and topic-document representation (Bhatia et al., 2018) and evidence shows different coherence scores consistently favouring other algorithms (Rijcken et al., 2022b).These limitations demonstrate an implementation gap.Although the metric behaves unintendedly, we still use it to evaluate, select, optimize and understand algorithms.Because of this gap, we believe there is a need for more (descriptive) metrics that highlight different properties of various models.
Two open problems with topic modeling are algorithm selection and determining the number of topics.Given a corpus, the first task before training any topic model is selecting the most appropriate algorithm.This task is not straightforward, with the vast range of algorithms to choose from (Grootendorst, 2022;Bianchi et al., 2021;Rijcken et al., 2021;Dieng et al., 2020;Karami et al., 2018;Srivastava and Sutton, 2017;Blei et al., 2003;Lee and Seung, 1999;Landauer et al., 1998) and due to the limited number of available metrics, it is challenging to characterize and distinguish between them effectively.Users commonly compare different algorithms and choose the one with the highest performance metric, typically the coherence score.
Once an algorithm is selected, most algorithms require the number of topics to be determined a priori.This number is typically determined by optimizing a single metric, often the coherence score.Thus, the number of topics is subject to the optimization goal of its metric.Similar to (Stammbach et al., 2023), we believe that finding the optimal number of topics is subjective, and there is no objective optimum.The topics of interest might be a few broad or many specific topics; the number and specificity are susceptible to an individual's preferred scope.
However, given the abovementioned limitations, whether using the coherence score for algorithm selection and for determining the desired number of topics yields the desired outcome is questionable.We propose topic specificity to address both issues.This automated measure assumes a word's frequency is inversely related to its specificity.By assessing the relative frequency of topic words in the corpus, it reveals the specificity of topic words.This way, users can select algorithms and the number of topics that match their desired specificity.In our experiments, we compare various algorithms trained with a different number of topics of various open datasets.We find: 1.The algorithms produce topics at different topic specificity levels.2. A strictly monotonic positive correlation between the number of topics and the topic specificity for LDA, FLSA-W, NMF and FLSA.
The first outcome supports the claim that this metric helps with algorithm selection, whereas the second outcome indicates this metric helps to find the right number of topics for some algorithms.
In Section 2, we discuss the literature on automated topic modeling metrics and establish a context for introducing the topic specificity.In Section 3, we elaborate on our notion of specificity and how it relates to word frequency.In Section 4, we formally introduce the topic specificity measure.We discuss the experimental setup in Section 5.The results in Section 6 show the varying specificity depth and the monotonic relationship between the topic specificity and the number of topics for different algorithms.Subsequently, we demonstrate how topic specificity can be used in Section 7. Lastly, we discuss and conclude our findings in Sections 8 and 9.

Topic Modeling Metrics
Topic modeling evaluation has been actively studied during the last decade.Starting with Chang et al. (2009)'s foundational word-intrusion work, subsequent research has explored different coherence measures and their effects on human evaluation.Newman et al. (2010) showed that coherence measures correlate with human evaluation when the aggregate Pointwise Mutual Information (PMI) pairwise scores were calculated for the top N topic words.This performance further improved when the PMI was substituted with its normalized version NPMI (Aletras and Stevenson, 2013;Lau et al., 2014).Then, based on an extensive experimental setup, Röder et al. (2015) evaluated prior work and established new coherence measures, most notably the c v score, correlating higher with human judgment.Lau and Baldwin (2016) study the sensitivity of the topic coherence score to topic cardinality and find the correlation with human judgment to decrease if the words per topic increase.Bhatia et al. (2018) propose an automated topic intrusion approach.Similarly, Stammbach et al. (2023) let large language models perform the rating and intrusion detection tasks for topic model evaluation and propose LLM scores as automated metrics.Ding et al. (2018) proposes two word-embedding-based coherence metrics that correlate almost equally well with the NPMI.Lund et al. (2019) propose an automated methodology for evaluating local topic model quality.Morstatter and Liu (2018) proposes that topic interpretability should consist of topic coherence and topic consensus.The latter is a metric that measures how well the results of a mixture of labels given by a group of human subjects match those given categories of topics.Similarly, Dieng et al. (2020) proposes the interpretability score of a topic should be the product of an intra-and inter-topic score.Rijcken et al. (2023) studies the effect of weighting word distances and finds that the word distance matters less than the size of a sliding window.

Limitations of the Coherence Score
Topic modeling is predominantly used to guide the exploration of large datasets (Agrawal et al., 2018).The assumption is that the topic words reflect the latent topics in the underlying corpus well.However, Bhatia et al. (2017) empirically demonstrate there can be large discrepancies between topic coherence and document-topic associations.They create an adversarial topic model that produces highly coherent topics that collectively reveal little about the content of the document collection.Moreover, recent experiments show a discrepancy between human evaluations and the coherence score (Hoyle et al., 2021).Similarly, other experiments show different coherence scores disagreeing by consistently favouring different algorithms (Rijcken et al., 2022b).These observations demonstrate a lack of thorough understanding of the coherence scores.Meanwhile, these scores are used for model selection and evaluation, and there is a growing trend towards using the coherence score (Doogan and Buntine, 2021).For this reason, we believe there is a need for more metrics that help distinguish and understand different algorithms better.

Word Frequency and Specificity
The topic specificity is based on the assumption that the frequency of a word is inversely related to its specificity.This section defines our notion of the word specificity and substantiates the underlying assumption.
A common use of the term specificity is the semantic-pragmatic notion that distinguishes between different uses or interpretations of indefinite noun phrases (von Heusinger, 2019).Our notion of specificity is different; we define it as the degree to which a word or phrase is limited to a particular context, domain, or subject matter.In this sense, words with high specificity are those that are most often found in specific, narrow contexts, such as technical jargon or industry-specific language.Words such as 'lexicology' in linguistics or 'photosynthesis' in biology are highly specific, as they are predominantly used within their respective fields and less likely to be found in general discourse.Conversely, words with low specificity are those that are commonly used across a wide range of contexts and subjects.These include function words like 'the', 'is', and 'at', as well as common nouns and verbs.Such words are not strongly associated with any particular subject or context and are fundamental to general communication.
Some words fall in between these extremes, like 'thought', 'car', and 'innovation'.These words are not confined to a single specialized field but are more context-specific than common function words.These examples demonstrate the spectrum of word specificities, from highly specialized to broadly applicable.The relative term frequency is used as a measure to quantify this spectrum.It reflects how often words are used in specific contexts, helping to map words onto this spectrum of specificity.

Methods
This section formally describes the topic specificity and topic descriptors within the framework of a topic model.In most topic models, a finite set of topics Θ is derived from the corpus, where each topic θ k is a propensity distribution p(w i |θ k ) over the vocabulary V , indicating to what extent a word w is associated to a topic.We have: Here, K is the total number of topics, and N is the number of unique vocabulary words.Then, the topic descriptor for topic k, denoted by θ * k , is defined as the set of words w i in V that have the M highest scores of p(w i |θ k ), given by: To calculate the topic specificity for a word or topic, we start with V .Then, V sorted is a reordering of V based on the frequency of each word i in the corpus, denoted by f (w i ): In V sorted , each word has a unique index reflecting its rank based on word frequency.However, multiple words may have the same frequency in the corpus.For each unique frequency α, we define the set F α as: We then define a rank mapping function ϕ(w i ) such that each index in F α is mapped to a rank value that corresponds to the lowest index in F α .This ensures that words with the same frequency are assigned the same rank in the topic descriptors1 .Each word w i in V can be assigned a rank score reflecting its relative frequency.The topic specificity score, denoted as ρ(w i ), for a given word w i is calculated as the rank score from mapping function ϕ(w i ) normalized by the vocabulary size (see equation 6).
The function ϕ(w i ) assigns the minimum index in F α to all words w i with frequency α, effectively ranking words with the same frequency identically.Dividing by N normalizes the rank and allows comparison across different vocabularies.While ρ(w i ) represents the topic specificity score of an individual word w i based on its characteristics in the corpus, ρ(Θ) aggregates these scores across the top M words in all K topics using a weighted average.This provides a comprehensive measure of topic specificity across the entire set of topics.Formally, the topic specificity for Θ is given by the average of the rank scores ρ(w i ) for all words w i in the top M words of each topic (see equation 7).
calculates the sum of the topic specificity scores ρ(w i ) for all words w i in each of the top M word sets θ * k across all K topics.This total is then divided by KM , the total number of words considered, to calculate the average topic specificity score for the set of topics Θ.Note that some topic modeling algorithms do not represent topics as a set of propensity distributions.Instead, they find their topic descriptors differently.In this case, the topic specificity ρ(Θ) can still be calculated as the mean score for all these words.
The code for the Topic Specificity score can be found in the Python package FuzzyTM Rijcken et al. (2022a), the first Python package for training fuzzy topic models2 .
We compare these algorithms on four open datasets: BBC-news (Greene and Cunningham, 2006), DBLP (Pan et al., 2016), M10 (Lim and Buntine, 2015) and 20Newsgroup (Lang, 1995); all retrieved from and preprocessed by the OCTIS package (Terragni et al., 2021) applying punctuation removal, lemmatization and stop word removal.See Table 1 for descriptive statistics of the datasets.For each algorithm-dataset combination, we use Bayesian Optimization (Snoek et al., 2012;Archetti and Candelieri, 2019) to find optimal values for three hyperparameters per algorithm.We optimize the C v coherence score because this metric correlates the highest with human judgment (Röder et al., 2015) and is the most common evaluation metric.While we acknowledge the limitations of the coherence score discussed in Section 2.1, we are unaware of more suitable metrics.Consequently, we consider the coherence score our best available option for model evaluation.
Moreover, we observed no strictly positive correlation between the number of optimization runs and the topic coherence in previous experiments.For this reason, it seemed unrealistic to expect high-quality topics using a convergence-based stop condition for optimization.Instead, we believe the most effective approach is to run a fixed number of iterations and select the best-performing set of hyperparameters.Determining the number of optimization runs involves balancing computational resources and the thoroughness of the search.A larger number of optimization runs increases the chance of finding a better-performing set of hyperparameters at the cost of increased computational time and resources.A common heuristic is to use 15 times the number of hyperparameter settings.To increase the chance of obtaining well-performing hyperparameters, we use 25 test runs per parameter instead of 15, slightly favouring the performance.Since we optimize three hyperparameters, this means we test 75 different settings.We train each setting five times and select the highest median score.Based on these hyperparameters, we train each algorithm-dataset combination on a number of topics 10, 20, . . ., 100, ten times each.Hence, we train 13 500 models for optimization and 3 600 for evaluation 3 .To assess whether different algorithms produce topics with different specificity scores, we report the average specificity scores (as defined in Section 4) on the ten most probable words per topic for each combination of algorithm, dataset and number of topics.Subsequently, we use the Spearman correlation coefficient ρ, and Kendall's Tau τ score to test the monotonicity per algorithm and assess whether the number of topics correlates with the specificity.Moreover, we report the average specificity slope, defined as the average distance between specificity scores for different numbers of topics.This number shows how sensitive a model's specificity is to the number of topics.

Results
Figures 2-5 show the mean scores per algorithm and the number of topics on different datasets.From these figures, a few observations can be made: 2. There is a significant positive correlation between the number of topics and the topic specificity for many algorithms.
Distinguishing between algorithms.Different algorithms produce topics at different specificity levels and can be distinguished based on this measure.Table 2 shows the large variation in the mean specificity per algorithm.ETM produces topics with the most generic scope (lowest specificity), whereas FLSA-W produces topics with the most specific scope (highest specificity).Also, the algorithms' slope between different numbers of topics varies from each other.ETM's specificity score remains approximately constant over a different number of topics.In contrast, FLSA-W starts with generic topics (low specificity) and has a concave slope becoming more specific (high specificity) as the number of topics increases.Positive correlation.Table 3 shows the mean slope per algorithm, combined with its Spearman correlation coefficient ρ, Kendall's Tau τ scores to show the monotonicity and significance of the slope.This table shows that LDA, FLSA-W, NMF, and FLSA all have positive monotonic correlations with p-values ≤ 0.01.Hence, these algorithms become more specific as the number of topics increases.

Mean
Algorithms.Distinct patterns across different datasets are found for various algorithms.The Latent Dirichlet Allocation (LDA) algorithm exhibits a notable upward trend in specificity levels in three datasets, but this trend is absent in the BBC News dataset, where it produces less specific topics.In contrast, the FLSA-W algorithm consistently demonstrates a concave pattern in topic specificity across all datasets.Also, both FLSA and NMF show a positive linear relationship between the number of topics and specificity in all datasets.Counterintuitively, CTM shows an inverse relationship, becoming more generic with the number of topics (p-value: 0.035).ETM and LSI consistently produce the most generic topics; ETM is always more specific than LSI, and their specificity levels vary little for different topics.In most datasets, NeuralLDA's specificity grows with the number of topics.However, the BBC News dataset depicts a reversed trend for a low number of topics.ProdLDA maintains a baseline specificity level (≥ 0.19) in three datasets, but the DBLP dataset is an exception, where it generates more generic topics.The consistency of ProdLDA's performance varies notably across different datasets.

Using Topic Specificity
Our results show that LDA, FLSA-W, NMF, FLSA, and ProdLDA exhibit a significant monotonic relationship between these two aspects.Leveraging this relationship can guide the optimal selection of the number of topics in a model.The following methodology demonstrates how the topic specificity can be used in practice.
1. Define a set of words of interest.2. Find the specificity of these words.3. Run an algorithm on the data with an arbitrary number of topics4 .4. Iteratively adjust the number of topics, aiming to align the average topic specificity closer to that of the defined words of interest.

Example
We are interested in finding the underlying topics of the DBLP dataset.Specifically, the goal is to extract topics with specificity scores similar to machine learning tasks, such as clustering and regression, with specificity scores of 0.081 and 0.110, respectively.For this reason, we train LDA with ten topics.The resulting topics have 78 unique words and are shown in Table 4 and 5 in the Appendix.The specificity of this model is 0.035.Looking at the first topic (ρ = 0.029), we find generic machine learning words such as: model (ρ = 0.001), learn (ρ = 0.005), neural (ρ = 0.019), network (ρ = 0.005), bayesian (ρ = 0.058) and inference (ρ = 0.065), which aligns with the low specificity number.Although this seems to be a coherent topic, the focus is too generic for our interest.For this reason, we train another model with more topics.When training 30 topics, we get a model with a specificity of 0.103, which falls within the range of our desired specificity.This model has 270 unique words and comprises 75 (out of 78) words from the ten-model topic.However, the specificity of this model is 0.103.Hence, similar topics are depicted as in the ten-topic model but in a more specific context.Topic  19 contains (ρ = 0.131) the words neural and network.While this topic has these generic words, it also contains more specific words such as artificial (ρ = 0.171), preserve (ρ = 0.189), privacy (ρ = 0.184), interaction (ρ = 0.165) and sequential (ρ = 0.170), that reveal a more specific context in which the words are used.With topic modeling, we cannot guarantee that words of interest will be used in topics.In our example, neither of the topics produced our words of interest, namely, clustering and regression.However, the example demonstrates that increasing the number of topics leads to topics with more specific contexts.

Discussion
This study introduces topic specificity, an automated metric designed to assist in algorithm selection and determining the optimal number of topics, offering new insights into the behaviours of various topic modeling algorithms.While a search strategy is still necessary to select the algorithm and number of topics, this metric enables users to tailor those settings to their desired specificity level, a capability  previously unavailable.Our experimental results show that algorithms vary in the specificity of the topics they generate, allowing users to choose an algorithm that best fits their needs.Additionally, there is a positive correlation between the topic specificity score and the number of topics for several algorithms (LDA, FLSA-W, NMF, FLSA) with significant p-values below 0.01.Conversely, CTM exhibits a reverse effect, producing more general topics as the number of topics increases, with a significance level of 0.035.The following sections discuss the implications and limitations of these findings.
Datasets.The dataset on which topic models are trained seems to affect the rank of the topics they produce.Particularly, LDA and NeuralLDA show different patterns on the BBC News dataset than on other datasets.This dataset significantly varies from the other datasets regarding the number of articles, average length, and vocabulary length (see Table 1.Because there is more than one distinguishing feature, further research is needed to explain the deviations.On the DBLP dataset, ProdLDA produces generic topics, whereas it produces specific topics on the other dataset.The DBLP stands out from other datasets in the number of articles it contains.Yet, further research into dataset characteristics is needed to explain the behavioural deviations. Descriptive Statistic.The topic specificity is a descriptive statistic rather than a performance metric; the 'optimal' specificity depends on the user's preference and cannot be objectively measured.Although there is no objective optimum, the information provided by the metric helps in understanding and comparing different algorithms.With this information, users can select the algorithm and number of topics that best match their preferences.Topic Specificity vs. Coherence.Figure 6 shows how the topic specificity score relates to the coherence score.Each data point reflects a topic model trained based on a setting from the experiments (algorithm, dataset and number of topics).Although low coherence scores have low specificity scores, the plot indicates that coherence does not necessarily increase as topic specificity increases.In fact, there are high coherence scores across the entire range of topic specificity.Hence, no clear linear relationship between coherence and specificity is visible in our data.
Exploring Algorithms.Our experiments show that changing the number of topics influences the specificity of topics for most algorithms but not for all.This leads to an intriguing observation: while it seems intuitive to expect this pattern for all algorithms, the observed variation suggests differently.We expect that the topic specificity metric, together with these experiments, can serve as a starting point for further research aiming to answer why some algorithms behave in certain ways.
Limitations.Our experimental results indicate that the topic specificity of some algorithms is insensitive to the number of topics.Similarly, some algorithms seem to 'behave' consistently on three datasets but differently on one.In this work, we have put no effort into exploring the reasons for these behaviours, as our scope is on the topic specificity metric itself.Nonetheless, we think such explorations would be useful for understanding the inner workings and dataset characteristics of different algorithms.

Conclusion
This work proposes an automatic metric called topic specificity to address the problems of algorithm selection and determining the number of topics; it reflects the average rank of topic words in the vocabulary.We demonstrate that algorithms can be distinguished based on their specificity scores, which can help users select the algorithm closest to their preferred specificity.Moreover, the strictly monotonic relationship between the topic specificity and the number of topics can help users in setting the number of topics.Lastly, we demonstrate with a practical example how to use our metric and that topics become more specific when the number of topics is increased.Future work should aim to explain the differences in topic specificity behaviour among algorithms/datasets.

Declaration of Competing Interests and Funding
The authors acknowledge the COVIDA funding provided by the strategic alliance of TU/e, WUR, UU, and UMC Utrecht.They have no competing interests to declare relevant to this article's content.

Declaration of Generative AI and AI-assisted technologies in the writing process
During the preparation of this work, the authors used ChatGPT to improve the writing style.After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication's content.

Thirty Topics
Cluster

Figure 1 :
Figure 1: Python code for topic specificity

Figure 2 :
Figure 2: Topic specificity on the BBC-News dataset for various algorithms and the number of topics.

Figure 3 :
Figure 3: Topic specificity on the DBLP dataset for various algorithms and the number of topics.

Figure 4 :
Figure 4: Topic specificity on the M10 dataset for various algorithms and the number of topics.

Figure 5 :
Figure 5: Topic specificity on the 20Newsgroup dataset for various algorithms and the number of topics.

Figure 6 :
Figure 6: A graph showing the average coherence-scores combined with the average specificity for each combination of model, dataset and number of topics.

Table 1 :
Descriptive statistics of the four datasets Table 2: Mean specificity 1. Algorithms produce topics with different specificities.

Table 3 :
Mean slope per algorithm, the Spearman correlation coefficient ρ, and Kendall's tau τ to test the strength of monotonicity.

Table 4 :
The ten topics produced by training an LDA model on the DBLP dataset.