Topic Modeling with LSA, pLSA, LDA, NMF, BERTopic, Top2Vec: a Comparison
A comparison between different topic modeling strategies including practical Python examples
In Natural Language Processing (NLP), the term topic modeling encompasses a series of statistical and Deep Learning techniques to find hidden semantic structures in sets of documents.
Topic modeling is an unsupervised Machine Learning problem. Unsupervised means that the algorithm learns patterns in absence of tags or labels.
Most of the information we generate and exchange as human beings has a textual nature. Documents, conversations, phone calls, messages, emails, notes, social media posts. The ability to automatically extract value from these sources in the absence of (or with limited) a priori knowledge is an everlasting and ubiquitous problem in Data Science.
In this post, we discuss popular approaches to topic modeling, from conventional algorithms to the most recent techniques based on Deep Learning. We aim at sharing a friendly introduction to these models, and comparing their advantages and disadvantages in practical applications.
We also provide end-to-end Python examples for the most predominant approaches.
Finally, we share some considerations on two of the most challenging aspects of unsupervised textual data analysis: the discrepancy between the human definition of “topic” and its statistical counterpart, and the difficulties associated to a quantitative assessment of topic models performances.
2. Topic Modeling Strategies
Latent Semantic Analysis (LSA) (Deerwester¹ et al. 1990), probabilistic Latent Semantic Analysis (pLSA) (Hofmann², 1999), Latent Dirichlet Allocation (LDA) (Blei³ et al., 2003) and Non-Negative Matrix Factorization (Lee³ et al., 1999) are conventional and well-known approaches to topic modeling.
They represent a document as a bag-of-words and assume that each document is a mixture of latent topics.
They all start with the conversion of a textual corpus into a Document -Term Matrix (DTM), a table where each row is a document, and each column is a distinct word:
Note: implementations/research papers may also use/refer to the Term-Document Matrix (TDM), the transpose of the DTM.
<i, j> contains a count, i.e. how many times the word
j appears in document
i. A common alternative to the word count is the TF-IDF score. It considers both term frequency (TF) and inverse document frequency (IDF) to penalize the weight of terms that appear very often in the corpus, and increase the weight of more rare terms:
The basic principle behind the search of latent topics is the decomposition of the DTM into a document-topic and a topic-term matrix. The following methods differ in how they define and reach this goal.
2.2 Latent Semantic Analysis (LSA)
To the aim of decomposing the DTM and extract topics, Latent Semantic Analysis (LSA) applies a matrix factorization technique called Single Value Decomposition (SVD).
SVD decomposes the DTM into the product of three distinct matrices:
DTM = U ∙ Σ ∙ Vᵗ, where
LSA selects the first
t <= min(m, n) largest singular values of the DTM, and thus discards the last
m - t and
n - t columns of
V, respectively. This procedure is known as truncated SVD. The resulting approximation of the DTM has rank
t, as sketched in the image below.
t approximation of the DTM is optimal in the sense that it is the closest rank
t matrix to DTM in terms of L₂ norm. The remaining columns of
V can be interpreted as document-topic and word-topic matrices, and
t indicates the number of topics.
2.3 Probabilistic Latent Semantic Analysis (pLSA)
Hofmann² (1999) proposed a variation of the LSA where the topics are estimated using a probabilistic model instead of SVD. Hence the name, probabilistic Latent Semantic Analysis (pLSA).
In particular, pLSA models the joint probability
P(d, w) of seeing a word
w and a document
d together as a mixture of conditionally independent multinomial distributions:
The previous expression can be re-written as:
We can draw an analogy between this expression and the previous formulation of the DTM decomposition, where:
The model can be fit using the expectation-maximization algorithm (EM). In brief, EM performs maximum likelihood estimation in the presence of latent variables (in this case, the topics).
Notably, the decomposition of the DTM relies on different objective functions. For LSA, it is the L₂ norm, while for pLSA it is the likelihood function. The latter aims at an explicit maximization of the predictive power of the model.
pLSA shares the same advantages and drawbacks with the LSA model, with some peculiar differences:
pLSA provides no probabilistic model at the level of documents. This implies that:
2.4 Latent Dirichlet Allocation (LDA)
The Latent Dirichlet Allocation (LDA) (Blei³ et al., 2003) improves pLSA by using Dirichlet priors to estimate the document-topic and term-topic distributions in a Bayesian approach.
The Dirichlet distribution
Dir(α) is a family of continuous multivariate probability distributions parameterized by a vector
α of positive reals.
Let us imagine a newspaper with three sections: politics, sports and arts, each section also representing a topic. The hypothetical distribution of the topics mixture in the newspaper sections is an example of Dirichlet distribution:
Section 1 (politics):
Section 2 (sports):
Section 3 (arts):
Let us observe the plate notation (a conventional method to represent variables in a graphical model) of the LDA to explain the use of Dirichlet priors:
M indicates the number of documents and N the number of words in a document. From the top, we observe
α, the parameter of the Dirichlet prior on the per-document topic distributions. From the Dirichlet distribution
Dir(α), we draw a random sample representing the topic distribution
θ for a document. As if, in our newspaper example, we were taking a mixture (0.99 politics, 0.05 sports, 0.05 arts) describing the distribution of topics for an article.
From the selected mixture
θ, we draw a topic
z based on the distribution (in our example, politics). From the bottom, we observe
β, the parameters of the Dirichlet prior on the per-topic word distribution. From the Dirichlet distribution
Dir(𝛽), we choose a sample representing the word distribution
φ given the topic
z. And, from
φ, we draw a word
In the end, we are interested in estimating the probability of a topic
z given a document
d and the parameters
P(z|d, α, 𝛽). The problem is formulated as the calculation of the posterior distribution of the hidden variables given a document:
Since this distribution is intractable to compute, Blei³ et al. (2013) suggested the use of an approximate inference algorithm (variational approximation). The optimizing values are found by minimizing the Kullback-Leibler divergence between the approximate distribution and the true posterior
P(θ, z|d, α, 𝛽). Once we get the optimal parameters for our data, we can again compute
P(z|d, α, 𝛽), which, in a sense, corresponds to the document-topic matrix
U. Each entry of
𝛽₁, 𝛽₂, ..., 𝛽ₜ is
p(w|z), which corresponds to the term-topic matrix
V. The main difference is that, much like in pLSA, the matrix coefficients have a statistical interpretation.
Practical example with LDA
In the following example, we use the Gensim library with pyLDAvis for a visual topic exploration.
2.5 Non-negative Matrix Factorization (NMF)
Non-negative Matrix Factorization (NMF), introduced by Lee⁴ et al. (1999), is a variation of LSA.
LSA leverages SVD to decompose the Document-Term Matrix and extract latent information (the topics). A property of SVD is that the basis vectors are orthogonal to each other, forcing some elements in the bases to be negative.
In brief, factorizations with negative matrix coefficients (like SVD) pose a problem for interpretability. Subtractive combinations do not allow to understand how a component contributes to the whole. NMF decomposes the Document-Term Matrix into a topic-documents matrix
U and a topic-term matrix
Vᵗ, much like SVD, but with the additional constraint that
Vᵗ can only contain non-negative elements.
Moreover, while we exploited a decomposition of the form
U ∙ Σ ∙ Vᵗ, in the case of non-negative matrix factorization, this becomes:
U ∙ Vᵗ.
The decomposition of the DTM can be posed as an optimization problem that aims at minimizing the difference between the DTM and its approximation. Frequently adopted distance measures are the Frobenius norm and the Kullback-Leibler divergence.
NMF shares the same main advantages and drawbacks of the other classical models (bag-of-words approach, need of pre-processing, …), but with some peculiar traits:
Practical example with NMF
2.6 BERTopic and Top2Vec
Grootendorst¹¹ (2022) and Angelov¹² (2020) proposed novel approaches to topic modeling, BERTopic and Top2Vec respectively. These models address the limitations of conventional strategies discussed so far. We explore them together in the following paragraphs.
2.6.1 Document embedding
BERTopic and Top2Vec manufacture semantic embeddings from input documents.
In the original papers, BERTopic leveraged BERT Sentence Transformers (SBERT) to manufacture high-quality, contextual word and sentence vector representations. Instead, Top2Vec used Doc2Vec to create jointly embedded word, document, and topic vectors.
At the moment of this writing, both algorithms support a variety of embedding strategies, although BERTopic has a broader coverage of embedding models:
2.6.2 Dimensionality reduction with UMAP
One may apply a clustering algorithm to the embeddings directly, but this would increase computational expenses and lead to poor clustering performances (due to the “curse of dimensionality”).
Therefore, a dimensionality reduction technique is applied before clustering. UMAP (Uniform Manifold Approximation and Projection) (McInnes¹³ et al., 2018) provides several benefits:
Both BERTopic and Top2Vec originally leveraged HDBSCAN (McInnes¹⁵ et al., 2017) as clustering algorithm.
BERTopic currently supports also K-Means and agglomerative clustering algorithms, providing flexibility of choice. K-Means allows to select the desired number of clusters and forces every document into a cluster. This avoids the generation of outliers, but may also result in poorer topic representation and coherence.
2.6.4. Topic representation
BERTopic and Top2Vec differ from each other in how they manufacture a representation for the topics.
BERTopic concatenates all documents within the same cluster (topic) and applies a modified TF-IDF. In brief, it replaces documents with clusters in the original TF-IDF formula. Then, it uses the first most important words for each cluster as topic representation.
This score is called class-based TF-IDF (c TF-IDF), as it estimates the importance of words in clusters instead of documents.
Top2Vec, instead, manufactures a representation with the words closest to the cluster’s centroid. In particular, for each dense area obtained through HDBSCAN, it calculates the centroid of document vectors in original dimension, then selects the most proximal word vectors.
Pros of BERTopic and Top2Vec:
Cons of BERTopic and Top2Vec:
Practical example with BERTopic
Practical example with Top2Vec
The following table summarizes the salient traits of the different topic modeling strategies considering practical application scenarios:
This summary table provides high-level selection criteria for a given use case.
Let’s share some examples.
Imagine the need to find trending topics in Tweets with little pre-processing effort. In this case, one may choose to use Top2Vec and BERTopic. They work splendidly on shorter textual sources and do not require much pre-processing.
Instead, imagine a scenario where a customer is interested in finding how a given document may contain a mixture of multiple topics. In this case, approaches like LDA and NMF would be preferable. BERTopic and Top2Vec assign a document to one topic only. Although the probability distribution of the HDBSCAN may be used as proxy of the topics distribution, BERTopic and Top2Vec are not mixed membership models by design.
4. Additional remarks
When discussing topic modeling, there are two major points of attention worth mentioning at the end of our journey.
4.1 A topic is not (necessarily) what we think it is
When we come across a magazine in a waiting room, we know at glance what genre it belongs. When we enter a conversation, few sentences are enough to let us guess the object of discussion. This is a “topic” from the perspective of a human being.
Unfortunately, the term “topic” assumes a completely different meaning in the models discussed so far.
Let us remember the Document-Term Matrix. At high level, we want to decompose it as the product of a document-topic and topic-term matrix, and extract the latent dimension — topics — in the process. The goal of these strategies (like LSA) is the minimization of the decomposition error.
Probabilistic generative models (like LDA) add an additional layer of statistical formalism with a robust and elegant Bayesian approach, but what they really try to do is to recreate the original document-word distribution with minimal error.
None of these models ensure that the obtained topics will be informative or useful from a human point of view.
In the beautiful words of Blei³ et al. (2013):
“We refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but we make no epistemological claims regarding these latent variables beyond their utility in representing probability distributions on sets of words.”
On the other hand, BERTopic and Top2Vec leverage semantic embeddings. Therefore, the vectors used to represent documents carry a proxy (the closest we have so far) of their “meaning” from a “human” perspective. These amazing models assume that clustering a projection of these embeddings may lead to more meaningful and specific topics.
Studies (to cite a few: Grootendorst¹¹ 2022, Angelov¹² 2020, Egger¹⁶ et al. 2022) show that topics obtained leveraging semantic embeddings are more informative and coherent across several domains.
But even in this case, there is still an underlying mismatch between the definition of topic for a human and for such models. What these algorithms produce is “interpretable, well represented and coherent groups of semantically similar documents”.
Make no mistake: this is an outstanding and unique result that opened an entirely new frontier in the field and achieved unprecedented performances.
But we may still debate on how this approximates the human definition of topic, and under what circumstances.
And if you think this is a trivial subtlety, have you ever tried to explain a topic like “mail_post_email_posting” to business stakeholders? Yes, it is coherent and interpretable, but is it what they think of when they imagine a “topic”?
This leads us to the second point of attention.
4.2 Topics are not easy to evaluate
Topic modeling is an unsupervised technique. There are no labels to rely on during evaluation.
Coherence measures have been proposed to evaluate topics quality with respect to interpretability. For example, Normalized pointwise mutual information (NPMI) (Bouma¹⁷, 2009) estimates how likely is the co-occurrence of two words
y than we would expect by chance:
NPMI can vary from -1 (no co-occurrences) to +1 (perfect co-occurrence). Independence between the occurrences of
y results in NPMI=0.
Lau¹⁸ et al. (2014) suggest that this metric emulates human judgement to a reasonable extent.
Other coherence measures also exist. For example, Cv (Röder¹⁹ et al. 2015) and UMass (Mimno²⁰ et al., 2011).
These coherence metrics suffer from a series of shortcomings:
As Grootendorst¹¹ (2022) writes:
“Validation measures such are topic coherence and topic diversity are proxies of what is essentially a subjective evaluation. One user might judge the coherence and diversity of a topic differently from another user. As a result, although these measures can be used to get an indication of a model’s performance, they are just that, an indication”.
In conclusion, validation measures fail to estimate a topic model performance with clarity. They do not provide the unequivocal interpretation as an accuracy or F1 score would for a classification problem. As a consequence, the quantification of a “measure of goodness” for obtained topics still requires domain knowledge and human evaluation. The assessment of business value (“Will these topics benefit the project?”) is no trivial feat, and may need composite metrics and an holistic approach.
In this post, we shared a friendly overview of popular topic modeling algorithms, from generative statistical models to transformers-based approaches.
We also provided a table highlighting advantages and disadvantages of each technique. This could be used for comparison and to aid a preliminary model selection in different scenarios.
Finally, we shared two of the most challenging aspects of unsupervised textual data analysis.
At first, the difference, so often overlooked, between the human definition of “topic” and its statistical counterpart as result of a “topic modeling” algorithm. The comprehension of this discrepancy is paramount to meet project goals and guide the expectations of business stakeholders in NLP endeavours.
Then, we discussed the difficulty to quantitatively assess topic models performances, by introducing popular metrics and their shortcomings.
 Deerwester et al., Indexing by latent semantic analysis, Journal of the American Society for Information Science, Volume 41, Issue 6 p. 391–407, 1990 (link).
 Hofmann, Probabilistic Latent Semantic Analysis, Proceedings of the XV Conference on Uncertainty in Artificial Intelligence (UAI1999), 1999 (link).
 Blei et al., Latent dirichlet allocation, The Journal of Machine Learning Research, Volume 3, p. 993–1022, 2003 (link).
 Lee et al., Learning the parts of objects by non-negative matrix factorization, Nature, Volume 401, p. 788–791, 1999 (link).
 Barbieri et al., Probabilistic topic models for sequence data, Machine Learning, Volume 93, p. 5–29, 2013 (link).
 Rizvi et al., Analyzing social media data to understand consumers’ information needs on dietary supplements, Stud. Health Technol. Inform., Volume 264, p. 323–327, 2019 (link).
 Alnusyan et al., A semi-supervised approach for user reviews topic modeling and classification, International Conference on Computing and Information Technology, 1–5, 2020 (link).
 Egger and Yu, Identifying hidden semantic structures in Instagram data: a topic modelling comparison, Tour. Rev. 2021:244, 2021 (link).
 Xu et al., Document clustering based on non-negative matrix factorization, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, p. 267–273, 2003 (link).
 Casalino et al., Nonnegative matrix factorizations for intelligent data analysis, Non-negative Matrix Factorization Techniques. Springer, p. 49–74, 2016 (link).
 Grootendorst, BERTopic: Neural topic modeling with a class-based TF-IDF procedure, 2022 (link).
 Angelov, Top2Vec: Distributed Representations of Topics, 2020 (link).
 McInnes et al., UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, 2018 (link).
 Allaoui et al., Considerably improving clustering algorithms using umap dimensionality reduction technique: A comparative study, International Conference on Image and Signal Processing, Springer, p. 317–325, 2020 (link).
 McInnes et al., hdbscan: Hierarchical density based clustering, The Journal of Open Source Software, 2(11):205, 2017 (link).
 Egger et al., A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts, Frontiers in Sociology, Volume 7, Article 886498, 2022 (link).
 Bouma, Normalized (pointwise) mutual information in collocation extraction, Proceedings of GSCL, 30:31–40, 2009 (link).
 Lau et al., Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, p. 530–539, 2014 (link).
 Röder et al., Exploring the space of topic coherence measures, Proceedings of the eighth ACM international conference on Web search and data mining, p. 399–408. ACM, 2015 (link).
 Mimno et al., Optimizing semantic coherence in topic models, Proc. of the Conf. on Empirical Methods in Natural Language Processing, p. 262–272, 2011 (link).
 Y. Zuo et al., Word network topic model: a simple but general solution for short and imbalanced texts, Knowledge and Information Systems, 48(2), p. 379–398 (link)
 Blair et al., Aggregated topic models for increasing social media topic coherence, Applied Intelligence, 50(1), p. 138–156, 2020 (link).
 Doogan et al., Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p. 3824–3848, 2021 (link).
 Hoyle et al., Is automated topic model evaluation broken? the incoherence of coherence, Advances in Neural Information Processing Systems, 34, 2021 (link).
This content was originally published here.