Skip to content

Commit 1d02b56

Browse files
Merge pull request #66 from x-tabdeveloping/quality_of_life
Quality of life improvements and better docs
2 parents 9755d9f + 6ecdd44 commit 1d02b56

19 files changed

Lines changed: 911 additions & 480 deletions

docs/GMM.md

Lines changed: 30 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# GMM
1+
# GMM (Gaussian Mixture Model)
22

33
GMM is a generative probabilistic model over the contextual embeddings.
44
The model assumes that contextual embeddings are generated from a mixture of underlying Gaussian components.
@@ -9,47 +9,51 @@ These Gaussian components are assumed to be the topics.
99
<figcaption>Components of a Gaussian Mixture Model <br>(figure from scikit-learn documentation)</figcaption>
1010
</figure>
1111

12-
## The Model
12+
## How does GMM work?
1313

14-
### 1. Generative Modeling
15-
16-
GMM assumes that the embeddings are generated according to the following stochastic process:
17-
18-
1. Select global topic weights: $\Theta$
19-
2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
20-
3. For each document:
21-
- Draw topic label: $z \sim Categorical(\Theta)$
22-
- Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$
14+
### Generative Modeling
2315

16+
GMM assumes that the embeddings are generated according to the following stochastic process from a number of Gaussian components.
2417
Priors are optionally imposed on the model parameters.
2518
The model is fitted either using expectation maximization or variational inference.
2619

27-
### 2. Topic Inference over Documents
20+
??? info "Click to see formula"
21+
1. Select global topic weights: $\Theta$
22+
2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
23+
3. For each document:
24+
- Draw topic label: $z \sim Categorical(\Theta)$
25+
- Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$
26+
27+
28+
### Calculate Topic Probabilities
2829

2930
After the model is fitted, soft topic labels are inferred for each document.
3031
A document-topic-matrix ($T$) is built from the likelihoods of each component given the document encodings.
3132

32-
Or in other words for document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
33+
??? info "Click to see formula"
34+
- For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
3335

34-
### 3. Soft c-TF-IDF
36+
### Soft c-TF-IDF
3537

3638
Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
3739
an extension of __c-TF-IDF__, that can be used with continuous labels.
3840

39-
Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
40-
Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:
41+
??? info "Click to see formula"
42+
43+
Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
44+
Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:
4145

42-
- Estimate weight of term $j$ for topic $z$: <br>
43-
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
44-
$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and
45-
$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
46-
- Estimate inverse document/topic frequency for term $j$:
47-
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
48-
$N$ is the total number of documents.
49-
- Calculate importance of term $j$ for topic $z$:
50-
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
46+
- Estimate weight of term $j$ for topic $z$: <br>
47+
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
48+
$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and
49+
$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
50+
- Estimate inverse document/topic frequency for term $j$:
51+
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
52+
$N$ is the total number of documents.
53+
- Calculate importance of term $j$ for topic $z$:
54+
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
5155

52-
### _(Optional)_ 4. Dynamic Modeling
56+
### Dynamic Modeling
5357

5458
GMM is also capable of dynamic topic modeling. This happens by fitting one underlying mixture model over the entire corpus, as we expect that there is only one semantic model generating the documents.
5559
To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF for each of the time slices separately.
@@ -90,25 +94,6 @@ from sklearn.decomposition import IncrementalPCA
9094
model = GMM(20, dimensionality_reduction=IncrementalPCA(20))
9195
```
9296

93-
## Considerations
94-
95-
### Strengths
96-
97-
- Efficiency, Stability: GMM relies on a rock solid implementation in scikit-learn, you can rest assured that the model will be fast and reliable.
98-
- Coverage of Ingroup Variance: The model is very efficient at describing the extracted topics in all their detail.
99-
This means that the topic descriptions will typically cover most of the documents generated from the topic fairly well.
100-
- Uncertainty: GMM is capable of expressing and modeling uncertainty around topic labels for documents.
101-
- Dynamic Modeling: You can model changes in topics over time using GMM.
102-
103-
### Weaknesses
104-
105-
- Curse of Dimensionality: The dimensionality of embeddings can vary wildly from model to model. High-dimensional embeddings might decrease the efficiency and performance of GMM, as it is sensitive to the curse of dimensionality. Dimensionality reduction can help mitigate these issues.
106-
- Assumption of Gaussianity: The model assumes that topics are Gaussian components, it might very well be that this is not the case.
107-
Fortunately enough this rarely effects real-world perceived performance of models, and typically does not present an issue in practical settings.
108-
- Moderate Scalability: While the model is scalable to a certain extent, it is not nearly as scalable as some of the other options. If you experience issues with computational efficiency or convergence, try another model.
109-
- Moderate Robustness to Noise: GMM is similarly sensitive to noise and stop words as BERTopic, and can sometimes find noise components. Our experience indicates that GMM is way less volatile, and the quality of the results is more reliable than with clustering models using C-TF-IDF.
110-
111-
11297
## API Reference
11398

11499
::: turftopic.models.gmm.GMM

docs/KeyNMF.md

Lines changed: 65 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -19,42 +19,42 @@ model.fit(corpus)
1919
model.print_topics()
2020
```
2121

22-
## Keyword Extraction
22+
## How does KeyNMF work?
2323

24-
The first step of the process is gaining enhanced representations of documents by using contextual embeddings.
25-
Both the documents and the vocabulary get encoded with the same sentence encoder.
26-
Keywords are assigned to each document based on the cosine similarity of the document embedding to the embedded words in the document.
27-
Only the top K words with positive cosine similarity to the document are kept.
28-
These keywords are then arranged into a document-term importance matrix where each column represents a keyword that was encountered in at least one document,
29-
and each row is a document. The entries in the matrix are the cosine similarities of the given keyword to the document in semantic space.
24+
### Keyword Extraction
3025

31-
- For each document $d$:
32-
1. Let $x_d$ be the document's embedding produced with the encoder model.
33-
2. For each word $w$ in the document $d$:
34-
1. Let $v_w$ be the word's embedding produced with the encoder model.
35-
2. Calculate cosine similarity between word and document
26+
KeyNMF discovers topics based on the importances of keywords for a given document.
27+
This is done by embedding words in a document, and then extracting the cosine similarities of documents to words using a transformer-model.
28+
Only the `top_n` keywords with positive similarity are kept.
3629

37-
$$
38-
\text{sim}(d, w) = \frac{x_d \cdot v_w}{||x_d|| \cdot ||v_w||}
39-
$$
30+
??? info "Click to see formula"
31+
- For each document $d$:
32+
1. Let $x_d$ be the document's embedding produced with the encoder model.
33+
2. For each word $w$ in the document $d$:
34+
1. Let $v_w$ be the word's embedding produced with the encoder model.
35+
2. Calculate cosine similarity between word and document
36+
37+
$$
38+
\text{sim}(d, w) = \frac{x_d \cdot v_w}{||x_d|| \cdot ||v_w||}
39+
$$
4040

41-
3. Let $K_d$ be the set of $N$ keywords with the highest cosine similarity to document $d$.
41+
3. Let $K_d$ be the set of $N$ keywords with the highest cosine similarity to document $d$.
4242

43-
$$
44-
K_d = \text{argmax}_{K^*} \sum_{w \in K^*}\text{sim}(d,w)\text{, where }
45-
|K_d| = N\text{, and } \\
46-
w \in d
47-
$$
43+
$$
44+
K_d = \text{argmax}_{K^*} \sum_{w \in K^*}\text{sim}(d,w)\text{, where }
45+
|K_d| = N\text{, and } \\
46+
w \in d
47+
$$
4848

49-
- Arrange positive keyword similarities into a keyword matrix $M$ where the rows represent documents, and columns represent unique keywords.
49+
- Arrange positive keyword similarities into a keyword matrix $M$ where the rows represent documents, and columns represent unique keywords.
5050

51-
$$
52-
M_{dw} =
53-
\begin{cases}
54-
\text{sim}(d,w), & \text{if } w \in K_d \text{ and } \text{sim}(d,w) > 0 \\
55-
0, & \text{otherwise}.
56-
\end{cases}
57-
$$
51+
$$
52+
M_{dw} =
53+
\begin{cases}
54+
\text{sim}(d,w), & \text{if } w \in K_d \text{ and } \text{sim}(d,w) > 0 \\
55+
0, & \text{otherwise}.
56+
\end{cases}
57+
$$
5858

5959
You can do this step manually if you want to precompute the keyword matrix.
6060
Keywords are represented as dictionaries mapping words to keyword importances.
@@ -78,19 +78,22 @@ keyword_matrix = model.extract_keywords(corpus)
7878
model.fit(None, keywords=keyword_matrix)
7979
```
8080

81-
## Topic Discovery
81+
### Topic Discovery
8282

8383
Topics in this matrix are then discovered using Non-negative Matrix Factorization.
8484
Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance
8585
can be explained.
8686

87-
- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss:
87+
??? info "Click to see formula"
88+
89+
- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss:
90+
91+
$$
92+
L(W,H) = ||M - WH||^2
93+
$$
8894

89-
$$
90-
L(W,H) = ||M - WH||^2
91-
$$
95+
You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.
9296

93-
You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.
9497
```python
9598
# Fitting just on the corpus
9699
model.fit(corpus)
@@ -109,7 +112,7 @@ keyword_matrix = model.extract_keywords(corpus)
109112
model.fit(None, keywords=keyword_matrix)
110113
```
111114

112-
## Asymmetric and Instruction-tuned Embedding Models
115+
### Asymmetric and Instruction-tuned Embedding Models
113116

114117
Some embedding models can be used together with prompting, or encode queries and passages differently.
115118
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
@@ -154,21 +157,23 @@ model = KeyNMF(10, encoder=encoder)
154157
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
155158

156159

157-
## Dynamic Topic Modeling
160+
### Dynamic Topic Modeling
158161

159162
KeyNMF is also capable of modeling topics over time.
160163
This happens by fitting a KeyNMF model first on the entire corpus, then
161164
fitting individual topic-term matrices using coordinate descent based on the document-topic and document-term matrices in the given time slices.
162165

163-
1. Compute keyword matrix $M$ for the whole corpus.
164-
2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$.
165-
3. For each time slice $t$:
166-
1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$.
167-
2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$:
166+
??? info "Click to see formula"
168167

169-
$$
170-
H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2
171-
$$
168+
1. Compute keyword matrix $M$ for the whole corpus.
169+
2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$.
170+
3. For each time slice $t$:
171+
1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$.
172+
2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$:
173+
174+
$$
175+
H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2
176+
$$
172177

173178
Here's an example of using KeyNMF in a dynamic modeling setting:
174179

@@ -200,12 +205,7 @@ model.print_topics_over_time()
200205
| - | - | - | - | - | - |
201206
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
202207
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
203-
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
204-
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
205-
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
206-
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
207-
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
208-
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
208+
| | | ... | | | |
209209
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
210210
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |
211211

@@ -225,11 +225,11 @@ model.plot_topics_over_time(top_k=5)
225225
```
226226

227227
<figure>
228-
<img src="../images/dynamic_keynmf.png" width="80%" style="margin-left: auto;margin-right: auto;">
228+
<img src="../images/dynamic_keynmf.png" width="50%" style="margin-left: auto;margin-right: auto;">
229229
<figcaption>Topics over time on a Figure</figcaption>
230230
</figure>
231231

232-
## Online Topic Modeling
232+
### Online Topic Modeling
233233

234234
KeyNMF can also be fitted in an online manner.
235235
This is done by fitting NMF with batches of data instead of the whole dataset at once.
@@ -354,22 +354,22 @@ for batch in batched(zip(corpus, timestamps)):
354354
model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins)
355355
```
356356

357-
## Hierarchical Topic Modeling
357+
### Hierarchical Topic Modeling
358358

359359
When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.
360360

361361
This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
362-
In other words:
363-
364-
1. Decompose keyword matrix $M \approx WH$
365-
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
366-
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
367-
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
368-
2. Perform multiplicative updates until convergence. <br>
369-
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
370-
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
371-
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
372-
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$
362+
363+
??? info "Click to see formula"
364+
1. Decompose keyword matrix $M \approx WH$
365+
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
366+
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
367+
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
368+
2. Perform multiplicative updates until convergence. <br>
369+
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
370+
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
371+
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
372+
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$
373373

374374
To create a hierarchical model, you can use the `hierarchy` property of the model.
375375

@@ -395,20 +395,6 @@ print(model.hierarchy)
395395

396396
For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
397397

398-
## Considerations
399-
400-
### Strengths
401-
402-
- Stability, Robustness and Quality: KeyNMF extracts very clean topics even when a lot of noise is present in the corpus, and the model's performance remains relatively stable across domains.
403-
- Scalability: The model can be fitted in an online fashion, and we recommend that you choose KeyNMF when the number of documents is large (over 100 000).
404-
- Fail Safe and Adjustable: Since the modelling process consists of multiple easily separable steps it is easy to repeat one if something goes wrong. This also makes it an ideal choice for production usage.
405-
- Can capture multiple topics in a document.
406-
407-
### Weaknesses
408-
409-
- Lack of Nuance: Since only the top K keywords are considered and used for topic extraction some of the nuances, especially in long texts might get lost. We therefore recommend that you scale K with the average length of the texts you're working with. For tweets it might be worth it to scale it down to 5, while with longer documents, a larger number (let's say 50) might be advisable.
410-
- Practitioners have to choose the number of topics a priori.
411-
412398
## API Reference
413399

414400
::: turftopic.models.keynmf.KeyNMF

0 commit comments

Comments
 (0)