You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Estimate inverse document/topic frequency for term $j$:
51
+
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
52
+
$N$ is the total number of documents.
53
+
- Calculate importance of term $j$ for topic $z$:
54
+
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
51
55
52
-
### _(Optional)_ 4. Dynamic Modeling
56
+
### Dynamic Modeling
53
57
54
58
GMM is also capable of dynamic topic modeling. This happens by fitting one underlying mixture model over the entire corpus, as we expect that there is only one semantic model generating the documents.
55
59
To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF for each of the time slices separately.
@@ -90,25 +94,6 @@ from sklearn.decomposition import IncrementalPCA
90
94
model = GMM(20, dimensionality_reduction=IncrementalPCA(20))
91
95
```
92
96
93
-
## Considerations
94
-
95
-
### Strengths
96
-
97
-
- Efficiency, Stability: GMM relies on a rock solid implementation in scikit-learn, you can rest assured that the model will be fast and reliable.
98
-
- Coverage of Ingroup Variance: The model is very efficient at describing the extracted topics in all their detail.
99
-
This means that the topic descriptions will typically cover most of the documents generated from the topic fairly well.
100
-
- Uncertainty: GMM is capable of expressing and modeling uncertainty around topic labels for documents.
101
-
- Dynamic Modeling: You can model changes in topics over time using GMM.
102
-
103
-
### Weaknesses
104
-
105
-
- Curse of Dimensionality: The dimensionality of embeddings can vary wildly from model to model. High-dimensional embeddings might decrease the efficiency and performance of GMM, as it is sensitive to the curse of dimensionality. Dimensionality reduction can help mitigate these issues.
106
-
- Assumption of Gaussianity: The model assumes that topics are Gaussian components, it might very well be that this is not the case.
107
-
Fortunately enough this rarely effects real-world perceived performance of models, and typically does not present an issue in practical settings.
108
-
- Moderate Scalability: While the model is scalable to a certain extent, it is not nearly as scalable as some of the other options. If you experience issues with computational efficiency or convergence, try another model.
109
-
- Moderate Robustness to Noise: GMM is similarly sensitive to noise and stop words as BERTopic, and can sometimes find noise components. Our experience indicates that GMM is way less volatile, and the quality of the results is more reliable than with clustering models using C-TF-IDF.
Copy file name to clipboardExpand all lines: docs/KeyNMF.md
+65-79Lines changed: 65 additions & 79 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,42 +19,42 @@ model.fit(corpus)
19
19
model.print_topics()
20
20
```
21
21
22
-
## Keyword Extraction
22
+
## How does KeyNMF work?
23
23
24
-
The first step of the process is gaining enhanced representations of documents by using contextual embeddings.
25
-
Both the documents and the vocabulary get encoded with the same sentence encoder.
26
-
Keywords are assigned to each document based on the cosine similarity of the document embedding to the embedded words in the document.
27
-
Only the top K words with positive cosine similarity to the document are kept.
28
-
These keywords are then arranged into a document-term importance matrix where each column represents a keyword that was encountered in at least one document,
29
-
and each row is a document. The entries in the matrix are the cosine similarities of the given keyword to the document in semantic space.
24
+
### Keyword Extraction
30
25
31
-
- For each document $d$:
32
-
1. Let $x_d$ be the document's embedding produced with the encoder model.
33
-
2. For each word $w$ in the document $d$:
34
-
1. Let $v_w$ be the word's embedding produced with the encoder model.
35
-
2. Calculate cosine similarity between word and document
26
+
KeyNMF discovers topics based on the importances of keywords for a given document.
27
+
This is done by embedding words in a document, and then extracting the cosine similarities of documents to words using a transformer-model.
28
+
Only the `top_n` keywords with positive similarity are kept.
Topics in this matrix are then discovered using Non-negative Matrix Factorization.
84
84
Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance
85
85
can be explained.
86
86
87
-
- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss:
87
+
??? info "Click to see formula"
88
+
89
+
- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss:
90
+
91
+
$$
92
+
L(W,H) = ||M - WH||^2
93
+
$$
88
94
89
-
$$
90
-
L(W,H) = ||M - WH||^2
91
-
$$
95
+
You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.
92
96
93
-
You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.
## Asymmetric and Instruction-tuned Embedding Models
115
+
###Asymmetric and Instruction-tuned Embedding Models
113
116
114
117
Some embedding models can be used together with prompting, or encode queries and passages differently.
115
118
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
@@ -154,21 +157,23 @@ model = KeyNMF(10, encoder=encoder)
154
157
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
155
158
156
159
157
-
## Dynamic Topic Modeling
160
+
###Dynamic Topic Modeling
158
161
159
162
KeyNMF is also capable of modeling topics over time.
160
163
This happens by fitting a KeyNMF model first on the entire corpus, then
161
164
fitting individual topic-term matrices using coordinate descent based on the document-topic and document-term matrices in the given time slices.
162
165
163
-
1. Compute keyword matrix $M$ for the whole corpus.
164
-
2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$.
165
-
3. For each time slice $t$:
166
-
1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$.
167
-
2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$:
166
+
??? info "Click to see formula"
168
167
169
-
$$
170
-
H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2
171
-
$$
168
+
1. Compute keyword matrix $M$ for the whole corpus.
169
+
2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$.
170
+
3. For each time slice $t$:
171
+
1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$.
172
+
2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$:
173
+
174
+
$$
175
+
H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2
176
+
$$
172
177
173
178
Here's an example of using KeyNMF in a dynamic modeling setting:
When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.
360
360
361
361
This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
362
-
In other words:
363
-
364
-
1. Decompose keyword matrix $M \approx WH$
365
-
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
366
-
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
367
-
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
368
-
2. Perform multiplicative updates until convergence. <br>
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
372
+
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$
373
373
374
374
To create a hierarchical model, you can use the `hierarchy` property of the model.
375
375
@@ -395,20 +395,6 @@ print(model.hierarchy)
395
395
396
396
For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
397
397
398
-
## Considerations
399
-
400
-
### Strengths
401
-
402
-
- Stability, Robustness and Quality: KeyNMF extracts very clean topics even when a lot of noise is present in the corpus, and the model's performance remains relatively stable across domains.
403
-
- Scalability: The model can be fitted in an online fashion, and we recommend that you choose KeyNMF when the number of documents is large (over 100 000).
404
-
- Fail Safe and Adjustable: Since the modelling process consists of multiple easily separable steps it is easy to repeat one if something goes wrong. This also makes it an ideal choice for production usage.
405
-
- Can capture multiple topics in a document.
406
-
407
-
### Weaknesses
408
-
409
-
- Lack of Nuance: Since only the top K keywords are considered and used for topic extraction some of the nuances, especially in long texts might get lost. We therefore recommend that you scale K with the average length of the texts you're working with. For tweets it might be worth it to scale it down to 5, while with longer documents, a larger number (let's say 50) might be advisable.
410
-
- Practitioners have to choose the number of topics a priori.
0 commit comments