You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Topics in this matrix are then discovered using Non-negative Matrix Factorization.
84
94
Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance
@@ -94,70 +104,89 @@ can be explained.
94
104
95
105
You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.
96
106
97
-
```python
98
-
# Fitting just on the corpus
99
-
model.fit(corpus)
100
107
101
-
# Fitting with precomputed embeddings
102
-
from sentence_transformers import SentenceTransformer
108
+
=== "Fitting on a corpus"
109
+
```python
110
+
model.fit(corpus)
111
+
```
103
112
104
-
trf = SentenceTransformer("all-MiniLM-L6-v2")
105
-
embeddings = trf.encode(corpus)
113
+
=== "Pre-computed embeddings"
114
+
```python
115
+
from sentence_transformers import SentenceTransformer
106
116
107
-
model = KeyNMF(10, encoder=trf)
108
-
model.fit(corpus, embeddings=embeddings)
117
+
trf = SentenceTransformer("all-MiniLM-L6-v2")
118
+
embeddings = trf.encode(corpus)
109
119
110
-
# Fitting with precomputed keyword matrix
111
-
keyword_matrix = model.extract_keywords(corpus)
112
-
model.fit(None, keywords=keyword_matrix)
113
-
```
120
+
model = KeyNMF(10, encoder=trf)
121
+
model.fit(corpus, embeddings=embeddings)
122
+
```
123
+
=== "Pre-computed keyword matrix"
124
+
```python
125
+
keyword_matrix = model.extract_keywords(corpus)
126
+
model.fit(None, keywords=keyword_matrix)
127
+
```
114
128
115
-
### Asymmetric and Instruction-tuned Embedding Models
129
+
##Seeded Topic Modeling
116
130
117
-
Some embedding models can be used together with prompting, or encode queries and passages differently.
118
-
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
119
-
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
131
+
When investigating a set of documents, you might already have an idea about what aspects you would like to explore.
132
+
In KeyNMF, you can describe this aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
133
+
which will then be used to only extract topics, which are relevant to your research question.
120
134
121
-
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
135
+
??? info "How is this done?"
122
136
123
-
Here's an example of using instruct models for keyword retrieval with KeyNMF.
124
-
In this case, documents will serve as the queries and words as the passages:
137
+
KeyNMF encodes the seed phrase into a seed-embedding.
138
+
Word importance scores in a document get weighted by their similarity to the seed-embedding.
139
+
140
+
- Embed seed-phrase into a seed-embedding: $s$
141
+
- When extracting keywords from a document:
142
+
1. Let $x_d$ be the document's embedding produced with the encoder model.
143
+
2. Let the document's relevance be $r_d = \text{sim}(d,w)$
144
+
3. For each word $w$:
145
+
1. Let the word's importance in the keyword matrix be: $\text{sim}(d, w) \cdot r_d$ if $r_d > 0$, otherwise $0$
125
146
126
147
```python
127
148
from turftopic import KeyNMF
128
-
from sentence_transformers import SentenceTransformer
129
149
130
-
encoder = SentenceTransformer(
131
-
"intfloat/multilingual-e5-large-instruct",
132
-
prompts={
133
-
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
134
-
"passage": "Passage: "
135
-
},
136
-
# Make sure to set default prompt to query!
137
-
default_prompt_name="query",
138
-
)
139
-
model = KeyNMF(10, encoder=encoder)
150
+
model = KeyNMF(5, seed_phrase="<your seed phrase>")
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
158
187
159
188
160
-
###Dynamic Topic Modeling
189
+
## Dynamic Topic Modeling
161
190
162
191
KeyNMF is also capable of modeling topics over time.
163
192
This happens by fitting a KeyNMF model first on the entire corpus, then
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
230
259
</figure>
231
260
232
-
### Online Topic Modeling
261
+
## Hierarchical Topic Modeling
262
+
263
+
When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.
264
+
265
+
This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
266
+
267
+
??? info "Click to see formula"
268
+
1. Decompose keyword matrix $M \approx WH$
269
+
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
270
+
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
271
+
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
272
+
2. Perform multiplicative updates until convergence. <br>
##Asymmetric and Instruction-tuned Embedding Models
358
428
359
-
When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.
360
-
361
-
This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
429
+
Some embedding models can be used together with prompting, or encode queries and passages differently.
430
+
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
431
+
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
362
432
363
-
??? info "Click to see formula"
364
-
1. Decompose keyword matrix $M \approx WH$
365
-
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
366
-
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
367
-
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
368
-
2. Perform multiplicative updates until convergence. <br>
For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
456
+
```python
457
+
encoder = SentenceTransformer(
458
+
"intfloat/e5-large-v2",
459
+
prompts={
460
+
"query": "query: "
461
+
"passage": "passage: "
462
+
},
463
+
# Make sure to set default prompt to query!
464
+
default_prompt_name="query",
465
+
)
466
+
model = KeyNMF(10, encoder=encoder)
467
+
```
468
+
469
+
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
0 commit comments