Skip to content

Commit c66ec54

Browse files
Updated docs in KeyNMF
1 parent 560682e commit c66ec54

2 files changed

Lines changed: 928 additions & 83 deletions

File tree

docs/KeyNMF.md

Lines changed: 156 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,30 @@ while taking inspiration from classical matrix-decomposition approaches for extr
88
<figcaption>Schematic overview of KeyNMF</figcaption>
99
</figure>
1010

11+
1112
Here's an example of how you can fit and interpret a KeyNMF model in the easiest way.
1213

1314
```python
1415
from turftopic import KeyNMF
1516

16-
model = KeyNMF(10, top_n=6)
17+
model = KeyNMF(10, encoder="paraphrase-MiniLM-L3-v2")
1718
model.fit(corpus)
1819

1920
model.print_topics()
2021
```
2122

23+
!!! question "Which Embedding model should I use"
24+
- You should probably use KeyNMF with a `paraphrase-` type embedding model. These seem to perform best in most tasks. Some examples include:
25+
- [paraphrase-MiniLM-L3-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L3-v2) - Absolutely tiny :mouse:
26+
- [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) - High performance :star2:
27+
- [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) - Multilingual, high-performance :earth_americas: :star2:
28+
- KeyNMF works remarkably well with static models, which are incredibly fast, even on your laptop:
29+
- [sentence-transformers/static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1) - Blazing Fast :zap:
30+
- [sentence-transformers/static-similarity-mrl-multilingual-v1](https://huggingface.co/sentence-transformers/static-similarity-mrl-multilingual-v1) - Multilingual, Blazing Fast :earth_americas: :zap:
31+
2232
## How does KeyNMF work?
2333

24-
### Keyword Extraction
34+
#### Keyword Extraction
2535

2636
KeyNMF discovers topics based on the importances of keywords for a given document.
2737
This is done by embedding words in a document, and then extracting the cosine similarities of documents to words using a transformer-model.
@@ -78,7 +88,7 @@ keyword_matrix = model.extract_keywords(corpus)
7888
model.fit(None, keywords=keyword_matrix)
7989
```
8090

81-
### Topic Discovery
91+
#### Topic Discovery
8292

8393
Topics in this matrix are then discovered using Non-negative Matrix Factorization.
8494
Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance
@@ -94,70 +104,89 @@ can be explained.
94104

95105
You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.
96106

97-
```python
98-
# Fitting just on the corpus
99-
model.fit(corpus)
100107

101-
# Fitting with precomputed embeddings
102-
from sentence_transformers import SentenceTransformer
108+
=== "Fitting on a corpus"
109+
```python
110+
model.fit(corpus)
111+
```
103112

104-
trf = SentenceTransformer("all-MiniLM-L6-v2")
105-
embeddings = trf.encode(corpus)
113+
=== "Pre-computed embeddings"
114+
```python
115+
from sentence_transformers import SentenceTransformer
106116

107-
model = KeyNMF(10, encoder=trf)
108-
model.fit(corpus, embeddings=embeddings)
117+
trf = SentenceTransformer("all-MiniLM-L6-v2")
118+
embeddings = trf.encode(corpus)
109119

110-
# Fitting with precomputed keyword matrix
111-
keyword_matrix = model.extract_keywords(corpus)
112-
model.fit(None, keywords=keyword_matrix)
113-
```
120+
model = KeyNMF(10, encoder=trf)
121+
model.fit(corpus, embeddings=embeddings)
122+
```
123+
=== "Pre-computed keyword matrix"
124+
```python
125+
keyword_matrix = model.extract_keywords(corpus)
126+
model.fit(None, keywords=keyword_matrix)
127+
```
114128

115-
### Asymmetric and Instruction-tuned Embedding Models
129+
## Seeded Topic Modeling
116130

117-
Some embedding models can be used together with prompting, or encode queries and passages differently.
118-
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
119-
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
131+
When investigating a set of documents, you might already have an idea about what aspects you would like to explore.
132+
In KeyNMF, you can describe this aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
133+
which will then be used to only extract topics, which are relevant to your research question.
120134

121-
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
135+
??? info "How is this done?"
122136

123-
Here's an example of using instruct models for keyword retrieval with KeyNMF.
124-
In this case, documents will serve as the queries and words as the passages:
137+
KeyNMF encodes the seed phrase into a seed-embedding.
138+
Word importance scores in a document get weighted by their similarity to the seed-embedding.
139+
140+
- Embed seed-phrase into a seed-embedding: $s$
141+
- When extracting keywords from a document:
142+
1. Let $x_d$ be the document's embedding produced with the encoder model.
143+
2. Let the document's relevance be $r_d = \text{sim}(d,w)$
144+
3. For each word $w$:
145+
1. Let the word's importance in the keyword matrix be: $\text{sim}(d, w) \cdot r_d$ if $r_d > 0$, otherwise $0$
125146

126147
```python
127148
from turftopic import KeyNMF
128-
from sentence_transformers import SentenceTransformer
129149

130-
encoder = SentenceTransformer(
131-
"intfloat/multilingual-e5-large-instruct",
132-
prompts={
133-
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
134-
"passage": "Passage: "
135-
},
136-
# Make sure to set default prompt to query!
137-
default_prompt_name="query",
138-
)
139-
model = KeyNMF(10, encoder=encoder)
150+
model = KeyNMF(5, seed_phrase="<your seed phrase>")
151+
model.fit(corpus)
152+
153+
model.print_topics()
140154
```
141155

142-
And a regular, asymmetric example:
143156

144-
```python
145-
encoder = SentenceTransformer(
146-
"intfloat/e5-large-v2",
147-
prompts={
148-
"query": "query: "
149-
"passage": "passage: "
150-
},
151-
# Make sure to set default prompt to query!
152-
default_prompt_name="query",
153-
)
154-
model = KeyNMF(10, encoder=encoder)
155-
```
157+
=== "`'Is the death penalty moral?'`"
158+
159+
| Topic ID | Highest Ranking |
160+
| - | - |
161+
| 0 | morality, moral, immoral, morals, objective, morally, animals, society, species, behavior |
162+
| 1 | armenian, armenians, genocide, armenia, turkish, turks, soviet, massacre, azerbaijan, kurdish |
163+
| 2 | murder, punishment, death, innocent, penalty, kill, crime, moral, criminals, executed |
164+
| 3 | gun, guns, firearms, crime, handgun, firearm, weapons, handguns, law, criminals |
165+
| 4 | jews, israeli, israel, god, jewish, christians, sin, christian, palestinians, christianity |
166+
167+
=== "`'Evidence for the existence of god'`"
168+
169+
| Topic ID | Highest Ranking |
170+
| - | - |
171+
| 0 | atheist, atheists, religion, religious, theists, beliefs, christianity, christian, religions, agnostic |
172+
| 1 | bible, christians, christian, christianity, church, scripture, religion, jesus, faith, biblical |
173+
| 2 | god, existence, exist, exists, universe, creation, argument, creator, believe, life |
174+
| 3 | believe, faith, belief, evidence, blindly, believing, gods, believed, beliefs, convince |
175+
| 4 | atheism, atheists, agnosticism, belief, arguments, believe, existence, alt, believing, argument |
176+
177+
=== "`'Operating system kernels'`"
178+
179+
| Topic ID | Highest Ranking |
180+
| - | - |
181+
| 0 | windows, dos, os, microsoft, ms, apps, pc, nt, file, shareware |
182+
| 1 | ram, motherboard, card, monitor, memory, cpu, vga, mhz, bios, intel |
183+
| 2 | unix, os, linux, intel, systems, programming, applications, compiler, software, platform |
184+
| 3 | disk, scsi, disks, drive, floppy, drives, dos, controller, cd, boot |
185+
| 4 | software, mac, hardware, ibm, graphics, apple, computer, pc, modem, program |
156186

157-
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
158187

159188

160-
### Dynamic Topic Modeling
189+
## Dynamic Topic Modeling
161190

162191
KeyNMF is also capable of modeling topics over time.
163192
This happens by fitting a KeyNMF model first on the entire corpus, then
@@ -229,7 +258,48 @@ model.plot_topics_over_time()
229258
<figcaption> Topics over time in a Dynamic KeyNMF model. </figcaption>
230259
</figure>
231260

232-
### Online Topic Modeling
261+
## Hierarchical Topic Modeling
262+
263+
When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.
264+
265+
This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
266+
267+
??? info "Click to see formula"
268+
1. Decompose keyword matrix $M \approx WH$
269+
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
270+
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
271+
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
272+
2. Perform multiplicative updates until convergence. <br>
273+
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
274+
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
275+
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
276+
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$
277+
278+
To create a hierarchical model, you can use the `hierarchy` property of the model.
279+
280+
```python
281+
# This divides each of the topics in the model to 3 subtopics.
282+
model.hierarchy.divide_children(n_subtopics=3)
283+
print(model.hierarchy)
284+
```
285+
286+
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
287+
<tt style="font-size: 11pt">
288+
<b>Root </b><br>
289+
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
290+
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
291+
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
292+
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
293+
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
294+
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
295+
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
296+
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
297+
</tt>
298+
</div>
299+
300+
For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
301+
302+
## Online Topic Modeling
233303

234304
KeyNMF can also be fitted in an online manner.
235305
This is done by fitting NMF with batches of data instead of the whole dataset at once.
@@ -326,7 +396,7 @@ for epoch in range(5):
326396
model.partial_fit(keywords=keyword_batch)
327397
```
328398

329-
#### Dynamic Online Topic Modeling
399+
### Dynamic Online Topic Modeling
330400

331401
KeyNMF can be online fitted in a dynamic manner as well.
332402
This is useful when you have large corpora of text over time, or when you want to fit the model on future information flowing in
@@ -354,46 +424,49 @@ for batch in batched(zip(corpus, timestamps)):
354424
model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins)
355425
```
356426

357-
### Hierarchical Topic Modeling
427+
## Asymmetric and Instruction-tuned Embedding Models
358428

359-
When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.
360-
361-
This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
429+
Some embedding models can be used together with prompting, or encode queries and passages differently.
430+
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
431+
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
362432

363-
??? info "Click to see formula"
364-
1. Decompose keyword matrix $M \approx WH$
365-
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
366-
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
367-
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
368-
2. Perform multiplicative updates until convergence. <br>
369-
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
370-
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
371-
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
372-
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$
433+
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
373434

374-
To create a hierarchical model, you can use the `hierarchy` property of the model.
435+
Here's an example of using instruct models for keyword retrieval with KeyNMF.
436+
In this case, documents will serve as the queries and words as the passages:
375437

376438
```python
377-
# This divides each of the topics in the model to 3 subtopics.
378-
model.hierarchy.divide_children(n_subtopics=3)
379-
print(model.hierarchy)
439+
from turftopic import KeyNMF
440+
from sentence_transformers import SentenceTransformer
441+
442+
encoder = SentenceTransformer(
443+
"intfloat/multilingual-e5-large-instruct",
444+
prompts={
445+
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
446+
"passage": "Passage: "
447+
},
448+
# Make sure to set default prompt to query!
449+
default_prompt_name="query",
450+
)
451+
model = KeyNMF(10, encoder=encoder)
380452
```
381453

382-
<div style="background-color: #F5F5F5; padding: 10px; padding-left: 20px; padding-right: 20px;">
383-
<tt style="font-size: 11pt">
384-
<b>Root </b><br>
385-
├── <b style="color: blue">0</b>: windows, dos, os, disk, card, drivers, file, pc, files, microsoft <br>
386-
│ ├── <b style="color: magenta">0.0</b>: dos, file, disk, files, program, windows, disks, shareware, norton, memory <br>
387-
│ ├── <b style="color: magenta">0.1</b>: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform <br>
388-
│ └── <b style="color: magenta">0.2</b>: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati <br>
389-
└── <b style="color: blue">1</b>: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs <br>
390-
. ├── <b style="color: magenta">1.0</b>: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers <br>
391-
. ├── <b style="color: magenta">1.1</b>: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions <br>
392-
. └── <b style="color: magenta">1.2</b>: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion <br>
393-
</tt>
394-
</div>
454+
And a regular, asymmetric example:
395455

396-
For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
456+
```python
457+
encoder = SentenceTransformer(
458+
"intfloat/e5-large-v2",
459+
prompts={
460+
"query": "query: "
461+
"passage": "passage: "
462+
},
463+
# Make sure to set default prompt to query!
464+
default_prompt_name="query",
465+
)
466+
model = KeyNMF(10, encoder=encoder)
467+
```
468+
469+
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
397470

398471
## API Reference
399472

0 commit comments

Comments
 (0)