Skip to content

Commit 6ecdd44

Browse files
Refining docs + added finetuning and modifying models page
1 parent 6c0f251 commit 6ecdd44

5 files changed

Lines changed: 213 additions & 10 deletions

File tree

docs/GMM.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# GMM
1+
# GMM (Gaussian Mixture Model)
22

33
GMM is a generative probabilistic model over the contextual embeddings.
44
The model assumes that contextual embeddings are generated from a mixture of underlying Gaussian components.
@@ -33,7 +33,7 @@ A document-topic-matrix ($T$) is built from the likelihoods of each component gi
3333
??? info "Click to see formula"
3434
- For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
3535

36-
### 3. Soft c-TF-IDF
36+
### Soft c-TF-IDF
3737

3838
Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
3939
an extension of __c-TF-IDF__, that can be used with continuous labels.

docs/finetuning.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Modifying and finetuning models
2+
3+
Some models in Turftopic can be flexibly modified after being fitted.
4+
This allows users to fit pretrained topic models to their specific use cases.
5+
6+
## Naming/renaming topics
7+
8+
Topics can be freely renamed in all topic models.
9+
This can be beneficial when interpreting models, as it allows you to assign labels to the topics you've already looked at.
10+
11+
```python
12+
from turftopic import SemanticSignalSeparation
13+
14+
model = SemanticSignalSeparation(10).fit(corpus)
15+
16+
# you can specify a dict mapping IDs to names
17+
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
18+
# or a list of topic names
19+
model.rename_topics([f"Topic {i}" for i in range(10)])
20+
```
21+
22+
## Changing the number of topics
23+
24+
Multiple models allow you to change the number of topics in a model after fitting them.
25+
26+
### Refitting $S^3$ with different number of topics
27+
28+
$S^3$ models store all information that is needed to refit them using a different number of topics, iterations or random seed.
29+
This process is incredibly fast and allows you to explore semantics in a corpora on multiple levels of detail.
30+
Moreover, any model you load from a third party can be refitted at will.
31+
32+
```python
33+
from turftopic import load_model
34+
35+
model = load_model("hf_user/some_s3_model")
36+
37+
print(type(model))
38+
# turftopic.models.decomp.SemanticSignalSeparation
39+
40+
print(len(model.topic_names))
41+
# 10
42+
43+
model.refit(n_components=20, random_seed=42)
44+
print(len(model.topic_names))
45+
# 20
46+
```
47+
48+
### Merging topics in clustering models
49+
50+
Clustering models are very flexible in this regard, as they allow you to merge clusters after the model has been fitted.
51+
52+
#### Manual topic merging
53+
54+
You can merge topics manually in a clustering model by using the `join_topics()` method:
55+
56+
```python
57+
from turftopic import ClusteringTopicModel
58+
59+
model = ClusteringTopicModel().fit(corpus)
60+
61+
# This will join topic 0, 5 and 4 into topic 0
62+
model.join_topics([0,5,4])
63+
```
64+
65+
#### Hierarchical merging
66+
67+
You can also merge clusters automatically into a desired number of topics.
68+
This can be done with the `reduce_topics()` method:
69+
70+
!!! info
71+
For more info on topic merging methods, check out [this page](clustering.md)
72+
73+
```python
74+
model = ClusteringTopicModel().fit(corpus)
75+
model.reduce_topics(n_reduce_to=20, reduction_method="smallest")
76+
```
77+
78+
## Finetuning models on a new corpus.
79+
80+
Currently, you can only finetune KeyNMF to a new corpus.
81+
You can do this by using the `partial_fit()` method on texts the model hasn't seen before:
82+
83+
```python
84+
from turftopic import load_model
85+
86+
model = load_model("pretrained_keynmf_model")
87+
88+
print(type(model))
89+
# turftopic.models.keynmf.KeyNMF
90+
91+
new_corpus: list[str] = [...]
92+
# Finetune the model to the new corpus
93+
model.partial_fit(new_corpus)
94+
95+
model.to_disk("finetuned_model/")
96+
```
97+
98+
99+
## Re-estimating word importance
100+
101+
Both $S^3$ and Clustering models come with multiple ways of estimating the importance of words for topics.
102+
Since both of these models use post-hoc measures, these scores can be calculated without fitting a new model or refitting an old one.
103+
This allows you to play around with different types of feature importance estimation measures for the same model (same underlying clusters or axes).
104+
105+
Here's an example with $S^3$:
106+
```python
107+
from turftopic import SemanticSignalSeparation
108+
109+
model = SemanticSignalSeparation(5, feature_importance="combined").fit(corpus)
110+
model.print_topics()
111+
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
112+
┃ Topic ID ┃ Highest Ranking ┃ Lowest Ranking ┃
113+
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
114+
0 │ hypocrisy, hypocritical, fallacy, debated, skeptics │ xfree86, emulator, codes, 9600, cd300 │
115+
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
116+
1 │ spectrometer, dblspace, statistically, nutritional, makefile │ uh, um, yeah, hm, oh │
117+
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
118+
2 │ bullpen, goaltenders, pitchers, goaltender, pitching │ intel, nsa, spying, encrypt, terrorism │
119+
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
120+
3 │ espionage, wiretapping, cia, fbi, wiretaps │ agnosticism, agnostic, upgrading, affordable, cheaper │
121+
├──────────┼──────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────┤
122+
4 │ affordable, dealers, warrants, handguns, dealership │ semitic, theologians, judaism, persecuted, pagan │
123+
└──────────┴──────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────┘
124+
125+
126+
model.estimate_components("angular")
127+
model.print_topics()
128+
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
129+
┃ Topic ID ┃ Highest Ranking ┃ Lowest Ranking ┃
130+
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
131+
0 │ hypocritical, debated, hypotheses, misconceptions, fallacy │ diagnostics, win31, modems, cd300, gd3004 │
132+
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
133+
1 │ spectrometer, dblspace, statistically, makefile, nutritional │ ye, sub, naked, experiences, uh │
134+
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
135+
2 │ bullpen, puckett, hitters, clemens, jenks │ encryption, encrypt, intel, cryptosystem, cryptosystems │
136+
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
137+
3 │ journalists, cdc, chlorine, npr, briefing │ values, ratios, upgrading, calculations, inherit │
138+
├──────────┼──────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────┤
139+
4 │ handguns, warrants, warranty, reliability, handgun │ nutritional, metabolism, deuteronomy, pathology, hormone │
140+
└──────────┴──────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘
141+
142+
```
143+
144+
And one with clustering models:
145+
146+
!!! info
147+
Remember, these are the same underlying clusters, just described in two different ways. For further details, check out [this page](clustering.md)
148+
149+
```python
150+
from turftopic import ClusteringTopicModel
151+
152+
model = ClusteringTopicModel(n_reduce_to=5, feature_importance="soft-c-tf-idf").fit(corpus)
153+
model.print_topics()
154+
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
155+
┃ Topic ID ┃ Highest Ranking ┃
156+
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
157+
-1 │ like, just, don, use, does, know, time, good, people, edu │
158+
├──────────┼────────────────────────────────────────────────────────────────────────────┤
159+
0 │ people, said, god, president, mr, think, going, say, did, myers │
160+
├──────────┼────────────────────────────────────────────────────────────────────────────┤
161+
1max, g9v, b8f, a86, pl, 00, 145, 1d9, dos, 34u
162+
├──────────┼────────────────────────────────────────────────────────────────────────────┤
163+
2 │ msg, cancer, food, battery, water, candida, medical, vitamin, yeast, diet │
164+
├──────────┼────────────────────────────────────────────────────────────────────────────┤
165+
325, 55, pit, det, pts, la, bos, 03, 10, 11
166+
├──────────┼────────────────────────────────────────────────────────────────────────────┤
167+
4 │ insurance, car, dog, radar, health, bike, helmet, private, detector, speed │
168+
└──────────┴────────────────────────────────────────────────────────────────────────────┘
169+
170+
171+
model.estimate_components(feature_importance="centroid")
172+
model.print_topics()
173+
174+
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
175+
┃ Topic ID ┃ Highest Ranking ┃
176+
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
177+
-1 │ documented, concerns, dubious, obsolete, concern, alternative, et4000, complaints, cx, discussed │
178+
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
179+
0 │ persecutions, persecution, condemning, condemnation, fundamentalists, persecuted, fundamentalism, │
180+
│ │ theology, advocating, fundamentalist │
181+
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
182+
1 │ xfree86, pcx, emulation, microsoft, hardware, emulator, x11r5, netware, workstations, chipset │
183+
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
184+
2 │ contamination, fungal, precautions, harmful, poisoning, chemicals, treatments, toxicity, dangers, │
185+
│ │ prevention │
186+
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
187+
3 │ nhl, bullpen, goaltenders, standings, sabres, canucks, braves, mlb, flyers, playoffs │
188+
├──────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
189+
4 │ automotive, vehicle, vehicles, speeding, automobile, automobiles, driving, motorcycling, │
190+
│ │ motorcycles, highways │
191+
└──────────┴───────────────────────────────────────────────────────────────────────────#───────────────────────────┘
192+
193+
```
194+
195+

docs/online.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,18 +54,18 @@ for epoch in range(5):
5454
You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before.
5555
This will morph the model's topics to the corpus at hand.
5656

57-
In this example I will load a pretrained KeyNMF model from disk. (see [Persistance](persistance.md))
57+
In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistance.md))
5858

5959
```python
60-
import joblib
60+
from turftopic import load_model
6161

62-
model = joblib.load("pretrained_keynmf.joblib")
62+
model = load_model("pretrained_keynmf_model")
6363

6464
new_corpus: list[str] = [...]
6565
# Finetune the model to the new corpus
6666
model.partial_fit(new_corpus)
6767

68-
joblib.dump(model, "finetuned_keynmf.joblib")
68+
model.to_disk("finetuned_model/")
6969
```
7070

7171
## Precomputed Embeddings

0 commit comments

Comments
 (0)