Skip to content

Commit ae349c8

Browse files
Added documentation on TopicData
1 parent d5d84d7 commit ae349c8

5 files changed

Lines changed: 150 additions & 13 deletions

File tree

docs/model_definition_and_training.md

Lines changed: 49 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -182,22 +182,64 @@ See a detailed guide [here](../namers.md).
182182
## Training and Inference
183183

184184
### Model Training
185-
186-
All models in Turftopic have a `fit()` method, that takes a textual corpus in the form of an iterable of strings.
187-
188-
Beware that the iterable has to be reusable, as models have to do multiple passes over the corpus.
185+
All models in Turftopic follow a scikit-learn API for fitting topic models.
186+
Every model in the library can be trained by passing a set of documents (a _corpus_) to the model.
187+
This has to be an `Iterable` type object, that has to be reusable as models will typically do multiple passes on the corpus.
189188

190189
```python
191190
corpus: list[str] = ["this is a a document", "this is yet another document", ...]
192-
193-
model.fit(corpus)
194191
```
192+
!!! quote "Fit your topic model"
193+
=== "`fit(raw_documents, embeddings=None)`"
194+
195+
`fit()` simply fits the topic model and returns the same model object fitted.
196+
You can optionally pass a set of precomputed embeddings for the documents.
197+
198+
```python
199+
model.fit(corpus)
200+
# or
201+
model.fit(corpus, embeddings=embeddings)
202+
```
203+
204+
=== "`fit_transform(raw_documents, embeddings=None)`"
205+
206+
`fit_transform()` not only trains the model but also returns topic-proportions in all documents in the corpus.
207+
```python
208+
document_topic_matrix = model.fit_transform(corpus)
209+
# or
210+
document_topic_matrix = model.fit_transform(corpus, embeddings=embeddings)
211+
print(document_topic_matrix.shape)
212+
# prints (n_documents, n_topics)
213+
```
214+
=== "`prepare_topic_data(corpus, embeddings=None)`"
215+
216+
`prepare_topic_data()` not only fits the model (only if not already fitted), but also saves other aspects of topic inference, which makes it easier to then use this object for pretty printing and visualizing your models (see [Model Interpretation](model_interpretation.md))
217+
218+
```python
219+
topic_data = model.prepare_topic_data(corpus)
220+
# print to see what attributes you can access.
221+
print(topic_data)
222+
```
223+
```
224+
TopicData
225+
├── corpus (1000)
226+
├── vocab (1746,)
227+
├── document_term_matrix (1000, 1746)
228+
├── topic_term_matrix (10, 1746)
229+
├── document_topic_matrix (1000, 10)
230+
├── document_representation (1000, 384)
231+
├── transform
232+
├── topic_names (10)
233+
├── has_negative_side
234+
└── hierarchy
235+
```
236+
See [Using TopicData](topic_data.md) for more detail.
237+
195238

196239
### Precomputing Embeddings
197240

198241
In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
199242
Encoding the corpus is the heaviest part of the process and you can spare yourself a lot of time by only doing it once.
200-
201243
Some models have to encode the vocabulary as well, this cannot be done before inference, as the models learn the vocabulary itself from the corpus.
202244

203245
The `fit()` method of all models takes and `embeddings` argument, that allows you to pass a precooked embedding matrix along to fitting.

docs/model_interpretation.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ topic_data = model.prepare_topic_data(corpus)
1111

1212
## Topic Tables
1313

14-
The easiest way you can investigate topics in your fitted model is to use the built-in pretty printing utilities, that you can call on every fitted model or `TopicData` object.
14+
The easiest way you can investigate topics in your fitted model is to use the built-in pretty printing utilities, that you can call on every fitted model or [`TopicData`](topic_data.md) object.
1515

1616
!!! quote "Interpret your models with topic tables"
1717
=== "Relevant Words"
@@ -127,7 +127,7 @@ pip install topic-wizard
127127
### Web App
128128

129129
The easiest way to investigate any topic model interactively is to use the topicwizard web app.
130-
You can launch the app either using a `TopicData` or a model object and a representative sample of documents.
130+
You can launch the app either using a [`TopicData`](topic_data.md) or a model object and a representative sample of documents.
131131

132132
=== "With `TopicData`"
133133

@@ -148,7 +148,7 @@ You can launch the app either using a `TopicData` or a model object and a repres
148148
### Figures
149149

150150
You can also produce individual interactive figures using the [Figures API in topicwizard](https://x-tabdeveloping.github.io/topicwizard/figures.html).
151-
Almost all figures in the Figures API can be called on the `figures` submodule of any `TopicData` object.
151+
Almost all figures in the Figures API can be called on the `figures` submodule of any [`TopicData`](topic_data.md) object.
152152

153153
!!! quote "Interpret your models using interactive figures"
154154
=== "Topic Map"

docs/persistence.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
# Saving and loading models
1+
# Saving and loading
2+
3+
## Model persistence
4+
All models in Turftopic can be serialized and saved to disk, or published to the HuggingFace Hub.
25

36
### Saving locally
47

@@ -32,3 +35,26 @@ model = load_model("./local_directory/")
3235
# or from hub
3336
model = load_model("your_user/s3_20-newsgroups_10-topics")
3437
```
38+
39+
## `TopicData` persistence
40+
41+
You can also save and load `TopicData` objects with Turftopic.
42+
These are saved using `joblib` and therefore we recommend that you give a `.joblib` file extension to all `TopicData` files:
43+
44+
!!! note "Note on compatibility"
45+
For backwards compatibility, `TopicData` objects are saved using `joblib` as simple `dict` objects.
46+
If you simply load a saved `TopicData` object with joblib without using `from_disk()`, it will load as a `dict`.
47+
48+
=== "Save"
49+
```python
50+
topic_data = model.prepare_topic_data(corpus)
51+
topic_data.to_disk("topic_data.joblib")
52+
```
53+
54+
=== "Load"
55+
```python
56+
from turftopic.data import TopicData
57+
58+
topic_data = TopicData.from_disk("topic_data.joblib")
59+
```
60+

docs/topic_data.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# `TopicData`
2+
3+
While Turftopic provides a fully sklearn-compatible interface for training and using topic models, this is not always optimal, especially when you have to visualize models, or save more information about inference then would be practical to have in a `model` object.
4+
We have thus added an abstraction borrowed from [topicwizard](https://github.com/x-tabdeveloping/topicwizard) called `TopicData`.
5+
6+
## Producing `TopicData`
7+
Every model has methods, with which you can produce this object:
8+
9+
!!! quote "Prepare `TopicData` objects"
10+
=== "`prepare_topic_data(corpus, embeddings=None)`"
11+
```python
12+
topic_data = model.prepare_topic_data(corpus)
13+
# print to see what attributes are available
14+
print(topic_data)
15+
```
16+
```
17+
TopicData
18+
├── corpus (1000)
19+
├── vocab (1746,)
20+
├── document_term_matrix (1000, 1746)
21+
├── topic_term_matrix (10, 1746)
22+
├── document_topic_matrix (1000, 10)
23+
├── document_representation (1000, 384)
24+
├── transform
25+
├── topic_names (10)
26+
├── has_negative_side
27+
└── hierarchy
28+
```
29+
=== "`prepare_dynamic_topic_data(corpus, timestamps, embeddings=None, bins=10)`"
30+
Models that support dynamic topic modeling have this method too, which includes dynamic topics in the resulting `TopicData` object.
31+
```python
32+
import datetime
33+
34+
timestamps: list[datetime.datetime] = [...]
35+
topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps)
36+
```
37+
38+
## Using `TopicData`
39+
`TopicData` is a dict-like object, and for all intents and purposes can be used as a Python dictionary, but for convenience you can also access its attributes with the dot syntax:
40+
41+
```python
42+
# They are the same
43+
assert topic_data["document_term_matrix"].shape == topic_data.document_term_matrix.shape
44+
```
45+
46+
Much like models, you can pretty-print information about topic models based on the `TopicData` object, but, since it contains more information on inference then the model object itself, you sometimes have to pass less parameters than if you called the same method on the model:
47+
48+
```python
49+
model.print_representative_documents(0, corpus, document_topic_matrix)
50+
# This is simpler with TopicData, since you only have to pass the topic ID
51+
topic_data.print_representative_documents(0)
52+
```
53+
54+
When producing figures, `TopicData` also gives you shorthands for accessing the topicwizard web app and Figures API:
55+
56+
```python
57+
topic_data.figures.topic_map()
58+
```
59+
60+
<center>
61+
<iframe src="https://x-tabdeveloping.github.io/topicwizard/_static/plots/topic_map.html" width="1000px" height="450px" frameborder=0></iframe>
62+
</center>
63+
64+
See our guide on [Model Interpretation](model_interpretation.md) for more info.
65+
66+
## API Reference
67+
68+
::: turftopic.data.TopicData

mkdocs.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,13 @@ nav:
66
- Getting Started: index.md
77
- Defining and Fitting Topic Models: model_definition_and_training.md
88
- Interpreting and Visualizing Models: model_interpretation.md
9-
- Modifying and Finetuning Models: finetuning.md
10-
- Saving and Loading Models: persistence.md
119
- Seeded Topic Modeling: seeded.md
1210
- Dynamic Topic Modeling: dynamic.md
1311
- Online Topic Modeling: online.md
1412
- Hierarchical Topic Modeling: hierarchical.md
13+
- Modifying and Finetuning Models: finetuning.md
14+
- Saving and Loading: persistence.md
15+
- Using TopicData: topic_data.md
1516
- Models:
1617
- Model Overview: model_overview.md
1718
- Semantic Signal Separation (S³): s3.md

0 commit comments

Comments
 (0)