Skip to content

Commit b22ef1d

Browse files
Restructured docs in a more sensible way
1 parent 52e78ef commit b22ef1d

1 file changed

Lines changed: 108 additions & 75 deletions

File tree

Lines changed: 108 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,125 +1,155 @@
11
# Defining and Training Topic Models
22

3-
To get started using Turftopic you will need to load and fit a topic model.
4-
This page provides more in-depth information on how to do these.
3+
In order to start modeling your corpora, you will need to define a topic model.
4+
There are a wide array of available models in Turftopic that all have their unique behaviour.
5+
On the other hand all models will need to have certain components, and have attributes you can adjust to your needs.
6+
This page provides a guide on how to define models, train them, and use them for inference.
7+
8+
<figure>
9+
<img src="../images/topic_modeling_pipeline.png" width="800px" style="margin-left: auto;margin-right: auto;">
10+
<figcaption>Components of a Topic Modeling Pipeline</figcaption>
11+
</figure>
12+
13+
14+
## Defining a Model
15+
16+
### 1. [Topic Model](../models.md)
17+
In order to initialize a model, you will first need to make a choice about which **topic model** you'd like to use.
18+
You might want to have a look at the [Models](models.md) page in order to make an informed choice about the topic model you intend to train.
19+
20+
Here are some examples of models you can load and use in the package:
21+
22+
<table>
23+
<tr>
24+
<td> Model </td> <td> Example Definition </td>
25+
</tr>
26+
<tr>
27+
<td>
28+
29+
<a href="https://x-tabdeveloping.github.io/turftopic/KeyNMF/">KeyNMF</a>
30+
31+
</td>
32+
<td>
533

634
```python
735
from turftopic import KeyNMF
836

9-
model = KeyNMF(20).fit(corpus)
37+
model = KeyNMF(n_components=10, top_n=15)
1038
```
1139

12-
## Important Attributes
40+
</td>
41+
</tr>
42+
<tr>
43+
<td>
1344

14-
In Turftopic all models have a vectorizer and an encoder component, which you can specify when initializing a model.
45+
<a href="https://x-tabdeveloping.github.io/turftopic/clustering/">ClusteringTopicModel</a>
1546

16-
1. The __vectorizer__ is used to turn documents into Bag-of-Words representations and learning the vocabulary. The default used in the package is sklearn's `CountVectorizer`.
17-
1. The __encoder__ is used to encode documents, and optionally the vocabulary into contextual representations. This will most frequently be a Sentence Transformer. The default in Turftopic is `all-MiniLM-L6-v2`, a very lightweight English model.
47+
</td>
48+
<td>
1849

19-
You can use any of the built-in encoders in Turftopic to encode your documents, or any sentence transformer from the HuggingFace Hub.
20-
This allows you to use embeddings of different quality and computational efficiency for different purposes.
50+
```python
51+
from turftopic import ClusteringTopicModel
2152

22-
Here's a model that uses E5 large as the embedding model, and only learns words that occur in at least 20 documents.
53+
model = ClusteringTopicModel(n_reduce_to=10, feature_importance="centroid")
54+
```
55+
56+
</td>
57+
</tr>
58+
<tr>
59+
<td>
60+
61+
<a href="https://x-tabdeveloping.github.io/turftopic/s3/">SemanticSignalSeparation</a>
62+
63+
</td>
64+
<td>
2365

2466
```python
2567
from turftopic import SemanticSignalSeparation
26-
from sklearn.feature_extraction.text import CountVectorizer
2768

28-
model = SemanticSignalSeparation(10, encoder="all-MiniLM-L6-v2", vectorizer=CountVectorizer(min_df=20))
69+
model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
2970
```
3071

31-
You can also use external models for encoding, here's an example with [OpenAI's embedding models](encoders.md#external_embeddings):
72+
</td>
73+
</tr>
74+
</table>
3275

33-
```python
34-
from turftopic import GMM
35-
from turftopic.encoders import OpenAIEmbeddings
76+
### 2. [Vectorizer](../vectorizers.md)
3677

37-
model = GMM(10, encoder=OpenAIEmbeddings("text-embedding-3-large"))
38-
```
78+
In Turftopic, all Models have a vectorizer component, which is responsible for extracting word content from documents in the corpus.
79+
This means, that a vectorizer also determines which words will be part of the model's vocabulary.
80+
For a more detailed explanation, see the [Vectorizers](../vectorizers.md) page
3981

40-
If you intend to, you can also use n-grams as features instead of words:
82+
The default is scikit-learn's CountVectorizer:
4183

4284
```python
43-
from turftopic import GMM
4485
from sklearn.feature_extraction.text import CountVectorizer
4586

46-
model = GMM(10, vectorizer=CountVectorizer(ngram_range=(2,4)))
87+
default_vectorizer = CountVectorizer(min_df=10, stop_words="english")
4788
```
4889

49-
## Fitting Models
50-
51-
All models in Turftopic have a `fit()` method, that takes a textual corpus in the form of an iterable of strings.
90+
You can add a custom vectorizer to a topic model upon initializing it,
91+
thereby getting different behaviours. You can for instance use noun-phrases in your model instead of words by using NounPhraseCountVectorizer:
5292

53-
Beware that the iterable has to be reusable, as models have to do multiple passes over the corpus.
93+
```bash
94+
pip install turftopic[spacy]
95+
python -m spacy download "en_core_web_sm"
96+
```
5497

5598
```python
56-
corpus: list[str] = ["this is a a document", "this is yet another document", ...]
99+
from turftopic import KeyNMF
100+
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
57101

58-
model.fit(corpus)
102+
model = KeyNMF(10, vectorizer=NounPhraseCountVectorizer())
59103
```
60104

61-
## Prompting Embedding Models
105+
### 3. [Encoder](../encoders.md)
62106

63-
Some embedding models can be used together with prompting, or encode queries and passages differently.
64-
This can significantly influence performance, especially in the case of models that are based on retrieval ([KeyNMF](KeyNMF.md)) or clustering ([ClusteringTopicModel](clustering.md)).
65-
Microsoft's E5 models are, for instance all prompted by default, and it would be detrimental to performance not to do so yourself.
107+
Since all models in Turftopic rely on contextual embeddings, you will need to specify a contextual embedding model to use.
108+
The default is [`all-MiniLM-L6-v2`](sentence-transformers/all-MiniLM-L6-v2), which is a very fast and reasonably performant embedding model for English.
109+
You might, however want to use custom embeddings, either because your corpus is not in English, or because you need higher speed or performance.
110+
See a detailed guide on Encoders [here](../encoders.md).
66111

67-
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
112+
Similar to a vectorizer, you can add an encoder to a topic model upon initializing it.
68113

69-
Here's an example for clustering models:
70114
```python
71-
from turftopic import ClusteringTopicModel
115+
from turftopic import KeyNMF
72116
from sentence_transformers import SentenceTransformer
73117

74-
encoder = SentenceTransformer(
75-
"intfloat/multilingual-e5-large-instruct",
76-
prompts={
77-
"query": "Instruct: Cluster documents according to the topic they are about. Query: "
78-
"passage": "Passage: "
79-
},
80-
# Make sure to set default prompt to query!
81-
default_prompt_name="query",
82-
)
83-
model = ClusteringTopicModel(encoder=encoder)
118+
encoder = SentenceTransformer("parahprase-multilingual-MiniLM-L12-v2")
119+
model = KeyNMF(10, encoder=encoder)
84120
```
85121

86-
You can also use instruct models for keyword retrieval with KeyNMF.
87-
In this case, documents will serve as the queries and words as the passages:
122+
### 4. [Namer](../namers.md) (*optional*)
123+
124+
A Namer is an optional part of your topic modeling pipeline, that can automatically assign human-readable names to topics.
125+
Namers are technically **not part of your topic model**, and should be used *after training*.
126+
See a detailed guide [here](../namers.md).
88127

89128
```python
90129
from turftopic import KeyNMF
91-
from sentence_transformers import SentenceTransformer
130+
from turftopic.namers import LLMTopicNamer
92131

93-
encoder = SentenceTransformer(
94-
"intfloat/multilingual-e5-large-instruct",
95-
prompts={
96-
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
97-
"passage": "Passage: "
98-
},
99-
# Make sure to set default prompt to query!
100-
default_prompt_name="query",
101-
)
102-
model = KeyNMF(10, encoder=encoder)
132+
model = KeyNMF(10).fit(corpus)
133+
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
134+
135+
model.rename_topics(namer)
103136
```
104137

105-
When using KeyNMF with E5, make sure to specify the prompts even if you're not using instruct models:
138+
## Training and Inference
139+
140+
### Model Training
141+
142+
All models in Turftopic have a `fit()` method, that takes a textual corpus in the form of an iterable of strings.
143+
144+
Beware that the iterable has to be reusable, as models have to do multiple passes over the corpus.
106145

107146
```python
108-
encoder = SentenceTransformer(
109-
"intfloat/e5-large-v2",
110-
prompts={
111-
"query": "query: "
112-
"passage": "passage: "
113-
},
114-
# Make sure to set default prompt to query!
115-
default_prompt_name="query",
116-
)
117-
model = KeyNMF(10, encoder=encoder)
118-
```
147+
corpus: list[str] = ["this is a a document", "this is yet another document", ...]
119148

120-
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
149+
model.fit(corpus)
150+
```
121151

122-
## Precomputing Embeddings
152+
### Precomputing Embeddings
123153

124154
In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
125155
Encoding the corpus is the heaviest part of the process and you can spare yourself a lot of time by only doing it once.
@@ -147,10 +177,16 @@ gmm = GMM(10, encoder=encoder).fit(corpus, embeddings=embeddings)
147177
clustering = ClusteringTopicModel(encoder=encoder).fit(corpus, embeddings=embeddings)
148178
```
149179

150-
## Inference
180+
### Inference
151181

182+
Some models in Turftopic are capable of estimating topic importance scores for documents in your corpus.
152183
In order to get the importance of each topic for the documents in the corpus, you might want to use `fit_transform()` instead of `fit()`
153184

185+
!!! warning
186+
Note that using `fit()` and `transform()` in succession is not the same as using `fit_transform()` and the later should be preferred under all circumstances.
187+
For one, not all models have a `transform()` method, but `fit_transform()` is also way more efficient, as documents don't have to be encoded twice.
188+
Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.
189+
154190
```python
155191
document_topic_matrix = model.fit_transform(corpus)
156192
```
@@ -163,9 +199,6 @@ You can infer topical content for new documents with a fitted model using the `t
163199
document_topic_matrix = model.transform(new_documents, embeddings=None)
164200
```
165201

166-
> Note that using `fit()` and `transform()` in succession is not the same as using `fit_transform()` and the later should be preferred under all circumstances.
167-
> For one, not all models have a `transform()` method, but `fit_transform()` is also way more efficient, as documents don't have to be encoded twice.
168-
> Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.
169202

170203

171204

0 commit comments

Comments
 (0)