You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. The __vectorizer__ is used to turn documents into Bag-of-Words representations and learning the vocabulary. The default used in the package is sklearn's `CountVectorizer`.
17
-
1. The __encoder__ is used to encode documents, and optionally the vocabulary into contextual representations. This will most frequently be a Sentence Transformer. The default in Turftopic is `all-MiniLM-L6-v2`, a very lightweight English model.
47
+
</td>
48
+
<td>
18
49
19
-
You can use any of the built-in encoders in Turftopic to encode your documents, or any sentence transformer from the HuggingFace Hub.
20
-
This allows you to use embeddings of different quality and computational efficiency for different purposes.
50
+
```python
51
+
from turftopic import ClusteringTopicModel
21
52
22
-
Here's a model that uses E5 large as the embedding model, and only learns words that occur in at least 20 documents.
53
+
model = ClusteringTopicModel(n_reduce_to=10, feature_importance="centroid")
Some embedding models can be used together with prompting, or encode queries and passages differently.
64
-
This can significantly influence performance, especially in the case of models that are based on retrieval ([KeyNMF](KeyNMF.md)) or clustering ([ClusteringTopicModel](clustering.md)).
65
-
Microsoft's E5 models are, for instance all prompted by default, and it would be detrimental to performance not to do so yourself.
107
+
Since all models in Turftopic rely on contextual embeddings, you will need to specify a contextual embedding model to use.
108
+
The default is [`all-MiniLM-L6-v2`](sentence-transformers/all-MiniLM-L6-v2), which is a very fast and reasonably performant embedding model for English.
109
+
You might, however want to use custom embeddings, either because your corpus is not in English, or because you need higher speed or performance.
110
+
See a detailed guide on Encoders [here](../encoders.md).
66
111
67
-
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
112
+
Similar to a vectorizer, you can add an encoder to a topic model upon initializing it.
68
113
69
-
Here's an example for clustering models:
70
114
```python
71
-
from turftopic importClusteringTopicModel
115
+
from turftopic importKeyNMF
72
116
from sentence_transformers import SentenceTransformer
73
117
74
-
encoder = SentenceTransformer(
75
-
"intfloat/multilingual-e5-large-instruct",
76
-
prompts={
77
-
"query": "Instruct: Cluster documents according to the topic they are about. Query: "
When using KeyNMF with E5, make sure to specify the prompts even if you're not using instruct models:
138
+
## Training and Inference
139
+
140
+
### Model Training
141
+
142
+
All models in Turftopic have a `fit()` method, that takes a textual corpus in the form of an iterable of strings.
143
+
144
+
Beware that the iterable has to be reusable, as models have to do multiple passes over the corpus.
106
145
107
146
```python
108
-
encoder = SentenceTransformer(
109
-
"intfloat/e5-large-v2",
110
-
prompts={
111
-
"query": "query: "
112
-
"passage": "passage: "
113
-
},
114
-
# Make sure to set default prompt to query!
115
-
default_prompt_name="query",
116
-
)
117
-
model = KeyNMF(10, encoder=encoder)
118
-
```
147
+
corpus: list[str] = ["this is a a document", "this is yet another document", ...]
119
148
120
-
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
149
+
model.fit(corpus)
150
+
```
121
151
122
-
## Precomputing Embeddings
152
+
###Precomputing Embeddings
123
153
124
154
In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
125
155
Encoding the corpus is the heaviest part of the process and you can spare yourself a lot of time by only doing it once.
Some models in Turftopic are capable of estimating topic importance scores for documents in your corpus.
152
183
In order to get the importance of each topic for the documents in the corpus, you might want to use `fit_transform()` instead of `fit()`
153
184
185
+
!!! warning
186
+
Note that using `fit()` and `transform()` in succession is not the same as using `fit_transform()` and the later should be preferred under all circumstances.
187
+
For one, not all models have a `transform()` method, but `fit_transform()` is also way more efficient, as documents don't have to be encoded twice.
188
+
Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.
> Note that using `fit()` and `transform()` in succession is not the same as using `fit_transform()` and the later should be preferred under all circumstances.
167
-
> For one, not all models have a `transform()` method, but `fit_transform()` is also way more efficient, as documents don't have to be encoded twice.
168
-
> Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.
0 commit comments