You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/clustering.md
+11-7Lines changed: 11 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,19 +23,19 @@ model = ClusteringTopicModel(dimensionality_reduction=TSNE())
23
23
24
24
It is common practice to reduce the dimensionality of the embeddings before clustering them.
25
25
This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by.
26
-
Dimensionality reduction by default is done with **TSNE** in Turftopic,
26
+
Dimensionality reduction by default is done with [**TSNE**](https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne) in Turftopic,
27
27
but users are free to specify the model that will be used for dimensionality reduction.
28
28
29
29
!!! tip "Use openTSNE for better performance!"
30
-
By default, a scikit-learn implementation is used, but if you have the openTSNE package installed on your system, Turftopic will automatically use it.
30
+
By default, a scikit-learn implementation is used, but if you have the [openTSNE](https://github.com/pavlin-policar/openTSNE) package installed on your system, Turftopic will automatically use it.
31
31
You can potentially speed up your clustering topic models by multiple orders of magnitude.
32
32
```bash
33
-
pip install opentsne
33
+
pip install turftopic[opentsne]
34
34
```
35
35
36
36
??? note "What reduction model should I choose?"
37
37
Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature.
38
-
Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
38
+
Top2Vec and BERTopic both use [UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html), which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed).
39
39
40
40
### Clustering
41
41
@@ -47,7 +47,7 @@ model = ClusteringTopicModel(clustering=HDBSCAN())
47
47
```
48
48
49
49
After reducing the dimensionality of the embeddings, they are clustered with a clustering model.
50
-
Turftopic uses **HDBSCAN** as its default.
50
+
Turftopic uses [**HDBSCAN**](https://scikit-learn.org/stable/modules/clustering.html#hdbscan) as its default.
51
51
52
52
??? note "What clustering model should I choose?"
53
53
Some clustering models are capable of discovering the number of clusters in the data (HDBSCAN, DBSCAN, OPTICS, etc.).
@@ -183,8 +183,12 @@ model.reset_topics()
183
183
184
184
### Visualization
185
185
186
-
You can interactively explore clusters using `datamapplot` directly in Turftopic!
187
-
You will first have to install `datamapplot` for this to work.
186
+
You can interactively explore clusters using [datamapplot](https://github.com/TutteInstitute/datamapplot) directly in Turftopic!
187
+
You will first have to install `datamapplot` for this to work:
0 commit comments