Skip to content

Commit 3ed07d2

Browse files
draft: started working on docs
1 parent 24eaa9e commit 3ed07d2

2 files changed

Lines changed: 32 additions & 3 deletions

File tree

docs/vectorizers.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Vectorizers
2+
3+
One of the most important attributes of a topic model you will have to choose is the vectorizer.
4+
It determines for which terms word-importance scores will be calculated.
5+
6+
By default, Turftopic uses sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html),
7+
which naively counts word/n-gram occurrences in text. This usually works quite well, but your use case might require you to use a different or more sophisticated approach.
8+
This is why we provide a `vectorizers` module, where a wide range of useful options is available to you.
9+
10+
## Chinese text
11+
12+
The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages.
13+
You thus need to use special tokenization rules for Chinese.
14+
Turftopic provides tools for Chinese tokenization via the [Jieba](https://github.com/fxsjy/jieba) package.
15+
16+
You will need to install the package in order to be able to use our Chinese vectorizer.
17+
18+
```bash
19+
pip install turftopic[jieba]
20+
```
21+
22+
You can then use the `ChineseCountVectorizer` object, which comes preloaded with the jieba tokenizer along with a Chinese stop word list.
23+
24+
```python
25+
from turftopic import KeyNMF
26+
from turftopic.vectorizers.chinese import ChineseCountVectorizer
27+
28+
vectorizer = ChineseCountVectorizer(min_df=10, stop_words="chinese")
29+
30+
model = KeyNMF(10, vectorizer=vectorizer)
31+
```

mkdocs.yml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@ nav:
1010
- Hierarchical Topic Modeling: hierarchical.md
1111
- Modifying and Finetuning Models: finetuning.md
1212
- Saving and Loading Models: persistence.md
13-
- Tutorials:
14-
- Topic Modeling in Chinese: chinese.md
15-
- Keyphrase-based Topic Modeling: keyphrase.md
1613
- Models:
1714
- Model Overview: model_overview.md
1815
- Semantic Signal Separation (S³): s3.md
@@ -22,6 +19,7 @@ nav:
2219
- Autoencoding Models: ctm.md
2320
- FASTopic: FASTopic.md
2421
- Encoders: encoders.md
22+
- Vectorizers: vectorizers.md
2523
- Namers: namers.md
2624
theme:
2725
name: material

0 commit comments

Comments
 (0)