You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the most important attributes of a topic model you will have to choose is the vectorizer.
4
+
It determines for which terms word-importance scores will be calculated.
5
+
6
+
By default, Turftopic uses sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html),
7
+
which naively counts word/n-gram occurrences in text. This usually works quite well, but your use case might require you to use a different or more sophisticated approach.
8
+
This is why we provide a `vectorizers` module, where a wide range of useful options is available to you.
9
+
10
+
## Chinese text
11
+
12
+
The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages.
13
+
You thus need to use special tokenization rules for Chinese.
14
+
Turftopic provides tools for Chinese tokenization via the [Jieba](https://github.com/fxsjy/jieba) package.
15
+
16
+
You will need to install the package in order to be able to use our Chinese vectorizer.
17
+
18
+
```bash
19
+
pip install turftopic[jieba]
20
+
```
21
+
22
+
You can then use the `ChineseCountVectorizer` object, which comes preloaded with the jieba tokenizer along with a Chinese stop word list.
23
+
24
+
```python
25
+
from turftopic import KeyNMF
26
+
from turftopic.vectorizers.chinese import ChineseCountVectorizer
0 commit comments