draft: started working on docs

x-tabdeveloping · x-tabdeveloping · commit 3ed07d2d2c99 · 2025-01-06T11:54:39.000+01:00
diff --git a/docs/vectorizers.md b/docs/vectorizers.md
@@ -0,0 +1,31 @@
+# Vectorizers
+
+One of the most important attributes of a topic model you will have to choose is the vectorizer.
+It determines for which terms word-importance scores will be calculated.
+
+By default, Turftopic uses sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html),
+which naively counts word/n-gram occurrences in text. This usually works quite well, but your use case might require you to use a different or more sophisticated approach.
+This is why we provide a `vectorizers` module, where a wide range of useful options is available to you.
+
+## Chinese text
+
+The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages.
+You thus need to use special tokenization rules for Chinese.
+Turftopic provides tools for Chinese tokenization via the [Jieba](https://github.com/fxsjy/jieba) package.
+
+You will need to install the package in order to be able to use our Chinese vectorizer.
+
+```bash
+pip install turftopic[jieba]
+```
+
+You can then use the `ChineseCountVectorizer` object, which comes preloaded with the jieba tokenizer along with a Chinese stop word list.
+
+```python
+from turftopic import KeyNMF
+from turftopic.vectorizers.chinese import ChineseCountVectorizer
+
+vectorizer = ChineseCountVectorizer(min_df=10, stop_words="chinese")
+
+model = KeyNMF(10, vectorizer=vectorizer)
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -10,9 +10,6 @@ nav:
     - Hierarchical Topic Modeling: hierarchical.md
     - Modifying and Finetuning Models: finetuning.md
     - Saving and Loading Models: persistence.md
-  - Tutorials:
-    - Topic Modeling in Chinese: chinese.md
-    - Keyphrase-based Topic Modeling: keyphrase.md
   - Models:
     - Model Overview: model_overview.md
     - Semantic Signal Separation (S³): s3.md
@@ -22,6 +19,7 @@ nav:
     - Autoencoding Models: ctm.md
     - FASTopic: FASTopic.md
   - Encoders: encoders.md
+  - Vectorizers: vectorizers.md
   - Namers: namers.md
 theme:
   name: material