You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/vectorizers.md
+189-2Lines changed: 189 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,179 @@
1
1
# Vectorizers
2
2
3
3
One of the most important attributes of a topic model you will have to choose is the vectorizer.
4
+
A vectorizer is responsible for extracting term-features from text.
4
5
It determines for which terms word-importance scores will be calculated.
5
6
6
7
By default, Turftopic uses sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html),
7
8
which naively counts word/n-gram occurrences in text. This usually works quite well, but your use case might require you to use a different or more sophisticated approach.
8
9
This is why we provide a `vectorizers` module, where a wide range of useful options is available to you.
9
10
10
-
## Chinese text
11
+
!!! question "How is this different from preprocessing?"
12
+
You might think that preprocessing the documents might result in the same effect as some of these vectorizers, but this is not entirely the case.
13
+
When you remove stop words or lemmatize texts in preprocessing, you remove a lot of valuable information that your topic model then can't use.
14
+
By defining a custom vectorizer you *limit the vocabulary* of your model, thereby only learning word importance scores for certain words, but **you keep your documents fully intact**.
15
+
16
+
## Phrase Vectorizers
17
+
18
+
You might want to get phrases in your topic descriptions instead of individual words.
19
+
This could prove a very reasonable choice as it's often not words in themselves but phrases made up by them that describe a topic most accurately.
20
+
Turftopic supports multiple ways of using phrases as fundamental terms.
21
+
22
+
### N-gram Features with `CountVectorizer`
23
+
24
+
`CountVectorizer` supports n-gram extraction right out of the box.
25
+
Just define a custom vectorizer with an `n_gram_range`.
26
+
27
+
!!! tip
28
+
While this option is naive, and will likely yield the lowest quality results, it is also incredibly fast in comparison to other phrase vectorization techniques.
29
+
It might, however be slower, if the topic model encodes its vocabulary when fitting.
30
+
31
+
```python
32
+
from sklearn.feature_extraction.text import CountVectorizer
| 1 | faq alt atheism, alt atheism archive, atheism overview alt, alt atheism resources, atheism faq frequently, archive atheism overview, alt atheism faq, overview alt atheism, titles alt atheism, readers alt atheism |
Turftopic can also use noun phrases by utilizing the [SpaCy](https://spacy.io/) package.
52
+
For Noun phrase vectorization to work, you will have to install SpaCy.
53
+
54
+
```bash
55
+
pip install turftopic[spacy]
56
+
```
57
+
58
+
You will also need to install a relevant SpaCy pipeline for the language you intend to use.
59
+
The default pipeline is English, and you should install it before attempting to use `NounPhraseCountVectorizer`.
60
+
61
+
You can find a model that fits your needs [here](https://spacy.io/models).
62
+
63
+
```bash
64
+
python -m spacy download en_core_web_sm
65
+
```
66
+
67
+
Using SpaCy pipelines will substantially slow down model fitting, but the results might be more correct and higher quality than with naive n-gram extraction.
68
+
```python
69
+
from turftopic import KeyNMF
70
+
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
86
+
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
87
+
|| ... |
88
+
89
+
### Keyphrases with [KeyphraseVectorizers](https://github.com/TimSchopf/KeyphraseVectorizers/tree/master)
90
+
91
+
You can extract candidate keyphrases from text using KeyphraseVectorizers.
92
+
KeyphraseVectorizers uses POS tag patterns to identify phrases instead of word dependency graphs, like `NounPhraseCountVectorizer`.
93
+
KeyphraseVectorizers can potentially be faster as the dependency parser component is not needed in the SpaCy pipeline.
94
+
This vectorizer is not part of the Turftopic package, but can be easily used with it out of the box.
95
+
96
+
```bash
97
+
pip install keyphrase-vectorizers
98
+
```
99
+
100
+
```python
101
+
from keyphrase_vectorizers import KeyphraseCountVectorizer
102
+
103
+
vectorizer = KeyphraseCountVectorizer()
104
+
model = KeyNMF(10, vectorizer=vectorizer).fit(corpus)
105
+
```
106
+
107
+
## Lemmatizing and Stemming Vectorizers
108
+
109
+
Since the same word can appear in multiple forms in a piece of text, one can sometimes obtain higher quality results by stemming or lemmatizing words in a text before processing them.
110
+
111
+
!!! warning
112
+
You should **NEVER** lemmatize or stem texts before passing them to a topic model in Turftopic, but rather, use a vectorizer that limits the model's vocabulary to the terms you are interested in.
113
+
114
+
### Extracting lemmata with `LemmaCountVectorizer`
115
+
116
+
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a SpaCy pipeline for extracting lemmas from a piece of text.
117
+
This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
118
+
119
+
```bash
120
+
pip install turftopic[spacy]
121
+
python -m spacy download en_core_web_sm
122
+
```
123
+
124
+
```python
125
+
from turftopic import KeyNMF
126
+
from turftopic.vectorizers.spacy import LemmaCountVectorizer
127
+
128
+
model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm"))
You might find that lemmatization isn't aggressive enough for your purposes and still many forms of the same word penetrate topic descriptions.
145
+
In that case you should try stemming! Stemming is available in Turftopic via the [Snowball Stemmer](https://snowballstem.org/), so it has to be installed before using stemming vectorization.
146
+
147
+
!!! question "Should I choose stemming or lemmatization?"
148
+
In almost all cases you should **prefer lemmatizaion** over stemming, as it provides higher quality and more correct results. You should only use a stemmer if
149
+
150
+
1. You need something fast (lemmatization is slower due to a more involved pipeline)
151
+
2. You know what you want and it is definitely stemming.
152
+
```bash
153
+
pip install turftopic[snowball]
154
+
```
155
+
156
+
Then you can initialize a topic model with this vectorizer:
157
+
158
+
```python
159
+
from turftopic import KeyNMF
160
+
from turftopic.vectorizers.snowball import StemmingCountVectorizer
161
+
162
+
model = KeyNMF(10, vectorizer=StemmingCountVectorizer(language="english"))
0 commit comments