Skip to content

Commit 14443c1

Browse files
Added vectorizers to docs
1 parent 3ed07d2 commit 14443c1

1 file changed

Lines changed: 189 additions & 2 deletions

File tree

docs/vectorizers.md

Lines changed: 189 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,179 @@
11
# Vectorizers
22

33
One of the most important attributes of a topic model you will have to choose is the vectorizer.
4+
A vectorizer is responsible for extracting term-features from text.
45
It determines for which terms word-importance scores will be calculated.
56

67
By default, Turftopic uses sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html),
78
which naively counts word/n-gram occurrences in text. This usually works quite well, but your use case might require you to use a different or more sophisticated approach.
89
This is why we provide a `vectorizers` module, where a wide range of useful options is available to you.
910

10-
## Chinese text
11+
!!! question "How is this different from preprocessing?"
12+
You might think that preprocessing the documents might result in the same effect as some of these vectorizers, but this is not entirely the case.
13+
When you remove stop words or lemmatize texts in preprocessing, you remove a lot of valuable information that your topic model then can't use.
14+
By defining a custom vectorizer you *limit the vocabulary* of your model, thereby only learning word importance scores for certain words, but **you keep your documents fully intact**.
15+
16+
## Phrase Vectorizers
17+
18+
You might want to get phrases in your topic descriptions instead of individual words.
19+
This could prove a very reasonable choice as it's often not words in themselves but phrases made up by them that describe a topic most accurately.
20+
Turftopic supports multiple ways of using phrases as fundamental terms.
21+
22+
### N-gram Features with `CountVectorizer`
23+
24+
`CountVectorizer` supports n-gram extraction right out of the box.
25+
Just define a custom vectorizer with an `n_gram_range`.
26+
27+
!!! tip
28+
While this option is naive, and will likely yield the lowest quality results, it is also incredibly fast in comparison to other phrase vectorization techniques.
29+
It might, however be slower, if the topic model encodes its vocabulary when fitting.
30+
31+
```python
32+
from sklearn.feature_extraction.text import CountVectorizer
33+
34+
vectorizer = CountVectorizer(ngram_range=(2,3), stop_words="english")
35+
36+
model = KeyNMF(10, vectorizer=vectorizer)
37+
model.fit(corpus)
38+
model.print_topics()
39+
```
40+
41+
| Topic ID | Highest Ranking |
42+
| - | - |
43+
| 0 | bronx away sank, blew bronx away, blew bronx, bronx away, sank manhattan, stay blew bronx, manhattan sea, away sank manhattan, said queens stay, queens stay |
44+
| 1 | faq alt atheism, alt atheism archive, atheism overview alt, alt atheism resources, atheism faq frequently, archive atheism overview, alt atheism faq, overview alt atheism, titles alt atheism, readers alt atheism |
45+
| 2 | theism factor fanatism, theism leads fanatism, fanatism caused theism, theism correlated fanaticism, fanatism point theism, fanatism deletion theism, fanatics tend theism, fanaticism said fanatism, correlated fanaticism belief, strongly correlated fanaticism |
46+
| 3 | alt atheism, atheism archive, alt atheism archive, archive atheism, atheism atheism, atheism faq, archive atheism introduction, atheism archive introduction, atheism introduction alt, atheism introduction |
47+
| | ... |
48+
49+
### Noun phrases with `NounPhraseCountVectorizer`
50+
51+
Turftopic can also use noun phrases by utilizing the [SpaCy](https://spacy.io/) package.
52+
For Noun phrase vectorization to work, you will have to install SpaCy.
53+
54+
```bash
55+
pip install turftopic[spacy]
56+
```
57+
58+
You will also need to install a relevant SpaCy pipeline for the language you intend to use.
59+
The default pipeline is English, and you should install it before attempting to use `NounPhraseCountVectorizer`.
60+
61+
You can find a model that fits your needs [here](https://spacy.io/models).
62+
63+
```bash
64+
python -m spacy download en_core_web_sm
65+
```
66+
67+
Using SpaCy pipelines will substantially slow down model fitting, but the results might be more correct and higher quality than with naive n-gram extraction.
68+
```python
69+
from turftopic import KeyNMF
70+
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
71+
72+
model = KeyNMF(
73+
n_components=10,
74+
vectorizer=NounPhraseCountVectorizer("en_core_web_sm"),
75+
)
76+
model.fit(corpus)
77+
model.print_topics()
78+
```
79+
80+
| Topic ID | Highest Ranking |
81+
| - | - |
82+
| 0 | atheists, atheism, atheist, belief, beliefs, theists, faith, gods, christians, abortion |
83+
| 1 | alt atheism, usenet alt atheism resources, usenet alt atheism introduction, alt atheism faq, newsgroup alt atheism, atheism faq resource txt, alt atheism groups, atheism, atheism faq intro txt, atheist resources |
84+
| 2 | religion, christianity, faith, beliefs, religions, christian, belief, science, cult, justification |
85+
| 3 | fanaticism, theism, fanatism, all fanatism, theists, strong theism, strong atheism, fanatics, precisely some theists, all theism |
86+
| 4 | religion foundation darwin fish bumper stickers, darwin fish, atheism, 3d plastic fish, fish symbol, atheist books, atheist organizations, negative atheism, positive atheism, atheism index |
87+
| | ... |
88+
89+
### Keyphrases with [KeyphraseVectorizers](https://github.com/TimSchopf/KeyphraseVectorizers/tree/master)
90+
91+
You can extract candidate keyphrases from text using KeyphraseVectorizers.
92+
KeyphraseVectorizers uses POS tag patterns to identify phrases instead of word dependency graphs, like `NounPhraseCountVectorizer`.
93+
KeyphraseVectorizers can potentially be faster as the dependency parser component is not needed in the SpaCy pipeline.
94+
This vectorizer is not part of the Turftopic package, but can be easily used with it out of the box.
95+
96+
```bash
97+
pip install keyphrase-vectorizers
98+
```
99+
100+
```python
101+
from keyphrase_vectorizers import KeyphraseCountVectorizer
102+
103+
vectorizer = KeyphraseCountVectorizer()
104+
model = KeyNMF(10, vectorizer=vectorizer).fit(corpus)
105+
```
106+
107+
## Lemmatizing and Stemming Vectorizers
108+
109+
Since the same word can appear in multiple forms in a piece of text, one can sometimes obtain higher quality results by stemming or lemmatizing words in a text before processing them.
110+
111+
!!! warning
112+
You should **NEVER** lemmatize or stem texts before passing them to a topic model in Turftopic, but rather, use a vectorizer that limits the model's vocabulary to the terms you are interested in.
113+
114+
### Extracting lemmata with `LemmaCountVectorizer`
115+
116+
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a SpaCy pipeline for extracting lemmas from a piece of text.
117+
This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
118+
119+
```bash
120+
pip install turftopic[spacy]
121+
python -m spacy download en_core_web_sm
122+
```
123+
124+
```python
125+
from turftopic import KeyNMF
126+
from turftopic.vectorizers.spacy import LemmaCountVectorizer
127+
128+
model = KeyNMF(10, vectorizer=LemmaCountVectorizer("en_core_web_sm"))
129+
model.fit(corpus)
130+
model.print_topics()
131+
```
132+
133+
| Topic ID | Highest Ranking |
134+
| - | - |
135+
| 0 | atheist, theist, belief, christians, agnostic, christian, mythology, asimov, abortion, read |
136+
| 1 | morality, moral, immoral, objective, society, animal, natural, societal, murder, morally |
137+
| 2 | religion, religious, christianity, belief, christian, faith, cult, church, secular, christians |
138+
| 3 | atheism, belief, agnosticism, religious, faq, lack, existence, theism, atheistic, allah |
139+
| 4 | islam, muslim, islamic, rushdie, khomeini, bank, imam, bcci, law, secular |
140+
| | ... |
141+
142+
### Stemming words with `StemmingCountVectorizer`
143+
144+
You might find that lemmatization isn't aggressive enough for your purposes and still many forms of the same word penetrate topic descriptions.
145+
In that case you should try stemming! Stemming is available in Turftopic via the [Snowball Stemmer](https://snowballstem.org/), so it has to be installed before using stemming vectorization.
146+
147+
!!! question "Should I choose stemming or lemmatization?"
148+
In almost all cases you should **prefer lemmatizaion** over stemming, as it provides higher quality and more correct results. You should only use a stemmer if
149+
150+
1. You need something fast (lemmatization is slower due to a more involved pipeline)
151+
2. You know what you want and it is definitely stemming.
152+
```bash
153+
pip install turftopic[snowball]
154+
```
155+
156+
Then you can initialize a topic model with this vectorizer:
157+
158+
```python
159+
from turftopic import KeyNMF
160+
from turftopic.vectorizers.snowball import StemmingCountVectorizer
161+
162+
model = KeyNMF(10, vectorizer=StemmingCountVectorizer(language="english"))
163+
model.fit(corpus)
164+
model.print_topics()
165+
```
166+
167+
| Topic ID | Highest Ranking |
168+
| - | - |
169+
| 0 | atheism, belief, alt, theism, agnostic, stalin, lack, sceptic, exist, faith |
170+
| 1 | religion, belief, religi, cult, faith, theism, secular, theist, scientist, dogma |
171+
| 2 | bronx, manhattan, sank, queen, sea, away, said, com, bob, blew |
172+
| 3 | moral, human, instinct, murder, kill, law, behaviour, action, behavior, ethic |
173+
| 4 | atheist, theist, belief, asimov, philosoph, mytholog, strong, faq, agnostic, weak |
174+
| | | ... |
175+
176+
## Chinese Vectorizer
11177

12178
The Chinese language does not separate tokens by whitespace, unlike most Indo-European languages.
13179
You thus need to use special tokenization rules for Chinese.
@@ -27,5 +193,26 @@ from turftopic.vectorizers.chinese import ChineseCountVectorizer
27193

28194
vectorizer = ChineseCountVectorizer(min_df=10, stop_words="chinese")
29195

30-
model = KeyNMF(10, vectorizer=vectorizer)
196+
model = KeyNMF(10, vectorizer=vectorizer, encoder="BAAI/bge-small-zh-v1.5")
197+
model.fit(corpus)
198+
199+
model.print_topics()
31200
```
201+
202+
| Topic ID | Highest Ranking |
203+
| - | - |
204+
| 0 | 消息, 时间, 科技, 媒体报道, 美国, 据, 国外, 讯, 宣布, 称 |
205+
| 1 | 体育讯, 新浪, 球员, 球队, 赛季, 火箭, nba, 已经, 主场, 时间 |
206+
| 2 | 记者, 本报讯, 昨日, 获悉, 新华网, 基金, 通讯员, 采访, 男子, 昨天 |
207+
| 3 | 股, 下跌, 上涨, 震荡, 板块, 大盘, 股指, 涨幅, 沪, 反弹 |
208+
| | ... |
209+
210+
## API Reference
211+
212+
:::turftopic.vectorizers.spacy.NounPhraseCountVectorizer
213+
214+
:::turftopic.vectorizers.spacy.LemmaCountVectorizer
215+
216+
:::turftopic.vectorizers.snowball.StemmingCountVectorizer
217+
218+
:::turftopic.vectorizers.chinese.ChineseCountVectorizer

0 commit comments

Comments
 (0)