Skip to content

Commit a639ebf

Browse files
Merge pull request #79 from x-tabdeveloping/cross_lingual_keynmf
Added Cross-lingual KeyNMF
2 parents 615a6a2 + ae9252d commit a639ebf

16 files changed

Lines changed: 241 additions & 22 deletions

.github/workflows/tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
run: python3 -c "import sys; print(sys.version)"
3030

3131
- name: Install dependencies
32-
run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest plotly igraph
32+
run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest plotly igraph datasets
3333
- name: Run tests
3434
run: python3 -m pytest tests/
3535

docs/KeyNMF.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,51 @@ print(model.hierarchy)
299299

300300
For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
301301

302+
## Cross-lingual KeyNMF
303+
304+
KeyNMF, by default, does not come with cross-lingual capabilities, since only words that appear in a document can be assigned to it as keywords.
305+
We, however provide a term-matching scheme that allows you to match words across languages based on their cosine similarity in a multilingual embedding model.
306+
307+
This is done by:
308+
309+
1. Computing a similarity matrix over terms.
310+
2. Checking, which terms have similarity over a given threshold (_0.9_ is the default)
311+
3. Building a graph from these connections, and finding graph components.
312+
4. Adding up term importances for terms that appear in the same component for all documents.
313+
314+
```python
315+
from datasets import load_dataset
316+
from sklearn.feature_extraction.text import CountVectorizer
317+
318+
from turftopic import KeyNMF
319+
320+
# Loading a parallel corpus
321+
ds = load_dataset(
322+
"aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
323+
)
324+
# Subsampling
325+
ds = ds.train_test_split(test_size=1000)["test"]
326+
corpus = ds["src"] + ds["tgt"]
327+
328+
model = KeyNMF(
329+
10,
330+
cross_lingual=True,
331+
encoder="paraphrase-multilingual-MiniLM-L12-v2",
332+
vectorizer=CountVectorizer()
333+
)
334+
model.fit(corpus)
335+
model.print_topics()
336+
```
337+
338+
| Topic ID | Highest Ranking |
339+
| - | - |
340+
| ... | |
341+
| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
342+
| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
343+
| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
344+
| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
345+
| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
346+
302347
## Online Topic Modeling
303348

304349
KeyNMF can also be fitted in an online manner.

docs/clustering.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Clustering topic models conceptualize topic modeling as a clustering task.
44
Essentially a topic for these models is a tightly packed group of documents in semantic space.
55
The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.
66

7-
If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](gmm.md).
7+
If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](GMM.md).
88

99
<figure>
1010
<iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:600px;width:800px;padding:0px;border:none;"></iframe>

docs/cross_lingual.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Cross-lingual Topic Modeling
2+
3+
Under certain circumstances you might want to run a topic model on a multilingual corpus, where you do not want the model to capture language-differences.
4+
In these cases we recommend that you turn to cross-lingual topic modeling.
5+
6+
## Natively multilingual models
7+
Some topic models in Turftopic support cross-lingual modeling by default.
8+
The only difference is that you will have to choose a multilingual encoder model to produce document embeddings (consult [MTEB(Multilingual)](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28Multilingual%2C+v1%29) to find an encoder for your use case).
9+
10+
=== "`SemanticSignalSeparation`"
11+
12+
```python
13+
from turftopic import SemanticSignalSeparation
14+
15+
model = SemanticSignalSeparation(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
16+
```
17+
18+
=== "`ClusteringTopicModel`"
19+
```python
20+
from turftopic import ClusteringTopicModel
21+
22+
model = ClusteringTopicModel(encoder="paraphrase-multilingual-MiniLM-L12-v2")
23+
```
24+
25+
=== "`AutoEncodingTopicModel(combined=False)`"
26+
27+
```python
28+
from turftopic import AutoEncodingTopicModel
29+
30+
model = AutoEncodingTopicModel(combined=False, encoder="paraphrase-multilingual-MiniLM-L12-v2")
31+
```
32+
33+
=== "`GMM`"
34+
35+
```python
36+
from turftopic import GMM
37+
38+
model = GMM(encoder="paraphrase-multilingual-MiniLM-L12-v2")
39+
```
40+
41+
42+
## Term Matching
43+
44+
Other models do not support cross-lingual use out of the box, and therefore need assistance to be applicable in a multilingual context.
45+
46+
[KeyNMF](KeyNMF.md) can use a trick called term-matching, in which terms that are highly similar get merged into the same term, thereby allowing for one term representing the same word in multiple languages:
47+
48+
!!! note
49+
Term matching is an experimental feature in Turftopic, and might be improved or extended to more models in the future.
50+
51+
```python
52+
from datasets import load_dataset
53+
from sklearn.feature_extraction.text import CountVectorizer
54+
55+
from turftopic import KeyNMF
56+
57+
# Loading a parallel corpus
58+
ds = load_dataset(
59+
"aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
60+
)
61+
# Subsampling
62+
ds = ds.train_test_split(test_size=1000)["test"]
63+
corpus = ds["src"] + ds["tgt"]
64+
65+
model = KeyNMF(
66+
10,
67+
cross_lingual=True,
68+
encoder="paraphrase-multilingual-MiniLM-L12-v2",
69+
vectorizer=CountVectorizer()
70+
)
71+
model.fit(corpus)
72+
model.print_topics()
73+
```
74+
75+
| Topic ID | Highest Ranking |
76+
| - | - |
77+
| ... | |
78+
| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
79+
| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
80+
| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
81+
| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
82+
| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
83+

docs/hierarchical.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ _Drag and click to zoom, hover to see word importance_
2121
## 1. Divisive/Top-down Hierarchical Modeling
2222

2323
In divisive modeling, you start from larger structures, higher up in the hierarchy, and divide topics into smaller sub-topics on-demand.
24-
This is how hierarchical modeling works in [KeyNMF](keynmf.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
24+
This is how hierarchical modeling works in [KeyNMF](KeyNMF.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
2525

2626
As a demonstration, let's load a corpus, that we know to have hierarchical themes.
2727

docs/model_definition_and_training.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ This page provides a guide on how to define models, train them, and use them for
1313

1414
## Defining a Model
1515

16-
### 1. [Topic Model](../models.md)
16+
### 1. [Topic Model](model_overview.md)
1717
In order to initialize a model, you will first need to make a choice about which **topic model** you'd like to use.
18-
You might want to have a look at the [Models](models.md) page in order to make an informed choice about the topic model you intend to train.
18+
You might want to have a look at the [Models](model_overview.md) page in order to make an informed choice about the topic model you intend to train.
1919

2020
Here are some examples of models you can load and use in the package:
2121

@@ -43,11 +43,11 @@ Here are some examples of models you can load and use in the package:
4343
model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
4444
```
4545

46-
### 2. [Vectorizer](../vectorizers.md)
46+
### 2. [Vectorizer](vectorizers.md)
4747

4848
In Turftopic, all Models have a vectorizer component, which is responsible for extracting word content from documents in the corpus.
4949
This means, that a vectorizer also determines which words will be part of the model's vocabulary.
50-
For a more detailed explanation, see the [Vectorizers](../vectorizers.md) page
50+
For a more detailed explanation, see the [Vectorizers](vectorizers.md) page
5151

5252
The default is scikit-learn's CountVectorizer:
5353

@@ -126,12 +126,12 @@ thereby getting different behaviours. You can for instance use noun-phrases in y
126126

127127
```
128128

129-
### 3. [Encoder](../encoders.md)
129+
### 3. [Encoder](encoders.md)
130130

131131
Since all models in Turftopic rely on contextual embeddings, you will need to specify a contextual embedding model to use.
132132
The default is [`all-MiniLM-L6-v2`](sentence-transformers/all-MiniLM-L6-v2), which is a very fast and reasonably performant embedding model for English.
133133
You might, however want to use custom embeddings, either because your corpus is not in English, or because you need higher speed or performance.
134-
See a detailed guide on Encoders [here](../encoders.md).
134+
See a detailed guide on Encoders [here](encoders.md).
135135

136136
Similar to a vectorizer, you can add an encoder to a topic model upon initializing it.
137137

@@ -143,11 +143,11 @@ encoder = SentenceTransformer("parahprase-multilingual-MiniLM-L12-v2")
143143
model = KeyNMF(10, encoder=encoder)
144144
```
145145

146-
### 4. [Namer](../namers.md) (*optional*)
146+
### 4. [Namer](namers.md) (*optional*)
147147

148148
A Namer is an optional part of your topic modeling pipeline, that can automatically assign human-readable names to topics.
149149
Namers are technically **not part of your topic model**, and should be used *after training*.
150-
See a detailed guide [here](../namers.md).
150+
See a detailed guide [here](namers.md).
151151

152152
=== "LLM from HuggingFace"
153153
```python

docs/model_overview.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ It is quite important that you choose the right topic model for your use case.
88

99
| :zap: Speed | :book: Long Documents | :elephant: Scalability | :nut_and_bolt: Flexibility |
1010
| - | - | - | - |
11-
| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** | **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](ClusteringTopicModel.md)** |
11+
| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** | **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](clustering.md)** |
1212

1313
_Table 1: You should tailor your model choice to your needs_
1414

@@ -40,7 +40,7 @@ Some models are also capable of being used in a dynamic context, some can be fit
4040
You should take the results presented here with a grain of salt. A more comprehensive and in-depth analysis can be found in [Kardos et al., 2024](https://arxiv.org/abs/2406.09556), though the general tendencies are similar.
4141
Note that some topic models are also less stable than others, and they might require tweaking optimal results (like BERTopic), while others perform well out-of-the-box, but are not as flexible ($S^3$)
4242

43-
The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](../vectorizers.md) and [encoder model](../encoders.md).
43+
The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](vectorizers.md) and [encoder model](encoders.md).
4444
More rigorous evaluation regimes can be found in a number of studies on topic modeling.
4545

4646
Two usual metrics to evaluate models by are *coherence* and *diversity*.
@@ -57,7 +57,7 @@ In general, the most balanced models are $S^3$, Clustering models with `centroid
5757

5858
| Model | :1234: Multiple Topics per Document | :hash: Detecting Number of Topics | :chart_with_upwards_trend: Dynamic Modeling | :evergreen_tree: Hierarchical Modeling | :star: Inference over New Documents | :globe_with_meridians: Cross-Lingual | :ocean: Online Fitting |
5959
| - | - | - | - | - | - | - | - |
60-
| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: |
60+
| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
6161
| **[SemanticSignalSeparation](s3.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
6262
| **[ClusteringTopicModel](clustering.md)** | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: |
6363
| **[GMM](GMM.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |

docs/online.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ for epoch in range(5):
5454
You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before.
5555
This will morph the model's topics to the corpus at hand.
5656

57-
In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistance.md))
57+
In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistence.md))
5858

5959
```python
6060
from turftopic import load_model

docs/seeded.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ When investigating a set of documents, you might already have an idea about what
44
Some models are able to account for this by taking seed phrases or words.
55
This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.
66

7-
In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
7+
In [KeyNMF](keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
88
which will then be used to only extract topics, which are relevant to your research question.
99

1010
In this example we investigate the 20Newsgroups corpus from three different aspects:

docs/vectorizers.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ Since the same word can appear in multiple forms in a piece of text, one can som
113113

114114
### Extracting lemmata with `LemmaCountVectorizer`
115115

116-
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](spacy.io) pipeline for extracting lemmas from a piece of text.
116+
Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](https://spacy.io/) pipeline for extracting lemmas from a piece of text.
117117
This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
118118

119119
```bash
@@ -180,7 +180,7 @@ In these cases we recommend that you use a vectorizer with its own language-spec
180180

181181
### Vectorizing Any Language with `TokenCountVectorizer`
182182

183-
The [SpaCy](spacy.io) package includes language-specific tokenization and stop-word rules for just about any language.
183+
The [SpaCy](https://spacy.io/) package includes language-specific tokenization and stop-word rules for just about any language.
184184
We provide a vectorizer that you can use with the language of your choice.
185185

186186
```bash

0 commit comments

Comments
 (0)