Merge pull request #79 from x-tabdeveloping/cross_lingual_keynmf

x-tabdeveloping · web-flow · commit a639ebfc88dc · 2025-02-20T09:40:02.000+01:00
Added Cross-lingual KeyNMF
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -29,7 +29,7 @@ jobs:
         run: python3 -c "import sys; print(sys.version)"
 
       - name: Install dependencies
-        run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest plotly igraph
+        run: python3 -m pip install --upgrade turftopic[pyro-ppl] pandas pytest plotly igraph datasets
       - name: Run tests
         run: python3 -m pytest tests/
 
diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
@@ -299,6 +299,51 @@ print(model.hierarchy)
 
 For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
 
+## Cross-lingual KeyNMF
+
+KeyNMF, by default, does not come with cross-lingual capabilities, since only words that appear in a document can be assigned to it as keywords.
+We, however provide a term-matching scheme that allows you to match words across languages based on their cosine similarity in a multilingual embedding model.
+
+This is done by:
+
+1. Computing a similarity matrix over terms.
+2. Checking, which terms have similarity over a given threshold (_0.9_ is the default)
+3. Building a graph from these connections, and finding graph components.
+4. Adding up term importances for terms that appear in the same component for all documents.
+
+```python
+from datasets import load_dataset
+from sklearn.feature_extraction.text import CountVectorizer
+
+from turftopic import KeyNMF
+
+# Loading a parallel corpus
+ds = load_dataset(
+    "aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
+)
+# Subsampling
+ds = ds.train_test_split(test_size=1000)["test"]
+corpus = ds["src"] + ds["tgt"]
+
+model = KeyNMF(
+    10,
+    cross_lingual=True,
+    encoder="paraphrase-multilingual-MiniLM-L12-v2",
+    vectorizer=CountVectorizer()
+)
+model.fit(corpus)
+model.print_topics()
+```
+
+| Topic ID | Highest Ranking |
+| - | - |
+| ... | |
+| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
+| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
+| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
+| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
+| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
+
 ## Online Topic Modeling
 
 KeyNMF can also be fitted in an online manner.
diff --git a/docs/clustering.md b/docs/clustering.md
@@ -4,7 +4,7 @@ Clustering topic models conceptualize topic modeling as a clustering task.
 Essentially a topic for these models is a tightly packed group of documents in semantic space.
 The first contextually sensitive clustering topic model was introduced with Top2Vec, and BERTopic has also iterated on this idea.
 
-If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](gmm.md).
+If you are looking for a probabilistic/soft-clustering model you should also check out [GMM](GMM.md).
 
 <figure>
   <iframe src="../images/cluster_datamapplot.html", title="Cluster visualization", style="height:600px;width:800px;padding:0px;border:none;"></iframe>
diff --git a/docs/cross_lingual.md b/docs/cross_lingual.md
@@ -0,0 +1,83 @@
+# Cross-lingual Topic Modeling
+
+Under certain circumstances you might want to run a topic model on a multilingual corpus, where you do not want the model to capture language-differences.
+In these cases we recommend that you turn to cross-lingual topic modeling.
+
+## Natively multilingual models
+Some topic models in Turftopic support cross-lingual modeling by default.
+The only difference is that you will have to choose a multilingual encoder model to produce document embeddings (consult [MTEB(Multilingual)](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28Multilingual%2C+v1%29) to find an encoder for your use case).
+
+=== "`SemanticSignalSeparation`"
+
+    ```python
+    from turftopic import SemanticSignalSeparation
+
+    model = SemanticSignalSeparation(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+=== "`ClusteringTopicModel`"
+    ```python
+    from turftopic import ClusteringTopicModel
+
+    model = ClusteringTopicModel(encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+=== "`AutoEncodingTopicModel(combined=False)`"
+
+    ```python
+    from turftopic import AutoEncodingTopicModel
+
+    model = AutoEncodingTopicModel(combined=False, encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+=== "`GMM`"
+
+    ```python
+    from turftopic import GMM
+
+    model = GMM(encoder="paraphrase-multilingual-MiniLM-L12-v2")
+    ```
+
+
+## Term Matching
+
+Other models do not support cross-lingual use out of the box, and therefore need assistance to be applicable in a multilingual context.
+
+[KeyNMF](KeyNMF.md) can use a trick called term-matching, in which terms that are highly similar get merged into the same term, thereby allowing for one term representing the same word in multiple languages:
+
+!!! note
+    Term matching is an experimental feature in Turftopic, and might be improved or extended to more models in the future.
+
+```python
+from datasets import load_dataset
+from sklearn.feature_extraction.text import CountVectorizer
+
+from turftopic import KeyNMF
+
+# Loading a parallel corpus
+ds = load_dataset(
+    "aiana94/polynews-parallel", "deu_Latn-eng_Latn", split="train"
+)
+# Subsampling
+ds = ds.train_test_split(test_size=1000)["test"]
+corpus = ds["src"] + ds["tgt"]
+
+model = KeyNMF(
+    10,
+    cross_lingual=True,
+    encoder="paraphrase-multilingual-MiniLM-L12-v2",
+    vectorizer=CountVectorizer()
+)
+model.fit(corpus)
+model.print_topics()
+```
+
+| Topic ID | Highest Ranking |
+| - | - |
+| ... | |
+| 15 | africa-afrikanisch-african, media-medien-medienwirksam, schwarzwald-negroe-schwarzer, apartheid, difficulties-complicated-problems, kontinents-continent-kontinent, äthiopien-ethiopia, investitionen-investiert-investierenden, satire-satirical, hundred-100-1001 |
+| 16 | lawmaker-judges-gesetzliche, schutz-sicherheit-geschützt, an-success-eintreten, australian-australien-australischen, appeal-appealing-appeals, lawyer-lawyers-attorney, regeln-rule-rules, öffentlichkeit-öffentliche-publicly, terrorism-terroristischer-terrorismus, convicted |
+| 17 | israels-israel-israeli, palästinensischen-palestinians-palestine, gay-lgbtq-gays, david, blockaden-blockades-blockade, stars-star-stelle, aviv, bombardieren-bombenexplosion-bombing, militärischer-army-military, kampfflugzeuge-warplanes |
+| 18 | russischer-russlands-russischen, facebookbeitrag-facebook-facebooks, soziale-gesellschaftliche-sozialbauten, internetnutzer-internet, activism-aktivisten-activists, webseiten-web-site, isis, netzwerken-networks-netzwerk, vkontakte, media-medien-medienwirksam |
+| 19 | bundesstaates-regierenden-regiert, chinesischen-chinesische-chinesisch, präsidentschaft-presidential-president, regions-region-regionen, demokratien-democratic-democracy, kapitalismus-capitalist-capitalism, staatsbürgerin-citizens-bürger, jemen-jemenitische-yemen, angolanischen-angola, media-medien-medienwirksam |
+
diff --git a/docs/hierarchical.md b/docs/hierarchical.md
@@ -21,7 +21,7 @@ _Drag and click to zoom, hover to see word importance_
 ## 1. Divisive/Top-down Hierarchical Modeling
 
 In divisive modeling, you start from larger structures, higher up in the hierarchy, and  divide topics into smaller sub-topics on-demand.
-This is how hierarchical modeling works in [KeyNMF](keynmf.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
+This is how hierarchical modeling works in [KeyNMF](KeyNMF.md), which, by default does not discover a topic hierarchy, but you can divide topics to as many subtopics as you see fit.
 
 As a demonstration, let's load a corpus, that we know to have hierarchical themes.
 
diff --git a/docs/model_definition_and_training.md b/docs/model_definition_and_training.md
@@ -13,9 +13,9 @@ This page provides a guide on how to define models, train them, and use them for
 
 ## Defining a Model
 
-### 1. [Topic Model](../models.md)
+### 1. [Topic Model](model_overview.md)
  In order to initialize a model, you will first need to make a choice about which **topic model** you'd like to use.
-You might want to have a look at the [Models](models.md) page in order to make an informed choice about the topic model you intend to train.
+You might want to have a look at the [Models](model_overview.md) page in order to make an informed choice about the topic model you intend to train.
 
 Here are some examples of models you can load and use in the package:
 
@@ -43,11 +43,11 @@ Here are some examples of models you can load and use in the package:
     model = SemanticSignalSeparation(n_components=10, feature_importance="combined")
     ```
 
-### 2. [Vectorizer](../vectorizers.md)
+### 2. [Vectorizer](vectorizers.md)
 
 In Turftopic, all Models have a vectorizer component, which is responsible for extracting word content from documents in the corpus.
 This means, that a vectorizer also determines which words will be part of the model's vocabulary.
-For a more detailed explanation, see the [Vectorizers](../vectorizers.md) page
+For a more detailed explanation, see the [Vectorizers](vectorizers.md) page
 
 The default is scikit-learn's CountVectorizer:
 
@@ -126,12 +126,12 @@ thereby getting different behaviours. You can for instance use noun-phrases in y
 
     ```
 
-### 3. [Encoder](../encoders.md)
+### 3. [Encoder](encoders.md)
 
 Since all models in Turftopic rely on contextual embeddings, you will need to specify a contextual embedding model to use.
 The default is [`all-MiniLM-L6-v2`](sentence-transformers/all-MiniLM-L6-v2), which is a very fast and reasonably performant embedding model for English.
 You might, however want to use custom embeddings, either because your corpus is not in English, or because you need higher speed or performance.
-See a detailed guide on Encoders [here](../encoders.md).
+See a detailed guide on Encoders [here](encoders.md).
 
 Similar to a vectorizer, you can add an encoder to a topic model upon initializing it.
 
@@ -143,11 +143,11 @@ encoder = SentenceTransformer("parahprase-multilingual-MiniLM-L12-v2")
 model = KeyNMF(10, encoder=encoder)
 ```
 
-### 4. [Namer](../namers.md) (*optional*)
+### 4. [Namer](namers.md) (*optional*)
 
 A Namer is an optional part of your topic modeling pipeline, that can automatically assign human-readable names to topics.
 Namers are technically **not part of your topic model**, and should be used *after training*.
-See a detailed guide [here](../namers.md).
+See a detailed guide [here](namers.md).
 
 === "LLM from HuggingFace"
     ```python
diff --git a/docs/model_overview.md b/docs/model_overview.md
@@ -8,7 +8,7 @@ It is quite important that you choose the right topic model for your use case.
 
 | :zap: Speed | :book: Long Documents | :elephant: Scalability | :nut_and_bolt: Flexibility |
 | - | - | - | - |
-| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** |  **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](ClusteringTopicModel.md)** |
+| **[SemanticSignalSeparation](s3.md)** | **[KeyNMF](KeyNMF.md)** |  **[KeyNMF](KeyNMF.md)** | **[ClusteringTopicModel](clustering.md)** |
 
 _Table 1: You should tailor your model choice to your needs_
 
@@ -40,7 +40,7 @@ Some models are also capable of being used in a dynamic context, some can be fit
     You should take the results presented here with a grain of salt. A more comprehensive and in-depth analysis can be found in [Kardos et al., 2024](https://arxiv.org/abs/2406.09556), though the general tendencies are similar.
     Note that some topic models are also less stable than others, and they might require tweaking optimal results (like BERTopic), while others perform well out-of-the-box, but are not as flexible ($S^3$)
 
-The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](../vectorizers.md) and [encoder model](../encoders.md).
+The quality of the topics you can get out of your topic model can depend on a lot of things, including your choice of [vectorizer](vectorizers.md) and [encoder model](encoders.md).
 More rigorous evaluation regimes can be found in a number of studies on topic modeling.
 
 Two usual metrics to evaluate models by are *coherence* and *diversity*.
@@ -57,7 +57,7 @@ In general, the most balanced models are $S^3$, Clustering models with `centroid
 
 | Model | :1234: Multiple Topics per Document  | :hash: Detecting Number of Topics  | :chart_with_upwards_trend: Dynamic Modeling  | :evergreen_tree: Hierarchical Modeling  | :star: Inference over New Documents  | :globe_with_meridians: Cross-Lingual  | :ocean: Online Fitting  |
 | - | - | - | - | - | - | - | - |
-| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x:  | :heavy_check_mark: |
+| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: |
 | **[SemanticSignalSeparation](s3.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
 | **[ClusteringTopicModel](clustering.md)** | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: |
 | **[GMM](GMM.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
diff --git a/docs/online.md b/docs/online.md
@@ -54,7 +54,7 @@ for epoch in range(5):
 You can pretrain a topic model on a large corpus and then finetune it on a novel corpus the model has not seen before.
 This will morph the model's topics to the corpus at hand.
 
-In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistance.md))
+In this example I will load a pretrained KeyNMF model from disk. (see [Model Loading and Saving](persistence.md))
 
 ```python
 from turftopic import load_model
diff --git a/docs/seeded.md b/docs/seeded.md
@@ -4,7 +4,7 @@ When investigating a set of documents, you might already have an idea about what
 Some models are able to account for this by taking seed phrases or words.
 This is currently only possible with KeyNMF in Turftopic, but will likely be extended in the future.
 
-In [KeyNMF](../keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
+In [KeyNMF](keynmf.md), you can describe the aspect, from which you want to investigate your corpus, using a free-text seed-phrase,
 which will then be used to only extract topics, which are relevant to your research question.
 
 In this example we investigate the 20Newsgroups corpus from three different aspects:
diff --git a/docs/vectorizers.md b/docs/vectorizers.md
@@ -113,7 +113,7 @@ Since the same word can appear in multiple forms in a piece of text, one can som
 
 ### Extracting lemmata with `LemmaCountVectorizer`
 
-Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](spacy.io) pipeline for extracting lemmas from a piece of text.
+Similarly to `NounPhraseCountVectorizer`, `LemmaCountVectorizer` relies on a [SpaCy](https://spacy.io/) pipeline for extracting lemmas from a piece of text.
 This means you will have to install SpaCy and a SpaCy pipeline to be able to use it.
 
 ```bash
@@ -180,7 +180,7 @@ In these cases we recommend that you use a vectorizer with its own language-spec
 
 ### Vectorizing Any Language with `TokenCountVectorizer`
 
-The [SpaCy](spacy.io) package includes language-specific tokenization and stop-word rules for just about any language.
+The [SpaCy](https://spacy.io/) package includes language-specific tokenization and stop-word rules for just about any language.
 We provide a vectorizer that you can use with the language of your choice.
 
 ```bash
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -10,6 +10,7 @@ nav:
     - Dynamic Topic Modeling: dynamic.md
     - Online Topic Modeling: online.md
     - Hierarchical Topic Modeling: hierarchical.md
+    - Cross-Lingual Topic Modeling: cross_lingual.md
     - Modifying and Finetuning Models: finetuning.md
     - Saving and Loading: persistence.md
     - Using TopicData: topic_data.md
diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ line-length=79
 
 [tool.poetry]
 name = "turftopic"
-version = "0.13.0"
+version = "0.14.0"
 description = "Topic modeling with contextual representations from sentence transformers."
 authors = ["Márton Kardos <power.up1163@gmail.com>"]
 license = "MIT"
diff --git a/tests/test_integration.py b/tests/test_integration.py
@@ -218,3 +218,15 @@ def test_serialization():
     with tempfile.TemporaryDirectory() as tmp_dir:
         model.to_disk(tmp_dir)
         model = load_model(tmp_dir)
+
+
+def test_cross_lingual():
+    from datasets import load_dataset
+
+    ds = load_dataset(
+        "aiana94/polynews-parallel", "dan_Latn-hun_Latn", split="train"
+    )
+    corpus = ds["src"] + ds["tgt"]
+    model = KeyNMF(5, cross_lingual=True)
+    model.fit(corpus)
+    model.print_topics()
diff --git a/turftopic/models/_keynmf.py b/turftopic/models/_keynmf.py
@@ -1,8 +1,10 @@
 import itertools
 import warnings
+from collections import defaultdict
 from datetime import datetime
 from typing import Iterable, Literal, Optional
 
+import igraph as ig
 import numpy as np
 import scipy.sparse as spr
 from sklearn.base import clone
@@ -33,6 +35,34 @@ def batched(iterable, n: int) -> Iterable[list[str]]:
         yield batch
 
 
+def _match_terms(
+    keywords: list[dict],
+    vocab: np.ndarray,
+    vocab_embeddings: np.ndarray,
+    threshold: float = 0.9,
+) -> list[dict]:
+    """Matches similar terms into a single, cross-lingual term."""
+    similarity = cosine_similarity(vocab_embeddings, vocab_embeddings)
+    np.fill_diagonal(similarity, 0)
+    similarity[similarity < threshold] = 0
+    edges = zip(*np.nonzero(similarity > 0))
+    g = ig.Graph(n=len(vocab), edges=edges)
+    components = g.connected_components()
+    old_to_new = {}
+    for component in components:
+        component_name = "-".join(vocab[component][:3])
+        for i_word in component:
+            old_to_new[vocab[i_word]] = component_name
+    joint_keywords = []
+    for entry in keywords:
+        joint_entry = defaultdict(lambda: 0)
+        for word, value in entry.items():
+            new_word = old_to_new[word]
+            joint_entry[new_word] += value
+        joint_keywords.append(joint_entry)
+    return joint_keywords
+
+
 def fit_timeslice(
     X,
     W,
@@ -88,6 +118,20 @@ def __init__(
         self.term_embeddings: Optional[np.ndarray] = None
         self.metric = metric
 
+    @property
+    def vocab(self) -> np.ndarray:
+        res = [""] * self.n_vocab
+        for key, index in self.key_to_index.items():
+            res[index] = key
+        return np.array(res)
+
+    def match_terms(
+        self, keywords: list[dict], threshold: float = 0.9
+    ) -> list[dict]:
+        return _match_terms(
+            keywords, self.vocab, self.term_embeddings, threshold=threshold
+        )
+
     @property
     def is_encoder_promptable(self) -> bool:
         prompts = getattr(self.encoder, "prompts", None)
diff --git a/turftopic/models/cluster.py b/turftopic/models/cluster.py
diff --git a/turftopic/models/keynmf.py b/turftopic/models/keynmf.py