Added cross-lingual KeyNMF to docs

x-tabdeveloping · x-tabdeveloping · commit 71cb95dd9cef · 2025-02-18T11:23:44.000+01:00
diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
@@ -299,6 +299,48 @@ print(model.hierarchy)
 
 For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
 
+## Cross-lingual KeyNMF
+
+KeyNMF, by default, does not come with cross-lingual capabilities, since only words that appear in a document can be assigned to it as keywords.
+We, however provide a term-matching scheme that allows you to match words across languages based on their cosine similarity in a multilingual embedding model.
+
+This is done by:
+
+1. Computing a similarity matrix over terms.
+2. Checking, which terms have similarity over a given threshold (_0.9_ is the default)
+3. Building a graph from these connections, and finding graph components.
+4. Adding up term importances for terms that appear in the same component for all documents.
+
+```python
+from datasets import load_dataset
+from sklearn.feature_extraction.text import CountVectorizer
+
+from turftopic import KeyNMF
+
+# Loading a parallel corpus
+ds = load_dataset(
+    "aiana94/polynews-parallel", "dan_Latn-hun_Latn", split="train"
+)
+corpus = ds["src"] + ds["tgt"]
+
+model = KeyNMF(
+    10,
+    cross_lingual=True,
+    encoder="paraphrase-multilingual-MiniLM-L12-v2",
+    vectorizer=CountVectorizer()
+)
+model.fit(corpus)
+model.print_topics()
+```
+
+| Topic ID | Highest Ranking |
+| - | - |
+| ... | |
+| 4 | internettets-internettet-interneten, nyitottság-åbne-åbnede, censurer-cenzúra-cenzúrázása, crowdsourcing-crowdsourcinghez, ytringsfrihed-szólásszabadság, hálózat-netværke-netværket, kommunikálhat-kommunikere, orosz-oroszországi-oroszországban, lært-uddanelse-oktatásnak, szabadság-szabadságát-friheder |
+| 5 | colombianske-colombia-kolumbiai, hangjai-voicesnál-voices, dignity-méltóság, béketárgyalásokba-béke-békét, női-nőket-kvindelige, áldozatok-ofre-áldozata, viszály-konflikter-konflikt, jogairól-rettighederne-jogainak, petronilas-petronila, bevæbnede-fegyveres-pisztolyt |
+| 6 | karikaturistára-karikaturtegning-karikaturista, bloggermøde-blogs-bloggere, hver-international-letartóztatásával, rslans-rslan, történetét-historier-biografi, kritikere-kritikát-kritisk, salvadori-salvador, szeptember-september-júliusban, aktivistát-aktivisták-aktivister, vietnami-vietnamesiske |
+| ... | |
+
 ## Online Topic Modeling
 
 KeyNMF can also be fitted in an online manner.
diff --git a/docs/model_overview.md b/docs/model_overview.md
@@ -57,7 +57,7 @@ In general, the most balanced models are $S^3$, Clustering models with `centroid
 
 | Model | :1234: Multiple Topics per Document  | :hash: Detecting Number of Topics  | :chart_with_upwards_trend: Dynamic Modeling  | :evergreen_tree: Hierarchical Modeling  | :star: Inference over New Documents  | :globe_with_meridians: Cross-Lingual  | :ocean: Online Fitting  |
 | - | - | - | - | - | - | - | - |
-| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x:  | :heavy_check_mark: |
+| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: |
 | **[SemanticSignalSeparation](s3.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
 | **[ClusteringTopicModel](clustering.md)** | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: |
 | **[GMM](GMM.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |