Skip to content

Commit 71cb95d

Browse files
Added cross-lingual KeyNMF to docs
1 parent 2b934af commit 71cb95d

2 files changed

Lines changed: 43 additions & 1 deletion

File tree

docs/KeyNMF.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,48 @@ print(model.hierarchy)
299299

300300
For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).
301301

302+
## Cross-lingual KeyNMF
303+
304+
KeyNMF, by default, does not come with cross-lingual capabilities, since only words that appear in a document can be assigned to it as keywords.
305+
We, however provide a term-matching scheme that allows you to match words across languages based on their cosine similarity in a multilingual embedding model.
306+
307+
This is done by:
308+
309+
1. Computing a similarity matrix over terms.
310+
2. Checking, which terms have similarity over a given threshold (_0.9_ is the default)
311+
3. Building a graph from these connections, and finding graph components.
312+
4. Adding up term importances for terms that appear in the same component for all documents.
313+
314+
```python
315+
from datasets import load_dataset
316+
from sklearn.feature_extraction.text import CountVectorizer
317+
318+
from turftopic import KeyNMF
319+
320+
# Loading a parallel corpus
321+
ds = load_dataset(
322+
"aiana94/polynews-parallel", "dan_Latn-hun_Latn", split="train"
323+
)
324+
corpus = ds["src"] + ds["tgt"]
325+
326+
model = KeyNMF(
327+
10,
328+
cross_lingual=True,
329+
encoder="paraphrase-multilingual-MiniLM-L12-v2",
330+
vectorizer=CountVectorizer()
331+
)
332+
model.fit(corpus)
333+
model.print_topics()
334+
```
335+
336+
| Topic ID | Highest Ranking |
337+
| - | - |
338+
| ... | |
339+
| 4 | internettets-internettet-interneten, nyitottság-åbne-åbnede, censurer-cenzúra-cenzúrázása, crowdsourcing-crowdsourcinghez, ytringsfrihed-szólásszabadság, hálózat-netværke-netværket, kommunikálhat-kommunikere, orosz-oroszországi-oroszországban, lært-uddanelse-oktatásnak, szabadság-szabadságát-friheder |
340+
| 5 | colombianske-colombia-kolumbiai, hangjai-voicesnál-voices, dignity-méltóság, béketárgyalásokba-béke-békét, női-nőket-kvindelige, áldozatok-ofre-áldozata, viszály-konflikter-konflikt, jogairól-rettighederne-jogainak, petronilas-petronila, bevæbnede-fegyveres-pisztolyt |
341+
| 6 | karikaturistára-karikaturtegning-karikaturista, bloggermøde-blogs-bloggere, hver-international-letartóztatásával, rslans-rslan, történetét-historier-biografi, kritikere-kritikát-kritisk, salvadori-salvador, szeptember-september-júliusban, aktivistát-aktivisták-aktivister, vietnami-vietnamesiske |
342+
| ... | |
343+
302344
## Online Topic Modeling
303345

304346
KeyNMF can also be fitted in an online manner.

docs/model_overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ In general, the most balanced models are $S^3$, Clustering models with `centroid
5757

5858
| Model | :1234: Multiple Topics per Document | :hash: Detecting Number of Topics | :chart_with_upwards_trend: Dynamic Modeling | :evergreen_tree: Hierarchical Modeling | :star: Inference over New Documents | :globe_with_meridians: Cross-Lingual | :ocean: Online Fitting |
5959
| - | - | - | - | - | - | - | - |
60-
| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: |
60+
| **[KeyNMF](KeyNMF.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
6161
| **[SemanticSignalSeparation](s3.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |
6262
| **[ClusteringTopicModel](clustering.md)** | :x: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: |
6363
| **[GMM](GMM.md)** | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :heavy_check_mark: | :heavy_check_mark: | :x: |

0 commit comments

Comments
 (0)