Skip to content

Commit 24c60ff

Browse files
Merge pull request #43 from x-tabdeveloping/dynamic_keynmf
Dynamic keynmf
2 parents 0ec180e + 0226673 commit 24c60ff

6 files changed

Lines changed: 97 additions & 32 deletions

File tree

docs/KeyNMF.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ while taking inspiration from classical matrix-decomposition approaches for extr
66
## The Model
77

88
<figure>
9-
<img src="/images/keynmf.png" width="90%" style="margin-left: auto;margin-right: auto;">
9+
<img src="../images/keynmf.png" width="90%" style="margin-left: auto;margin-right: auto;">
1010
<figcaption>Schematic overview of KeyNMF</figcaption>
1111
</figure>
1212

@@ -30,6 +30,12 @@ Topics in this matrix are then discovered using Non-negative Matrix Factorizatio
3030
Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance
3131
can be explained.
3232

33+
### _(Optional)_ 3. Dynamic Modeling
34+
35+
KeyNMF is also capable of modeling topics over time.
36+
This happens by fitting a KeyNMF model first on the entire corpus, then
37+
fitting individual topic-term matrices using coordinate descent based on the document-topic and document-term matrices in the given time slices.
38+
3339
## Considerations
3440

3541
### Strengths

docs/dynamic.md

Lines changed: 8 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,41 +4,30 @@ If you want to examine the evolution of topics over time, you will need a dynami
44

55
> Note that regular static models can also be used to study the evolution of topics and information dynamics, but they can't capture changes in the topics themselves.
66
7-
## Theory
7+
## Models
88

9-
A number of different conceptualizations can be used to study evolving topics in corpora, for instance:
10-
11-
1. One can imagine topic representations to be governed by a Brownian Markov Process (random walk), in such a case the evolution is part of the model itself.
12-
In layman's terms you describe the evolution of topics directly in your generative model by expecting the topic representations to be sampled from Gaussian noise around the last time step.
13-
Sometimes researchers will also refer to such models as _state-space_ approaches.
14-
This is the approach that the original [DTM paper](https://mimno.infosci.cornell.edu/info6150/readings/dynamic_topic_models.pdf) utilizes.
15-
Along with [this paper](https://arxiv.org/pdf/1709.00025.pdf) on Dynamic NMF.
16-
2. You can fit one underlying statistical model over the entire corpus, and then do post-hoc term importance estimation per time slice.
17-
This is [what BERTopic does](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html).
18-
3. You can fit one model per time slice, and then use some aggregation procedure to merge the models.
19-
This approach is used in the Dynamic NMF in [this paper](https://www.cambridge.org/core/journals/political-analysis/article/exploring-the-political-agenda-of-the-european-parliament-using-a-dynamic-topic-modeling-approach/BBC7751778E4542C7C6C69E6BF954E4B).
20-
21-
Developing such approaches takes a lot of time and effort, and we have plans to add dynamic modeling capabilities to all models in Turftopic.
22-
For now only models of the second kind are on our list of things to do, and dynamic topic modeling has been implemented for GMM, and will soon be implemented for Clustering Topic Models.
23-
For more theoretical background, see the page on [GMM](GMM.md).
9+
In Turftopic you can currently use three different topic models for modeling topics over time:
10+
1. [ClusteringTopicModel](clustering.md), where an overall model is fitted on the whole corpus, and then term importances are estimated over time slices.
11+
2. [GMM](GMM.md), similarly to clustering models, term importances are reestimated per time slice
12+
3. [KeyNMF](KeyNMF.md), an overall decomposition is done, then using coordinate descent, topic-term-matrices are recalculated based on document-topic importances in the given time slice.
2413

2514
## Usage
2615

2716
Dynamic topic models in Turftopic have a unified interface.
2817
To fit a dynamic topic model you will need a corpus, that has been annotated with timestamps.
2918
The timestamps need to be Python `datetime` objects, but pandas `Timestamp` object are also supported.
3019

31-
Models that have dynamic modeling capabilities (currently, `GMM` and `ClusteringTopicModel`) have a `fit_transform_dynamic()` method, that fits the model on the corpus over time.
20+
Models that have dynamic modeling capabilities (`KeyNMF`, `GMM` and `ClusteringTopicModel`) have a `fit_transform_dynamic()` method, that fits the model on the corpus over time.
3221

3322
```python
3423
from datetime import datetime
3524

36-
from turftopic import GMM
25+
from turftopic import KeyNMF
3726

3827
corpus: list[str] = [...]
3928
timestamps: list[datetime] = [...]
4029

41-
model = GMM(5)
30+
model = KeyNMF(5)
4231
document_topic_matrix = model.fit_transform_dynamic(corpus, timestamps=timestamps)
4332
```
4433

docs/model_overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ Here is an opinionated guide for common use cases:
9090
### 1. When in doubt **use KeyNMF**.
9191

9292
When you can't make an informed decision about which model is optimal for your use case, or you just want to get your hands dirty with topic modeling,
93-
KeyNMF is the best option.
93+
KeyNMF is by far the best option.
9494
It is very stable, gives high quality topics, and is incredibly robust to noise.
9595
It is also the closest to classical topic models and thus conforms to your intuition about topic modeling.
9696

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ line-length=79
66

77
[tool.poetry]
88
name = "turftopic"
9-
version = "0.2.13"
9+
version = "0.3.0"
1010
description = "Topic modeling with contextual representations from sentence transformers."
1111
authors = ["Márton Kardos <power.up1163@gmail.com>"]
1212
license = "MIT"

tests/test_integration.py

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
from datetime import datetime
21
import tempfile
2+
from datetime import datetime
33
from pathlib import Path
44

55
import numpy as np
@@ -8,13 +8,8 @@
88
from sentence_transformers import SentenceTransformer
99
from sklearn.datasets import fetch_20newsgroups
1010

11-
from turftopic import (
12-
GMM,
13-
AutoEncodingTopicModel,
14-
ClusteringTopicModel,
15-
KeyNMF,
16-
SemanticSignalSeparation,
17-
)
11+
from turftopic import (GMM, AutoEncodingTopicModel, ClusteringTopicModel,
12+
KeyNMF, SemanticSignalSeparation)
1813

1914

2015
def generate_dates(
@@ -75,8 +70,9 @@ def generate_dates(
7570
n_reduce_to=5,
7671
feature_importance="soft-c-tf-idf",
7772
encoder=trf,
78-
reduction_method="smallest"
73+
reduction_method="smallest",
7974
),
75+
KeyNMF(5, encoder=trf),
8076
]
8177

8278

turftopic/models/keynmf.py

Lines changed: 75 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,67 @@
11
import itertools
22
import json
33
import random
4+
from datetime import datetime
45
from typing import Dict, Iterable, List, Optional, Union
56

67
import numpy as np
78
from rich.console import Console
89
from sentence_transformers import SentenceTransformer
910
from sklearn.decomposition import NMF, MiniBatchNMF
11+
from sklearn.decomposition._nmf import (_initialize_nmf,
12+
_update_coordinate_descent)
1013
from sklearn.exceptions import NotFittedError
1114
from sklearn.feature_extraction import DictVectorizer
1215
from sklearn.feature_extraction.text import CountVectorizer
1316
from sklearn.metrics.pairwise import cosine_similarity
17+
from sklearn.utils import check_array
1418

1519
from turftopic.base import ContextualModel, Encoder
1620
from turftopic.data import TopicData
21+
from turftopic.dynamic import DynamicTopicModel, bin_timestamps
1722
from turftopic.vectorizer import default_vectorizer
1823

1924

25+
def fit_timeslice(
26+
X,
27+
W,
28+
H,
29+
tol=1e-4,
30+
max_iter=200,
31+
l1_reg_W=0,
32+
l1_reg_H=0,
33+
l2_reg_W=0,
34+
l2_reg_H=0,
35+
verbose=0,
36+
shuffle=False,
37+
random_state=None,
38+
):
39+
"""Fits topic_term_matrix based on a precomputed document_topic_matrix.
40+
This is used to get temporal components in dynamic KeyNMF.
41+
"""
42+
Ht = check_array(H.T, order="C")
43+
if random_state is None:
44+
rng = np.random.mtrand._rand
45+
else:
46+
rng = np.random.RandomState(random_state)
47+
for n_iter in range(1, max_iter + 1):
48+
violation = 0.0
49+
violation += _update_coordinate_descent(
50+
X.T, Ht, W, l1_reg_H, l2_reg_H, shuffle, rng
51+
)
52+
if n_iter == 1:
53+
violation_init = violation
54+
if violation_init == 0:
55+
break
56+
if verbose:
57+
print("violation:", violation / violation_init)
58+
if violation / violation_init <= tol:
59+
if verbose:
60+
print("Converged at iteration", n_iter + 1)
61+
break
62+
return W, Ht.T, n_iter
63+
64+
2065
def batched(iterable, n: int) -> Iterable[List[str]]:
2166
"Batch data into tuples of length n. The last batch may be shorter."
2267
# batched('ABCDEFG', 3) --> ABC DEF G
@@ -48,7 +93,7 @@ def __iter__(self) -> Iterable[Dict[str, float]]:
4893
yield deserialize_keywords(line.strip())
4994

5095

51-
class KeyNMF(ContextualModel):
96+
class KeyNMF(ContextualModel, DynamicTopicModel):
5297
"""Extracts keywords from documents based on semantic similarity of
5398
term encodings to document encodings.
5499
Topics are then extracted with non-negative matrix factorization from
@@ -305,3 +350,32 @@ def prepare_topic_data(
305350
"topic_names": self.topic_names,
306351
}
307352
return res
353+
354+
def fit_transform_dynamic(
355+
self,
356+
raw_documents,
357+
timestamps: list[datetime],
358+
embeddings: Optional[np.ndarray] = None,
359+
bins: Union[int, list[datetime]] = 10,
360+
) -> np.ndarray:
361+
time_labels, self.time_bin_edges = bin_timestamps(timestamps, bins)
362+
topic_data = self.prepare_topic_data(
363+
raw_documents, embeddings=embeddings
364+
)
365+
n_bins = len(self.time_bin_edges) + 1
366+
n_comp, n_vocab = self.components_.shape
367+
self.temporal_components_ = np.zeros((n_bins, n_comp, n_vocab))
368+
self.temporal_importance_ = np.zeros((n_bins, n_comp))
369+
for label in np.unique(time_labels):
370+
idx = np.nonzero(time_labels == label)
371+
X = topic_data["document_term_matrix"][idx]
372+
W = topic_data["document_topic_matrix"][idx]
373+
_, H = _initialize_nmf(
374+
X, self.components_.shape[0], random_state=self.random_state
375+
)
376+
_, H, _ = fit_timeslice(X, W, H, random_state=self.random_state)
377+
self.temporal_components_[label] = H
378+
topic_importances = np.squeeze(np.asarray(W.sum(axis=0)))
379+
topic_importances = topic_importances / topic_importances.sum()
380+
self.temporal_importance_[label] = topic_importances
381+
return topic_data["document_topic_matrix"]

0 commit comments

Comments
 (0)