Skip to content

Commit 4491342

Browse files
Merge branch 'main' into partial_keynmf
2 parents ecffc37 + ef5fd2f commit 4491342

6 files changed

Lines changed: 178 additions & 77 deletions

File tree

README.md

Lines changed: 39 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,39 @@
44
<b>Topic modeling is your turf too.</b> <br> <i> Contextual topic models with representations from transformers. </i></p>
55

66

7-
## Intentions
8-
- Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
9-
- Implement state-of-the-art approaches from my papers. (papers work-in-progress)
10-
- Put all approaches in a broader conceptual framework.
11-
- Provide clear and extensive documentation about the best use-cases for each model.
12-
- Make the models' API streamlined and compatible with topicwizard and scikit-learn.
13-
- Develop smarter, transformer-based evaluation metrics.
14-
15-
**Note**: This package is still work in progress and scientific papers on some of the novel methods (e.g., decomposition-based methods) are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
16-
17-
## Roadmap
18-
- [x] Model Implementation
19-
- [x] Pretty Printing
20-
- [x] Implement visualization utilites for these models in topicwizard
21-
- [x] Thorough documentation
22-
- [x] Dynamic modeling (currently `GMM` and `ClusteringTopicModel` others might follow)
23-
- [ ] Publish papers :hourglass_flowing_sand: (in progress..)
24-
- [ ] High-level topic descriptions with LLMs.
25-
- [ ] Contextualized evaluation metrics.
7+
## Features
8+
- Novel transformer-based topic models:
9+
- Semantic Signal Separation - S³ 🧭
10+
- KeyNMF 🔑
11+
- GMM
12+
- Implementations of existing transformer-based topic models
13+
- Clustering Topic Models: BERTopic and Top2Vec
14+
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
15+
- Streamlined scikit-learn compatible API 🛠️
16+
- Easy topic interpretation 🔍
17+
- Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
18+
- Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
19+
20+
> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
21+
22+
#### New in version 0.3.0: Dynamic KeyNMF
23+
KeyNMF can now be used for dynamic topic modeling.
2624

25+
```python
26+
from datetime import datetime
27+
from turftopic import KeyNMF
28+
29+
corpus: list[str] = [...]
30+
timestamps = list[datetime] = [...]
31+
32+
model = KeyNMF(10)
33+
doc_topic_matrix = model.fit_transform_dynamic(corpus, timestamps=timestamps, bins=10)
34+
35+
model.print_topics_over_time()
36+
37+
# This needs Plotly: pip install plotly
38+
model.plot_topics_over_time()
39+
```
2740

2841
## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
2942
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
@@ -146,14 +159,10 @@ topicwizard.visualize(corpus, model=model)
146159

147160
Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/topicwizard/figures.html) in topicwizard for individual HTML figures.
148161

149-
## Models
150-
151-
| Model | Description | Usage |
152-
| - | - | - |
153-
| KeyNMF | Non-negative Matrix Factorization enhanced with keyword extraction using sentence embeddings | `model = KeyNMF(n_components=10).fit(corpus)` |
154-
| GMM | Gaussian Mixture Model over contextual embeddings + post-hoc term importance estimation | `model = GMM(n_components=10).fit(corpus)` |
155-
|| Separates semantic signals, aka. axes of semantics in a corpus using independent component analysis. | `model = SemanticSignalSeparation(n_components=10).fit(corpus)` |
156-
| Autoencoding Models | Learn topics using amortized variational inference enhanced by contextual representations. | `model = AutoEncodingTopicModel(n_components=10, combined=False).fit(corpus)` |
157-
| Clustering Models | Clusters semantic embeddings, and estimates term importances for clusters. | `model = ClusteringTopicModel(feature_importance="ctfidf").fit(corpus)` |
158-
159-
For extensive comparison see our [Model Overview](https://x-tabdeveloping.github.io/turftopic/model_overview/).
162+
## References
163+
- Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
164+
- Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
165+
- Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
166+
- Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
167+
- Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European
168+
- Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.

docs/images/arxiv_ml_compass.png

285 KB
Loading

docs/images/arxiv_ml_map.png

439 KB
Loading

docs/images/s3_math_correct.png

319 KB
Loading

docs/s3.md

Lines changed: 63 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,85 +1,102 @@
1-
# Semantic Signal Separation ()
1+
# Semantic Signal Separation ($S^3$)
22

33
Semantic Signal Separation tries to recover dimensions/axes along which most of the semantic variations can be explained.
4-
A topic in S³ is a dimension of semantics, or a "semantic signal".
4+
A topic in S³ is an axis of semantics in the corpus.
55
This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.
66

77
<figure>
8-
<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_ica_vs_pca_001.png" width="80%" style="margin-left: auto;margin-right: auto;">
9-
<figcaption>PCA and ICA Recovering Underlying Signals <br> (figure from scikit-learn's documentation) </figcaption>
8+
<img src="../images/s3_math_correct.png" width="60%" style="margin-left: auto;margin-right: auto;">
9+
<figcaption> Schematic overview of S³ </figcaption>
1010
</figure>
1111

1212
## The Model
1313

14-
### 1. Semantic Signal Decomposition
14+
### 0. Encoding
1515

16-
S³ finds semantic signals in the embedding matrix by decomposing it either with Independent Component Analysis(default) or with Principal Component Analysis.
17-
The difference between these two is that PCA finds maximally uncorrelated(orthogonal) components, while ICA recovers maximally independent signals.
16+
Documents in $S^3$ get first encoded using an [encoder model](encoders.md).
1817

19-
To use one or the other, set the `objective` parameter of the model:
18+
- Let the encodings of documents in the corpus be $X$.
2019

21-
```python
22-
from turftopic import SemanticSignalSeparation
20+
### 1. Decomposition
2321

24-
# Uses ICA
25-
model = SemanticSignalSeparation(10, objective="independence")
22+
The next step is to decompose the embedding matrix using ICA, this step discovers the underlying semantics axes as latent independent components in the embeddings.
2623

27-
# Uses PCA
28-
model = SemanticSignalSeparation(10, objective="orthogonality")
29-
```
24+
- Decompose $X$ using FastICA: $X = AS$, where $A$ is the mixing matrix and $S$ is the document-topic-matrix.
3025

31-
My anecdotal experience indicates that ICA generally gives better results, but feel free to experiment with the two options.
26+
### 2. Term Importance Estimation
3227

33-
Turftopic uses the FastICA and PCA implementations from scikit-learn in the background.
28+
Term importances for each topic are calculated by encoding the entire vocabulary of the corpus using the same embedding model,
29+
then recovering the strength of each latent component in the word embedding matrix.
30+
The strength of the components in the words will be interpreted as the words' importance in a given topic.
3431

35-
### 2. Term Importance Estimation: Recovering Signal Strength for the Vocabulary
36-
37-
To estimate the importance of terms for each component, S³ embeds all terms with the same encoder as the documents, and decomposes the vocabulary embeddings with the fitted components.
38-
The decomposed signals' matrix is then transposed to get a topic-term matrix.
32+
- Let the matrix of word encodings be $V$.
33+
- Calculate the pseudo-inverse of the mixing matrix $C = A^{+}$, where $C$ is the _unmixing matrix_.
34+
- Estimate component strength by multiplying word encodings with the unmixing matrix: $W = VC^T$. $W^T$ is then the topic-term matrix (`model.components_`).
3935

4036
## Comparison to Classical Models
4137

42-
S³ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
43-
The conceptualization is very similar these models, but instead of recovering factors of word use, S³ recovers dimensions in a continuous semantic space.
38+
$S^3$ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
39+
The conceptualization is very similar to these models, but instead of recovering factors of word use, $S^3$ recovers dimensions in a continuous semantic space.
40+
This means that you get many of the advantages of those models, including incredible speed, low sensitivity to hyperparameters and stable results.
4441

45-
Most of the intuitions you have about LSA will also apply with , but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
42+
Most of the intuitions you have about LSA will also apply with $S^3$, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
4643

47-
is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
44+
$S^3$ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
4845

4946
## Interpretation
5047

51-
S³ is one of the trickier models to interpret due to the way it conceptualizes topics.
52-
Unlike many other models, the fact that a word ranks very low for a topic is also useful information for interpretation's sake.
53-
In other words, both ends of term importance are important for S³, words that rank highest, and words that rank lowest.
48+
### Negative terms
49+
50+
Terms, which rank lowest on a topic have meaning in $S^3$.
51+
Whenever interpreting semantic axes, you should probably consider both ends of the axis.
52+
As such, when you print or export topics from $S^3$, the lowest ranking terms will also be shown along with the highest ranking ones.
53+
54+
Here's an example on ArXiv ML papers:
55+
56+
```python
57+
from turftopic import SemanticSignalSeparation
58+
from sklearn.feature_extraction.text import CountVectorizer
59+
60+
model = SemanticSignalSeparation(5, vectorizer=CountVectorizer(), random_state=42)
61+
model.fit(corpus)
5462

55-
To investigate these relations, we recommend that you use [Word Maps from topicwizard](https://x-tabdeveloping.github.io/topicwizard/figures.html#word-map).
56-
Word maps allow you to display the distribution of all words in the vocabulary on two given topic axes.
63+
model.print_topics(top_k=5)
64+
```
65+
66+
| | **Positive** | **Negative** |
67+
|---|------------------------------------------------------------------|-----------------------------------------------------------|
68+
| 0 | clustering, histograms, clusterings, histogram, classifying | reinforcement, exploration, planning, tactics, reinforce |
69+
| 1 | textual, pagerank, litigants, marginalizing, entailment | matlab, waveforms, microcontroller, accelerometers, microcontrollers |
70+
| 2 | sparsestmax, denoiseing, denoising, minimizers, minimizes | automation, affective, chatbots, questionnaire, attitudes |
71+
| 3 | rebmigraph, subgraph, subgraphs, graphsage, graph | adversarial, adversarially, adversarialization, adversary, security |
72+
| 4 | clustering, estimations, algorithm, dbscan, estimation | cnn, deepmind, deeplabv3, convnet, deepseenet |
73+
74+
75+
### Concept Compass
76+
77+
If you want to gain a deeper understanding of terms' relation to axes, you can produce a *concept compass*.
78+
This involves plotting terms in a corpus along two semantic axes.
79+
80+
In order to use the compass in Turftopic you will need to have `plotly` installed:
5781

5882
```bash
59-
pip install topic-wizard
83+
pip install plotly
6084
```
6185

86+
You can display a compass based on a fitted model like so:
87+
6288
```python
63-
from turftopic import SemanticSignalSeparation
64-
from topicwizard import figures
65-
66-
model = SemanticSignalSeparation(10)
67-
topic_data = model.prepare_topic_data(chatgpt_tweets)
68-
69-
figures.word_map(
70-
topic_data,
71-
topic_axes=(
72-
"9_api_apis_register_automatedsarcasmgenerator",
73-
"4_study_studying_assessments_exams"
74-
)
75-
)
89+
fig = model.concept_compass(topic_x=1, topic_y=4)
90+
fig.show()
7691
```
7792

7893
<figure>
79-
<img src="../images/word_map_axes.png" width="100%" style="margin-left: auto;margin-right: auto;">
80-
<figcaption>Word Map with two Discovered Semantic Components as Axes</figcaption>
94+
<img src="../images/arxiv_ml_compass.png" width="60%" style="margin-left: auto;margin-right: auto;">
95+
<figcaption> Concept Compass of ArXiv ML Papers along two semantic axes. </figcaption>
8196
</figure>
8297

98+
99+
83100
## Considerations
84101

85102
### Strengths
@@ -91,7 +108,6 @@ figures.word_map(
91108

92109
### Weaknesses
93110

94-
- Noise Components: The model tends to find components in corpora that only contain noise. This is typical in other applications of ICA as well, and it is frequently used for noise removal in other disciplines. We are working on automated solutions to detect and flag these components.
95111
- Sometimes Unintuitive: Neural embedding models might have a different mapping of the semantic space than humans. Sometimes S³ uncovers unintuitive dimensions of meaning as a result of this.
96112
- Moderate Scalability: The model cannot be fitted in an online fashion. It is reasonably scalable, but for very large corpora you might want to consider using a different model.
97113

turftopic/models/decomp.py

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from sklearn.base import TransformerMixin
77
from sklearn.decomposition import FastICA
88
from sklearn.feature_extraction.text import CountVectorizer
9+
from sklearn.metrics.pairwise import euclidean_distances
910

1011
from turftopic.base import ContextualModel, Encoder
1112
from turftopic.vectorizer import default_vectorizer
@@ -170,3 +171,78 @@ def export_representative_documents(
170171
show_negative,
171172
format,
172173
)
174+
175+
def concept_compass(
176+
self, topic_x: Union[int, str], topic_y: Union[str, int]
177+
):
178+
"""Display a compass of concepts along two semantic axes.
179+
In order for the plot to be concise and readable, terms are randomly selected on
180+
a grid of the two topics.
181+
182+
Parameters
183+
----------
184+
topic_x: int or str
185+
Index or name of the topic to display on the X axis.
186+
topic_y: int or str
187+
Index or name of the topic to display on the Y axis.
188+
189+
Returns
190+
-------
191+
go.Figure
192+
Plotly interactive plot of the concept compass.
193+
"""
194+
try:
195+
import plotly.express as px
196+
except (ImportError, ModuleNotFoundError) as e:
197+
raise ModuleNotFoundError(
198+
"Please install plotly if you intend to use plots in Turftopic."
199+
) from e
200+
if isinstance(topic_x, str):
201+
try:
202+
topic_x = list(self.topic_names).index(topic_x)
203+
except ValueError as e:
204+
raise ValueError(
205+
f"{topic_x} is not a valid topic name or index."
206+
) from e
207+
if isinstance(topic_y, str):
208+
try:
209+
topic_y = list(self.topic_names).index(topic_y)
210+
except ValueError as e:
211+
raise ValueError(
212+
f"{topic_y} is not a valid topic name or index."
213+
) from e
214+
x = self.components_[topic_x]
215+
y = self.components_[topic_y]
216+
vocab = self.get_vocab()
217+
points = np.array(list(zip(x, y)))
218+
xx, yy = np.meshgrid(
219+
np.linspace(np.min(x), np.max(x), 20),
220+
np.linspace(np.min(y), np.max(y), 20),
221+
)
222+
coords = np.array(list(zip(np.ravel(xx), np.ravel(yy))))
223+
coords = coords + np.random.default_rng(0).normal(
224+
[0, 0], [0.1, 0.1], size=coords.shape
225+
)
226+
dist = euclidean_distances(coords, points)
227+
idxs = np.argmin(dist, axis=1)
228+
fig = px.scatter(
229+
x=x[idxs],
230+
y=y[idxs],
231+
text=vocab[idxs],
232+
template="plotly_white",
233+
)
234+
fig = fig.update_traces(
235+
mode="text", textfont_color="black", marker=dict(color="black")
236+
).update_layout(
237+
xaxis_title=f"{self.topic_names[topic_x]}",
238+
yaxis_title=f"{self.topic_names[topic_y]}",
239+
)
240+
fig = fig.update_layout(
241+
width=1000,
242+
height=1000,
243+
font=dict(family="Times New Roman", color="black", size=21),
244+
margin=dict(l=5, r=5, t=5, b=5),
245+
)
246+
fig = fig.add_hline(y=0, line_color="black", line_width=4)
247+
fig = fig.add_vline(x=0, line_color="black", line_width=4)
248+
return fig

0 commit comments

Comments
 (0)