x-tabdeveloping
diff --git a/‎README.md‎
Lines changed: 39 additions & 30 deletions b/‎README.md‎
Lines changed: 39 additions & 30 deletions
diff --git a/‎docs/images/arxiv_ml_compass.png‎
285 KB b/‎docs/images/arxiv_ml_compass.png‎
285 KB
diff --git a/‎docs/images/arxiv_ml_map.png‎
439 KB b/‎docs/images/arxiv_ml_map.png‎
439 KB
diff --git a/‎docs/images/s3_math_correct.png‎
319 KB b/‎docs/images/s3_math_correct.png‎
319 KB
diff --git a/‎docs/s3.md‎
Lines changed: 63 additions & 47 deletions b/‎docs/s3.md‎
Lines changed: 63 additions & 47 deletions
diff --git a/‎turftopic/models/decomp.py‎
Lines changed: 76 additions & 0 deletions b/‎turftopic/models/decomp.py‎
Lines changed: 76 additions & 0 deletions
@@ -4,26 +4,39 @@
  <b>Topic modeling is your turf too.</b> <br> <i> Contextual topic models with representations from transformers. </i></p>
 
 
-## Intentions
- - Provide simple, robust and fast implementations of existing approaches (BERTopic, Top2Vec, CTM) with minimal dependencies.
- - Implement state-of-the-art approaches from my papers. (papers work-in-progress)
- - Put all approaches in a broader conceptual framework.
- - Provide clear and extensive documentation about the best use-cases for each model.
- - Make the models' API streamlined and compatible with topicwizard and scikit-learn.
- - Develop smarter, transformer-based evaluation metrics.
-
-**Note**: This package is still work in progress and scientific papers on some of the novel methods (e.g., decomposition-based methods) are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
-
-## Roadmap
- - [x] Model Implementation
- - [x] Pretty Printing
- - [x] Implement visualization utilites for these models in topicwizard
- - [x] Thorough documentation
- - [x] Dynamic modeling (currently `GMM` and `ClusteringTopicModel` others might follow)
- - [ ] Publish papers :hourglass_flowing_sand: (in progress..)
- - [ ] High-level topic descriptions with LLMs.
- - [ ] Contextualized evaluation metrics.
+## Features
+ - Novel transformer-based topic models:
+   - Semantic Signal Separation - S³ 🧭
+   - KeyNMF 🔑
+   - GMM
+ - Implementations of existing transformer-based topic models
+   - Clustering Topic Models: BERTopic and Top2Vec
+   - Autoencoding Topic Models: CombinedTM and ZeroShotTM
+ - Streamlined scikit-learn compatible API 🛠️
+ - Easy topic interpretation 🔍
+ - Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
+ - Visualization with [topicwizard](https://github.com/x-tabdeveloping/topicwizard) 🖌️
+
+> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
+
+#### New in version 0.3.0: Dynamic KeyNMF
+KeyNMF can now be used for dynamic topic modeling.
 
+```python
+from datetime import datetime
+from turftopic import KeyNMF
+
+corpus: list[str] = [...]
+timestamps = list[datetime] = [...]
+
+model = KeyNMF(10)
+doc_topic_matrix = model.fit_transform_dynamic(corpus, timestamps=timestamps, bins=10)
+
+model.print_topics_over_time()
+
+# This needs Plotly: pip install plotly
+model.plot_topics_over_time()
+```
 
 ## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
@@ -146,14 +159,10 @@ topicwizard.visualize(corpus, model=model)
 
 Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/topicwizard/figures.html) in topicwizard for individual HTML figures.
 
-## Models
-
-| Model | Description | Usage |
-| - | - | - |
-| KeyNMF | Non-negative Matrix Factorization enhanced with keyword extraction using sentence embeddings | `model = KeyNMF(n_components=10).fit(corpus)` |
-| GMM | Gaussian Mixture Model over contextual embeddings + post-hoc term importance estimation | `model = GMM(n_components=10).fit(corpus)` |
-| S³ | Separates semantic signals, aka. axes of semantics in a corpus using independent component analysis. | `model = SemanticSignalSeparation(n_components=10).fit(corpus)` |
-| Autoencoding Models | Learn topics using amortized variational inference enhanced by contextual representations.  | `model = AutoEncodingTopicModel(n_components=10, combined=False).fit(corpus)` |
-| Clustering Models | Clusters semantic embeddings, and estimates term importances for clusters.  | `model = ClusteringTopicModel(feature_importance="ctfidf").fit(corpus)` |
-
-For extensive comparison see our [Model Overview](https://x-tabdeveloping.github.io/turftopic/model_overview/).
+## References
+- Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
+ - Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
+ - Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
+ - Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
+ - Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European 
+ - Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
@@ -1,85 +1,102 @@
-# Semantic Signal Separation (S³)
+# Semantic Signal Separation ($S^3$)
 
 Semantic Signal Separation tries to recover dimensions/axes along which most of the semantic variations can be explained.
-A topic in S³ is a dimension of semantics, or a "semantic signal". 
+A topic in S³ is an axis of semantics in the corpus.
 This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.
 
 <figure>
-  <img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_ica_vs_pca_001.png" width="80%" style="margin-left: auto;margin-right: auto;">
-  <figcaption>PCA and ICA Recovering Underlying Signals <br> (figure from scikit-learn's documentation) </figcaption>
+  <img src="../images/s3_math_correct.png" width="60%" style="margin-left: auto;margin-right: auto;">
+  <figcaption> Schematic overview of S³  </figcaption>
 </figure>
 
 ## The Model
 
-### 1. Semantic Signal Decomposition
+### 0. Encoding
 
-S³ finds semantic signals in the embedding matrix by decomposing it either with Independent Component Analysis(default) or with Principal Component Analysis.
-The difference between these two is that PCA finds maximally uncorrelated(orthogonal) components, while ICA recovers maximally independent signals.
+Documents in $S^3$ get first encoded using an [encoder model](encoders.md).
 
-To use one or the other, set the `objective` parameter of the model:
+- Let the encodings of documents in the corpus be $X$.
 
-```python
-from turftopic import SemanticSignalSeparation
+### 1. Decomposition
 
-# Uses ICA
-model = SemanticSignalSeparation(10, objective="independence")
+The next step is to decompose the embedding matrix using ICA, this step discovers the underlying semantics axes as latent independent components in the embeddings.
 
-# Uses PCA
-model = SemanticSignalSeparation(10, objective="orthogonality")
-```
+- Decompose $X$ using FastICA: $X = AS$, where $A$ is the mixing matrix and $S$ is the document-topic-matrix.
 
-My anecdotal experience indicates that ICA generally gives better results, but feel free to experiment with the two options.
+### 2. Term Importance Estimation
 
-Turftopic uses the FastICA and PCA implementations from scikit-learn in the background.
+Term importances for each topic are calculated by encoding the entire vocabulary of the corpus using the same embedding model,
+then recovering the strength of each latent component in the word embedding matrix.
+The strength of the components in the words will be interpreted as the words' importance in a given topic.
 
-### 2. Term Importance Estimation: Recovering Signal Strength for the Vocabulary
-
-To estimate the importance of terms for each component, S³ embeds all terms with the same encoder as the documents, and decomposes the vocabulary embeddings with the fitted components.
-The decomposed signals' matrix is then transposed to get a topic-term matrix.
+- Let the matrix of word encodings be $V$.
+- Calculate the pseudo-inverse of the mixing matrix $C = A^{+}$, where $C$ is the _unmixing matrix_.
+- Estimate component strength by multiplying word encodings with the unmixing matrix: $W = VC^T$. $W^T$ is then the topic-term matrix (`model.components_`).
 
 ## Comparison to Classical Models
 
-S³ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
-The conceptualization is very similar these models, but instead of recovering factors of word use, S³ recovers dimensions in a continuous semantic space.
+$S^3$ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
+The conceptualization is very similar to these models, but instead of recovering factors of word use, $S^3$ recovers dimensions in a continuous semantic space.
+This means that you get many of the advantages of those models, including incredible speed, low sensitivity to hyperparameters and stable results.
 
-Most of the intuitions you have about LSA will also apply with S³, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
+Most of the intuitions you have about LSA will also apply with $S^3$, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
 
-S³ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
+$S^3$ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
 
 ## Interpretation
 
-S³ is one of the trickier models to interpret due to the way it conceptualizes topics.
-Unlike many other models, the fact that a word ranks very low for a topic is also useful information for interpretation's sake.
-In other words, both ends of term importance are important for S³, words that rank highest, and words that rank lowest.
+### Negative terms
+
+Terms, which rank lowest on a topic have meaning in $S^3$.
+Whenever interpreting semantic axes, you should probably consider both ends of the axis.
+As such, when you print or export topics from $S^3$, the lowest ranking terms will also be shown along with the highest ranking ones.
+
+Here's an example on ArXiv ML papers:
+
+```python
+from turftopic import SemanticSignalSeparation
+from sklearn.feature_extraction.text import CountVectorizer
+
+model = SemanticSignalSeparation(5, vectorizer=CountVectorizer(), random_state=42)
+model.fit(corpus)
 
-To investigate these relations, we recommend that you use [Word Maps from topicwizard](https://x-tabdeveloping.github.io/topicwizard/figures.html#word-map).
-Word maps allow you to display the distribution of all words in the vocabulary on two given topic axes.
+model.print_topics(top_k=5)
+```
+
+|   | **Positive**                                                     | **Negative**                                              |
+|---|------------------------------------------------------------------|-----------------------------------------------------------|
+| 0 | clustering, histograms, clusterings, histogram, classifying      | reinforcement, exploration, planning, tactics, reinforce  |
+| 1 | textual, pagerank, litigants, marginalizing, entailment          | matlab, waveforms, microcontroller, accelerometers, microcontrollers |
+| 2 | sparsestmax, denoiseing, denoising, minimizers, minimizes        | automation, affective, chatbots, questionnaire, attitudes  |
+| 3 | rebmigraph, subgraph, subgraphs, graphsage, graph                | adversarial, adversarially, adversarialization, adversary, security |
+| 4 | clustering, estimations, algorithm, dbscan, estimation           | cnn, deepmind, deeplabv3, convnet, deepseenet              |
+
+
+### Concept Compass
+
+If you want to gain a deeper understanding of terms' relation to axes, you can produce a *concept compass*.
+This involves plotting terms in a corpus along two semantic axes.
+
+In order to use the compass in Turftopic you will need to have `plotly` installed:
 
 ```bash
-pip install topic-wizard
+pip install plotly
 ```
 
+You can display a compass based on a fitted model like so:
+
 ```python
-from turftopic import SemanticSignalSeparation
-from topicwizard import figures
-
-model = SemanticSignalSeparation(10)
-topic_data = model.prepare_topic_data(chatgpt_tweets)
-
-figures.word_map(
-  topic_data,
-  topic_axes=(
-     "9_api_apis_register_automatedsarcasmgenerator",
-     "4_study_studying_assessments_exams"
-  )
-)
+fig = model.concept_compass(topic_x=1, topic_y=4)
+fig.show()
 ```
 
 <figure>
-  <img src="../images/word_map_axes.png" width="100%" style="margin-left: auto;margin-right: auto;">
-  <figcaption>Word Map with two Discovered Semantic Components as Axes</figcaption>
+  <img src="../images/arxiv_ml_compass.png" width="60%" style="margin-left: auto;margin-right: auto;">
+  <figcaption> Concept Compass of ArXiv ML Papers along two semantic axes. </figcaption>
 </figure>
 
+
+
 ## Considerations
 
 ### Strengths
@@ -91,7 +108,6 @@ figures.word_map(
 
 ### Weaknesses
 
- - Noise Components: The model tends to find components in corpora that only contain noise. This is typical in other applications of ICA as well, and it is frequently used for noise removal in other disciplines. We are working on automated solutions to detect and flag these components.
  - Sometimes Unintuitive: Neural embedding models might have a different mapping of the semantic space than humans. Sometimes S³ uncovers unintuitive dimensions of meaning as a result of this.
  - Moderate Scalability: The model cannot be fitted in an online fashion. It is reasonably scalable, but for very large corpora you might want to consider using a different model.
 
 
@@ -6,6 +6,7 @@
 from sklearn.base import TransformerMixin
 from sklearn.decomposition import FastICA
 from sklearn.feature_extraction.text import CountVectorizer
+from sklearn.metrics.pairwise import euclidean_distances
 
 from turftopic.base import ContextualModel, Encoder
 from turftopic.vectorizer import default_vectorizer
@@ -170,3 +171,78 @@ def export_representative_documents(
             show_negative,
             format,
         )
+
+    def concept_compass(
+        self, topic_x: Union[int, str], topic_y: Union[str, int]
+    ):
+        """Display a compass of concepts along two semantic axes.
+        In order for the plot to be concise and readable, terms are randomly selected on
+        a grid of the two topics.
+
+        Parameters
+        ----------
+        topic_x: int or str
+            Index or name of the topic to display on the X axis.
+        topic_y: int or str
+            Index or name of the topic to display on the Y axis.
+
+        Returns
+        -------
+        go.Figure
+            Plotly interactive plot of the concept compass.
+        """
+        try:
+            import plotly.express as px
+        except (ImportError, ModuleNotFoundError) as e:
+            raise ModuleNotFoundError(
+                "Please install plotly if you intend to use plots in Turftopic."
+            ) from e
+        if isinstance(topic_x, str):
+            try:
+                topic_x = list(self.topic_names).index(topic_x)
+            except ValueError as e:
+                raise ValueError(
+                    f"{topic_x} is not a valid topic name or index."
+                ) from e
+        if isinstance(topic_y, str):
+            try:
+                topic_y = list(self.topic_names).index(topic_y)
+            except ValueError as e:
+                raise ValueError(
+                    f"{topic_y} is not a valid topic name or index."
+                ) from e
+        x = self.components_[topic_x]
+        y = self.components_[topic_y]
+        vocab = self.get_vocab()
+        points = np.array(list(zip(x, y)))
+        xx, yy = np.meshgrid(
+            np.linspace(np.min(x), np.max(x), 20),
+            np.linspace(np.min(y), np.max(y), 20),
+        )
+        coords = np.array(list(zip(np.ravel(xx), np.ravel(yy))))
+        coords = coords + np.random.default_rng(0).normal(
+            [0, 0], [0.1, 0.1], size=coords.shape
+        )
+        dist = euclidean_distances(coords, points)
+        idxs = np.argmin(dist, axis=1)
+        fig = px.scatter(
+            x=x[idxs],
+            y=y[idxs],
+            text=vocab[idxs],
+            template="plotly_white",
+        )
+        fig = fig.update_traces(
+            mode="text", textfont_color="black", marker=dict(color="black")
+        ).update_layout(
+            xaxis_title=f"{self.topic_names[topic_x]}",
+            yaxis_title=f"{self.topic_names[topic_y]}",
+        )
+        fig = fig.update_layout(
+            width=1000,
+            height=1000,
+            font=dict(family="Times New Roman", color="black", size=21),
+            margin=dict(l=5, r=5, t=5, b=5),
+        )
+        fig = fig.add_hline(y=0, line_color="black", line_width=4)
+        fig = fig.add_vline(x=0, line_color="black", line_width=4)
+        return fig