x-tabdeveloping
diff --git a/‎docs/images/arxiv_ml_compass.png‎
285 KB b/‎docs/images/arxiv_ml_compass.png‎
285 KB
diff --git a/‎docs/s3.md‎
Lines changed: 63 additions & 47 deletions b/‎docs/s3.md‎
Lines changed: 63 additions & 47 deletions
@@ -1,85 +1,102 @@
-# Semantic Signal Separation (S³)
+# Semantic Signal Separation ($S^3$)
 
 Semantic Signal Separation tries to recover dimensions/axes along which most of the semantic variations can be explained.
-A topic in S³ is a dimension of semantics, or a "semantic signal". 
+A topic in S³ is an axis of semantics in the corpus.
 This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.
 
 <figure>
-  <img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_ica_vs_pca_001.png" width="80%" style="margin-left: auto;margin-right: auto;">
-  <figcaption>PCA and ICA Recovering Underlying Signals <br> (figure from scikit-learn's documentation) </figcaption>
+  <img src="../images/s3_math_correct.png" width="60%" style="margin-left: auto;margin-right: auto;">
+  <figcaption> Schematic overview of S³  </figcaption>
 </figure>
 
 ## The Model
 
-### 1. Semantic Signal Decomposition
+### 0. Encoding
 
-S³ finds semantic signals in the embedding matrix by decomposing it either with Independent Component Analysis(default) or with Principal Component Analysis.
-The difference between these two is that PCA finds maximally uncorrelated(orthogonal) components, while ICA recovers maximally independent signals.
+Documents in $S^3$ get first encoded using an [encoder model](encoders.md).
 
-To use one or the other, set the `objective` parameter of the model:
+- Let the encodings of documents in the corpus be $X$.
 
-```python
-from turftopic import SemanticSignalSeparation
+### 1. Decomposition
 
-# Uses ICA
-model = SemanticSignalSeparation(10, objective="independence")
+The next step is to decompose the embedding matrix using ICA, this step discovers the underlying semantics axes as latent independent components in the embeddings.
 
-# Uses PCA
-model = SemanticSignalSeparation(10, objective="orthogonality")
-```
+- Decompose $X$ using FastICA: $X = AS$, where $A$ is the mixing matrix and $S$ is the document-topic-matrix.
 
-My anecdotal experience indicates that ICA generally gives better results, but feel free to experiment with the two options.
+### 2. Term Importance Estimation
 
-Turftopic uses the FastICA and PCA implementations from scikit-learn in the background.
+Term importances for each topic are calculated by encoding the entire vocabulary of the corpus using the same embedding model,
+then recovering the strength of each latent component in the word embedding matrix.
+The strength of the components in the words will be interpreted as the words' importance in a given topic.
 
-### 2. Term Importance Estimation: Recovering Signal Strength for the Vocabulary
-
-To estimate the importance of terms for each component, S³ embeds all terms with the same encoder as the documents, and decomposes the vocabulary embeddings with the fitted components.
-The decomposed signals' matrix is then transposed to get a topic-term matrix.
+- Let the matrix of word encodings be $V$.
+- Calculate the pseudo-inverse of the mixing matrix $C = A^{+}$, where $C$ is the _unmixing matrix_.
+- Estimate component strength by multiplying word encodings with the unmixing matrix: $W = VC^T$. $W^T$ is then the topic-term matrix (`model.components_`).
 
 ## Comparison to Classical Models
 
-S³ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
-The conceptualization is very similar these models, but instead of recovering factors of word use, S³ recovers dimensions in a continuous semantic space.
+$S^3$ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
+The conceptualization is very similar to these models, but instead of recovering factors of word use, $S^3$ recovers dimensions in a continuous semantic space.
+This means that you get many of the advantages of those models, including incredible speed, low sensitivity to hyperparameters and stable results.
 
-Most of the intuitions you have about LSA will also apply with S³, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
+Most of the intuitions you have about LSA will also apply with $S^3$, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
 
-S³ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
+$S^3$ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
 
 ## Interpretation
 
-S³ is one of the trickier models to interpret due to the way it conceptualizes topics.
-Unlike many other models, the fact that a word ranks very low for a topic is also useful information for interpretation's sake.
-In other words, both ends of term importance are important for S³, words that rank highest, and words that rank lowest.
+### Negative terms
+
+Terms, which rank lowest on a topic have meaning in $S^3$.
+Whenever interpreting semantic axes, you should probably consider both ends of the axis.
+As such, when you print or export topics from $S^3$, the lowest ranking terms will also be shown along with the highest ranking ones.
+
+Here's an example on ArXiv ML papers:
+
+```python
+from turftopic import SemanticSignalSeparation
+from sklearn.feature_extraction.text import CountVectorizer
+
+model = SemanticSignalSeparation(5, vectorizer=CountVectorizer(), random_state=42)
+model.fit(corpus)
 
-To investigate these relations, we recommend that you use [Word Maps from topicwizard](https://x-tabdeveloping.github.io/topicwizard/figures.html#word-map).
-Word maps allow you to display the distribution of all words in the vocabulary on two given topic axes.
+model.print_topics(top_k=5)
+```
+
+|   | **Positive**                                                     | **Negative**                                              |
+|---|------------------------------------------------------------------|-----------------------------------------------------------|
+| 0 | clustering, histograms, clusterings, histogram, classifying      | reinforcement, exploration, planning, tactics, reinforce  |
+| 1 | textual, pagerank, litigants, marginalizing, entailment          | matlab, waveforms, microcontroller, accelerometers, microcontrollers |
+| 2 | sparsestmax, denoiseing, denoising, minimizers, minimizes        | automation, affective, chatbots, questionnaire, attitudes  |
+| 3 | rebmigraph, subgraph, subgraphs, graphsage, graph                | adversarial, adversarially, adversarialization, adversary, security |
+| 4 | clustering, estimations, algorithm, dbscan, estimation           | cnn, deepmind, deeplabv3, convnet, deepseenet              |
+
+
+### Concept Compass
+
+If you want to gain a deeper understanding of terms' relation to axes, you can produce a *concept compass*.
+This involves plotting terms in a corpus along two semantic axes.
+
+In order to use the compass in Turftopic you will need to have `plotly` installed:
 
 ```bash
-pip install topic-wizard
+pip install plotly
 ```
 
+You can display a compass based on a fitted model like so:
+
 ```python
-from turftopic import SemanticSignalSeparation
-from topicwizard import figures
-
-model = SemanticSignalSeparation(10)
-topic_data = model.prepare_topic_data(chatgpt_tweets)
-
-figures.word_map(
-  topic_data,
-  topic_axes=(
-     "9_api_apis_register_automatedsarcasmgenerator",
-     "4_study_studying_assessments_exams"
-  )
-)
+fig = model.concept_compass(topic_x=1, topic_y=4)
+fig.show()
 ```
 
 <figure>
-  <img src="../images/word_map_axes.png" width="100%" style="margin-left: auto;margin-right: auto;">
-  <figcaption>Word Map with two Discovered Semantic Components as Axes</figcaption>
+  <img src="../images/arxiv_ml_compass.png" width="60%" style="margin-left: auto;margin-right: auto;">
+  <figcaption> Concept Compass of ArXiv ML Papers along two semantic axes. </figcaption>
 </figure>
 
+
+
 ## Considerations
 
 ### Strengths
@@ -91,7 +108,6 @@ figures.word_map(
 
 ### Weaknesses
 
- - Noise Components: The model tends to find components in corpora that only contain noise. This is typical in other applications of ICA as well, and it is frequently used for noise removal in other disciplines. We are working on automated solutions to detect and flag these components.
  - Sometimes Unintuitive: Neural embedding models might have a different mapping of the semantic space than humans. Sometimes S³ uncovers unintuitive dimensions of meaning as a result of this.
  - Moderate Scalability: The model cannot be fitted in an online fashion. It is reasonably scalable, but for very large corpora you might want to consider using a different model.