You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/s3.md
+63-47Lines changed: 63 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,85 +1,102 @@
1
-
# Semantic Signal Separation (S³)
1
+
# Semantic Signal Separation ($S^3$)
2
2
3
3
Semantic Signal Separation tries to recover dimensions/axes along which most of the semantic variations can be explained.
4
-
A topic in S³ is a dimension of semantics, or a "semantic signal".
4
+
A topic in S³ is an axis of semantics in the corpus.
5
5
This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.
<figcaption> Schematic overview of S³ </figcaption>
10
10
</figure>
11
11
12
12
## The Model
13
13
14
-
### 1. Semantic Signal Decomposition
14
+
### 0. Encoding
15
15
16
-
S³ finds semantic signals in the embedding matrix by decomposing it either with Independent Component Analysis(default) or with Principal Component Analysis.
17
-
The difference between these two is that PCA finds maximally uncorrelated(orthogonal) components, while ICA recovers maximally independent signals.
16
+
Documents in $S^3$ get first encoded using an [encoder model](encoders.md).
18
17
19
-
To use one or the other, set the `objective` parameter of the model:
18
+
- Let the encodings of documents in the corpus be $X$.
20
19
21
-
```python
22
-
from turftopic import SemanticSignalSeparation
20
+
### 1. Decomposition
23
21
24
-
# Uses ICA
25
-
model = SemanticSignalSeparation(10, objective="independence")
22
+
The next step is to decompose the embedding matrix using ICA, this step discovers the underlying semantics axes as latent independent components in the embeddings.
26
23
27
-
# Uses PCA
28
-
model = SemanticSignalSeparation(10, objective="orthogonality")
29
-
```
24
+
- Decompose $X$ using FastICA: $X = AS$, where $A$ is the mixing matrix and $S$ is the document-topic-matrix.
30
25
31
-
My anecdotal experience indicates that ICA generally gives better results, but feel free to experiment with the two options.
26
+
### 2. Term Importance Estimation
32
27
33
-
Turftopic uses the FastICA and PCA implementations from scikit-learn in the background.
28
+
Term importances for each topic are calculated by encoding the entire vocabulary of the corpus using the same embedding model,
29
+
then recovering the strength of each latent component in the word embedding matrix.
30
+
The strength of the components in the words will be interpreted as the words' importance in a given topic.
34
31
35
-
### 2. Term Importance Estimation: Recovering Signal Strength for the Vocabulary
36
-
37
-
To estimate the importance of terms for each component, S³ embeds all terms with the same encoder as the documents, and decomposes the vocabulary embeddings with the fitted components.
38
-
The decomposed signals' matrix is then transposed to get a topic-term matrix.
32
+
- Let the matrix of word encodings be $V$.
33
+
- Calculate the pseudo-inverse of the mixing matrix $C = A^{+}$, where $C$ is the _unmixing matrix_.
34
+
- Estimate component strength by multiplying word encodings with the unmixing matrix: $W = VC^T$. $W^T$ is then the topic-term matrix (`model.components_`).
39
35
40
36
## Comparison to Classical Models
41
37
42
-
S³ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
43
-
The conceptualization is very similar these models, but instead of recovering factors of word use, S³ recovers dimensions in a continuous semantic space.
38
+
$S^3$ is potentially the closest you can get with contextually sensitive models to classical matrix decomposition approaches, such as NMF or Latent Semantic Analysis.
39
+
The conceptualization is very similar to these models, but instead of recovering factors of word use, $S^3$ recovers dimensions in a continuous semantic space.
40
+
This means that you get many of the advantages of those models, including incredible speed, low sensitivity to hyperparameters and stable results.
44
41
45
-
Most of the intuitions you have about LSA will also apply with S³, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
42
+
Most of the intuitions you have about LSA will also apply with $S^3$, but it might give more surprising results, as embedding models can potentially learn different efficient representations of semantics from humans.
46
43
47
-
S³ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
44
+
$S^3$ is also way more robust to stop words, meaning that you won't have to do extensive preprocessing.
48
45
49
46
## Interpretation
50
47
51
-
S³ is one of the trickier models to interpret due to the way it conceptualizes topics.
52
-
Unlike many other models, the fact that a word ranks very low for a topic is also useful information for interpretation's sake.
53
-
In other words, both ends of term importance are important for S³, words that rank highest, and words that rank lowest.
48
+
### Negative terms
49
+
50
+
Terms, which rank lowest on a topic have meaning in $S^3$.
51
+
Whenever interpreting semantic axes, you should probably consider both ends of the axis.
52
+
As such, when you print or export topics from $S^3$, the lowest ranking terms will also be shown along with the highest ranking ones.
53
+
54
+
Here's an example on ArXiv ML papers:
55
+
56
+
```python
57
+
from turftopic import SemanticSignalSeparation
58
+
from sklearn.feature_extraction.text import CountVectorizer
59
+
60
+
model = SemanticSignalSeparation(5, vectorizer=CountVectorizer(), random_state=42)
61
+
model.fit(corpus)
54
62
55
-
To investigate these relations, we recommend that you use [Word Maps from topicwizard](https://x-tabdeveloping.github.io/topicwizard/figures.html#word-map).
56
-
Word maps allow you to display the distribution of all words in the vocabulary on two given topic axes.
<figcaption> Concept Compass of ArXiv ML Papers along two semantic axes. </figcaption>
81
96
</figure>
82
97
98
+
99
+
83
100
## Considerations
84
101
85
102
### Strengths
@@ -91,7 +108,6 @@ figures.word_map(
91
108
92
109
### Weaknesses
93
110
94
-
- Noise Components: The model tends to find components in corpora that only contain noise. This is typical in other applications of ICA as well, and it is frequently used for noise removal in other disciplines. We are working on automated solutions to detect and flag these components.
95
111
- Sometimes Unintuitive: Neural embedding models might have a different mapping of the semantic space than humans. Sometimes S³ uncovers unintuitive dimensions of meaning as a result of this.
96
112
- Moderate Scalability: The model cannot be fitted in an online fashion. It is reasonably scalable, but for very large corpora you might want to consider using a different model.
0 commit comments