You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/s3.md
+50-27Lines changed: 50 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,59 +4,82 @@ Semantic Signal Separation tries to recover dimensions/axes along which most of
4
4
A topic in $S^3$ is an axis of semantics in the corpus.
5
5
This makes the model able to recover more nuanced topical content in documents, but is not optimal when you expect topics to be groupings of documents.
6
6
7
-
$S^3$ is one of the fastest topic models out there, even rivalling vanilla NMF, when not accounting for embedding time.
8
-
It also typically produces very high quality topics, and our evaluations indicate that it performs significantly better when no preprocessing is applied to texts.
<figcaption> Schematic overview of S³ </figcaption>
14
11
</figure>
15
12
16
-
## How does $S^3$ work?
17
-
18
-
### Encoding
19
-
20
-
Documents in $S^3$ get first encoded using an [encoder model](encoders.md).
13
+
$S^3$ is one of the fastest topic models out there, even rivalling vanilla NMF, when not accounting for embedding time.
14
+
It also typically produces very high quality topics, and our evaluations indicate that it performs significantly better when no preprocessing is applied to texts.
21
15
22
-
- Let the encodings of documents in the corpus be $X$.
23
16
24
-
### Decomposition
17
+
##How does $S^3$ work?
25
18
26
-
The next step is to decompose the embedding matrix using ICA, this step discovers the underlying semantics axes as latent independent components in the embeddings.
19
+
### Step 1: Document-embedding Decomposition
27
20
28
-
- Decompose $X$ using FastICA: $X = AS$, where $A$ is the mixing matrix and $S$ is the document-topic-matrix.
21
+
The first step is to decompose the embedding matrix using ICA, this step discovers the underlying semantics axes as latent independent components in the embeddings.
29
22
30
-
### Term Importance Estimation
23
+
??? info "See formula"
24
+
- Let the encodings of documents in the corpus be $X$.
25
+
- Decompose $X$ using FastICA: $X = AS$, where $A$ is the mixing matrix and $S$ is the document-topic-matrix.
31
26
27
+
### Step 2: Term Importance Estimation
32
28
33
29
Term importances for each topic are calculated by encoding the entire vocabulary of the corpus using the same embedding model,
34
30
then recovering the strength of each latent component in the word embedding matrix.
35
31
The strength of the components in the words will be interpreted as the words' importance in a given topic.
36
32
37
-
- Let the matrix of word encodings be $V$.
38
-
- Calculate the pseudo-inverse of the mixing matrix $C = A^{+}$, where $C$ is the _unmixing matrix_.
39
-
- Project word embeddings onto the semantic axes by multiplying them with unmixing matrix: $W = VC^T$. $W^T$ is then the topic-term matrix (`model.components_`).
<figcaption> Visual representation of term importance approaches in S³ </figcaption>
44
36
</figure>
45
37
38
+
39
+
??? info "See formula"
40
+
- Let the matrix of word encodings be $V$.
41
+
- Calculate the pseudo-inverse of the mixing matrix $C = A^{+}$, where $C$ is the _unmixing matrix_.
42
+
- Project word embeddings onto the semantic axes by multiplying them with unmixing matrix: $W = VC^T$. $W^T$ is then the topic-term matrix (`model.components_`).
43
+
44
+
46
45
There are three distinct methods to calculate term importances from word projections:
47
46
48
-
1.*Axial* word importances (`feature_importance="axial"`) are defined as the words' positions on the semantic axes. The importance of word $j$ for topic $t$ is: $\beta_{tj} = W_{jt}$.
49
-
2.*Angular* topics (`feature_importance="angular"`) can be calculated by taking the cosine of the angle between projected word vectors and semantic axes:
3.*Combined* (`feature_importance="combined"`, this is the default) word importance is a combination of the two approaches
52
-
$\beta_{tj} = \frac{(W_{jt})^3}{||W_j||}$
47
+
!!! quote "Choose a word importance method"
48
+
49
+
=== "Axial"
50
+
51
+
```python
52
+
from turftopic import SemanticSignalSeparation
53
+
54
+
model = SemanticSignalSeparation(n_components=10, feature_importance="axial")
55
+
```
56
+
Axial word importances are defined as the words' positions on the semantic axes.
57
+
This approach selects highly relevant words for topic descriptions, but topic descriptions might share words if a word scores high on multiple axes.
58
+
59
+
The importance of word $j$ for topic $t$ is: $\beta_{tj} = W_{jt}$
60
+
61
+
=== "Angular"
62
+
```python
63
+
from turftopic import SemanticSignalSeparation
64
+
65
+
model = SemanticSignalSeparation(n_components=10, feature_importance="angular")
66
+
```
67
+
Angular topics can be calculated by taking the cosine of the angle between projected word vectors and semantic axes. This allows the approach axis descriptions to be very distinct and specific to the given axis, but might include words that are not as relevant in the corpus.
0 commit comments