Skip to content

Commit de1d511

Browse files
Added topeax paper starting stage
1 parent dfd7fac commit de1d511

10 files changed

Lines changed: 607 additions & 0 deletions

File tree

papers/topeax/Merriweather.ttf

4.39 MB
Binary file not shown.
662 KB
Loading

papers/topeax/figures/peax.png

260 KB
Loading

papers/topeax/figures/peax.svg

Lines changed: 173 additions & 0 deletions
Loading

papers/topeax/figures/performance.svg

Lines changed: 98 additions & 0 deletions
Loading
265 KB
Loading

papers/topeax/figures/perplexity_robustness.svg

Lines changed: 101 additions & 0 deletions
Loading

papers/topeax/main.html

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<head>
4+
<meta charset="utf-8">
5+
<meta name="viewport" content="width=device-width, initial-scale=1">
6+
</head>
7+
<body>
8+
<h1>A Fluid Dynamic Model for Glacier Flow</h1>
9+
</body>
10+
</html>

papers/topeax/main.pdf

903 KB
Binary file not shown.

papers/topeax/main.typ

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
#show title: set text(size: 18pt)
2+
#show title: set align(left)
3+
4+
#set text(
5+
size: 12pt,
6+
weight: "medium",
7+
)
8+
#set page(
9+
paper: "a4",
10+
margin: (x: 1.8cm, y: 1.5cm),
11+
)
12+
#set highlight(
13+
fill: rgb("#ddddff"),
14+
radius: 5pt,
15+
extent: 3pt
16+
)
17+
18+
#title[
19+
#highlight[Topeax] -
20+
An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance
21+
]
22+
23+
#v(10pt)
24+
#par[
25+
*Márton Kardos* \
26+
Aarhus University \
27+
#link("mailto:martonkardos@cas.au.dk")
28+
]
29+
30+
== Abstract
31+
32+
#text[
33+
Text clustering is today the most popular paradigm of topic modelling both in research and industry settings.
34+
Despite clustering topic models' apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved.
35+
Firstly, these approaches are unreliable at discovering the number of topics in a corpus, due to sensitivity to hyperparameters.
36+
Secondly, while BERTopic ignores the semantic distance of keywords to topic vectors, Top2Vec ignores word counts in the corpus.
37+
This results in, on the one hand, hardly interpretable topics due to the presence of stop words and junk words, and lack of variety and trust on the other.
38+
In this paper, I introduce a new approach, *#highlight[Topeax]*, which discovers the number of clusters from peaks in density estimates,
39+
and combines lexical and semantic term importance to gain high-quality topic keywords.
40+
]
41+
42+
#set heading(numbering: "1.")
43+
44+
= Introduction
45+
= Model Specification
46+
47+
I introduce Topeax, a novel topic modelling approach based on document clustering.
48+
The model differs in a number of aspects from traditional clustering topic models like BERTopic and Top2Vec.
49+
50+
#figure(
51+
image("figures/peax.png", width: 100%),
52+
caption: [A schematic overview of the Peax clustering algorithm.
53+
\ _The illustrations were generated from the *political ideologies* dataset._],
54+
) <peax>
55+
56+
== Dimensionality Reduction
57+
58+
Unlike other clustering topic models, Topeax relies on
59+
t-Distributed Stochastic Neighbour Embeddings (cite it here) instead of UMAP.
60+
I use the the cosine metric to calculate document similarities for TSNE,
61+
as it is widely used for model training and downstream applications.
62+
The number of dimensions was fixed to 2 in all of our experiments,
63+
as this allows us to visualize the reduced embeddings.
64+
Additionally, TSNE has fewer hyperparameters than UMAP.
65+
While it has been demonstrated that TSNE can be sensitive the chosen value of `perplexity`,
66+
we will show that, within a reasonable range, this will not have an effect on the number of topics
67+
or topic quality.
68+
69+
70+
== The Peax Clustering Model
71+
72+
While HDBSCAN is the choice of clustering model for both BERTopic and Top2Vec,
73+
I introduce a new technique for document clustering, termed *#highlight[Peax]*, which,
74+
instead, clusters documents based on density peaks in the reduced document space.
75+
76+
77+
The Peax algorithm consists of the following steps:
78+
79+
+ A Gaussian Kernel Density Estimate (KDE) is obtained over the reduced document embeddings.
80+
Bandwidth is determined with the Scott method.
81+
+ The KDE is evaluated on a 100x100 grid over the embedding space.
82+
Density peaks are then detected by applying a local-maximum filter to the KDE heatmap.
83+
A neighbourhood connectivity of 25 is used, which means,
84+
every pixel is included within a 5 unit radius.
85+
+ Cluster centres are assigned to these density peaks.
86+
The density structure of each cluster is estimated
87+
by fitting a Gaussian mixture model, with its means fixed to the peaks, using the Expectation-Maximization algorithm.
88+
Documents are assigned to the component with the highest responsibility:
89+
\ #align(center)[$accent(z_d, "^") = arg max_(k) (r_("kd"))" , and " r_("kd")=p(z_k=1 | accent(x, "^")_d)$]
90+
where $accent(z_d, "^")$ is the estimated underlying component assigned to document $d$,
91+
$accent(x, "^")_d$ is the TSN embedding of document $d$, and $r_("kd")$ is the responsibility of component $k$ for document $d$.
92+
93+
#figure(
94+
placement: top,
95+
image("figures/bbc_news_density.png", width: 100%),
96+
caption: [Topeax model illustrated on the BBC News dataset. Topics are identified at density peaks, and keywords get selected based on combined term importance.],
97+
) <bbc_news_density>
98+
99+
== Term Importance Estimation
100+
101+
To mitigate the issues experienced with c-TF-IDF and centroid-based term importance estimation in previously proposed clustering topic models,
102+
I introduce a novel approach that uses a combination of a semantic and a lexical cluster-term importance.
103+
104+
=== Semantic Importance
105+
106+
Semantic term importance is estimated similar to (cite Top2Vec), but,
107+
since we have access to a probabilistic, non-spherical model, and cluster boundaries are not hard,
108+
topic vectors are estimated from the responsibility-weighted average of document embeddings. \
109+
#align(center)[$t_k = frac(sum_(d) r_("kd") dot x_d, sum_(d) r_("kd"))$]
110+
where $t_k$ is the embedding of topic $k$ and $x_d$ is the embedding of document $d$.
111+
Let the embedding of term $j$ be $w_j$. The semantic importance of term $j$ for cluster $k$ is then:
112+
#align(center)[$s_("kj") = cos(t_k, w_j)$]
113+
114+
=== Lexical Importance
115+
116+
Instead of relying on a tf-idf-based measure for computing the valence of a term in a corpus,
117+
an information-theoretical approach is used.
118+
Theoretically, we can estimate the lexical importance of a term for a cluster,
119+
by computing the mutual information of the term's occurrence with the cluster's occurrence.
120+
Due to its convenient interpretability properties, I opt for using normalized pointwise mutual information (NPMI),
121+
which has been historically used for phrase detection (cite) and topic-coherence evaluation (cite).
122+
123+
We calculate the pointwise mutual information by taking the logarithm of the fraction of conditional and marginal word probabilities:
124+
#align(center)[$"pmi"_("kj") = log_2 frac(p(v_j|z_k), p(v_j))$]
125+
where $p(v_j|z_k)$ is the conditional probability of word $j$ given the presence of topic $z$,
126+
and $p(v_j)$ is the probability of word $j$ occurring.
127+
128+
A naive approach might include estimating these probabilities empirically:
129+
#align(center)[$p(v_j) = frac(n_j, sum_i n_i)", and " p(v_j | z_k) = frac(n_("jt"), sum_i n_("it"))$]
130+
where $n_j$ is the number of times word $j$ occurs, $n_"jt"$ is the number of times word $j$ occurs in cluster $t$.
131+
132+
This would, however, overestimate the importance of rare words in the clusters where they appear.
133+
We can therefore opt for a mean-a-posteriori estimate under a symmetric dirichlet prior with an $alpha$ _smoothing_ parameter,
134+
which is analyticaly tractable:
135+
#align(center)[$p(v_j) = frac(n_j + alpha, N alpha + sum_i n_i)", and " p(v_j | z_k) = frac(n_("jt") + alpha, N alpha + sum_i n_("it"))$]
136+
where $N$ is the size of the vocabulary. In further analysis, $alpha=2$ will be used.
137+
Since regular PMI scores have no lower bound, we normalize them to obtain NPMI:
138+
#align(center)[$"npmi"_("kj") = frac("pmi"_("kj"), -log_2 p(v_j, z_k))", where " p(v_j, z_k) = p(v_j|z_k) dot p(z_k)$]
139+
140+
=== Combined Term Importance
141+
142+
To balance the semantic proximity of keywords to topic embeddings and cluster-term occurrences,
143+
a I introduce a combined approach, which consists of the geometric mean of min-max normalized lexical and semantic scores:
144+
145+
#align(center)[$beta_("kj") = sqrt(frac(1 + "npmi"_("kj"), 2) dot frac(1 + s_("kj"), 2))$]
146+
147+
148+
= Experimental Methods
149+
150+
Since one of the main strengths of clustering approaches, that they can supposedly find the number of clusters in the data, and are not given this information a-priori,
151+
a good clustering topic model should be able to faithfully replicate a human-assigned clustering of the data, and should be able to describe these clusters in a manner that is human-interpretable. I will therefore utilize datasets with gold-standard labels.
152+
In this section I will outline the criteria and considerations taken into account when designing an evaluation procedure:
153+
154+
+ The number of clusters in the topic model should preferably be not too far from the number of gold categories.
155+
+ Preferably, if two points are in the same gold category, they should also belong together in the predicted clustering, while points that do not, shouldn't.
156+
+ For topic modelling purposes, it is often preferable that the number of clusters is not overly large.
157+
Topic models should, in theory, aid the understanding of a corpus. Using a topic model becomes impractical when the number of topics one has to interpret is over a couple hundred.
158+
+ Topics should be distinct and easily readable.
159+
160+
== Datasets
161+
162+
In order to evaluate these properties, I used a number of openly available datasets with gold-standard category metadata.
163+
This included all clustering tasks from the new version of the Massive Text Embedding Benchmark `MTEB(eng, v2)` (cite).
164+
To avoid evaluating on the same corpus twice, the P2P variants of the tasks where used.
165+
In addition an annotated Twitter topic-classification dataset, and a BBC News dataset was used.
166+
167+
#figure(
168+
caption: [Descriptive statistics of the datasets used for evaluation\ _Document length is reported as mean±standard deviation_],
169+
table(
170+
columns: 4,
171+
stroke: none,
172+
align: (left, center, center, center),
173+
table.hline(),
174+
table.header[*Dataset*][*Document Length*\ _N characters_ ][*Corpus Size*\ _N documents_ ][*Clusters* \ _N unique gold labels_],
175+
table.hline(),
176+
[ArXivHierarchicalClusteringP2P],[1008.44±438.01],[2048],[23],
177+
[BiorxivClusteringP2P.v2],[1663.97±541.93],[53787],[26],
178+
[MedrxivClusteringP2P.v2],[1981.20±922.01],[37500],[51],
179+
[StackExchangeClusteringP2P.v2],[1091.06±808.88],[74914],[524],
180+
[TwentyNewsgroupsClustering.v2],[32.04±14.60],[59545],[20],
181+
[TweetTopicClustering],[165.66±68.19],[4374],[6],
182+
[BBCNewsClustering],[1000.46±638.41],[2224],[5],
183+
table.hline(),
184+
)
185+
) <dataset_stats>
186+
187+
== Models
188+
189+
To compare Topeax with existing approaches, it was run on all corpora alongside BERTopic and Top2Vec.
190+
Implementations were sourced from the Turftopic (cite) Python package.
191+
For the main analysis, default hyperparameters were used from the original BERTopic and Top2Vec packages respectively,
192+
as these give different clusterings, despite having the same pipeline.
193+
All models were run with both the `all-MiniLM-L6-v2`, and the slightly larger and higher performing `all-mpnet-base-v2` sentence encoders (cite sbert)
194+
to to control for embedding size and quality.
195+
The models were fitted without filtering for stop words and uncommon terms,
196+
since state-of-the art topic models are able to handle such information without issues (cite S3).
197+
198+
== Metrics
199+
200+
For evaluating model performance, both clustering quality and topic quality was evaluated.
201+
I evaluated the faithfulness of the predicted clustering to the gold labels using the Fowlkes-Mallows index (cite).
202+
The FMI, is very similar to the F1 score for classification, in that it also intends to balance precision and recall.
203+
Unlike F1, however, FMI uses the geometric mean of these quantities:
204+
#align(center)[$"FMI" = N_("TP")/sqrt((N_("TP") + N_("FP")) dot (N_("TP") + N_("FN")))$]
205+
where $N_("TP")$ is the number of pairs of points that get clustered together in both clusterings (true positives),
206+
$N_("FP")$ is the number of pairs that get clustered together in the predicted clustering but not in the gold labels (false positives) and
207+
$N_("FN")$ is the number of pairs that do not get clustered together in the predicted clustering, despite them belonging together in the gold labels (false negatives).
208+
209+
For topic quality, I adopt the methodology of (cite S3), with minor differences.
210+
I use GloVe embeddings (cite GloVe) for evaluating internal word embedding coherence instead of Skip-gram.
211+
As such, topic quality was evaluated on topic diversity $d$, external word embedding coherence $C_("ex")$ using the `word2vec-google-news-300` word embedding model,
212+
as well as internal word embedding coherence $C_("in")$ with a GloVe model trained on each corpus.
213+
Ideally a model should both have high intrinsic and extrinsic coherence, and thus an aggregate measure of coherence can give a better
214+
estimate of topic quality: $accent(C, -) = sqrt(C_("in") dot C_("ex"))$.
215+
In addition an aggregate metric of topic quality can be calculated by taking the geometric mean of coherence and diversity $I = sqrt(accent(C, -) dot d)$.
216+
217+
== Robustness checks
218+
219+
+ Hyperparameters (perplexity)
220+
+ Corpus Subsampling
221+
222+
= Results
223+
224+
== Cluster Recovery
225+

0 commit comments

Comments
 (0)