Skip to content

Commit 3dd624a

Browse files
Added more results
1 parent 44847ac commit 3dd624a

9 files changed

Lines changed: 3227 additions & 55 deletions

papers/topeax/figures/.xdp-performance.svg.2025_11_06_21_47_55.0.svg-LKVaas

Lines changed: 112 additions & 0 deletions
Large diffs are not rendered by default.
181 KB
Loading

papers/topeax/figures/clustering_models.svg

Lines changed: 2878 additions & 0 deletions
Loading
336 KB
Loading

papers/topeax/figures/performance.svg

Lines changed: 66 additions & 52 deletions
Loading
283 KB
Loading

papers/topeax/figures/subsampling.svg

Lines changed: 81 additions & 0 deletions
Loading

papers/topeax/main.pdf

714 KB
Binary file not shown.

papers/topeax/main.typ

Lines changed: 90 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,15 @@
4343
#set heading(numbering: "1.")
4444

4545
= Introduction
46+
47+
== Clustering Topic Models
48+
49+
#figure(
50+
image("figures/clustering_models.png", width: 100%),
51+
caption: [Schematic overview of clustering topic models' steps.],
52+
) <clustering_topic_models>
53+
54+
4655
= Model Specification
4756

4857
I introduce Topeax, a novel topic modelling approach based on document clustering.
@@ -51,7 +60,7 @@ The model differs in a number of aspects from traditional clustering topic model
5160
#figure(
5261
image("figures/peax.png", width: 100%),
5362
caption: [A schematic overview of the Peax clustering algorithm.
54-
\ _The illustrations were generated from the *political ideologies* dataset._],
63+
\ Illustrations were generated from the _political ideologies dataset#footnote[https://huggingface.co/datasets/JyotiNayak/political_ideologies]._],
5564
) <peax>
5665

5766
== Dimensionality Reduction
@@ -214,25 +223,85 @@ as well as internal word embedding coherence $C_("in")$ with a GloVe model train
214223
Ideally a model should both have high intrinsic and extrinsic coherence, and thus an aggregate measure of coherence can give a better
215224
estimate of topic quality: $accent(C, -) = sqrt(C_("in") dot C_("ex"))$.
216225
In addition an aggregate metric of topic quality can be calculated by taking the geometric mean of coherence and diversity $I = sqrt(accent(C, -) dot d)$.
226+
We will also refer to this quantity as _interpretability_.
217227

218228
== Sensitivity to Perplexity
219229

220230
Both TSNE and UMAP, have a hyperparameter that determines, how many neighbours of a given point are considered when generating lower-dimensional projections, this hyperparameter is usually referred to as _perplexity_.
221231
It is also known that both methods are sensitive to the choice of hyperparameters, and depending on these, structures, that do not exist in the higher-dimensional feature space might occur (cite Distill article and "Understanding UMAP").
222-
In order to see how this affects the Topeax algorithm, and how robust it is to the choice of this hyperparameter in comparison with other clustering topic models, I fitted each model to the 20 Newsgroups corpus using `all-MiniLM-L6-v2` with `perplexities=[2, 5, 30, 50, 100]`.
232+
In order to see how this affects the Topeax algorithm, and how robust it is to the choice of this hyperparameter in comparison with other clustering topic models, I fitted each model to the 20 Newsgroups corpus from `scikit-learn`, using `all-MiniLM-L6-v2` with `perplexities=[2, 5, 30, 50, 100]`.
223233
This choice of values was inspired by (cite Distill). Each model was evaluated on the metrics outlined above.
224234

225235
== Subsampling Invariance
226236

237+
Ideally, a good topic model should roughly recover the same topics, and same number of topics in a corpus even when we only have access to a subsample of that corpus, assuming that the underlying categories are the same.
238+
On the other hand, we would reasonably assume that a model having access to the full corpus, instead of a subsample, should increase the accuracy of the results, not decrease it.
239+
To evaluate models' ability to cope with subsampling, I fit each model on the same corpus and embeddings as in the perplexity sensitivity test, and evaluate them on the previously outlined metrics.
240+
Subsample sizes are the following: `[250, 1000, 5000, 10_000, "full"]`.
241+
227242
= Results
228243

244+
Topeax substantially outperformed both Top2Vec and BERTopic in cluster recovery, as well as the quality of the topic keywords (see @performance).
245+
A regression analysis predicting Fowlkes-Mallows index from model type, with random effects and intercepts for encoders and datasets was conducted.
246+
The regression was significant at $alpha=0.05$. ($R^2=0.127$, $F=4.368$, $p=0.0169$).
247+
Both BERTopic and Top2Vec had significantly negative slopes (see @coeffs).
248+
249+
#figure(
250+
table(
251+
columns: 4,
252+
align: (left, center, center, center),
253+
stroke: none,
254+
table.hline(),
255+
table.header([*Coefficients*], [*Estimate*], [*p-value*], [*95% CI*]),
256+
table.hline(),
257+
[Intercept (_Topeax_)], [0.3405], [0.000], [[0.267, 0.414]],
258+
[Topeax], [-0.1106], [0.038], [[-0.215, -0.006]],
259+
[BERTopic], [-0.1479], [0.006], [[-0.252, -0.044]],
260+
table.hline(),
261+
262+
),
263+
caption: [Regression coefficients for predicting Fowlkes-Mallows Index from choice of topic model]
264+
) <coeffs>
265+
266+
Topeax also exhibited the lowest absolute percentage error in recovering the number of topics (see @performance) with $"MAPE" = 60.52$ ($"SD"=26.19$),
267+
while Top2Vec ($M=1797.29%, "SD"=2622.52$) and Top2Vec ($M = 2438.91%,"SD" = 3011.63$) drastically deviated from the number of gold labels in the datasets.
268+
It is also important to note the opposite directionality of these errors.
269+
While Topeax almost universally underestimated the number of topics, especially in `StackExchangeClusteringP2P` and `MedrxivClusteringP2P`, where the number of unique labels was very large, Top2Vec and BERTopic almost always grossly overestimated the number of clusters in the data.
270+
This is undesirable behaviour for a topic model, as topic interpretation requires manual effort, and vast numbers of topics (>500) become difficult and labour-intensive to label for any individual.
271+
272+
273+
#figure(
274+
table(
275+
columns: 5,
276+
align: (left, center, center, center, center),
277+
stroke: none,
278+
table.hline(),
279+
table.vline(x: 4),
280+
table.header([*Model*], [*$C_("in")$*], [*$C_("ex")$*], [*$d$*], [*$I$*]),
281+
table.hline(),
282+
[Topeax], [*0.35±0.15*], [#underline[0.32±0.09]], [*0.96±0.05*], [*0.55±0.10*],
283+
[Top2Vec], [0.21±0.11], [*0.39±0.09*], [#underline[0.57±0.29]], [#underline[0.38±0.15]],
284+
[BERTopic], [#underline[0.24±0.12]], [0.17±0.04], [0.64±0.17], [0.35±0.10],
285+
table.hline(),
286+
),
287+
caption: [Metrics of topic quality compared between different models. Best bold, second best underlined. Uncertainty is standard deviation.]
288+
) <topic_quality>
289+
290+
#figure(
291+
image("figures/performance.png", width: 100%),
292+
caption: [Performance comparison of clustering topic models.\
293+
_Left (Higher is better)_: Fowlkes-Mallows Index against topic interpretability. Large point with error bar represents mean with bootstrapped 95% confidence interval. \
294+
_Right (Lower is better)_: Distribution of absolute percentage error in finding the number of topics.
295+
],
296+
) <performance>
297+
229298

230299
== Perplexity
231300

232301
Metrics of quality and number of topics across perplexity values can are displayed on @perplexity_robustness.
233302
Topeax converges very early on the number of topics with perplexity, and remains stable from `perplexity=5`, while converges at around `perplexity=30` for quality metrics. In light of this, it is reasonable to conclude that 50 is a reasonable recommendation and default value.
234303
Meanwhile, BERTopic converges at around `perplexity=50`, and has the lowest performance on all metrics. Top2Vec does not seem to converge at all for the values of perplexity tested, and is most unstable. It does seem to improve with larger values of the hyperparameter.
235-
Keep in mind, that while BERTopic and Top2Vec improve with higher values, their default is set at `perplexity=15`, which, according to these evaluations seems rather unreasonable.
304+
Keep in mind, that while BERTopic and Top2Vec improve with higher values, their default is set at `perplexity=15`, which, in light of these evaluations, seems rather unreasonable.
236305

237306

238307
#figure(
@@ -244,3 +313,21 @@ Keep in mind, that while BERTopic and Top2Vec improve with higher values, their
244313

245314
],
246315
) <perplexity_robustness>
316+
317+
== Subsampling
318+
319+
Number of topics, topic quality and cluster quality are displayed on @subsampling.
320+
Topeax is relatively well-behaved, and converges to the highest performance when it has access to the full corpus.
321+
The number of topics is also relatively stable across from a sample size of 5000 (hovers around 10-12).
322+
In contrast, BERTopic and Top2Vec do not converge to a single value of N topics and keep growing with the size of the subsample.
323+
This also has an impact on cluster and topic quality. BERTopic has highest performance on the smallest subsamples (250-1000), while Top2Vec has best performance on a subsample of 5000, both methods decrease in performance as the number of topics grows with sample size. This behaviour is far from ideal, and it is apparent that Topeax is much more reliable at determining the number and structure of clusters in subsampled and full corpora.
324+
325+
#figure(
326+
image("figures/subsampling.png", width: 100%),
327+
caption: [Topic models' performance at different subsample sizes.\
328+
_Left_: Fowlkes-Mallows Index as a function of sample size,
329+
_Middle_: Topic Interpretability Score at different subsamples,
330+
_Right_: Number of Topics discovered in each subsample per sample size.
331+
332+
],
333+
) <subsampling>

0 commit comments

Comments
 (0)