You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
caption: [Schematic overview of clustering topic models' steps.],
52
+
) <clustering_topic_models>
53
+
54
+
46
55
= Model Specification
47
56
48
57
I introduce Topeax, a novel topic modelling approach based on document clustering.
@@ -51,7 +60,7 @@ The model differs in a number of aspects from traditional clustering topic model
51
60
#figure(
52
61
image("figures/peax.png", width: 100%),
53
62
caption: [A schematic overview of the Peax clustering algorithm.
54
-
\ _The illustrations were generated from the *political ideologies* dataset._],
63
+
\ Illustrations were generated from the _political ideologies dataset#footnote[https://huggingface.co/datasets/JyotiNayak/political_ideologies]._],
55
64
) <peax>
56
65
57
66
== Dimensionality Reduction
@@ -214,25 +223,85 @@ as well as internal word embedding coherence $C_("in")$ with a GloVe model train
214
223
Ideally a model should both have high intrinsic and extrinsic coherence, and thus an aggregate measure of coherence can give a better
215
224
estimate of topic quality: $accent(C, -) = sqrt(C_("in")dot C_("ex"))$.
216
225
In addition an aggregate metric of topic quality can be calculated by taking the geometric mean of coherence and diversity $I = sqrt(accent(C, -) dot d)$.
226
+
We will also refer to this quantity as _interpretability_.
217
227
218
228
== Sensitivity to Perplexity
219
229
220
230
Both TSNE and UMAP, have a hyperparameter that determines, how many neighbours of a given point are considered when generating lower-dimensional projections, this hyperparameter is usually referred to as _perplexity_.
221
231
It is also known that both methods are sensitive to the choice of hyperparameters, and depending on these, structures, that do not exist in the higher-dimensional feature space might occur (cite Distill article and "Understanding UMAP").
222
-
In order to see how this affects the Topeax algorithm, and how robust it is to the choice of this hyperparameter in comparison with other clustering topic models, I fitted each model to the 20 Newsgroups corpus using `all-MiniLM-L6-v2` with `perplexities=[2, 5, 30, 50, 100]`.
232
+
In order to see how this affects the Topeax algorithm, and how robust it is to the choice of this hyperparameter in comparison with other clustering topic models, I fitted each model to the 20 Newsgroups corpus from `scikit-learn`, using `all-MiniLM-L6-v2` with `perplexities=[2, 5, 30, 50, 100]`.
223
233
This choice of values was inspired by (cite Distill). Each model was evaluated on the metrics outlined above.
224
234
225
235
== Subsampling Invariance
226
236
237
+
Ideally, a good topic model should roughly recover the same topics, and same number of topics in a corpus even when we only have access to a subsample of that corpus, assuming that the underlying categories are the same.
238
+
On the other hand, we would reasonably assume that a model having access to the full corpus, instead of a subsample, should increase the accuracy of the results, not decrease it.
239
+
To evaluate models' ability to cope with subsampling, I fit each model on the same corpus and embeddings as in the perplexity sensitivity test, and evaluate them on the previously outlined metrics.
240
+
Subsample sizes are the following: `[250, 1000, 5000, 10_000, "full"]`.
241
+
227
242
= Results
228
243
244
+
Topeax substantially outperformed both Top2Vec and BERTopic in cluster recovery, as well as the quality of the topic keywords (see @performance).
245
+
A regression analysis predicting Fowlkes-Mallows index from model type, with random effects and intercepts for encoders and datasets was conducted.
246
+
The regression was significant at $alpha=0.05$. ($R^2=0.127$, $F=4.368$, $p=0.0169$).
247
+
Both BERTopic and Top2Vec had significantly negative slopes (see @coeffs).
caption: [Regression coefficients for predicting Fowlkes-Mallows Index from choice of topic model]
264
+
) <coeffs>
265
+
266
+
Topeax also exhibited the lowest absolute percentage error in recovering the number of topics (see @performance) with $"MAPE" = 60.52$ ($"SD"=26.19$),
267
+
while Top2Vec ($M=1797.29%, "SD"=2622.52$) and Top2Vec ($M = 2438.91%,"SD" = 3011.63$) drastically deviated from the number of gold labels in the datasets.
268
+
It is also important to note the opposite directionality of these errors.
269
+
While Topeax almost universally underestimated the number of topics, especially in `StackExchangeClusteringP2P` and `MedrxivClusteringP2P`, where the number of unique labels was very large, Top2Vec and BERTopic almost always grossly overestimated the number of clusters in the data.
270
+
This is undesirable behaviour for a topic model, as topic interpretation requires manual effort, and vast numbers of topics (>500) become difficult and labour-intensive to label for any individual.
caption: [Metrics of topic quality compared between different models. Best bold, second best underlined. Uncertainty is standard deviation.]
288
+
) <topic_quality>
289
+
290
+
#figure(
291
+
image("figures/performance.png", width: 100%),
292
+
caption: [Performance comparison of clustering topic models.\
293
+
_Left (Higher is better)_: Fowlkes-Mallows Index against topic interpretability. Large point with error bar represents mean with bootstrapped 95% confidence interval. \
294
+
_Right (Lower is better)_: Distribution of absolute percentage error in finding the number of topics.
295
+
],
296
+
) <performance>
297
+
229
298
230
299
== Perplexity
231
300
232
301
Metrics of quality and number of topics across perplexity values can are displayed on @perplexity_robustness.
233
302
Topeax converges very early on the number of topics with perplexity, and remains stable from `perplexity=5`, while converges at around `perplexity=30` for quality metrics. In light of this, it is reasonable to conclude that 50 is a reasonable recommendation and default value.
234
303
Meanwhile, BERTopic converges at around `perplexity=50`, and has the lowest performance on all metrics. Top2Vec does not seem to converge at all for the values of perplexity tested, and is most unstable. It does seem to improve with larger values of the hyperparameter.
235
-
Keep in mind, that while BERTopic and Top2Vec improve with higher values, their default is set at `perplexity=15`, which, according to these evaluations seems rather unreasonable.
304
+
Keep in mind, that while BERTopic and Top2Vec improve with higher values, their default is set at `perplexity=15`, which, in light of these evaluations, seems rather unreasonable.
236
305
237
306
238
307
#figure(
@@ -244,3 +313,21 @@ Keep in mind, that while BERTopic and Top2Vec improve with higher values, their
244
313
245
314
],
246
315
) <perplexity_robustness>
316
+
317
+
== Subsampling
318
+
319
+
Number of topics, topic quality and cluster quality are displayed on @subsampling.
320
+
Topeax is relatively well-behaved, and converges to the highest performance when it has access to the full corpus.
321
+
The number of topics is also relatively stable across from a sample size of 5000 (hovers around 10-12).
322
+
In contrast, BERTopic and Top2Vec do not converge to a single value of N topics and keep growing with the size of the subsample.
323
+
This also has an impact on cluster and topic quality. BERTopic has highest performance on the smallest subsamples (250-1000), while Top2Vec has best performance on a subsample of 5000, both methods decrease in performance as the number of topics grows with sample size. This behaviour is far from ideal, and it is apparent that Topeax is much more reliable at determining the number and structure of clusters in subsampled and full corpora.
324
+
325
+
#figure(
326
+
image("figures/subsampling.png", width: 100%),
327
+
caption: [Topic models' performance at different subsample sizes.\
328
+
_Left_: Fowlkes-Mallows Index as a function of sample size,
329
+
_Middle_: Topic Interpretability Score at different subsamples,
330
+
_Right_: Number of Topics discovered in each subsample per sample size.
0 commit comments