|
| 1 | +#show title: set text(size: 18pt) |
| 2 | +#show title: set align(left) |
| 3 | + |
| 4 | +#set text( |
| 5 | + size: 12pt, |
| 6 | + weight: "medium", |
| 7 | +) |
| 8 | +#set page( |
| 9 | + paper: "a4", |
| 10 | + margin: (x: 1.8cm, y: 1.5cm), |
| 11 | +) |
| 12 | +#set highlight( |
| 13 | + fill: rgb("#ddddff"), |
| 14 | + radius: 5pt, |
| 15 | + extent: 3pt |
| 16 | +) |
| 17 | + |
| 18 | +#title[ |
| 19 | + #highlight[Topeax] - |
| 20 | + An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance |
| 21 | +] |
| 22 | + |
| 23 | +#v(10pt) |
| 24 | +#par[ |
| 25 | + *Márton Kardos* \ |
| 26 | + Aarhus University \ |
| 27 | +#link("mailto:martonkardos@cas.au.dk") |
| 28 | +] |
| 29 | + |
| 30 | +== Abstract |
| 31 | + |
| 32 | +#text[ |
| 33 | + Text clustering is today the most popular paradigm of topic modelling both in research and industry settings. |
| 34 | + Despite clustering topic models' apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. |
| 35 | + Firstly, these approaches are unreliable at discovering the number of topics in a corpus, due to sensitivity to hyperparameters. |
| 36 | + Secondly, while BERTopic ignores the semantic distance of keywords to topic vectors, Top2Vec ignores word counts in the corpus. |
| 37 | + This results in, on the one hand, hardly interpretable topics due to the presence of stop words and junk words, and lack of variety and trust on the other. |
| 38 | + In this paper, I introduce a new approach, *#highlight[Topeax]*, which discovers the number of clusters from peaks in density estimates, |
| 39 | + and combines lexical and semantic term importance to gain high-quality topic keywords. |
| 40 | +] |
| 41 | + |
| 42 | +#set heading(numbering: "1.") |
| 43 | + |
| 44 | += Introduction |
| 45 | += Model Specification |
| 46 | + |
| 47 | +I introduce Topeax, a novel topic modelling approach based on document clustering. |
| 48 | +The model differs in a number of aspects from traditional clustering topic models like BERTopic and Top2Vec. |
| 49 | + |
| 50 | +#figure( |
| 51 | + image("figures/peax.png", width: 100%), |
| 52 | + caption: [A schematic overview of the Peax clustering algorithm. |
| 53 | + \ _The illustrations were generated from the *political ideologies* dataset._], |
| 54 | +) <peax> |
| 55 | + |
| 56 | +== Dimensionality Reduction |
| 57 | + |
| 58 | +Unlike other clustering topic models, Topeax relies on |
| 59 | +t-Distributed Stochastic Neighbour Embeddings (cite it here) instead of UMAP. |
| 60 | +I use the the cosine metric to calculate document similarities for TSNE, |
| 61 | +as it is widely used for model training and downstream applications. |
| 62 | +The number of dimensions was fixed to 2 in all of our experiments, |
| 63 | +as this allows us to visualize the reduced embeddings. |
| 64 | +Additionally, TSNE has fewer hyperparameters than UMAP. |
| 65 | +While it has been demonstrated that TSNE can be sensitive the chosen value of `perplexity`, |
| 66 | +we will show that, within a reasonable range, this will not have an effect on the number of topics |
| 67 | +or topic quality. |
| 68 | + |
| 69 | + |
| 70 | +== The Peax Clustering Model |
| 71 | + |
| 72 | +While HDBSCAN is the choice of clustering model for both BERTopic and Top2Vec, |
| 73 | +I introduce a new technique for document clustering, termed *#highlight[Peax]*, which, |
| 74 | +instead, clusters documents based on density peaks in the reduced document space. |
| 75 | + |
| 76 | + |
| 77 | +The Peax algorithm consists of the following steps: |
| 78 | + |
| 79 | ++ A Gaussian Kernel Density Estimate (KDE) is obtained over the reduced document embeddings. |
| 80 | + Bandwidth is determined with the Scott method. |
| 81 | ++ The KDE is evaluated on a 100x100 grid over the embedding space. |
| 82 | + Density peaks are then detected by applying a local-maximum filter to the KDE heatmap. |
| 83 | + A neighbourhood connectivity of 25 is used, which means, |
| 84 | + every pixel is included within a 5 unit radius. |
| 85 | ++ Cluster centres are assigned to these density peaks. |
| 86 | + The density structure of each cluster is estimated |
| 87 | + by fitting a Gaussian mixture model, with its means fixed to the peaks, using the Expectation-Maximization algorithm. |
| 88 | + Documents are assigned to the component with the highest responsibility: |
| 89 | + \ #align(center)[$accent(z_d, "^") = arg max_(k) (r_("kd"))" , and " r_("kd")=p(z_k=1 | accent(x, "^")_d)$] |
| 90 | + where $accent(z_d, "^")$ is the estimated underlying component assigned to document $d$, |
| 91 | + $accent(x, "^")_d$ is the TSN embedding of document $d$, and $r_("kd")$ is the responsibility of component $k$ for document $d$. |
| 92 | + |
| 93 | +#figure( |
| 94 | + placement: top, |
| 95 | + image("figures/bbc_news_density.png", width: 100%), |
| 96 | + caption: [Topeax model illustrated on the BBC News dataset. Topics are identified at density peaks, and keywords get selected based on combined term importance.], |
| 97 | +) <bbc_news_density> |
| 98 | + |
| 99 | +== Term Importance Estimation |
| 100 | + |
| 101 | +To mitigate the issues experienced with c-TF-IDF and centroid-based term importance estimation in previously proposed clustering topic models, |
| 102 | +I introduce a novel approach that uses a combination of a semantic and a lexical cluster-term importance. |
| 103 | + |
| 104 | +=== Semantic Importance |
| 105 | + |
| 106 | +Semantic term importance is estimated similar to (cite Top2Vec), but, |
| 107 | +since we have access to a probabilistic, non-spherical model, and cluster boundaries are not hard, |
| 108 | +topic vectors are estimated from the responsibility-weighted average of document embeddings. \ |
| 109 | +#align(center)[$t_k = frac(sum_(d) r_("kd") dot x_d, sum_(d) r_("kd"))$] |
| 110 | +where $t_k$ is the embedding of topic $k$ and $x_d$ is the embedding of document $d$. |
| 111 | +Let the embedding of term $j$ be $w_j$. The semantic importance of term $j$ for cluster $k$ is then: |
| 112 | +#align(center)[$s_("kj") = cos(t_k, w_j)$] |
| 113 | + |
| 114 | +=== Lexical Importance |
| 115 | + |
| 116 | +Instead of relying on a tf-idf-based measure for computing the valence of a term in a corpus, |
| 117 | +an information-theoretical approach is used. |
| 118 | +Theoretically, we can estimate the lexical importance of a term for a cluster, |
| 119 | +by computing the mutual information of the term's occurrence with the cluster's occurrence. |
| 120 | +Due to its convenient interpretability properties, I opt for using normalized pointwise mutual information (NPMI), |
| 121 | +which has been historically used for phrase detection (cite) and topic-coherence evaluation (cite). |
| 122 | + |
| 123 | +We calculate the pointwise mutual information by taking the logarithm of the fraction of conditional and marginal word probabilities: |
| 124 | +#align(center)[$"pmi"_("kj") = log_2 frac(p(v_j|z_k), p(v_j))$] |
| 125 | +where $p(v_j|z_k)$ is the conditional probability of word $j$ given the presence of topic $z$, |
| 126 | +and $p(v_j)$ is the probability of word $j$ occurring. |
| 127 | + |
| 128 | +A naive approach might include estimating these probabilities empirically: |
| 129 | +#align(center)[$p(v_j) = frac(n_j, sum_i n_i)", and " p(v_j | z_k) = frac(n_("jt"), sum_i n_("it"))$] |
| 130 | +where $n_j$ is the number of times word $j$ occurs, $n_"jt"$ is the number of times word $j$ occurs in cluster $t$. |
| 131 | + |
| 132 | +This would, however, overestimate the importance of rare words in the clusters where they appear. |
| 133 | +We can therefore opt for a mean-a-posteriori estimate under a symmetric dirichlet prior with an $alpha$ _smoothing_ parameter, |
| 134 | +which is analyticaly tractable: |
| 135 | +#align(center)[$p(v_j) = frac(n_j + alpha, N alpha + sum_i n_i)", and " p(v_j | z_k) = frac(n_("jt") + alpha, N alpha + sum_i n_("it"))$] |
| 136 | +where $N$ is the size of the vocabulary. In further analysis, $alpha=2$ will be used. |
| 137 | +Since regular PMI scores have no lower bound, we normalize them to obtain NPMI: |
| 138 | +#align(center)[$"npmi"_("kj") = frac("pmi"_("kj"), -log_2 p(v_j, z_k))", where " p(v_j, z_k) = p(v_j|z_k) dot p(z_k)$] |
| 139 | + |
| 140 | +=== Combined Term Importance |
| 141 | + |
| 142 | +To balance the semantic proximity of keywords to topic embeddings and cluster-term occurrences, |
| 143 | +a I introduce a combined approach, which consists of the geometric mean of min-max normalized lexical and semantic scores: |
| 144 | + |
| 145 | +#align(center)[$beta_("kj") = sqrt(frac(1 + "npmi"_("kj"), 2) dot frac(1 + s_("kj"), 2))$] |
| 146 | + |
| 147 | + |
| 148 | += Experimental Methods |
| 149 | + |
| 150 | +Since one of the main strengths of clustering approaches, that they can supposedly find the number of clusters in the data, and are not given this information a-priori, |
| 151 | +a good clustering topic model should be able to faithfully replicate a human-assigned clustering of the data, and should be able to describe these clusters in a manner that is human-interpretable. I will therefore utilize datasets with gold-standard labels. |
| 152 | +In this section I will outline the criteria and considerations taken into account when designing an evaluation procedure: |
| 153 | + |
| 154 | ++ The number of clusters in the topic model should preferably be not too far from the number of gold categories. |
| 155 | ++ Preferably, if two points are in the same gold category, they should also belong together in the predicted clustering, while points that do not, shouldn't. |
| 156 | ++ For topic modelling purposes, it is often preferable that the number of clusters is not overly large. |
| 157 | + Topic models should, in theory, aid the understanding of a corpus. Using a topic model becomes impractical when the number of topics one has to interpret is over a couple hundred. |
| 158 | ++ Topics should be distinct and easily readable. |
| 159 | + |
| 160 | +== Datasets |
| 161 | + |
| 162 | +In order to evaluate these properties, I used a number of openly available datasets with gold-standard category metadata. |
| 163 | +This included all clustering tasks from the new version of the Massive Text Embedding Benchmark `MTEB(eng, v2)` (cite). |
| 164 | +To avoid evaluating on the same corpus twice, the P2P variants of the tasks where used. |
| 165 | +In addition an annotated Twitter topic-classification dataset, and a BBC News dataset was used. |
| 166 | + |
| 167 | +#figure( |
| 168 | + caption: [Descriptive statistics of the datasets used for evaluation\ _Document length is reported as mean±standard deviation_], |
| 169 | + table( |
| 170 | + columns: 4, |
| 171 | + stroke: none, |
| 172 | + align: (left, center, center, center), |
| 173 | + table.hline(), |
| 174 | + table.header[*Dataset*][*Document Length*\ _N characters_ ][*Corpus Size*\ _N documents_ ][*Clusters* \ _N unique gold labels_], |
| 175 | + table.hline(), |
| 176 | + [ArXivHierarchicalClusteringP2P],[1008.44±438.01],[2048],[23], |
| 177 | + [BiorxivClusteringP2P.v2],[1663.97±541.93],[53787],[26], |
| 178 | + [MedrxivClusteringP2P.v2],[1981.20±922.01],[37500],[51], |
| 179 | + [StackExchangeClusteringP2P.v2],[1091.06±808.88],[74914],[524], |
| 180 | + [TwentyNewsgroupsClustering.v2],[32.04±14.60],[59545],[20], |
| 181 | + [TweetTopicClustering],[165.66±68.19],[4374],[6], |
| 182 | + [BBCNewsClustering],[1000.46±638.41],[2224],[5], |
| 183 | + table.hline(), |
| 184 | + ) |
| 185 | +) <dataset_stats> |
| 186 | + |
| 187 | +== Models |
| 188 | + |
| 189 | +To compare Topeax with existing approaches, it was run on all corpora alongside BERTopic and Top2Vec. |
| 190 | +Implementations were sourced from the Turftopic (cite) Python package. |
| 191 | +For the main analysis, default hyperparameters were used from the original BERTopic and Top2Vec packages respectively, |
| 192 | +as these give different clusterings, despite having the same pipeline. |
| 193 | +All models were run with both the `all-MiniLM-L6-v2`, and the slightly larger and higher performing `all-mpnet-base-v2` sentence encoders (cite sbert) |
| 194 | +to to control for embedding size and quality. |
| 195 | +The models were fitted without filtering for stop words and uncommon terms, |
| 196 | +since state-of-the art topic models are able to handle such information without issues (cite S3). |
| 197 | + |
| 198 | +== Metrics |
| 199 | + |
| 200 | +For evaluating model performance, both clustering quality and topic quality was evaluated. |
| 201 | +I evaluated the faithfulness of the predicted clustering to the gold labels using the Fowlkes-Mallows index (cite). |
| 202 | +The FMI, is very similar to the F1 score for classification, in that it also intends to balance precision and recall. |
| 203 | +Unlike F1, however, FMI uses the geometric mean of these quantities: |
| 204 | +#align(center)[$"FMI" = N_("TP")/sqrt((N_("TP") + N_("FP")) dot (N_("TP") + N_("FN")))$] |
| 205 | +where $N_("TP")$ is the number of pairs of points that get clustered together in both clusterings (true positives), |
| 206 | +$N_("FP")$ is the number of pairs that get clustered together in the predicted clustering but not in the gold labels (false positives) and |
| 207 | +$N_("FN")$ is the number of pairs that do not get clustered together in the predicted clustering, despite them belonging together in the gold labels (false negatives). |
| 208 | + |
| 209 | +For topic quality, I adopt the methodology of (cite S3), with minor differences. |
| 210 | +I use GloVe embeddings (cite GloVe) for evaluating internal word embedding coherence instead of Skip-gram. |
| 211 | +As such, topic quality was evaluated on topic diversity $d$, external word embedding coherence $C_("ex")$ using the `word2vec-google-news-300` word embedding model, |
| 212 | +as well as internal word embedding coherence $C_("in")$ with a GloVe model trained on each corpus. |
| 213 | +Ideally a model should both have high intrinsic and extrinsic coherence, and thus an aggregate measure of coherence can give a better |
| 214 | +estimate of topic quality: $accent(C, -) = sqrt(C_("in") dot C_("ex"))$. |
| 215 | +In addition an aggregate metric of topic quality can be calculated by taking the geometric mean of coherence and diversity $I = sqrt(accent(C, -) dot d)$. |
| 216 | + |
| 217 | +== Robustness checks |
| 218 | + |
| 219 | ++ Hyperparameters (perplexity) |
| 220 | ++ Corpus Subsampling |
| 221 | + |
| 222 | += Results |
| 223 | + |
| 224 | +== Cluster Recovery |
| 225 | + |
0 commit comments