Skip to content

Commit 5399de6

Browse files
Added citation boxes to all models in docs
1 parent 796b003 commit 5399de6

9 files changed

Lines changed: 271 additions & 4 deletions

File tree

docs/FASTopic.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ doc_topic_matrix = model.fit_transform(documents)
2525
model.print_topics()
2626
```
2727

28-
## References
28+
## Citation
2929

3030
Please cite the authors of the paper, and Turftopic when using the FASTopic model:
3131

docs/GMM.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,26 @@ from sklearn.decomposition import IncrementalPCA
9494
model = GMM(20, dimensionality_reduction=IncrementalPCA(20))
9595
```
9696

97+
## Citation
98+
99+
Please cite Turftopic when using GMM in publications:
100+
101+
```bibtex
102+
@article{
103+
Kardos2025,
104+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
105+
doi = {10.21105/joss.08183},
106+
url = {https://doi.org/10.21105/joss.08183},
107+
year = {2025},
108+
publisher = {The Open Journal},
109+
volume = {10},
110+
number = {111},
111+
pages = {8183},
112+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
113+
journal = {Journal of Open Source Software}
114+
}
115+
```
116+
97117
## API Reference
98118

99119
::: turftopic.models.gmm.GMM

docs/KeyNMF.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# KeyNMF
22

3-
KeyNMF is a topic model that relies on contextually sensitive embeddings for keyword retrieval and term importance estimation,
3+
KeyNMF (Kristensen-McLachlan et al., 2024) is a topic model that relies on contextually sensitive embeddings for keyword retrieval and term importance estimation,
44
while taking inspiration from classical matrix-decomposition approaches for extracting topics.
55

66
<figure>
@@ -513,6 +513,44 @@ model = KeyNMF(10, encoder=encoder)
513513

514514
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
515515

516+
## Citation
517+
518+
Please cite Turftopic and Kristensen-McLachlan et al. (2024) when using KeyNMF:
519+
520+
```bibtex
521+
@article{
522+
Kardos2025,
523+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
524+
doi = {10.21105/joss.08183},
525+
url = {https://doi.org/10.21105/joss.08183},
526+
year = {2025},
527+
publisher = {The Open Journal},
528+
volume = {10},
529+
number = {111},
530+
pages = {8183},
531+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
532+
journal = {Journal of Open Source Software}
533+
}
534+
535+
@inproceedings{keynmf,
536+
title = "Context is Key(NMF):: Modelling Topical Information Dynamics in Chinese Diaspora Media",
537+
abstract = "Does the People{\textquoteright}s Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation. In order to study diaspora media ef{\"I}ciently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling. In this paper, we present a pipeline for studying information dynamics in Chinese media. Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models. We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems. We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections. Our methods and results demonstrate the effectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions.",
538+
keywords = "Chinese, contextual topic models, information dynamics, keywords, novelty",
539+
author = "Kristensen-McLachlan, \{Ross Deans\} and Hicke, \{Rebecca Marie Matouschek\} and M{\'a}rton Kardos and Mette Thun{\o}",
540+
year = "2024",
541+
month = dec,
542+
language = "English",
543+
volume = "3834",
544+
series = "CEUR Workshop Proceedings",
545+
publisher = "CEUR-WS",
546+
pages = "829--847",
547+
editor = "Haverals, \{Wouter \} and Koolen, \{Marijn \} and Thompson, \{Laure \}",
548+
booktitle = "Proceedings of the Computational Humanities Research Conference 2024",
549+
address = "Germany",
550+
}
551+
```
552+
553+
516554
## API Reference
517555

518556
::: turftopic.models.keynmf.KeyNMF

docs/SensTopic.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,26 @@ model.print_topics()
162162
| 4 | tennis, competing, federer, wimbledon, iaaf, olympic, tournament, athlete, rugby, olympics |
163163
| 5 | gdp, stock, economy, earnings, investments, investment, invest, exports, finance, economies |
164164

165+
## Citation
166+
167+
Please cite Turftopic when using the SensTopic model:
168+
169+
```bibtex
170+
@article{
171+
Kardos2025,
172+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
173+
doi = {10.21105/joss.08183},
174+
url = {https://doi.org/10.21105/joss.08183},
175+
year = {2025},
176+
publisher = {The Open Journal},
177+
volume = {10},
178+
number = {111},
179+
pages = {8183},
180+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
181+
journal = {Journal of Open Source Software}
182+
}
183+
```
184+
165185

166186
## API Reference
167187

docs/Topeax.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Topeax
22

3-
Topeax is a probabilistic topic model based on the Peax clustering model, which finds topics based on peaks in point density in the embedding space. The model can recover the number of topics automatically.
3+
Topeax (Kardos, 2026) is a probabilistic topic model based on the Peax clustering model, which finds topics based on peaks in point density in the embedding space. The model can recover the number of topics automatically.
44

55
In the following example I run a Topeax model on the BBC News corpus, and plot the steps of the algorithm to inspect how our documents have been clustered and why:
66

@@ -123,6 +123,36 @@ topeax.plot_components_datamapplot()
123123
<figcaption> Figure 5: Datapoints colored by mixture components on a datamapplot. </figcaption>
124124
</figure>
125125

126+
## Citation
127+
128+
Please cite Turftopic and Kardos (2026) when using the Topeax model:
129+
130+
```bibtex
131+
@article{
132+
Kardos2025,
133+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
134+
doi = {10.21105/joss.08183},
135+
url = {https://doi.org/10.21105/joss.08183},
136+
year = {2025},
137+
publisher = {The Open Journal},
138+
volume = {10},
139+
number = {111},
140+
pages = {8183},
141+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
142+
journal = {Journal of Open Source Software}
143+
}
144+
145+
@misc{kardos2026topeaximprovedclustering,
146+
title={Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance},
147+
author={Márton Kardos},
148+
year={2026},
149+
eprint={2601.21465},
150+
archivePrefix={arXiv},
151+
primaryClass={cs.AI},
152+
url={https://arxiv.org/abs/2601.21465},
153+
}
154+
```
155+
126156
## API Reference
127157

128158
::: turftopic.models.topeax.Topeax

docs/clustering.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -403,7 +403,50 @@ _See Figure 1_
403403
!!! info
404404
If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
405405

406+
## Citation
407+
408+
Please cite Turftopic when using clustering models:
409+
410+
```bibtex
411+
412+
@article{
413+
Kardos2025,
414+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
415+
doi = {10.21105/joss.08183},
416+
url = {https://doi.org/10.21105/joss.08183},
417+
year = {2025},
418+
publisher = {The Open Journal},
419+
volume = {10},
420+
number = {111},
421+
pages = {8183},
422+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
423+
journal = {Journal of Open Source Software}
424+
}
425+
```
406426

427+
In addition, cite Grootendorst (2022) or Angelov (2020) when using BERTopic or Top2Vec respectively:
428+
429+
```bibtex
430+
@misc{grootendorst2022bertopicneuraltopicmodeling,
431+
title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
432+
author={Maarten Grootendorst},
433+
year={2022},
434+
eprint={2203.05794},
435+
archivePrefix={arXiv},
436+
primaryClass={cs.CL},
437+
url={https://arxiv.org/abs/2203.05794},
438+
}
439+
440+
@misc{angelov2020top2vecdistributedrepresentationstopics,
441+
title={Top2Vec: Distributed Representations of Topics},
442+
author={Dimo Angelov},
443+
year={2020},
444+
eprint={2008.09470},
445+
archivePrefix={arXiv},
446+
primaryClass={cs.CL},
447+
url={https://arxiv.org/abs/2008.09470},
448+
}
449+
```
407450

408451
## API Reference
409452

docs/ctm.md

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Variational Autoencoding Topic Models
22

3-
Topic models based on Variational Autoencoding are generative models based on ProdLDA (citation) enhanced with contextual representations.
3+
Topic models based on Variational Autoencoding are generative models based on ProdLDA (Bianchi et al, 2021) enhanced with contextual representations.
44

55
<figure>
66
<img src="../images/CTM_plate.png" width="60%" style="margin-left: auto;margin-right: auto;">
@@ -56,6 +56,46 @@ This has a number of implications, most notably:
5656

5757
Turftopic, similarly to Clustering models might not contain some model specific utilites, that CTM boasts.
5858

59+
## Citation
60+
61+
Please cite Turftopic and Bianchi et al. (2021) when using Autoencoding models in Turftopic
62+
```bibtex
63+
64+
@article{
65+
Kardos2025,
66+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
67+
doi = {10.21105/joss.08183},
68+
url = {https://doi.org/10.21105/joss.08183},
69+
year = {2025},
70+
publisher = {The Open Journal},
71+
volume = {10},
72+
number = {111},
73+
pages = {8183},
74+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
75+
journal = {Journal of Open Source Software}
76+
}
77+
78+
@inproceedings{bianchi-etal-2021-pre,
79+
title = "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence",
80+
author = "Bianchi, Federico and
81+
Terragni, Silvia and
82+
Hovy, Dirk",
83+
editor = "Zong, Chengqing and
84+
Xia, Fei and
85+
Li, Wenjie and
86+
Navigli, Roberto",
87+
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
88+
month = aug,
89+
year = "2021",
90+
address = "Online",
91+
publisher = "Association for Computational Linguistics",
92+
url = "https://aclanthology.org/2021.acl-short.96/",
93+
doi = "10.18653/v1/2021.acl-short.96",
94+
pages = "759--766",
95+
abstract = "Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models."
96+
}
97+
```
98+
5999
## API Reference
60100

61101
::: turftopic.models.ctm.AutoEncodingTopicModel

docs/cvp.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,39 @@ print(concept_df)
7171
1 0.269454 0.009495
7272
```
7373

74+
## Citation
75+
76+
Please cite Lyngbæk et al. (2025) and Turftopic when using Concept Vector Projection in publications:
77+
78+
```bibtex
79+
@article{
80+
Kardos2025,
81+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
82+
doi = {10.21105/joss.08183},
83+
url = {https://doi.org/10.21105/joss.08183},
84+
year = {2025},
85+
publisher = {The Open Journal},
86+
volume = {10},
87+
number = {111},
88+
pages = {8183},
89+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
90+
journal = {Journal of Open Source Software}
91+
}
92+
93+
@incollection{Lyngbaek2025,
94+
title = {Continuous Sentiment Scores for Literary and Multilingual
95+
Contexts},
96+
author = {Laurits Lyngbaek and Pascale Feldkamp and Yuri Bizzoni and Kristoffer L. Nielbo and Kenneth Enevoldsen},
97+
year = {2025},
98+
booktitle = {Computational Humanities Research 2025},
99+
publisher = {Anthology of Computers and the Humanities},
100+
pages = {480--497},
101+
editor = {Taylor Arnold and Margherita Fantoli and Ruben Ros},
102+
doi = {10.63744/nVu1Zq5gRkuD}
103+
}
104+
```
105+
106+
74107
## API Reference
75108

76109

docs/s3.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,49 @@ fig.show()
202202
<figcaption> Image Compass of IKEA furnitures along two semantic axes </figcaption>
203203
</figure>
204204

205+
## Citation
206+
207+
Please cite Turftopic and Kardos et al. (2025) when using $S^3$ in Turftopic:
208+
```bibtex
209+
210+
@article{
211+
Kardos2025,
212+
title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
213+
doi = {10.21105/joss.08183},
214+
url = {https://doi.org/10.21105/joss.08183},
215+
year = {2025},
216+
publisher = {The Open Journal},
217+
volume = {10},
218+
number = {111},
219+
pages = {8183},
220+
author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
221+
journal = {Journal of Open Source Software}
222+
}
223+
224+
@inproceedings{kardos-etal-2025-s3,
225+
title = "$S^3$ - Semantic Signal Separation",
226+
author = "Kardos, M{\'a}rton and
227+
Kostkan, Jan and
228+
Enevoldsen, Kenneth and
229+
Vermillet, Arnault-Quentin and
230+
Nielbo, Kristoffer and
231+
Rocca, Roberta",
232+
editor = "Che, Wanxiang and
233+
Nabende, Joyce and
234+
Shutova, Ekaterina and
235+
Pilehvar, Mohammad Taher",
236+
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
237+
month = jul,
238+
year = "2025",
239+
address = "Vienna, Austria",
240+
publisher = "Association for Computational Linguistics",
241+
url = "https://aclanthology.org/2025.acl-long.32/",
242+
doi = "10.18653/v1/2025.acl-long.32",
243+
pages = "633--666",
244+
ISBN = "979-8-89176-251-0",
245+
abstract = "Topic models are useful tools for discovering latent semantic structures in large textual corpora. Recent efforts have been oriented at incorporating contextual representations in topic modeling and have been shown to outperform classical topic models. These approaches are typically slow, volatile, and require heavy preprocessing for optimal results. We present Semantic Signal Separation ($S^3$), a theory-driven topic modeling approach in neural embedding spaces. $S^3$ conceptualizes topics as independent axes of semantic space and uncovers these by decomposing contextualized document embeddings using Independent Component Analysis. Our approach provides diverse and highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextual topic model, being, on average, 4.5x faster than the runner-up BERTopic. We offer an implementation of $S^3$, and all contextual baselines, in the Turftopic Python package."
246+
}
247+
```
205248

206249
## API Reference
207250

0 commit comments

Comments
 (0)