Merge branch 'main' into contextual_encoder

x-tabdeveloping · web-flow · commit 898185fe0b62 · 2026-04-04T18:58:40.000+02:00
diff --git a/README.md b/README.md
@@ -15,7 +15,6 @@
 | [Topic Analysis](https://x-tabdeveloping.github.io/turftopic/analyzers/) | :robot: LLM-generated names and descriptions, :wave: Manual Topic Naming |
 | [Informative Topic Descriptions](https://x-tabdeveloping.github.io/turftopic/vectorizers/) | :key: Keyphrases, Noun-phrases, Lemmatization, Stemming |
 
-
 ## Basics
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
 
@@ -228,6 +227,26 @@ topicwizard.visualize(corpus, model=model)
 
 Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/topicwizard/figures.html) in topicwizard for individual HTML figures.
 
+## Citation
+
+Please cite us when using Turftopic:
+
+```bibtex
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+```
+
 ## References
 - Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
 - Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
diff --git a/docs/FASTopic.md b/docs/FASTopic.md
@@ -1,14 +1,59 @@
 # FASTopic
 
-FASTopic is a neural topic model based on Dual Semantic-relation Reconstruction.
+FASTopic (Wu et al., 2024) is a neural topic model based on Dual Semantic-relation Reconstruction.
 
-> Turftopic contains an implementation repurposed for our API, but the implementation is mostly from the [original FASTopic package](https://github.com/BobXWu/FASTopic).
+<figure>
+  <img src="../images/fastopic.png", title="", style="width:1050px;padding:0px;border:none;"></img>
+  <figcaption> Figure 1: Schematic Overview of the FASTopic Model.<br> <i>Figure from Wu et al. (2024)</i> </figcaption>
+</figure>
 
-:warning: This part of the documentation is still under construction :warning:
+FASTopic, instead of reconstructing Bag-of-words, like classical topic models or VAE-based models do, reconstructs the relations between topics words and documents.
 
-## References
+Wu et al. (2025) express semantic relations for this model using the Embedding Transport Plan (ETP) method.
+
+The model uses a combined loss function that helps the model learn semantic relations between topic and word embeddings, and learn to reconstruct these relations.
+
+## Usage
+
+```python
+from turftopic import FASTopic
+
+documents = [...]
+
+model = FASTopic(10)
+doc_topic_matrix = model.fit_transform(documents)
+model.print_topics()
+```
+
+## Citation
+
+Please cite the authors of the paper, and  Turftopic when using the FASTopic model:
+
+```bibtex
+@inproceedings{
+  wu2024fastopic,
+  title={{FAST}opic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model},
+  author={Xiaobao Wu and Thong Thanh Nguyen and Delvin Ce Zhang and William Yang Wang and Anh Tuan Luu},
+  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
+  year={2024},
+  url={https://openreview.net/forum?id=7t6aq0Fa9D}
+}
+
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+```
 
-Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. ArXiv Preprint ArXiv:2405.17978.
 
 ## API Reference
 
diff --git a/docs/GMM.md b/docs/GMM.md
@@ -94,6 +94,26 @@ from sklearn.decomposition import IncrementalPCA
 model = GMM(20, dimensionality_reduction=IncrementalPCA(20))
 ```
 
+## Citation
+
+Please cite Turftopic when using GMM in publications:
+
+```bibtex
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+```
+
 ## API Reference
 
 ::: turftopic.models.gmm.GMM
diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
@@ -1,6 +1,6 @@
 # KeyNMF
 
-KeyNMF is a topic model that relies on contextually sensitive embeddings for keyword retrieval and term importance estimation,
+KeyNMF (Kristensen-McLachlan et al., 2024) is a topic model that relies on contextually sensitive embeddings for keyword retrieval and term importance estimation,
 while taking inspiration from classical matrix-decomposition approaches for extracting topics.
 
 <figure>
@@ -513,6 +513,44 @@ model = KeyNMF(10, encoder=encoder)
 
 Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
 
+## Citation
+
+Please cite Turftopic and Kristensen-McLachlan et al. (2024) when using KeyNMF:
+
+```bibtex
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+
+@inproceedings{keynmf,
+  title = "Context is Key(NMF):: Modelling Topical Information Dynamics in Chinese Diaspora Media",
+  abstract = "Does the People{\textquoteright}s Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation. In order to study diaspora media ef{\"I}ciently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling. In this paper, we present a pipeline for studying information dynamics in Chinese media. Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models. We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems. We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections. Our methods and results demonstrate the effectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions.",
+  keywords = "Chinese, contextual topic models, information dynamics, keywords, novelty",
+  author = "Kristensen-McLachlan, \{Ross Deans\} and Hicke, \{Rebecca Marie Matouschek\} and M{\'a}rton Kardos and Mette Thun{\o}",
+  year = "2024",
+  month = dec,
+  language = "English",
+  volume = "3834",
+  series = "CEUR Workshop Proceedings",
+  publisher = "CEUR-WS",
+  pages = "829--847",
+  editor = "Haverals, \{Wouter \} and Koolen, \{Marijn \} and Thompson, \{Laure \}",
+  booktitle = "Proceedings of the Computational Humanities Research Conference 2024",
+  address = "Germany",
+}
+```
+
+
 ## API Reference
 
 ::: turftopic.models.keynmf.KeyNMF
diff --git a/docs/SensTopic.md b/docs/SensTopic.md
@@ -162,6 +162,26 @@ model.print_topics()
 | 4 | tennis, competing, federer, wimbledon, iaaf, olympic, tournament, athlete, rugby, olympics |
 | 5 | gdp, stock, economy, earnings, investments, investment, invest, exports, finance, economies |
 
+## Citation
+
+Please cite Turftopic when using the SensTopic model:
+
+```bibtex
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+```
+
 
 ## API Reference
 
diff --git a/docs/Topeax.md b/docs/Topeax.md
@@ -1,6 +1,6 @@
 # Topeax
 
-Topeax is a probabilistic topic model based on the Peax clustering model, which finds topics based on peaks in point density in the embedding space. The model can recover the number of topics automatically.
+Topeax (Kardos, 2026) is a probabilistic topic model based on the Peax clustering model, which finds topics based on peaks in point density in the embedding space. The model can recover the number of topics automatically.
 
 In the following example I run a Topeax model on the BBC News corpus, and plot the steps of the algorithm to inspect how our documents have been clustered and why:
 
@@ -123,6 +123,36 @@ topeax.plot_components_datamapplot()
   <figcaption> Figure 5: Datapoints colored by mixture components on a datamapplot. </figcaption>
 </figure>
 
+## Citation
+
+Please cite Turftopic and Kardos (2026) when using the Topeax model:
+
+```bibtex
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+
+@misc{kardos2026topeaximprovedclustering,
+      title={Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance}, 
+      author={Márton Kardos},
+      year={2026},
+      eprint={2601.21465},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2601.21465}, 
+}
+```
+
 ## API Reference
 
 ::: turftopic.models.topeax.Topeax
diff --git a/docs/clustering.md b/docs/clustering.md
@@ -403,7 +403,50 @@ _See Figure 1_
 !!! info
     If you are not running Turftopic from a Jupyter notebook, make sure to call `fig.show()`. This will open up a new browser tab with the interactive figure.
 
+## Citation
+
+Please cite Turftopic when using clustering models:
+
+```bibtex
+
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+```
 
+In addition, cite Grootendorst (2022) or Angelov (2020) when using BERTopic or Top2Vec respectively:
+
+```bibtex
+@misc{grootendorst2022bertopicneuraltopicmodeling,
+      title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure}, 
+      author={Maarten Grootendorst},
+      year={2022},
+      eprint={2203.05794},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2203.05794}, 
+}
+
+@misc{angelov2020top2vecdistributedrepresentationstopics,
+      title={Top2Vec: Distributed Representations of Topics}, 
+      author={Dimo Angelov},
+      year={2020},
+      eprint={2008.09470},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2008.09470}, 
+}
+```
 
 ## API Reference
 
diff --git a/docs/ctm.md b/docs/ctm.md
@@ -1,6 +1,6 @@
 # Variational Autoencoding Topic Models
 
-Topic models based on Variational Autoencoding are generative models based on ProdLDA (citation) enhanced with contextual representations.
+Topic models based on Variational Autoencoding are generative models based on ProdLDA (Bianchi et al, 2021) enhanced with contextual representations.
 
 <figure>
   <img src="../images/CTM_plate.png" width="60%" style="margin-left: auto;margin-right: auto;">
@@ -56,6 +56,46 @@ This has a number of implications, most notably:
 
 Turftopic, similarly to Clustering models might not contain some model specific utilites, that CTM boasts.
 
+## Citation
+
+Please cite Turftopic and Bianchi et al. (2021) when using Autoencoding models in Turftopic
+```bibtex
+
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+
+@inproceedings{bianchi-etal-2021-pre,
+    title = "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence",
+    author = "Bianchi, Federico  and
+      Terragni, Silvia  and
+      Hovy, Dirk",
+    editor = "Zong, Chengqing  and
+      Xia, Fei  and
+      Li, Wenjie  and
+      Navigli, Roberto",
+    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
+    month = aug,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2021.acl-short.96/",
+    doi = "10.18653/v1/2021.acl-short.96",
+    pages = "759--766",
+    abstract = "Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models."
+}
+```
+
 ## API Reference
 
 ::: turftopic.models.ctm.AutoEncodingTopicModel
diff --git a/docs/cvp.md b/docs/cvp.md
@@ -110,6 +110,40 @@ fig.show()
     <figcaption> Figure 2: Concepts evolving over tokens in the first document. </figcaption>
 </figure>
 
+
+## Citation
+
+Please cite Lyngbæk et al. (2025) and Turftopic when using Concept Vector Projection in publications:
+
+```bibtex
+@article{
+  Kardos2025,
+  title = {Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers},
+  doi = {10.21105/joss.08183},
+  url = {https://doi.org/10.21105/joss.08183},
+  year = {2025},
+  publisher = {The Open Journal},
+  volume = {10},
+  number = {111},
+  pages = {8183},
+  author = {Kardos, Márton and Enevoldsen, Kenneth C. and Kostkan, Jan and Kristensen-McLachlan, Ross Deans and Rocca, Roberta},
+  journal = {Journal of Open Source Software} 
+}
+
+@incollection{Lyngbaek2025,
+  title = {Continuous Sentiment Scores for Literary and Multilingual
+Contexts},
+  author = {Laurits Lyngbaek and Pascale Feldkamp and Yuri Bizzoni and Kristoffer L. Nielbo and Kenneth Enevoldsen},
+  year = {2025},
+  booktitle = {Computational Humanities Research 2025},
+  publisher = {Anthology of Computers and the Humanities},
+  pages = {480--497},
+  editor = {Taylor Arnold and Margherita Fantoli and Ruben Ros},
+  doi = {10.63744/nVu1Zq5gRkuD}
+}
+```
+
+
 ## API Reference
 
 
diff --git a/docs/images/fastopic.png b/docs/images/fastopic.png
diff --git a/docs/index.md b/docs/index.md
diff --git a/docs/s3.md b/docs/s3.md