Skip to content

Commit dc42d80

Browse files
Merge pull request #80 from x-tabdeveloping/paper
Added paper draft
2 parents a639ebf + 09fa4f4 commit dc42d80

5 files changed

Lines changed: 391 additions & 1 deletion

File tree

.github/CONTRIBUTING

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Contributing to Turftopic
2+
3+
We do our best to maintain Turftopic, and any help or suggestions are most welcome from the community including:
4+
5+
- Reporting a bug
6+
- Discussing the current state of the code
7+
- Submitting a fix
8+
- Proposing new features
9+
10+
## We Develop with Github
11+
We use github to host code, to track issues and feature requests, as well as accept pull requests.
12+
13+
## Setting up your environment
14+
15+
#### Installing dependencies
16+
17+
Install all developer dependencies by running:
18+
```bash
19+
pip install turftopic[dev]
20+
```
21+
22+
#### Formatting
23+
24+
We use black and isort on all files and encourage you to do the same. Code formatted with Ruff will also be accepted:
25+
26+
```bash
27+
pip install black isort
28+
29+
python -m black turftopic/
30+
python -m isort turftopic/
31+
```
32+
33+
#### Running tests
34+
35+
To run tests locally run:
36+
37+
```bash
38+
python -m pytest tests/
39+
```
40+
41+
#### Running documentation
42+
43+
Turftopic uses [mkdocs](https://www.mkdocs.org/) and [mkdocs-material](https://squidfunk.github.io/mkdocs-material/) to generate documentation.
44+
You can build the documentation and host it locally by running:
45+
46+
```bash
47+
mkdocs serve
48+
```
49+
50+
## How to contribute?
51+
52+
1. Fork the repo and create your branch from `main`.
53+
2. If you've added code that should be tested, add tests.
54+
2. If you intend to introducee breaking changes, please start a Github discussion before submitting.
55+
3. If you've changed APIs, update the documentation.
56+
4. Ensure the test suite passes.
57+
5. Make sure to format the code.
58+
6. Issue that pull request!
59+
60+
61+
## Report bugs using Github's [issues](https://github.com/x-tabdeveloping/turftopic/issues)
62+
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/x-tabdeveloping/turftopic/issues/new/choose); it's that easy!
63+
64+
**Great Bug Reports** tend to have:
65+
66+
- A quick summary and/or background
67+
- Steps to reproduce
68+
- Be specific!
69+
- Give sample code if you can.
70+
- What you expected would happen
71+
- What actually happens
72+
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
73+
74+
## Any contributions you make will be under the MIT Software License
75+
By contributing you agree to the terms of the [MIT License](http://choosealicense.com/licenses/mit/), under which Turftopic is released.
76+
77+
## Be Respectful
78+
We expect all discussions on this repository to be conducted in a civil and respectful tone, please refrain from being offensive.
79+
80+
## References
81+
This document was adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)
82+

paper.bib

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
@misc{s3,
2+
title={$S^3$ -- Semantic Signal Separation},
3+
author={Márton Kardos and Jan Kostkan and Arnault-Quentin Vermillet and Kristoffer Nielbo and Kenneth Enevoldsen and Roberta Rocca},
4+
year={2024},
5+
eprint={2406.09556},
6+
archivePrefix={arXiv},
7+
primaryClass={cs.LG},
8+
url={https://arxiv.org/abs/2406.09556},
9+
}
10+
11+
@inproceedings{keynmf,
12+
title = "Context is Key(NMF):: Modelling Topical Information Dynamics in Chinese Diaspora Media",
13+
abstract = "Does the People{\textquoteright}s Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation. In order to study diaspora media ef{\"I}ciently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling. In this paper, we present a pipeline for studying information dynamics in Chinese media. Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models. We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems. We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections. Our methods and results demonstrate the effectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions.",
14+
keywords = "Chinese, contextual topic models, information dynamics, keywords, novelty",
15+
author = "Kristensen-McLachlan, {Ross Deans} and Hicke, {Rebecca Marie Matouschek} and M{\'a}rton Kardos and Mette Thun{\o}",
16+
year = "2024",
17+
month = dec,
18+
language = "English",
19+
volume = "3834",
20+
series = "CEUR Workshop Proceedings",
21+
publisher = "CEUR-WS",
22+
pages = "829--847",
23+
editor = "Haverals, {Wouter } and Koolen, {Marijn } and Thompson, {Laure }",
24+
booktitle = "Proceedings of the Computational Humanities Research Conference 2024",
25+
address = "Germany",
26+
}
27+
28+
@article{bertopic_paper,
29+
title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
30+
author={Grootendorst, Maarten},
31+
journal={arXiv preprint arXiv:2203.05794},
32+
year={2022}
33+
}
34+
35+
@inproceedings{topmost,
36+
title = "Towards the {T}op{M}ost: A Topic Modeling System Toolkit",
37+
author = "Wu, Xiaobao and
38+
Pan, Fengjun and
39+
Luu, Anh Tuan",
40+
editor = "Cao, Yixin and
41+
Feng, Yang and
42+
Xiong, Deyi",
43+
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
44+
month = aug,
45+
year = "2024",
46+
address = "Bangkok, Thailand",
47+
publisher = "Association for Computational Linguistics",
48+
url = "https://aclanthology.org/2024.acl-demos.4/",
49+
doi = "10.18653/v1/2024.acl-demos.4",
50+
pages = "31--41",
51+
abstract = "Topic models have a rich history with various applications and have recently been reinvigorated by neural topic modeling. However, these numerous topic models adopt totally distinct datasets, implementations, and evaluations. This impedes quick utilization and fair comparisons, and thereby hinders their research progress and applications. To tackle this challenge, we in this paper propose a Topic Modeling System Toolkit (TopMost). Compared to existing toolkits, TopMost stands out by supporting more extensive features. It covers a broader spectrum of topic modeling scenarios with their complete lifecycles, including datasets, preprocessing, models, training, and evaluations. Thanks to its highly cohesive and decoupled modular design, TopMost enables rapid utilization, fair comparisons, and flexible extensions of diverse cutting-edge topic models. Our code, tutorials, and documentation are available at https://github.com/bobxwu/topmost."
52+
}
53+
54+
@article{quantitative_text_analysis, title={Quantitative text analysis}, volume={4}, url={https://www.nature.com/articles/s43586-024-00302-w#citeas}, DOI={10.1038/s43586-024-00302-w}, number={1}, journal={Nature Reviews Methods Primers}, author={Nielbo, Kristoffer L. and Karsdorp, Folgert and Wevers, Melvin and Lassche, Alie and Baglini, Rebekah B. and Kestemont, Mike and Tahmasebi, Nina}, year={2024}, month=apr }
55+
56+
@inproceedings{stream,
57+
title = "{STREAM}: Simplified Topic Retrieval, Exploration, and Analysis Module",
58+
author = {Thielmann, Anton and
59+
Reuter, Arik and
60+
Weisser, Christoph and
61+
Kant, Gillian and
62+
Kumar, Manish and
63+
S{\"a}fken, Benjamin},
64+
editor = "Ku, Lun-Wei and
65+
Martins, Andre and
66+
Srikumar, Vivek",
67+
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
68+
month = aug,
69+
year = "2024",
70+
address = "Bangkok, Thailand",
71+
publisher = "Association for Computational Linguistics",
72+
url = "https://aclanthology.org/2024.acl-short.41/",
73+
doi = "10.18653/v1/2024.acl-short.41",
74+
pages = "435--444",
75+
abstract = "Topic modeling is a widely used technique to analyze large document corpora. With the ever-growing emergence of scientific contributions in the field, non-technical users may often use the simplest available software module, independent of whether there are potentially better models available. We present a Simplified Topic Retrieval, Exploration, and Analysis Module (STREAM) for user-friendly topic modelling and especially subsequent interactive topic visualization and analysis. For better topic analysis, we implement multiple intruder-word based topic evaluation metrics. Additionally, we publicize multiple new datasets that can extend the so far very limited number of publicly available benchmark datasets in topic modeling. We integrate downstream interpretable analysis modules to enable users to easily analyse the created topics in downstream tasks together with additional tabular information.The code is available at the following link: https://github.com/AnFreTh/STREAM"
76+
}
77+
78+
@inproceedings{ctm,
79+
title = "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence",
80+
author = "Bianchi, Federico and
81+
Terragni, Silvia and
82+
Hovy, Dirk",
83+
editor = "Zong, Chengqing and
84+
Xia, Fei and
85+
Li, Wenjie and
86+
Navigli, Roberto",
87+
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
88+
month = aug,
89+
year = "2021",
90+
address = "Online",
91+
publisher = "Association for Computational Linguistics",
92+
url = "https://aclanthology.org/2021.acl-short.96",
93+
doi = "10.18653/v1/2021.acl-short.96",
94+
pages = "759--766",
95+
abstract = "Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models.",
96+
}
97+
98+
@inproceedings{zeroshot_tm,
99+
title = "Cross-lingual Contextualized Topic Models with Zero-shot Learning",
100+
author = "Bianchi, Federico and
101+
Terragni, Silvia and
102+
Hovy, Dirk and
103+
Nozza, Debora and
104+
Fersini, Elisabetta",
105+
editor = "Merlo, Paola and
106+
Tiedemann, Jorg and
107+
Tsarfaty, Reut",
108+
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
109+
month = apr,
110+
year = "2021",
111+
address = "Online",
112+
publisher = "Association for Computational Linguistics",
113+
url = "https://aclanthology.org/2021.eacl-main.143",
114+
doi = "10.18653/v1/2021.eacl-main.143",
115+
pages = "1676--1683",
116+
abstract = "Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in multiple languages. They all cover the same content, but the linguistic differences make it impossible to use traditional, bag-of-word-based topic models. Models have to be either single-language or suffer from a huge, but extremely sparse vocabulary. Both issues can be addressed by transfer learning. In this paper, we introduce a zero-shot cross-lingual topic model. Our model learns topics on one language (here, English), and predicts them for unseen documents in different languages (here, Italian, French, German, and Portuguese). We evaluate the quality of the topic predictions for the same document in different languages. Our results show that the transferred topics are coherent and stable across languages, which suggests exciting future research directions.",
117+
}
118+
119+
@article{blei_prob_topic_models, title={Probabilistic topic models}, volume={55}, url={https://doi.org/10.1145/2133806.2133826}, DOI={10.1145/2133806.2133826}, number={4}, journal={Communications of the ACM}, author={Blei, David M.}, year={2012}, month=apr, pages={77–84} }
120+
121+
@misc{top2vec,
122+
title={Top2Vec: Distributed Representations of Topics},
123+
author={Dimo Angelov},
124+
year={2020},
125+
eprint={2008.09470},
126+
archivePrefix={arXiv},
127+
primaryClass={cs.CL}
128+
}
129+
130+
@inproceedings{prodla,
131+
title={Autoencoding Variational Inference For Topic Models},
132+
author={Akash Srivastava and Charles Sutton},
133+
booktitle={International Conference on Learning Representations},
134+
year={2017},
135+
url={https://openreview.net/forum?id=BybtVK9lg}
136+
}
137+
138+
@article{scikit-learn,
139+
title={Scikit-learn: Machine Learning in {P}ython},
140+
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
141+
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
142+
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
143+
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
144+
journal={Journal of Machine Learning Research},
145+
volume={12},
146+
pages={2825--2830},
147+
year={2011}
148+
}
149+
150+
@inproceedings{blei_dynamic,
151+
author = {Blei, David M. and Lafferty, John D.},
152+
title = {Dynamic Topic Models},
153+
year = {2006},
154+
isbn = {1595933832},
155+
publisher = {Association for Computing Machinery},
156+
address = {New York, NY, USA},
157+
url = {https://doi.org/10.1145/1143844.1143859},
158+
doi = {10.1145/1143844.1143859},
159+
abstract = {A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. The approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. Variational approximations based on Kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. In addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. The models are demonstrated by analyzing the OCR'ed archives of the journal Science from 1880 through 2000.},
160+
booktitle = {Proceedings of the 23rd International Conference on Machine Learning},
161+
pages = {113–120},
162+
numpages = {8},
163+
location = {Pittsburgh, Pennsylvania, USA},
164+
series = {ICML '06}
165+
}
166+
167+
@inproceedings{blei_hierarchical,
168+
author = {Blei, David M. and Jordan, Michael I. and Griffiths, Thomas L. and Tenenbaum, Joshua B.},
169+
title = {Hierarchical Topic Models and the Nested Chinese Restaurant Process},
170+
year = {2003},
171+
publisher = {MIT Press},
172+
address = {Cambridge, MA, USA},
173+
abstract = {We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting—which of the large collection of possible trees to use? We take a Bayesian approach, generating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior allows arbitrarily large branching factors and readily accommodates growing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts.},
174+
booktitle = {Proceedings of the 16th International Conference on Neural Information Processing Systems},
175+
pages = {17–24},
176+
numpages = {8},
177+
location = {Whistler, British Columbia, Canada},
178+
series = {NIPS'03}
179+
}
180+
181+
@misc{ctm_docs,
182+
author={Bianchi, Federico and Terragni, Silvia and Hovy, Dirk},
183+
title={Contextualized Topic Models — Contextualized Topic Models 2.5.0 documentation}, url={https://contextualized-topic-models.readthedocs.io/en/latest/introduction.html}, year={2020} }
184+
185+
@inproceedings{fastopic,
186+
title={FASTopic: Pretrained Transformer is a Fast, Adaptive, Stable, and Transferable Topic Model},
187+
author={Wu, Xiaobao and Nguyen, Thong Thanh and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
188+
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
189+
year={2024}
190+
}
191+
192+
@inproceedings{sentence_transformers,
193+
title = "Sentence-{BERT}: Sentence Embeddings using {S}iamese {BERT}-Networks",
194+
author = "Reimers, Nils and
195+
Gurevych, Iryna",
196+
editor = "Inui, Kentaro and
197+
Jiang, Jing and
198+
Ng, Vincent and
199+
Wan, Xiaojun",
200+
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
201+
month = nov,
202+
year = "2019",
203+
address = "Hong Kong, China",
204+
publisher = "Association for Computational Linguistics",
205+
url = "https://aclanthology.org/D19-1410/",
206+
doi = "10.18653/v1/D19-1410",
207+
pages = "3982--3992",
208+
abstract = "BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations ({\textasciitilde}65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods."
209+
}
210+
211+
@software{topicwizard,
212+
author = {Kardos, Márton},
213+
month = nov,
214+
title = {{topicwizard: Pretty and opinionated topic model visualization in Python}},
215+
url = {https://github.com/x-tabdeveloping/topic-wizard},
216+
version = {0.5.0},
217+
year = {2023}
218+
}

0 commit comments

Comments
 (0)