You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper.md
+19-18Lines changed: 19 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ authors:
24
24
affiliations:
25
25
- name: Center for Humanities Computing, Aarhus University, Denmark
26
26
index: 1
27
-
- name: Interactive Minds Center, Aarhus University, Denmark
27
+
- name: Interacting Minds Center, Aarhus University, Denmark
28
28
index: 2
29
29
- name: Department of Linguistics, Cognitive Science, and Semiotics, Aarhus University, Denmark
30
30
index: 3
@@ -34,10 +34,10 @@ bibliography: paper.bib
34
34
35
35
# Summary
36
36
37
-
Turftopic is a topic modelling library including a number of recent topic models that go beyond bag-of-words and can understand text in context, utilizing representations from transformers.
38
-
Turftopic focuses on ease-of-use, providing a unified, interface for a number of different modern topic models, and boasting both model-specific and model-agnostic interpretation and visualization utilities.
39
-
The user is afforded great flexibility in model choice and customization, but the library comes with reasonable defaults, not to overwhelm first-time users with a plethora of choices.
40
-
In addition, Turftopic allows you to model topics, as they change over time, learning themes from streams of texts, finding hierarchical topics, and multilingual usage.
37
+
Turftopic is a topic modelling library including a number of recent topic models that go beyond bag-of-words models and can understand text in context, utilizing representations from transformers.
38
+
Turftopic focuses on ease of use, providing a unified interface for a number of different modern topic models, and boasting both model-specific and model-agnostic interpretation and visualization utilities.
39
+
While the user is afforded great flexibility in model choice and customization, the library comes with reasonable defaults, so as not to needlessly overwhelm first-time users.
40
+
In addition, Turftopic allows the user to: a) model topics as they change over time, b) learn topics on-line from a stream of texts, c) find hierarchical structure in topics, d) learning topics in multilingual texts and corpora.
41
41
Users can utilize the power of large language models (LLMs) to give human-readable names to topics.
42
42
Turftopic also comes with built-in utilities for generating topic descriptions based on key-phrases or lemmas rather than individual words.
43
43
@@ -47,38 +47,39 @@ Turftopic also comes with built-in utilities for generating topic descriptions b
47
47
48
48
While a number of software packages have been developed for contextual topic modelling in recent years, including BERTopic [@bertopic_paper], Top2Vec [@top2vec], CTM [@ctm], these packages include implementations of one or two topic models, and most of the utilities they provide are model-specific. This has resulted in the unfortunate situation that practitioners need to switch between different libraries and adapt to their particularities in both interface and functionality.
49
49
Some attempts have been made at creating unified packages for modern topic models, including STREAM [@stream] and TopMost [@topmost].
50
-
These packages, however have a focus on neural models and topic model evaluation, have abstract and highly specialized interfaces, and do not include some popular topic models.
51
-
Additionally, while model interpretation is an incredibly important aspect of topic modelling, the interpretation utilities provided in these libraries are fairly limited, especially in comparison with model-specific packages, like BERTopic.
50
+
These packages, however, have a focus on neural models and topic model evaluation, have abstract and highly specialized interfaces, and do not include some popular topic models.
51
+
Additionally, while model interpretation is fundamental aspect of topic modelling, the interpretation utilities provided in these libraries are fairly limited, especially in comparison with model-specific packages, like BERTopic.
52
52
53
53
Turftopic unifies state-of-the-art contextual topic models under a superset of the `scikit-learn`[@scikit-learn] API, which users are likely already familiar with, and can be readily included in `scikit-learn` workflows and pipelines.
54
-
We focused on making Turftopic first and foremost an easy-to-use library, that does not necessitate expert knowledge or excessive amounts of code to get started with, but gives great flexibility to power users.
55
-
Furthermore, included an extensive suite of pretty-printing and visualization utilities that aid users in interpreting their results.
56
-
The library also includes three topic models, which to our knowledge only have implementations in Turftopic, these are: KeyNMF [@keynmf], S^3^ [@s3], and GMM.
54
+
We focused on making Turftopic first and foremost an easy-to-use library that does not necessitate expert knowledge or excessive amounts of code to get started with, but gives great flexibility to power users.
55
+
Furthermore, we included an extensive suite of pretty-printing and visualization utilities that aid users in interpreting their results.
56
+
The library also includes three topic models, which to our knowledge only have implementations in Turftopic, these are: KeyNMF [@keynmf], S^3^ [@s3], and GMM, a Gaussian Mixture model of document representations with a soft-c-tf-idf term weighting scheme.
57
57
58
58
# Functionality
59
59
60
60
Turftopic includes a wide array of contextual topic models from the literature, these include:
61
61
FASTopic [@fastopic], Clustering models, such as BERTopic [@bertopic_paper] and Top2Vec [@top2vec], auto-encoding topic models, like CombinedTM [@ctm] and ZeroShotTM [@zeroshot_tm], KeyNMF [@keynmf], Semantic Signal Separation [@s3] and GMM.
62
-
We believe these models to be representative of the state of the art in contextual topic modelling and intend to expand on them in the future.
62
+
At the time of writing, these models are representative of the state of the art in contextual topic modelling and intend to expand on them in the future.
63
63
64
64
{width="800px"}
65
65
66
-
Each model in Turftopic has an *encoder* component, which is used for producing continuous document-representations, and a *vectorizer* component, which extracts term counts in each documents, thereby dictating which terms will be considered in topics.
66
+
Each model in Turftopic has an *encoder* component, which is used for producing continuous document-representations[@sentence_transformers], and a *vectorizer* component, which extracts term counts in each documents, thereby dictating which terms will be considered in topics.
67
67
The user has full control over what components should be used at different stages of the topic modelling process, thereby having fine-grained influence on the nature and quality of topics.
68
68
69
-
The library comes loaded with a lot of utilities to help users interpret their results, including *pretty printing* utilities for exploring topics, *interactive visualizations* partially powered by the `topicwizard`[@topicwizard] Python package, and *automated topic naming* with LLMs.
69
+
The library comes loaded with numerous utilities to help users interpret their results, including *pretty printing* utilities for exploring topics, *interactive visualizations* partially powered by the `topicwizard`[@topicwizard] Python package, and *automated topic naming* with LLMs.
70
70
71
-
To accommodate a variety of use-cases, Turftopic can be used for dynamic topic modelling, where we expect topics to change over time, can be used for uncovering hierarchical structure in topics.
72
-
Some models can also be fitted in an *online* fashion, where documents are accounted for as they come in by batches.
71
+
To accommodate a variety of use cases, Turftopic can be used for *dynamic* topic modelling, where we expect topics to change over time.
72
+
Turftopic is also capable of extracting topics at multiple levels of granularity, thereby uncovering *hierarchical* topic structures.
73
+
Some models can also be fitted in an *online* fashion, where documents are accounted for as they come in batches.
73
74
Turftopic also includes *seeded* topic modelling, where a seed phrase can be used to retrieve topics relevant to the specific research question.
74
75
75
76
# Use Cases
76
77
77
-
Topic models can be utilized in a number of research settings, including exploratory data analysis, discourse analysis of diverse domains, such as newspapers, social media or policy documents.
78
-
Turftopic has already been utilized by @keynmf for analyzing information dynamics in Chinese Diaspora Media, and is currently being used in multiple ongoing research projects, including one analyzing discourse on the HPV vaccine in Denmark, and studying Danish golden-age literature.
78
+
Topic modelling is a key tool for quantitative text analysis [@quantitative_text_analysis], and can be utilized in a number of research settings, including exploratory data analysis, discourse analysis of diverse domains, such as newspapers, social media or policy documents.
79
+
Turftopic has already been utilized by @keynmf for analyzing information dynamics in Chinese diaspora media, and is currently being used in multiple ongoing research projects, including one analyzing discourse on the HPV vaccine in Denmark, and studying Danish golden-age literature.
79
80
80
81
# Target Audience
81
82
82
83
We expect that Turftopic will prove useful to a diverse user base including computational researchers in digital humanities and social sciences, and industry NLP professionals.
83
-
Due to ease of use, Turftopic is also an appropriate choice for educational purposes.
84
+
Turftopic is also an appropriate choice for educational purposes, providing instructors with a single, user-friendly framework for students to explore and compare alternative topic modelling approaches.
0 commit comments