Skip to content

Commit 81edd97

Browse files
Merge branch 'main' into bayes_rule
2 parents ef4c28d + 34c3761 commit 81edd97

11 files changed

Lines changed: 219 additions & 181 deletions

File tree

.github/workflows/documentation.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,12 @@ jobs:
2323
- name: Dependencies
2424
run: |
2525
python -m pip install --upgrade pip
26-
pip install "turftopic[pyro-ppl,docs]"
26+
pip install "turftopic[pyro-ppl]" "griffe" "mkdocstrings[python]" "mkdocs" "mkdocs-material"
2727
2828
- name: Build and Deploy
2929
if: github.event_name == 'push'
3030
run: mkdocs gh-deploy --force
3131

3232
- name: Build
3333
if: github.event_name == 'pull_request'
34-
run: mkdocs build
34+
run: mkdocs build

README.md

Lines changed: 19 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
## Features
88
- Novel transformer-based topic models:
99
- Semantic Signal Separation - S³ 🧭
10-
- KeyNMF 🔑 (paper in progress ⏳)
10+
- KeyNMF 🔑
1111
- GMM :gem: (paper soon)
1212
- Implementations of other transformer-based topic models
1313
- Clustering Topic Models: BERTopic and Top2Vec
@@ -20,42 +20,29 @@
2020

2121
> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
2222
23-
### New in version 0.5.0
23+
### New in version 0.6.0
2424

25-
#### Hierarchical KeyNMF
25+
#### Prompting Embedding Models
2626

27-
You can now subdivide topics in KeyNMF at will.
27+
KeyNMF and clustering topic models can now efficiently utilise asymmetric and instruction-finetuned embedding models.
28+
This, in combination with the right embedding model, can enhance performance significantly.
2829

2930
```python
3031
from turftopic import KeyNMF
31-
32-
model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
33-
model.hierarchy.divide_children(n_subtopics=3)
34-
print(model.hierarchy)
35-
```
36-
37-
```
38-
Root
39-
├── windows, dos, os, disk, card, drivers, file, pc, files, microsoft
40-
│ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory
41-
│ ├── 0.1: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform
42-
│ └── 0.2: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati
43-
└── 1: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs
44-
. ├── 1.0: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers
45-
. ├── 1.1: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions
46-
. └── 1.2: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion
32+
from sentence_transformers import SentenceTransformer
33+
34+
encoder = SentenceTransformer(
35+
"intfloat/multilingual-e5-large-instruct",
36+
prompts={
37+
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
38+
"passage": "Passage: "
39+
},
40+
# Make sure to set default prompt to query!
41+
default_prompt_name="query",
42+
)
43+
model = KeyNMF(10, encoder=encoder)
4744
```
4845

49-
#### FASTopic *(Experimental)*
50-
51-
You can now use [FASTopic](https://github.com/BobXWu/FASTopic) inside Turftopic.
52-
53-
```python
54-
from turftopic import FASTopic
55-
56-
model = FASTopic(10).fit(corpus)
57-
model.print_topics()
58-
```
5946

6047
## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
6148
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
@@ -184,5 +171,5 @@ Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/to
184171
- Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
185172
- Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
186173
- Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
187-
- Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European
188-
- Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
174+
- Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.
175+
- Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024, October 16). Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media. arXiv.org. https://arxiv.org/abs/2410.12791

docs/KeyNMF.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,51 @@ keyword_matrix = model.extract_keywords(corpus)
109109
model.fit(None, keywords=keyword_matrix)
110110
```
111111

112+
## Asymmetric and Instruction-tuned Embedding Models
113+
114+
Some embedding models can be used together with prompting, or encode queries and passages differently.
115+
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
116+
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
117+
118+
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
119+
120+
Here's an example of using instruct models for keyword retrieval with KeyNMF.
121+
In this case, documents will serve as the queries and words as the passages:
122+
123+
```python
124+
from turftopic import KeyNMF
125+
from sentence_transformers import SentenceTransformer
126+
127+
encoder = SentenceTransformer(
128+
"intfloat/multilingual-e5-large-instruct",
129+
prompts={
130+
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
131+
"passage": "Passage: "
132+
},
133+
# Make sure to set default prompt to query!
134+
default_prompt_name="query",
135+
)
136+
model = KeyNMF(10, encoder=encoder)
137+
```
138+
139+
And a regular, asymmetric example:
140+
141+
```python
142+
encoder = SentenceTransformer(
143+
"intfloat/e5-large-v2",
144+
prompts={
145+
"query": "query: "
146+
"passage": "passage: "
147+
},
148+
# Make sure to set default prompt to query!
149+
default_prompt_name="query",
150+
)
151+
model = KeyNMF(10, encoder=encoder)
152+
```
153+
154+
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
155+
156+
112157
## Dynamic Topic Modeling
113158

114159
KeyNMF is also capable of modeling topics over time.

docs/basics.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Here's a model that uses E5 large as the embedding model, and only learns words
2727
from turftopic import SemanticSignalSeparation
2828
from sklearn.feature_extraction.text import CountVectorizer
2929

30-
model = SemanticSignalSeparation(10, encoder="intfloat/e5-large-v2", vectorizer=CountVectorizer(min_df=20))
30+
model = SemanticSignalSeparation(10, encoder="all-MiniLM-L6-v2", vectorizer=CountVectorizer(min_df=20))
3131
```
3232

3333
You can also use external models for encoding, here's an example with [OpenAI's embedding models](encoders.md#external_embeddings):
@@ -60,6 +60,67 @@ corpus: list[str] = ["this is a a document", "this is yet another document", ...
6060
model.fit(corpus)
6161
```
6262

63+
## Prompting Embedding Models
64+
65+
Some embedding models can be used together with prompting, or encode queries and passages differently.
66+
This can significantly influence performance, especially in the case of models that are based on retrieval ([KeyNMF](KeyNMF.md)) or clustering ([ClusteringTopicModel](clustering.md)).
67+
Microsoft's E5 models are, for instance all prompted by default, and it would be detrimental to performance not to do so yourself.
68+
69+
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
70+
71+
Here's an example for clustering models:
72+
```python
73+
from turftopic import ClusteringTopicModel
74+
from sentence_transformers import SentenceTransformer
75+
76+
encoder = SentenceTransformer(
77+
"intfloat/multilingual-e5-large-instruct",
78+
prompts={
79+
"query": "Instruct: Cluster documents according to the topic they are about. Query: "
80+
"passage": "Passage: "
81+
},
82+
# Make sure to set default prompt to query!
83+
default_prompt_name="query",
84+
)
85+
model = ClusteringTopicModel(encoder=encoder)
86+
```
87+
88+
You can also use instruct models for keyword retrieval with KeyNMF.
89+
In this case, documents will serve as the queries and words as the passages:
90+
91+
```python
92+
from turftopic import KeyNMF
93+
from sentence_transformers import SentenceTransformer
94+
95+
encoder = SentenceTransformer(
96+
"intfloat/multilingual-e5-large-instruct",
97+
prompts={
98+
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
99+
"passage": "Passage: "
100+
},
101+
# Make sure to set default prompt to query!
102+
default_prompt_name="query",
103+
)
104+
model = KeyNMF(10, encoder=encoder)
105+
```
106+
107+
When using KeyNMF with E5, make sure to specify the prompts even if you're not using instruct models:
108+
109+
```python
110+
encoder = SentenceTransformer(
111+
"intfloat/e5-large-v2",
112+
prompts={
113+
"query": "query: "
114+
"passage": "passage: "
115+
},
116+
# Make sure to set default prompt to query!
117+
default_prompt_name="query",
118+
)
119+
model = KeyNMF(10, encoder=encoder)
120+
```
121+
122+
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
123+
63124
## Precomputing Embeddings
64125

65126
In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
@@ -78,7 +139,7 @@ import numpy as np
78139
from sentence_transformers import SentenceTransformer
79140
from turftopic import GMM, ClusteringTopicModel
80141

81-
encoder = SentenceTransformer("intfloat/e5-large-v2")
142+
encoder = SentenceTransformer("intfloat/e5-large-v2", prompts={"query": "query: ", "passage": "passage: "}, default_prompt_name="query")
82143

83144
corpus: list[str] = ["this is a a document", "this is yet another document", ...]
84145
embeddings = np.asarray(encoder.encode(corpus))

docs/encoders.md

Lines changed: 59 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,65 @@ model = GMM(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
2121
Different encoders have different performance and model sizes.
2222
To make an informed choice about which embedding model you should be using check out the [Massive Text Embedding Benchmark](https://huggingface.co/blog/mteb).
2323

24+
## Asymmetric and Instruction-tuned Embedding Models
25+
26+
Some embedding models can be used together with prompting, or encode queries and passages differently.
27+
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
28+
29+
In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
30+
31+
Here's an example of using instruct models for keyword retrieval with KeyNMF.
32+
In this case, documents will serve as the queries and words as the passages:
33+
34+
```python
35+
from turftopic import KeyNMF
36+
from sentence_transformers import SentenceTransformer
37+
38+
encoder = SentenceTransformer(
39+
"intfloat/multilingual-e5-large-instruct",
40+
prompts={
41+
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
42+
"passage": "Passage: "
43+
},
44+
# Make sure to set default prompt to query!
45+
default_prompt_name="query",
46+
)
47+
model = KeyNMF(10, encoder=encoder)
48+
```
49+
50+
And a regular, asymmetric example:
51+
52+
```python
53+
encoder = SentenceTransformer(
54+
"intfloat/e5-large-v2",
55+
prompts={
56+
"query": "query: "
57+
"passage": "passage: "
58+
},
59+
# Make sure to set default prompt to query!
60+
default_prompt_name="query",
61+
)
62+
model = KeyNMF(10, encoder=encoder)
63+
```
64+
65+
## Performance tips
66+
67+
From `sentence-transformers` version `3.2.0` you can significantly speed up some models by using
68+
the `onnx` backend instead of regular torch.
69+
70+
```
71+
pip install sentence-transformers[onnx, onnx-gpu]
72+
```
73+
74+
```python
75+
from turftopic import SemanticSignalSeparation
76+
from sentence_transformers import SentenceTransformer
77+
78+
encoder = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
79+
80+
model = SemanticSignalSeparation(10, encoder=encoder)
81+
```
82+
2483
## External Embeddings
2584

2685
If you do not have the computational resources to run embedding models on your own infrastructure, you can also use high quality 3rd party embeddings.
@@ -33,11 +92,3 @@ Turftopic currently supports OpenAI, Voyage and Cohere embeddings.
3392
:::turftopic.encoders.OpenAIEmbeddings
3493

3594
:::turftopic.encoders.VoyageEmbeddings
36-
37-
## E5 Embeddings
38-
39-
Most E5 models expect the input to be prefixed with something like `"query: "` (see the [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) model card).
40-
In instructional E5 models, it is also possible to add an instruction, following the format `f"Instruct: {task_description} \nQuery: {document}"` (see the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model card).
41-
In Turftopic, E5 embeddings including the prefixing is handled by the `E5Encoder`.
42-
43-
:::turftopic.encoders.E5Encoder

pyproject.toml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,17 @@ line-length=79
66

77
[tool.poetry]
88
name = "turftopic"
9-
version = "0.5.4"
9+
version = "0.6.0"
1010
description = "Topic modeling with contextual representations from sentence transformers."
1111
authors = ["Márton Kardos <power.up1163@gmail.com>"]
1212
license = "MIT"
1313
readme = "README.md"
1414

1515
[tool.poetry.dependencies]
1616
python = "^3.9"
17-
numpy = "^1.23.0"
17+
numpy = ">=1.23.0"
1818
scikit-learn = "^1.2.0"
19-
sentence-transformers = "^2.2.0"
19+
sentence-transformers = ">=2.2.0"
2020
torch = "^2.1.0"
2121
scipy = "^1.10.0"
2222
rich = "^13.6.0"

turftopic/encoders/__init__.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,10 @@
22
from turftopic.encoders.cohere import CohereEmbeddings
33
from turftopic.encoders.openai import OpenAIEmbeddings
44
from turftopic.encoders.voyage import VoyageEmbeddings
5-
from turftopic.encoders.e5 import E5Encoder
65

76
__all__ = [
87
"CohereEmbeddings",
98
"OpenAIEmbeddings",
109
"VoyageEmbeddings",
1110
"ExternalEncoder",
12-
"E5Encoder",
1311
]

0 commit comments

Comments
 (0)