Skip to content

Commit 043b498

Browse files
Merge pull request #69 from x-tabdeveloping/llm_naming
Automated topic naming
2 parents 11e3c6e + 94fb935 commit 043b498

12 files changed

Lines changed: 516 additions & 22 deletions

File tree

docs/basics.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@ document_topic_matrix = model.transform(new_documents, embeddings=None)
170170
> Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.
171171
172172

173+
173174
## Interpreting Models
174175

175176
Turftopic comes with a number of pretty printing utilities for interpreting the models.
@@ -236,7 +237,7 @@ latex_table: str = model.export_topics(format="latex")
236237
md_table: str = model.export_representative_documents(0, corpus, document_topic_matrix, format="markdown")
237238
```
238239

239-
### Naming topics
240+
### Manual topic naming
240241

241242
You can manually name topics in Turftopic models after having interpreted them.
242243
If you find a more fitting name for a topic, feel free to rename it in your model.
@@ -246,8 +247,39 @@ from turftopic import SemanticSignalSeparation
246247

247248
model = SemanticSignalSeparation(10).fit(corpus)
248249
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
250+
249251
```
250252

253+
### Automated topic naming
254+
255+
You can also use large language models, or other NLP techniques to assign human-readable names to topics.
256+
Here is an example of using ChatGPT to generate topic names from the highest ranking keywords.
257+
258+
Read more about namer models [here](namers.md).
259+
260+
```python
261+
from turftopic import KeyNMF
262+
from turftopic.namers import OpenAITopicNamer
263+
264+
namer = OpenAITopicNamer("gpt-4o-mini")
265+
model.rename_topics(namer)
266+
267+
model.print_topics()
268+
```
269+
270+
| Topic ID | Topic Name | Highest Ranking |
271+
| - | - | - |
272+
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
273+
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
274+
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
275+
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
276+
| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
277+
| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
278+
| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
279+
| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
280+
| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
281+
| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
282+
251283
### Visualization
252284

253285
Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models.

docs/finetuning.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,15 @@ model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
1919
model.rename_topics([f"Topic {i}" for i in range(10)])
2020
```
2121

22+
You can also automatically name topics with a [topic namer](namers.md) model.
23+
24+
```python
25+
from turftopic.namers import LLMTopicNamer
26+
27+
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
28+
model.rename_topics(namer)
29+
```
30+
2231
## Changing the number of topics
2332

2433
Multiple models allow you to change the number of topics in a model after fitting them.

docs/namers.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Topic Namers
2+
3+
Sometimes, especially when the number of topics grows large,
4+
it might be convenient to assign human-readable names to topics in an automated manner.
5+
6+
Turftopic allows you to accomplish this with a number of different topic namer models.
7+
8+
## Large Language Models
9+
10+
Turftopic lets you utilise Large Language Models for generating human-readable topic names.
11+
This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic.
12+
13+
### Running LLMs locally
14+
15+
You can use any LLM from the HuggingFace Hub to generate topic names on your own machine.
16+
The default in Turftopic is [SmolLM](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts.
17+
18+
```python
19+
from turftopic import KeyNMF
20+
from turftopic.namers import LLMTopicNamer
21+
22+
model = KeyNMF(10).fit(corpus)
23+
24+
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
25+
model.rename_topics(namer)
26+
27+
model.print_topics()
28+
```
29+
30+
| Topic ID | Topic Name | Highest Ranking |
31+
| - | - | - |
32+
| 0 | Windows NT | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
33+
| 1 | Theism vs. Atheism | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
34+
| 2 | "486 Motherboard" | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
35+
| 3 | Disk Drives | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
36+
| 4 | Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
37+
| 5 | Christianity | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
38+
| 6 | modem-port-serial-connect-uart-pc-9600 | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
39+
| 7 | "Graphics Card" | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
40+
| 8 | File Manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
41+
| 9 | Printer and Fonts | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
42+
43+
### Using OpenAI's LLMs
44+
45+
You might not have the computational resources to run a high-quality LLM locally.
46+
Luckily Turftopic allows you to use OpenAI's chat models for topic naming too!
47+
48+
49+
!!! info
50+
You will also need to install the `openai` Python package.
51+
```bash
52+
pip install openai
53+
export OPENAI_API_KEY="sk-<your key goes here>"
54+
```
55+
56+
```python
57+
from turftopic.namers import OpenAITopicNamer
58+
59+
namer = OpenAITopicNamer("gpt-4o-mini")
60+
model.rename_topics(namer)
61+
model.print_topics()
62+
```
63+
64+
| Topic ID | Topic Name | Highest Ranking |
65+
| - | - | - |
66+
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
67+
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
68+
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
69+
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
70+
| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
71+
| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
72+
| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
73+
| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
74+
| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
75+
| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
76+
77+
### Prompting
78+
79+
Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation:
80+
81+
```python
82+
from turftopic.namers import OpenAITopicNamer
83+
84+
system_prompt = """
85+
You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe.
86+
You only repond briefly with the name of the topic, and nothing else.
87+
"""
88+
89+
prompt_template = """
90+
You will be tasked with naming a topic.
91+
Based on the keywords, create a short label that best summarizes the topics.
92+
Only respond with a short, human readable topic name and nothing else.
93+
94+
The topic is described by the following set of keywords: {keywords}.
95+
"""
96+
97+
namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt)
98+
```
99+
100+
## N-gram Patterns
101+
102+
You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions.
103+
This method typically results in lower quality names, but might be good enough for your use case.
104+
105+
106+
```python
107+
from turftopic.namers import NgramTopicNamer
108+
109+
namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2")
110+
model.rename_topics(namer)
111+
model.print_topics()
112+
```
113+
114+
| Topic ID | Topic Name | Highest Ranking |
115+
| - | - | - |
116+
| 0 | windows and dos | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
117+
| 1 | many atheists out there | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
118+
| 2 | hardware and software | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
119+
| 3 | floppy disk drives and | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
120+
| 4 | morality is subjective | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
121+
| 5 | the christian bible | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
122+
| 6 | the serial port | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
123+
| 7 | the video card | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
124+
| 8 | the file manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
125+
| 9 | the print manager | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
126+
127+
128+
## API Reference
129+
130+
:::turftopic.namers.base.TopicNamer
131+
132+
:::turftopic.namers.hf_transformers.LLMTopicNamer
133+
134+
:::turftopic.namers.openai.OpenAITopicNamer
135+
136+
:::turftopic.namers.ngram.NgramTopicNamer

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ nav:
1919
- Autoencoding Models: ctm.md
2020
- FASTopic: FASTopic.md
2121
- Encoders: encoders.md
22+
- Namers: namers.md
2223
theme:
2324
name: material
2425
logo: images/logo.svg

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,14 @@ rich = "^13.6.0"
2323
huggingface-hub = "^0.23.2"
2424
joblib = "^1.2.0"
2525
pyro-ppl = { version = "^1.8.0", optional = true }
26+
openai = { version = "^1.40.0", optional = true }
2627
mkdocs = { version = "^1.5.2", optional = true }
2728
mkdocs-material = { version = "^9.5.12", optional = true }
2829
mkdocstrings = { version = "^0.24.0", extras = ["python"], optional = true }
2930

3031
[tool.poetry.extras]
3132
pyro-ppl = ["pyro-ppl"]
33+
openai = ["openai"]
3234
docs = ["mkdocs", "mkdocs-material", "mkdocstrings"]
3335

3436
[build-system]

turftopic/base.py

Lines changed: 58 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515

1616
from turftopic.data import TopicData
1717
from turftopic.encoders import ExternalEncoder
18+
from turftopic.namers.base import TopicNamer
1819
from turftopic.serialization import create_readme, get_package_versions
1920
from turftopic.utils import export_table
2021

@@ -35,7 +36,9 @@ def get_topics(
3536
"""Returns high-level topic representations in form of the top K words
3637
in each topic.
3738
38-
Parameters ---------- top_k: int, default 10
39+
Parameters
40+
----------
41+
top_k: int, default 10
3942
Number of top words to return for each topic.
4043
4144
Returns
@@ -65,13 +68,36 @@ def get_topics(
6568
topics.append(topic_data)
6669
return topics
6770

71+
def _top_terms(
72+
self, top_k: int = 10, positive: bool = True
73+
) -> list[list[str]]:
74+
terms = []
75+
vocab = self.get_vocab()
76+
for component in self.components_:
77+
lowest = np.argpartition(component, top_k)[:top_k]
78+
lowest = lowest[np.argsort(component[lowest])]
79+
highest = np.argpartition(-component, top_k)[:top_k]
80+
highest = highest[np.argsort(-component[highest])]
81+
if not positive:
82+
terms.append(list(vocab[lowest]))
83+
else:
84+
terms.append(list(vocab[highest]))
85+
return terms
86+
87+
def _rename_automatic(self, namer: TopicNamer) -> list[str]:
88+
self.topic_names_ = namer.name_topics(self._top_terms())
89+
return self.topic_names_
90+
6891
def _topics_table(
6992
self,
7093
top_k: int = 10,
7194
show_scores: bool = False,
7295
show_negative: bool = False,
7396
) -> list[list[str]]:
74-
columns = ["Topic ID", "Highest Ranking"]
97+
columns = ["Topic ID"]
98+
if getattr(self, "topic_names_", None):
99+
columns.append("Topic Name")
100+
columns.append("Highest Ranking")
75101
if show_negative:
76102
columns.append("Lowest Ranking")
77103
rows = []
@@ -80,7 +106,9 @@ def _topics_table(
80106
except AttributeError:
81107
classes = list(range(self.components_.shape[0]))
82108
vocab = self.get_vocab()
83-
for topic_id, component in zip(classes, self.components_):
109+
for i_topic, (topic_id, component) in enumerate(
110+
zip(classes, self.components_)
111+
):
84112
highest = np.argpartition(-component, top_k)[:top_k]
85113
highest = highest[np.argsort(-component[highest])]
86114
lowest = np.argpartition(component, top_k)[:top_k]
@@ -105,7 +133,10 @@ def _topics_table(
105133
else:
106134
concat_positive = ", ".join([word for word in vocab[highest]])
107135
concat_negative = ", ".join([word for word in vocab[lowest]])
108-
row = [f"{topic_id}", f"{concat_positive}"]
136+
row = [f"{topic_id}"]
137+
if getattr(self, "topic_names_", None):
138+
row.append(self.topic_names_[i_topic])
139+
row.append(f"{concat_positive}")
109140
if show_negative:
110141
row.append(concat_negative)
111142
rows.append(row)
@@ -130,20 +161,19 @@ def print_topics(
130161
"""
131162
columns, *rows = self._topics_table(top_k, show_scores, show_negative)
132163
table = Table(show_lines=True)
133-
table.add_column("Topic ID", style="blue", justify="right")
134-
table.add_column(
135-
"Highest Ranking",
136-
justify="left",
137-
style="magenta",
138-
max_width=100,
139-
)
140-
if show_negative:
141-
table.add_column(
142-
"Lowest Ranking",
143-
justify="left",
144-
style="red",
145-
max_width=100,
146-
)
164+
for column in columns:
165+
if column == "Highest Ranking":
166+
table.add_column(
167+
column, justify="left", style="magenta", max_width=100
168+
)
169+
elif column == "Lowest Ranking":
170+
table.add_column(
171+
column, justify="left", style="red", max_width=100
172+
)
173+
elif column == "Topic ID":
174+
table.add_column(column, style="blue", justify="right")
175+
else:
176+
table.add_column(column)
147177
for row in rows:
148178
table.add_row(*row)
149179
console = Console()
@@ -324,22 +354,29 @@ def topic_names(self) -> list[str]:
324354
names.append(f"{topic_id}_{concat_words}")
325355
return names
326356

327-
def rename_topics(self, names: Union[list[str], dict[int, str]]) -> None:
328-
"""Rename topics in a model manually.
357+
def rename_topics(
358+
self, names: Union[list[str], dict[int, str], TopicNamer]
359+
) -> None:
360+
"""Rename topics in a model manually or automatically, using a namer.
329361
330362
Examples:
331363
```python
332364
model.rename_topics(["Automobiles", "Telephones"])
333365
# Or:
334366
model.rename_topics({-1: "Outliers", 2: "Christianity"})
367+
# Or:
368+
namer = OpenAITopicNamer()
369+
model.rename_topics(namer)
335370
```
336371
337372
Parameters
338373
----------
339374
names: list[str] or dict[int,str]
340375
Should be a list of topic names, or a mapping of topic IDs to names.
341376
"""
342-
if isinstance(names, dict):
377+
if isinstance(names, TopicNamer):
378+
self._rename_automatic(names)
379+
elif isinstance(names, dict):
343380
topic_names = self.topic_names
344381
for topic_id, topic_name in names.items():
345382
try:

0 commit comments

Comments
 (0)