Skip to content

Commit 94fb935

Browse files
Updated docs
1 parent d743ed2 commit 94fb935

4 files changed

Lines changed: 179 additions & 1 deletion

File tree

docs/basics.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@ document_topic_matrix = model.transform(new_documents, embeddings=None)
170170
> Some models have additional optimizations going on when using `fit_transform()`, and the `fit()` method typically uses `fit_transform()` in the background.
171171
172172

173+
173174
## Interpreting Models
174175

175176
Turftopic comes with a number of pretty printing utilities for interpreting the models.
@@ -236,7 +237,7 @@ latex_table: str = model.export_topics(format="latex")
236237
md_table: str = model.export_representative_documents(0, corpus, document_topic_matrix, format="markdown")
237238
```
238239

239-
### Naming topics
240+
### Manual topic naming
240241

241242
You can manually name topics in Turftopic models after having interpreted them.
242243
If you find a more fitting name for a topic, feel free to rename it in your model.
@@ -246,8 +247,39 @@ from turftopic import SemanticSignalSeparation
246247

247248
model = SemanticSignalSeparation(10).fit(corpus)
248249
model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
250+
249251
```
250252

253+
### Automated topic naming
254+
255+
You can also use large language models, or other NLP techniques to assign human-readable names to topics.
256+
Here is an example of using ChatGPT to generate topic names from the highest ranking keywords.
257+
258+
Read more about namer models [here](namers.md).
259+
260+
```python
261+
from turftopic import KeyNMF
262+
from turftopic.namers import OpenAITopicNamer
263+
264+
namer = OpenAITopicNamer("gpt-4o-mini")
265+
model.rename_topics(namer)
266+
267+
model.print_topics()
268+
```
269+
270+
| Topic ID | Topic Name | Highest Ranking |
271+
| - | - | - |
272+
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
273+
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
274+
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
275+
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
276+
| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
277+
| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
278+
| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
279+
| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
280+
| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
281+
| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
282+
251283
### Visualization
252284

253285
Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models.

docs/finetuning.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,15 @@ model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"})
1919
model.rename_topics([f"Topic {i}" for i in range(10)])
2020
```
2121

22+
You can also automatically name topics with a [topic namer](namers.md) model.
23+
24+
```python
25+
from turftopic.namers import LLMTopicNamer
26+
27+
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
28+
model.rename_topics(namer)
29+
```
30+
2231
## Changing the number of topics
2332

2433
Multiple models allow you to change the number of topics in a model after fitting them.

docs/namers.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Topic Namers
2+
3+
Sometimes, especially when the number of topics grows large,
4+
it might be convenient to assign human-readable names to topics in an automated manner.
5+
6+
Turftopic allows you to accomplish this with a number of different topic namer models.
7+
8+
## Large Language Models
9+
10+
Turftopic lets you utilise Large Language Models for generating human-readable topic names.
11+
This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic.
12+
13+
### Running LLMs locally
14+
15+
You can use any LLM from the HuggingFace Hub to generate topic names on your own machine.
16+
The default in Turftopic is [SmolLM](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts.
17+
18+
```python
19+
from turftopic import KeyNMF
20+
from turftopic.namers import LLMTopicNamer
21+
22+
model = KeyNMF(10).fit(corpus)
23+
24+
namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct")
25+
model.rename_topics(namer)
26+
27+
model.print_topics()
28+
```
29+
30+
| Topic ID | Topic Name | Highest Ranking |
31+
| - | - | - |
32+
| 0 | Windows NT | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
33+
| 1 | Theism vs. Atheism | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
34+
| 2 | "486 Motherboard" | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
35+
| 3 | Disk Drives | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
36+
| 4 | Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
37+
| 5 | Christianity | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
38+
| 6 | modem-port-serial-connect-uart-pc-9600 | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
39+
| 7 | "Graphics Card" | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
40+
| 8 | File Manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
41+
| 9 | Printer and Fonts | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
42+
43+
### Using OpenAI's LLMs
44+
45+
You might not have the computational resources to run a high-quality LLM locally.
46+
Luckily Turftopic allows you to use OpenAI's chat models for topic naming too!
47+
48+
49+
!!! info
50+
You will also need to install the `openai` Python package.
51+
```bash
52+
pip install openai
53+
export OPENAI_API_KEY="sk-<your key goes here>"
54+
```
55+
56+
```python
57+
from turftopic.namers import OpenAITopicNamer
58+
59+
namer = OpenAITopicNamer("gpt-4o-mini")
60+
model.rename_topics(namer)
61+
model.print_topics()
62+
```
63+
64+
| Topic ID | Topic Name | Highest Ranking |
65+
| - | - | - |
66+
| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
67+
| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
68+
| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
69+
| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
70+
| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
71+
| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
72+
| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
73+
| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
74+
| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
75+
| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
76+
77+
### Prompting
78+
79+
Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation:
80+
81+
```python
82+
from turftopic.namers import OpenAITopicNamer
83+
84+
system_prompt = """
85+
You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe.
86+
You only repond briefly with the name of the topic, and nothing else.
87+
"""
88+
89+
prompt_template = """
90+
You will be tasked with naming a topic.
91+
Based on the keywords, create a short label that best summarizes the topics.
92+
Only respond with a short, human readable topic name and nothing else.
93+
94+
The topic is described by the following set of keywords: {keywords}.
95+
"""
96+
97+
namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt)
98+
```
99+
100+
## N-gram Patterns
101+
102+
You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions.
103+
This method typically results in lower quality names, but might be good enough for your use case.
104+
105+
106+
```python
107+
from turftopic.namers import NgramTopicNamer
108+
109+
namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2")
110+
model.rename_topics(namer)
111+
model.print_topics()
112+
```
113+
114+
| Topic ID | Topic Name | Highest Ranking |
115+
| - | - | - |
116+
| 0 | windows and dos | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps |
117+
| 1 | many atheists out there | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith |
118+
| 2 | hardware and software | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance |
119+
| 3 | floppy disk drives and | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot |
120+
| 4 | morality is subjective | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species |
121+
| 5 | the christian bible | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical |
122+
| 6 | the serial port | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 |
123+
| 7 | the video card | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors |
124+
| 8 | the file manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip |
125+
| 9 | the print manager | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints |
126+
127+
128+
## API Reference
129+
130+
:::turftopic.namers.base.TopicNamer
131+
132+
:::turftopic.namers.hf_transformers.LLMTopicNamer
133+
134+
:::turftopic.namers.openai.OpenAITopicNamer
135+
136+
:::turftopic.namers.ngram.NgramTopicNamer

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ nav:
1919
- Autoencoding Models: ctm.md
2020
- FASTopic: FASTopic.md
2121
- Encoders: encoders.md
22+
- Namers: namers.md
2223
theme:
2324
name: material
2425
logo: images/logo.svg

0 commit comments

Comments
 (0)