|
| 1 | +# Topic Namers |
| 2 | + |
| 3 | +Sometimes, especially when the number of topics grows large, |
| 4 | +it might be convenient to assign human-readable names to topics in an automated manner. |
| 5 | + |
| 6 | +Turftopic allows you to accomplish this with a number of different topic namer models. |
| 7 | + |
| 8 | +## Large Language Models |
| 9 | + |
| 10 | +Turftopic lets you utilise Large Language Models for generating human-readable topic names. |
| 11 | +This is done by instructing the language model to generate a topic name based on the keywords the topic model assigns as the most important for a given topic. |
| 12 | + |
| 13 | +### Running LLMs locally |
| 14 | + |
| 15 | +You can use any LLM from the HuggingFace Hub to generate topic names on your own machine. |
| 16 | +The default in Turftopic is [SmolLM](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), due to it's small size and speed, but we recommend using larger LLMs for higher quality topic names, especially in multilingual contexts. |
| 17 | + |
| 18 | +```python |
| 19 | +from turftopic import KeyNMF |
| 20 | +from turftopic.namers import LLMTopicNamer |
| 21 | + |
| 22 | +model = KeyNMF(10).fit(corpus) |
| 23 | + |
| 24 | +namer = LLMTopicNamer("HuggingFaceTB/SmolLM2-1.7B-Instruct") |
| 25 | +model.rename_topics(namer) |
| 26 | + |
| 27 | +model.print_topics() |
| 28 | +``` |
| 29 | + |
| 30 | +| Topic ID | Topic Name | Highest Ranking | |
| 31 | +| - | - | - | |
| 32 | +| 0 | Windows NT | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | |
| 33 | +| 1 | Theism vs. Atheism | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | |
| 34 | +| 2 | "486 Motherboard" | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | |
| 35 | +| 3 | Disk Drives | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | |
| 36 | +| 4 | Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species | |
| 37 | +| 5 | Christianity | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical | |
| 38 | +| 6 | modem-port-serial-connect-uart-pc-9600 | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 | |
| 39 | +| 7 | "Graphics Card" | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors | |
| 40 | +| 8 | File Manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip | |
| 41 | +| 9 | Printer and Fonts | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints | |
| 42 | + |
| 43 | +### Using OpenAI's LLMs |
| 44 | + |
| 45 | +You might not have the computational resources to run a high-quality LLM locally. |
| 46 | +Luckily Turftopic allows you to use OpenAI's chat models for topic naming too! |
| 47 | + |
| 48 | + |
| 49 | +!!! info |
| 50 | + You will also need to install the `openai` Python package. |
| 51 | + ```bash |
| 52 | + pip install openai |
| 53 | + export OPENAI_API_KEY="sk-<your key goes here>" |
| 54 | + ``` |
| 55 | + |
| 56 | +```python |
| 57 | +from turftopic.namers import OpenAITopicNamer |
| 58 | + |
| 59 | +namer = OpenAITopicNamer("gpt-4o-mini") |
| 60 | +model.rename_topics(namer) |
| 61 | +model.print_topics() |
| 62 | +``` |
| 63 | + |
| 64 | +| Topic ID | Topic Name | Highest Ranking | |
| 65 | +| - | - | - | |
| 66 | +| 0 | Operating Systems and Software | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | |
| 67 | +| 1 | Atheism and Belief Systems | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | |
| 68 | +| 2 | Computer Architecture and Performance | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | |
| 69 | +| 3 | Storage Technologies | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | |
| 70 | +| 4 | Moral Philosophy and Ethics | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species | |
| 71 | +| 5 | Christian Faith and Beliefs | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical | |
| 72 | +| 6 | Serial Modem Connectivity | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 | |
| 73 | +| 7 | Graphics Card Drivers | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors | |
| 74 | +| 8 | Windows File Management | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip | |
| 75 | +| 9 | Printer Font Management | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints | |
| 76 | + |
| 77 | +### Prompting |
| 78 | + |
| 79 | +Since these namers use chat-finetuned LLMs you can freely define custom prompts for topic name generation: |
| 80 | + |
| 81 | +```python |
| 82 | +from turftopic.namers import OpenAITopicNamer |
| 83 | + |
| 84 | +system_prompt = """ |
| 85 | +You are a topic namer. When the user gives you a set of keywords, you respond with a name for the topic they describe. |
| 86 | +You only repond briefly with the name of the topic, and nothing else. |
| 87 | +""" |
| 88 | + |
| 89 | +prompt_template = """ |
| 90 | +You will be tasked with naming a topic. |
| 91 | +Based on the keywords, create a short label that best summarizes the topics. |
| 92 | +Only respond with a short, human readable topic name and nothing else. |
| 93 | +
|
| 94 | +The topic is described by the following set of keywords: {keywords}. |
| 95 | +""" |
| 96 | + |
| 97 | +namer = OpenAITopicNamer("gpt-4o-mini", prompt_template=prompt_template, system_prompt=system_prompt) |
| 98 | +``` |
| 99 | + |
| 100 | +## N-gram Patterns |
| 101 | + |
| 102 | +You can also name topics based on the semantically closest n-grams from the corpus to the topic descriptions. |
| 103 | +This method typically results in lower quality names, but might be good enough for your use case. |
| 104 | + |
| 105 | + |
| 106 | +```python |
| 107 | +from turftopic.namers import NgramTopicNamer |
| 108 | + |
| 109 | +namer = NgramTopicNamer(corpus, encoder="all-MiniLM-L6-v2") |
| 110 | +model.rename_topics(namer) |
| 111 | +model.print_topics() |
| 112 | +``` |
| 113 | + |
| 114 | +| Topic ID | Topic Name | Highest Ranking | |
| 115 | +| - | - | - | |
| 116 | +| 0 | windows and dos | windows, dos, os, ms, microsoft, unix, nt, memory, program, apps | |
| 117 | +| 1 | many atheists out there | atheism, atheist, atheists, belief, religion, religious, theists, beliefs, believe, faith | |
| 118 | +| 2 | hardware and software | motherboard, ram, memory, cpu, bios, isa, speed, 486, bus, performance | |
| 119 | +| 3 | floppy disk drives and | disk, drive, scsi, drives, disks, floppy, ide, dos, controller, boot | |
| 120 | +| 4 | morality is subjective | morality, moral, objective, immoral, morals, subjective, morally, society, animals, species | |
| 121 | +| 5 | the christian bible | christian, bible, christians, god, christianity, religion, jesus, faith, religious, biblical | |
| 122 | +| 6 | the serial port | modem, port, serial, modems, ports, uart, pc, connect, fax, 9600 | |
| 123 | +| 7 | the video card | card, drivers, monitor, vga, driver, cards, ati, graphics, diamond, monitors | |
| 124 | +| 8 | the file manager | file, files, ftp, bmp, windows, program, directory, bitmap, win3, zip | |
| 125 | +| 9 | the print manager | printer, print, fonts, printing, font, printers, hp, driver, deskjet, prints | |
| 126 | + |
| 127 | + |
| 128 | +## API Reference |
| 129 | + |
| 130 | +:::turftopic.namers.base.TopicNamer |
| 131 | + |
| 132 | +:::turftopic.namers.hf_transformers.LLMTopicNamer |
| 133 | + |
| 134 | +:::turftopic.namers.openai.OpenAITopicNamer |
| 135 | + |
| 136 | +:::turftopic.namers.ngram.NgramTopicNamer |
0 commit comments