|
7 | 7 | ## Features |
8 | 8 | - Novel transformer-based topic models: |
9 | 9 | - Semantic Signal Separation - S³ 🧭 |
10 | | - - KeyNMF 🔑 (paper in progress ⏳) |
| 10 | + - KeyNMF 🔑 |
11 | 11 | - GMM :gem: (paper soon) |
12 | 12 | - Implementations of other transformer-based topic models |
13 | 13 | - Clustering Topic Models: BERTopic and Top2Vec |
|
20 | 20 |
|
21 | 21 | > This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues. |
22 | 22 |
|
23 | | -### New in version 0.5.0 |
| 23 | +### New in version 0.6.0 |
24 | 24 |
|
25 | | -#### Hierarchical KeyNMF |
| 25 | +#### Prompting Embedding Models |
26 | 26 |
|
27 | | -You can now subdivide topics in KeyNMF at will. |
| 27 | +KeyNMF and clustering topic models can now efficiently utilise asymmetric and instruction-finetuned embedding models. |
| 28 | +This, in combination with the right embedding model, can enhance performance significantly. |
28 | 29 |
|
29 | 30 | ```python |
30 | 31 | from turftopic import KeyNMF |
31 | | - |
32 | | -model = KeyNMF(2, top_n=15, random_state=42).fit(corpus) |
33 | | -model.hierarchy.divide_children(n_subtopics=3) |
34 | | -print(model.hierarchy) |
35 | | -``` |
36 | | - |
37 | | -``` |
38 | | -Root |
39 | | -├── windows, dos, os, disk, card, drivers, file, pc, files, microsoft |
40 | | -│ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory |
41 | | -│ ├── 0.1: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform |
42 | | -│ └── 0.2: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati |
43 | | -└── 1: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs |
44 | | -. ├── 1.0: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers |
45 | | -. ├── 1.1: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions |
46 | | -. └── 1.2: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion |
| 32 | +from sentence_transformers import SentenceTransformer |
| 33 | + |
| 34 | +encoder = SentenceTransformer( |
| 35 | + "intfloat/multilingual-e5-large-instruct", |
| 36 | + prompts={ |
| 37 | + "query": "Instruct: Retrieve relevant keywords from the given document. Query: " |
| 38 | + "passage": "Passage: " |
| 39 | + }, |
| 40 | + # Make sure to set default prompt to query! |
| 41 | + default_prompt_name="query", |
| 42 | +) |
| 43 | +model = KeyNMF(10, encoder=encoder) |
47 | 44 | ``` |
48 | 45 |
|
49 | | -#### FASTopic *(Experimental)* |
50 | | - |
51 | | -You can now use [FASTopic](https://github.com/BobXWu/FASTopic) inside Turftopic. |
52 | | - |
53 | | -```python |
54 | | -from turftopic import FASTopic |
55 | | - |
56 | | -model = FASTopic(10).fit(corpus) |
57 | | -model.print_topics() |
58 | | -``` |
59 | 46 |
|
60 | 47 | ## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/) |
61 | 48 | [](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb) |
@@ -184,5 +171,5 @@ Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/to |
184 | 171 | - Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794 |
185 | 172 | - Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470 |
186 | 173 | - Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974 |
187 | | - - Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European |
188 | | - - Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics. |
| 174 | + - Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics. |
| 175 | + - Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024, October 16). Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media. arXiv.org. https://arxiv.org/abs/2410.12791 |
0 commit comments