Skip to content

Commit 8a720f6

Browse files
Exposed concept browser and added docs with example
1 parent 6304bac commit 8a720f6

3 files changed

Lines changed: 850 additions & 0 deletions

File tree

docs/concept_induction.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Concept Induction (BETA)
2+
3+
Concept induction is the idea that higher-level concepts can be discovered and described in detail in corpora using the power of Large Language Models ([Lam et al. 2024](https://arxiv.org/abs/2404.12259)).
4+
These high-level concepts in corpora can also be discovered from particular angles, using seeds.
5+
The original study, and the [Lloom package](https://stanfordhci.github.io/lloom/) uses LLMs all the way, and therefore requires excessive computational resources, and aggressive down-sampling of the original corpus.
6+
7+
In order to account for this scalability issue, we use a [seeded topic model](seeded.md) ([KeyNMF](keynmf.md)) to discover the concepts, and only use LLMs to describe and use them.
8+
This allows us to get similar results to Lloom with a fraction of the costs.
9+
10+
In addition, we allow users to generate a **Concept Browser** programmatically, with which these concepts and their related documents can be explored.
11+
12+
<figure>
13+
<iframe src="../images/concept_induction.html", title="Concepts discovered on the political ideologies dataset", style="height:1000px;width:1200px;padding:0px;border:none;"></iframe>
14+
<figcaption> Figure 1: Concepts discovered on the political ideologies dataset. </figcaption>
15+
</figure>
16+
17+
## Example Usage
18+
19+
The example bellow uses a synthetically generated political ideologies dataset, that we examine from the following angles:
20+
21+
- Taxation
22+
- Stance on immigration
23+
- Environmental policy
24+
25+
We use an OpenAI analyzer and KeyNMF, with the `paraphrase-MiniLM-L12-v2` embedding model.
26+
The code runs in about ten minutes.
27+
28+
Install dependencies and set API Key:
29+
30+
```bash
31+
pip install turftopic[openai] datasets
32+
export OPENAI_API_KEY="sk-<your API key here>"
33+
```
34+
35+
```python
36+
import numpy as np
37+
from datasets import load_dataset
38+
from sentence_transformers import SentenceTransformer
39+
40+
from turftopic import KeyNMF, create_concept_browser
41+
from turftopic.analyzers import OpenAIAnalyzer
42+
43+
# Loading the dataset from huggingface
44+
ds = load_dataset("JyotiNayak/political_ideologies", split="train")
45+
corpus = list(ds["statement"])
46+
47+
# Embedding all documents in the corpus
48+
encoder = SentenceTransformer("paraphrase-MiniLM-L12-v2")
49+
embeddings = encoder.encode(corpus, show_progress_bar=True)
50+
51+
# Running separate seeded KeyNMF models for each tab and saving them
52+
seeds = ["Taxation", "Stance on immigration", "Environmental policy"]
53+
models = []
54+
doc_topic = []
55+
for seed in seeds:
56+
model = KeyNMF(
57+
3, encoder=encoder, seed_phrase=seed, seed_exponent=2, random_state=42
58+
)
59+
doc_topic_matrix = model.fit_transform(corpus, embeddings=embeddings)
60+
doc_topic.append(doc_topic_matrix)
61+
models.append(model)
62+
63+
# Calculating topic sizes
64+
sizes = []
65+
top_documents = []
66+
topic_sizes = []
67+
for doc_topic_matrix in doc_topic:
68+
# We say that if a document has at least five percent of the max importance
69+
# then it contains the topic
70+
rescaled = doc_topic_matrix / doc_topic_matrix.max()
71+
sizes = (rescaled >= 0.05).sum(axis=0)
72+
topic_sizes.append(sizes)
73+
# Finding representative documents for each topic
74+
docs = []
75+
for doc_t in rescaled.T:
76+
# Extracting top 10 documents for each topic
77+
top = np.argsort(-doc_t)[:10]
78+
# Making sure only those documents get in,
79+
# that we have marked to contain the topic
80+
top = top[doc_t[top] >= 0.05]
81+
docs.append([corpus[i] for i in top])
82+
top_documents.append(docs)
83+
topic_sizes = np.stack(topic_sizes)
84+
85+
# Running topic analysis on all models using GPT-5-Nano
86+
analyzer = OpenAIAnalyzer()
87+
analysis_results = []
88+
for model, docs in zip(models, top_documents):
89+
res = analyzer.analyze_topics(
90+
keywords=model.get_top_words(), documents=docs
91+
)
92+
analysis_results.append(res)
93+
94+
# Creating the concept browser:
95+
browser = create_concept_browser(
96+
seeds=seeds,
97+
topic_names=[res.topic_names for res in analysis_results],
98+
keywords=[model.get_top_words() for model in models],
99+
topic_descriptions=[res.topic_descriptions for res in analysis_results],
100+
topic_sizes=topic_sizes,
101+
top_documents=top_documents,
102+
)
103+
browser.show()
104+
```
105+
106+
_See Figure 1 for the results_
107+
108+
## API reference
109+
110+
::: turftopic._concept_browser.create_browser

0 commit comments

Comments
 (0)