Added documentation on TopicData

x-tabdeveloping · x-tabdeveloping · commit ae349c811dad · 2025-02-17T16:27:46.000+01:00
diff --git a/docs/model_definition_and_training.md b/docs/model_definition_and_training.md
@@ -182,22 +182,64 @@ See a detailed guide [here](../namers.md).
 ## Training and Inference
 
 ### Model Training
-
-All models in Turftopic have a `fit()` method, that takes a textual corpus in the form of an iterable of strings.
-
-Beware that the iterable has to be reusable, as models have to do multiple passes over the corpus.
+All models in Turftopic follow a scikit-learn API for fitting topic models.
+Every model in the library can be trained by passing a set of documents (a _corpus_) to the model.
+This has to be an `Iterable` type object, that has to be reusable as models will typically do multiple passes on the corpus.
 
 ```python
 corpus: list[str] = ["this is a a document", "this is yet another document", ...]
-
-model.fit(corpus)
 ```
+!!! quote "Fit your topic model"
+    === "`fit(raw_documents, embeddings=None)`"
+
+        `fit()` simply fits the topic model and returns the same model object fitted.
+        You can optionally pass a set of precomputed embeddings for the documents.
+
+        ```python
+        model.fit(corpus)
+        # or
+        model.fit(corpus, embeddings=embeddings)
+        ```
+
+    === "`fit_transform(raw_documents, embeddings=None)`"
+
+        `fit_transform()` not only trains the model but also returns topic-proportions in all documents in the corpus.
+        ```python
+        document_topic_matrix = model.fit_transform(corpus)
+        # or 
+        document_topic_matrix = model.fit_transform(corpus, embeddings=embeddings)
+        print(document_topic_matrix.shape)
+        # prints (n_documents, n_topics)
+        ```
+    === "`prepare_topic_data(corpus, embeddings=None)`"
+        
+        `prepare_topic_data()` not only fits the model (only if not already fitted), but also saves other aspects of topic inference, which makes it easier to then use this object for pretty printing and visualizing your models (see [Model Interpretation](model_interpretation.md))
+
+        ```python
+        topic_data = model.prepare_topic_data(corpus)
+        # print to see what attributes you can access.
+        print(topic_data)
+        ```
+        ```
+        TopicData
+        ├── corpus (1000)
+        ├── vocab (1746,)
+        ├── document_term_matrix (1000, 1746)
+        ├── topic_term_matrix (10, 1746)
+        ├── document_topic_matrix (1000, 10)
+        ├── document_representation (1000, 384)
+        ├── transform
+        ├── topic_names (10)
+        ├── has_negative_side
+        └── hierarchy
+        ```
+        See [Using TopicData](topic_data.md) for more detail.
+
 
 ### Precomputing Embeddings
 
 In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
 Encoding the corpus is the heaviest part of the process and you can spare yourself a lot of time by only doing it once.
-
 Some models have to encode the vocabulary as well, this cannot be done before inference, as the models learn the vocabulary itself from the corpus.
 
 The `fit()` method of all models takes and `embeddings` argument, that allows you to pass a precooked embedding matrix along to fitting.
diff --git a/docs/model_interpretation.md b/docs/model_interpretation.md
@@ -11,7 +11,7 @@ topic_data = model.prepare_topic_data(corpus)
 
 ## Topic Tables
 
-The easiest way you can investigate topics in your fitted model is to use the built-in pretty printing utilities, that you can call on every fitted model or `TopicData` object.
+The easiest way you can investigate topics in your fitted model is to use the built-in pretty printing utilities, that you can call on every fitted model or [`TopicData`](topic_data.md) object.
 
 !!! quote "Interpret your models with topic tables"
     === "Relevant Words"
@@ -127,7 +127,7 @@ pip install topic-wizard
 ### Web App
 
 The easiest way to investigate any topic model interactively is to use the topicwizard web app.
-You can launch the app either using a `TopicData` or a model object and a representative sample of documents.
+You can launch the app either using a [`TopicData`](topic_data.md) or a model object and a representative sample of documents.
 
 === "With `TopicData`"
 
@@ -148,7 +148,7 @@ You can launch the app either using a `TopicData` or a model object and a repres
 ### Figures
 
 You can also produce individual interactive figures using the [Figures API in topicwizard](https://x-tabdeveloping.github.io/topicwizard/figures.html).
-Almost all figures in the Figures API can be called on the `figures` submodule of any `TopicData` object.
+Almost all figures in the Figures API can be called on the `figures` submodule of any [`TopicData`](topic_data.md) object.
 
 !!! quote "Interpret your models using interactive figures"
     === "Topic Map"
diff --git a/docs/persistence.md b/docs/persistence.md
@@ -1,4 +1,7 @@
-# Saving and loading models
+# Saving and loading
+
+## Model persistence
+All models in Turftopic can be serialized and saved to disk, or published to the HuggingFace Hub.
 
 ### Saving locally
 
@@ -32,3 +35,26 @@ model = load_model("./local_directory/")
 # or from hub
 model = load_model("your_user/s3_20-newsgroups_10-topics")
 ```
+
+## `TopicData` persistence
+
+You can also save and load `TopicData` objects with Turftopic.
+These are saved using `joblib` and therefore we recommend that you give a `.joblib` file extension to all `TopicData` files:
+
+!!! note "Note on compatibility"
+    For backwards compatibility, `TopicData` objects are saved using `joblib` as simple `dict` objects.
+    If you simply load a saved `TopicData` object with joblib without using `from_disk()`, it will load as a `dict`.
+
+=== "Save"
+    ```python
+    topic_data = model.prepare_topic_data(corpus)
+    topic_data.to_disk("topic_data.joblib")
+    ```
+
+=== "Load"
+    ```python
+    from turftopic.data import TopicData
+    
+    topic_data = TopicData.from_disk("topic_data.joblib")
+    ```
+
diff --git a/docs/topic_data.md b/docs/topic_data.md
@@ -0,0 +1,68 @@
+# `TopicData`
+
+While Turftopic provides a fully sklearn-compatible interface for training and using topic models, this is not always optimal, especially when you have to visualize models, or save more information about inference then would be practical to have in a `model` object.
+We have thus added an abstraction borrowed from [topicwizard](https://github.com/x-tabdeveloping/topicwizard) called `TopicData`.
+
+## Producing `TopicData`
+Every model has methods, with which you can produce this object:
+
+!!! quote "Prepare `TopicData` objects"
+    === "`prepare_topic_data(corpus, embeddings=None)`"
+        ```python
+        topic_data = model.prepare_topic_data(corpus)
+        # print to see what attributes are available
+        print(topic_data)
+        ```
+        ```
+        TopicData
+        ├── corpus (1000)
+        ├── vocab (1746,)
+        ├── document_term_matrix (1000, 1746)
+        ├── topic_term_matrix (10, 1746)
+        ├── document_topic_matrix (1000, 10)
+        ├── document_representation (1000, 384)
+        ├── transform
+        ├── topic_names (10)
+        ├── has_negative_side
+        └── hierarchy
+        ```
+    === "`prepare_dynamic_topic_data(corpus, timestamps, embeddings=None, bins=10)`"
+        Models that support dynamic topic modeling have this method too, which includes dynamic topics in the resulting `TopicData` object.
+        ```python
+        import datetime
+
+        timestamps: list[datetime.datetime] = [...] 
+        topic_data = model.prepare_dynamic_topic_data(corpus, timestamps=timestamps)
+        ```
+
+## Using `TopicData`
+`TopicData` is a dict-like object, and for all intents and purposes can be used as a Python dictionary, but for convenience you can also access its attributes with the dot syntax:
+
+```python
+# They are the same
+assert topic_data["document_term_matrix"].shape == topic_data.document_term_matrix.shape
+```
+
+Much like models, you can pretty-print information about topic models based on the `TopicData` object, but, since it contains more information on inference then the model object itself, you sometimes have to pass less parameters than if you called the same method on the model:
+
+```python
+model.print_representative_documents(0, corpus, document_topic_matrix)
+# This is simpler with TopicData, since you only have to pass the topic ID
+topic_data.print_representative_documents(0)
+```
+
+When producing figures, `TopicData` also gives you shorthands for accessing the topicwizard web app and Figures API:
+
+```python
+topic_data.figures.topic_map()
+```
+
+<center>
+<iframe src="https://x-tabdeveloping.github.io/topicwizard/_static/plots/topic_map.html" width="1000px" height="450px" frameborder=0></iframe>
+</center>
+
+See our guide on [Model Interpretation](model_interpretation.md) for more info.
+
+## API Reference
+
+::: turftopic.data.TopicData
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -6,12 +6,13 @@ nav:
     - Getting Started: index.md
     - Defining and Fitting Topic Models: model_definition_and_training.md
     - Interpreting and Visualizing Models: model_interpretation.md
-    - Modifying and Finetuning Models: finetuning.md
-    - Saving and Loading Models: persistence.md
     - Seeded Topic Modeling: seeded.md
     - Dynamic Topic Modeling: dynamic.md
     - Online Topic Modeling: online.md
     - Hierarchical Topic Modeling: hierarchical.md
+    - Modifying and Finetuning Models: finetuning.md
+    - Saving and Loading: persistence.md
+    - Using TopicData: topic_data.md
   - Models:
     - Model Overview: model_overview.md
     - Semantic Signal Separation (S³): s3.md