Skip to content

Commit 1b5026b

Browse files
authored
Camera ready (#48)
* Add reproducibility package infrastructure - Makefile with targets for running table/figure generation - README with usage instructions and data download link - requirements.txt with dependencies - download_data.py script for fetching pre-generated graphs * Add PGD benchmark and MMD table generation scripts - generate_benchmark_tables.py: Standard PGD metrics with VUN - generate_mmd_tables.py: Gaussian TV and RBF MMD metrics Both scripts use the polygraph-benchmark API: - StandardPGDInterval for PGD computation - GaussianTVMMD2BenchmarkInterval/RBFMMD2BenchmarkInterval for MMD - Proper graph format conversion for DIGRESS tensor outputs * Add GKLR table generation script Computes PGD metrics using Logistic Regression classifier instead of TabPFN, with standard descriptors (orbit counts, degree, spectral, clustering, GIN). Uses PolyGraphDiscrepancyInterval with sklearn LogisticRegression for classifier-based evaluation. * Add concatenation ablation table generation script Compares standard PGD (max over individual descriptors) vs concatenated PGD (all descriptors combined into single feature vector). Features: - ConcatenatedDescriptor class with PCA dimensionality reduction - Handles TabPFN 500-feature limit via PCA to 100 components - Uses LogisticRegression for concatenated features - Optimized subset mode for faster testing * Add figure generation scripts for reproducibility - generate_model_quality_figures.py: Training/denoising curves - generate_perturbation_figures.py: Metric sensitivity to edge perturbations - generate_phase_plot.py: PGD vs VUN training dynamics - generate_subsampling_figures.py: Bias-variance tradeoff analysis All scripts use StandardPGDInterval from polygraph-benchmark API. Phase plot gracefully handles missing VUN values (requires graph_tool). * Update pixi config to new workspace schema and fix package name Rename [project] to [workspace] per updated pixi schema and correct the pypi-dependencies package name from polygraph to polygraph-benchmark. * Relocate data directory to data/polygraph_graphs/ Move the expected data path from polygraph_graphs/ to data/polygraph_graphs/ to keep generated data under the gitignored data/ directory. * Remove standalone reproducibility README and requirements.txt Documentation is consolidated into the main README and dependencies are managed through pyproject.toml extras. * Add SLURM cluster submission infrastructure Add submitit-based cluster module for distributing reproducibility workloads across SLURM nodes. Includes YAML-configurable job parameters, job metadata tracking, and result collection helpers. - cluster.py: shared wrapper with SlurmConfig, submit_jobs, collect_results - configs/: default CPU and GPU SLURM configurations - pyproject.toml: new [cluster] optional dependency group (submitit, pyyaml) * Add SLURM cluster support to table generation scripts Add --slurm-config, --local, and --collect CLI options to all four table generation scripts for distributing computation across SLURM nodes. Each script gains a standalone task function suitable for submitit, result reshaping helpers, and three execution modes (local, submit, collect). Also updates DATA_DIR paths and adds tables-submit/tables-collect Make targets. * Add reproducibility section to README Document the full reproducibility workflow including data download, script overview, Make targets, hardware requirements, SLURM cluster submission, and troubleshooting tips. * Add pre-generated tables and figures for reference Include LaTeX tables and PDF figures produced by the reproducibility scripts so reviewers can verify outputs without re-running computation. * Restructure reproducibility pipeline and regenerate all results - Replace monolithic generate_*.py scripts with modular 01-08 experiment directories, each with compute.py, plot.py, and/or format.py - Add Hydra configs for all experiments with SLURM launcher support - Fix sparse feature OOM in GKLR (Bug 12), package name in graph_storage, TabPFN CPU limit workaround, and stale cache issues - Add kernel logistic regression module and async results I/O utility - Regenerate all tables with correct PGD values, subscores, and GKLR graph kernel metrics (PM/SP/WL) - Regenerate all figures including new subsampling, perturbation, model quality, and phase plot visualizations - Include all JSON result files for full reproducibility * Fix KernelLogisticRegression float64 precision in kernel normalization Ensure consistent float64 dtype in kernel diagonal computation and normalization to prevent precision issues with sparse matrix outputs. * Fix equal-size graph splits for ego and proteins perturbation Ego dataset has 757 graphs (odd number), causing unequal reference/ perturbed splits which fails the equal-count requirement. Use half = len // 2 and slice [half : 2*half] to guarantee equal sizes. Same fix applied to proteins split for consistency. * Handle TabPFN NaN errors for near-constant features in PGD TabPFN v2.0.9 raises ValueError when its encoder produces NaN from near-constant features after StandardScaler normalization (see github.com/PriorLabs/TabPFN/issues/108). This caused lobster/GRAN n=32 PGD subsampling to crash completely. Wrap classifier fit/predict in try/except in both the CV fold loop and the refit section. On failure, treat as indistinguishable distributions (score=0), matching the existing constant-feature fallback semantics. * Fix concatenation experiment PCA pipeline and subsample sizes Match original polygraph CombinedDescriptor behavior: per-descriptor StandardScaler + PCA, both fit on reference data only. Fix subsample size calculation to use 50% of min subset capped at 2048, matching the original experiment configuration. * Fix GKLR experiment reference graph count and subsample sizes Increase reference graph count from 512 to 4096 to match original experiment. Fix subsample size to 50% of min subset capped at 2048, consistent with the 2x requirement of PolyGraphDiscrepancyInterval. * Add H100 GPU SLURM launcher configuration Add submitit launcher config for p.hpcl94g partition with H100 GPUs, enabling faster TabPFN computation for PGD experiments. * Regenerate all reproducibility results with bug fixes Recompute all experiments after fixing: - KernelLogisticRegression float64 precision - Ego/proteins unequal graph splits - TabPFN NaN handling for near-constant features - Concatenation PCA pipeline - GKLR reference graph count and subsample sizes PGD subsampling: 117/120 results (3 ESGG n=4096 infeasible due to dataset size). All values within bootstrap variance of paper. Perturbation: 25/25 results including ego dataset. Benchmark, concatenation, GKLR tables: all 16/16 regenerated. * Update PGD computation, reproducibility scripts, and dependencies Includes TabPFN v6 classifier updates, plotting and formatting improvements across all reproducibility experiments, and added backoff/tabpfn dependencies. * Update all figures and tables to TabPFN weights v2.5 Regenerate all reproducibility tables and figures using TabPFN weights v2.5 for camera-ready preparation. Add --results-suffix support to 03_model_quality/format.py. Include comparison and merge utility scripts. * Rename PGS to PGD in model quality table headers * Add pymupdf and pillow dependencies Needed for PDF-to-image conversion in diff report generation. * Add parallel VUN computation with isomorphism timeout Refactor the VUN metric to support multiprocessing for novelty and validity checks, and add a per-pair SIGALRM timeout on isomorphism to prevent hangs on pathological graph pairs. Extract shared VUN helpers into reproducibility/utils/vun.py for reuse across experiments. * Refactor TabPFN classifier creation to use explicit version map Replace ad-hoc if/else branching on weights version with a version_map dict that raises on unknown versions instead of silently falling back. Applied consistently across all five compute scripts. * Add VUN computation scripts for model quality and benchmark experiments Add dedicated scripts to compute VUN (Valid-Unique-Novel) metrics for denoising-iteration checkpoints and benchmark results. These patch existing result JSONs with VUN values using parallel isomorphism checking. * Add SLURM launcher configs and train-test reference experiment Add CPU-only (hpcl94c) and GPU (hpcl93) SLURM launcher configs for Hydra multirun. Add experiment 09 that computes train-vs-test reference PGD values to establish metric baselines per dataset. * Improve table formatting with per-row ranking and VUN column Add bold/underline formatting for best/second-best values per row in correlation and benchmark tables. Scale correlation values by 100 for readability. Add VUN column support in denoising PGS table. Add subscore ranking in benchmark table. Rename orbit_pgs to orbit4_pgs. * Add single-dataset perturbation plotting command Add a new CLI subcommand for generating perturbation metric-vs-noise figures for a single dataset (e.g. SBM-only plots), supporting both single-perturbation and all-perturbation layouts. * Regenerate all figures and tables for camera-ready Updated with TabPFN weights v2.5, improved table formatting (bold/ underline ranking, values scaled by 100), new SBM perturbation plots, and additional versioned table snapshots for comparison. * Add reproducibility debug utilities and analysis scripts Add helper scripts used during the camera-ready recomputation: PGD diff checking, environment validation, pickle inspection, HTML diff report generation, SLURM recompute wrappers, and rerun notes documenting the process. * Move pymupdf and pillow to dev optional dependencies These are only used by the diff report generator script, not the core library. Move them from top-level pixi.toml dependencies into the dev extras in pyproject.toml so they're pulled in via the existing extras = ["dev", "cluster"] configuration. * Remove pre-generated figures and tables from tracking These are generated artifacts that should be reproduced from the scripts, not tracked in version control. * remove vscode settings * Fix ruff lint, ruff format, and pyright type check errors - Fix 37 ruff lint errors (unused imports, f-strings, ambiguous variable names, unused assignments) - Auto-format all files with ruff - Fix 31 pyright type errors: numpy.bool return types, scipy sparse shape stubs, Literal type annotations, None-safety assertions, and conditional weight kwargs * Add kernel_diag abstraction and refactor KernelLogisticRegression Extract _resolve_kernel method to eliminate duplicated kernel selection logic in _compute_kernel_matrix and _compute_kernel_diag. Add abstract kernel_diag method to DescriptorKernel with concrete implementations in all subclasses, replacing inline isinstance checks. * Relax pydantic constraint, remove grakel dev dep, register slow marker Widen pydantic to >=2.0,<3.0. Remove grakel from dev dependencies since tests now use frozen reference values. Register the slow pytest marker to suppress warnings. * Clean up test fixtures and add requires_import helper Remove hard imports of rdkit and graph_tool that prevented the test suite from loading without optional dependencies. Remove autouse=True from fixtures that are only needed by specific tests. Add requires_import() skip decorator and --skip-slow marker filtering. * Replace DGL runtime dependency with frozen reference values DGL is incompatible with PyTorch>=2.4. Replace runtime DGL comparisons with reference values precomputed under DGL 2.3.0 / PyTorch 2.3.1, with regeneration instructions in comments. * Replace grakel runtime dependency with frozen reference values grakel is incompatible with numpy>=2. Replace runtime grakel comparisons with reference gram matrices precomputed under grakel 0.1.10, with regeneration instructions in comments. * Use @pytest.mark.slow consistently for slow tests Replace skipif("config.getoption('--skip-slow')") with the @pytest.mark.slow decorator for consistency across the test suite. * Remove one-off debug and comparison scripts from reproducibility/ Remove 15 files that were used during development and paper review but are not part of the reproducibility pipeline. All unique configuration they contained is already captured in the Hydra configs and compute scripts. Removed: debug utilities (_check_pgd_diffs, check_env, check_pkl), HTML comparison generators (compare_figures, compare_pgd_v2_vs_v25, compare_tables, generate_diff_report), one-off recomputation scripts (recompute_training_pgd, slurm_recompute_*), merge_v2_results, rerun_notes.md, and the generated rebuttal_vs_camera_ready_diff.html. * Restore grakel_wl_mmd function with lazy import grakel is incompatible with numpy>=2 but the reference code should remain accessible. Restore the function with a lazy import inside the body and re-reference it in the skipped test_measure_runtime test. * Speed up test suite and fix parallel test issues - Mark slow tests (snippets, demo, TabPFN, bootstrap, graph_tool, standard PGD) so --skip-slow skips them by default - Add xdist_group markers to prevent dataset cache races and graph_tool concurrency issues under parallel execution - Add test-all pixi task for running the full suite including slow tests - Fix MockDescriptorKernel missing kernel_diag abstract method - Reduce molecule SMILES lists to 10 (sufficient for smoke tests) - Switch test output from -sv to -v --tb=short for cleaner parallel output - Use --dist loadgroup to respect xdist_group markers * Fix ruff and pyright CI failures - Remove unused unattr_ref variable in test_gin_metrics.py - Suppress pyright reportOptionalSubscript for csr_array.shape[0] * Fix ruff format: remove extra blank line in test_mmd.py * Add pyright to pre-commit hooks and fix whitespace issues Add pyright as a dev dependency and local pre-commit hook so the pre-commit workflow mirrors CI (ruff check, ruff format, pyright). Fix trailing whitespace and missing EOF newlines caught by hooks. * Remove redundant reproducibility pixi tasks The reproducibility workflow is fully covered by the Makefile in reproducibility/. Keep pixi tasks for dev workflow only (test, docs). * Expand pyright to cover tests/ and reproducibility/ - Add __iter__ to NetworkXView so list() works on dataset views - Cast np.quantile/mean/std to float in MetricInterval.from_samples - Replace BinomConfidenceInterval namedtuple with typed class - Fix Literal type mismatches for split/variant params at call sites - Add type narrowing assertions in tests for Optional attributes - Fix matplotlib private imports (use ticker/colors modules directly) - Exclude third-party test implementations (ggm/gran) from pyright - Fix pre-commit pyright hook to use pixi run * Install dev dependencies in pyright CI workflow Pyright now checks tests/ and reproducibility/ which import pytest and other dev dependencies. * Remove --paper-dir CLI option from all reproducibility scripts This option copied generated outputs to an external paper directory, which is no longer needed. * Remove --results-suffix and --mmd-only/--pgd-only CLI options These options are no longer needed. Output files use fixed paths directly instead of being parameterized through a suffix. * Remove results_suffix from compute scripts and Hydra configs The suffix was always empty. Compute scripts now use fixed result directory names that match the hardcoded paths in plot/format scripts. The 09_train_test_reference script now embeds the tabpfn weights version directly in the directory name. * Make default classifier explicit via default_classifier() factory The classifier parameter previously defaulted to None with the actual TabPFN instantiation hidden deep inside _descriptions_to_classifier_metric. Now the None sentinel is resolved immediately at the top of the function, and all docstrings document that the default is TabPFN via default_classifier(). * Resolve classifier=None at init, not deep in the call chain Each class now resolves None to default_classifier() in its __init__, so _classifier is always a concrete ClassifierProtocol. The internal _descriptions_to_classifier_metric now requires a classifier (keyword- only) and never sees None. * Add docstring to _json_default serializer helper * Simplify io.py: drop maybe_append_reproducibility_jsonl alias Kept only maybe_append_jsonl as the single function. Added docstrings to all public functions. Updated all 13 import sites. * Remove dead compute scripts - 01_subsampling/compute.py: duplicated compute_pgd.py, unused by Makefile and submit scripts - 09_train_test_reference/: experiment never integrated into the reproduction pipeline (no Makefile target, no plot/format scripts, no results) * Remove dead 01_subsampling/compute.py, restore 09_train_test_reference compute.py in 01_subsampling duplicated compute_pgd.py and was unused. 09_train_test_reference is kept and will be integrated into the Makefile. * Integrate 09_train_test_reference into Makefile Added to compute-tables, submit-tables, and as standalone target 09. * Remove cluster-specific SLURM configs and hardcoded paths - Removed 6 cluster-specific launcher variants (slurm_cpu_hpcl94c, slurm_cpu_large, slurm_cpu_small, slurm_gpu_fallback, slurm_gpu_h100, slurm_gpu_hpcl93). Only generic slurm_cpu and slurm_gpu remain with placeholder partitions. - Replaced hardcoded absolute paths in submit scripts with git rev-parse --show-toplevel. - Replaced cluster-specific partition names with TODO placeholders. - Cleaned up docstring references to removed launchers. * Remove review-camera-ready.md * Fix critical, high, and medium review issues in core library kernel_lr.py: - C1: Use K.shape[0] for alpha_init instead of len(X) - C4: Use np.logaddexp(0, -yf) for numerical stability - H1: Merge objective/gradient to eliminate redundant K @ alpha - H2: Avoid double featurization when X2 is None - L1: Remove unused random_state and project_dim parameters vun.py: - C2: Replace signal.SIGALRM timeout with ThreadPoolExecutor (works in multiprocessing workers and on Windows) - H3: Add edges="links" to nx.node_link_data/graph calls - M7: Remove section separator comments generic_descriptors.py: - M2: Replace .get() defensive defaults with direct access - M3: Remove bare except Exception in PyramidMatchDescriptor io.py: - M1: Replace hasattr with isinstance(obj, np.generic) * Fix remaining review items: performance, style, cleanup polygraphdiscrepancy.py: - P3: Vectorize _is_constant sparse check (col min/max vs row loop) Reproducibility scripts: - L10: Remove duplicate runtime fields (keep *_perf_seconds only) - L11: Remove pointless _fmt_pgs/_best_two aliases in format scripts - M7: Remove section separator comments across all scripts * Deduplicate TabPFN factory and VUN logic, add sparse eigenvalue path - H6: 05_benchmark/compute_vun.py now imports compute_vun_parallel from utils.vun instead of duplicating ~120 lines of VUN logic - L5: Extracted make_tabpfn_classifier to utils/data.py, replaced 6 local copies across compute scripts - P4: EigenvalueHistogram uses scipy.sparse.linalg.eigsh for graphs with >500 nodes, avoiding dense conversion of large Laplacians * Deduplicate utilities and add kernel size guard - L4: load_graphs/get_reference_dataset in 01-03 now delegate to utils/data.py instead of local copies - L6: load_results extracted to utils/formatting.py, removed from 4 format scripts - P1: Warn when kernel matrix exceeds 10k samples in kernel_lr.py * Add download hash verification and restore test thresholds - H4: download_data.py now verifies SHA-256 hash after download, before extraction. Uses placeholder hash with TODO for now. - M8: Restore test thresholds from 0.5 to 0.7 in test_polygraphdiscrepancy.py. The test distributions (ER 0.8 vs ER 0.1) are clearly distinct; 0.5 was essentially random chance. * Clean up dead code, deduplicate constants, and fix style issues Remove unused utilities (to_list, mol2smiles, BOND_STEREO_TYPES, MetricInterval.__getitem__), extract shared constants (_DEFAULT_RBF_BANDWIDTHS, _molecule_descriptors, _standard_descriptors), fix f-string bug in polygraphdiscrepancy, use NamedTuple classes over namedtuple calls, modernize super() calls, replace assert False with proper exceptions, use sqeuclidean metric directly, and move tqdm to core dependencies. * Split TabPFN tests into slow variants with larger sample sizes Separate logistic and TabPFN classifier tests so TabPFN variants use 256 samples (up from 128) for stability, and mark them @pytest.mark.slow instead of using request.applymarker at runtime. * Fix MoleculePGDInterval test: reduce subsample_size to fit test data The test only has 10 molecules but subsample_size=8 requires at least 16 reference molecules (2 * subsample_size). Reduce to 4. * Fix ruff formatting
1 parent 7a60b5c commit 1b5026b

99 files changed

Lines changed: 15532 additions & 621 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/pyright.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
- name: Upgrade pip and install project (editable)
2222
run: |
2323
python -m pip install --upgrade pip
24-
pip install -e .
24+
pip install -e ".[dev]"
2525
2626
- name: Install Pyright
2727
run: |

.gitignore

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,14 @@ experiments/*/figures/*.png
99
experiments/*/results/*.csv
1010
experiments/*/tables/*.tex
1111
experiments/model_benchmark/benchmark_results.tex
12+
multirun/
13+
14+
# Reproducibility generated outputs
15+
reproducibility/figures/**/*.pdf
16+
reproducibility/figures/**/results/
17+
reproducibility/figures/**/results_*/
18+
reproducibility/tables/results/
19+
reproducibility/figure_comparison/
1220

1321
# Byte-compiled / optimized / DLL files
1422
__pycache__/
@@ -793,4 +801,4 @@ TSWLatexianTemp*
793801
*.vtc
794802

795803
# glossaries
796-
*.glstex
804+
*.glstex

.pre-commit-config.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,12 @@ repos:
2525
- id: ruff-format
2626
types_or: [ python, pyi ]
2727
args: [--config=pyproject.toml ]
28+
29+
- repo: local
30+
hooks:
31+
- id: pyright
32+
name: Pyright Type Check
33+
entry: pixi run pyright --project pyproject.toml
34+
language: system
35+
types: [python]
36+
pass_filenames: false

.readthedocs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ build:
88
jobs:
99
build:
1010
html:
11-
- python -m mkdocs build --clean --site-dir $READTHEDOCS_OUTPUT/html --config-file mkdocs.yml
11+
- python -m mkdocs build --clean --site-dir $READTHEDOCS_OUTPUT/html --config-file mkdocs.yml
1212
- python -m mkdocs build --clean --site-dir $READTHEDOCS_OUTPUT/html --config-file mkdocs.yml # Execute twice to make links to /images work
1313
- ls -la $READTHEDOCS_OUTPUT/html
1414

.vscode/settings.json

Lines changed: 0 additions & 7 deletions
This file was deleted.

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,4 @@ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
2525
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
2626
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
2727
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28-
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 168 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -243,19 +243,185 @@ The following results mirror the tables from [our paper](https://arxiv.org/abs/2
243243

244244
<sub>* AutoGraph* denotes a variant that leverages additional training heuristics as described in the [paper](https://arxiv.org/abs/2510.06122).</sub>
245245

246+
## Reproducibility
247+
248+
The [`reproducibility/`](reproducibility/) directory contains scripts to reproduce all tables and figures from the paper.
249+
250+
### Quick Start
251+
252+
```bash
253+
# 1. Install dependencies
254+
pixi install
255+
256+
# 2. Download the graph data (~3GB)
257+
cd reproducibility
258+
python download_data.py
259+
260+
# 3. Generate all tables and figures
261+
make all
262+
```
263+
264+
### Data Download
265+
266+
The generated graph data (~3GB) is hosted on [Proton Drive](https://drive.proton.me/urls/VM4NWYBQD0#3sqmZtmSgWTB). After downloading, extract to `data/polygraph_graphs/` in the repository root.
267+
268+
```bash
269+
# Full dataset (required for complete reproducibility)
270+
python download_data.py
271+
272+
# Small subset for testing/CI (~50 graphs per model)
273+
python download_data.py --subset
274+
```
275+
276+
Expected data structure after extraction:
277+
278+
```
279+
data/polygraph_graphs/
280+
├── AUTOGRAPH/
281+
│ ├── planar.pkl
282+
│ ├── lobster.pkl
283+
│ ├── sbm.pkl
284+
│ └── proteins.pkl
285+
├── DIGRESS/
286+
│ ├── planar.pkl
287+
│ ├── lobster.pkl
288+
│ ├── sbm.pkl
289+
│ ├── proteins.pkl
290+
│ ├── denoising-iterations/
291+
│ │ └── {15,30,45,60,75,90}_steps.pkl
292+
│ └── training-iterations/
293+
│ └── {119,209,...,3479}_steps.pkl
294+
├── ESGG/
295+
│ └── *.pkl
296+
├── GRAN/
297+
│ └── *.pkl
298+
└── molecule_eval/
299+
└── *.smiles
300+
```
301+
302+
### Scripts Overview
303+
304+
#### Table Generation
305+
306+
| Script | Output | Description |
307+
|--------|--------|-------------|
308+
| `generate_benchmark_tables.py` | `tables/benchmark_results.tex` | Main PGD benchmark (Table 1) comparing AUTOGRAPH, DiGress, GRAN, ESGG |
309+
| `generate_mmd_tables.py` | `tables/mmd_gtv.tex`, `tables/mmd_rbf_biased.tex` | MMD² metrics with GTV and RBF kernels |
310+
| `generate_gklr_tables.py` | `tables/gklr.tex` | PGD with Kernel Logistic Regression using WL and SP kernels |
311+
| `generate_concatenation_tables.py` | `tables/concatenation.tex` | Ablation comparing individual vs concatenated descriptors |
312+
313+
#### Figure Generation
314+
315+
| Script | Output | Description |
316+
|--------|--------|-------------|
317+
| `generate_subsampling_figures.py` | `figures/subsampling/` | Bias-variance tradeoff as function of sample size |
318+
| `generate_perturbation_figures.py` | `figures/perturbation/` | Metric sensitivity to edge perturbations |
319+
| `generate_model_quality_figures.py` | `figures/model_quality/` | PGD vs training/denoising steps for DiGress |
320+
| `generate_phase_plot.py` | `figures/phase_plot/` | Training dynamics showing PGD vs VUN |
321+
322+
Each script can be run independently with `--subset` for quick testing:
323+
324+
```bash
325+
# Tables (full computation)
326+
python generate_benchmark_tables.py
327+
python generate_mmd_tables.py
328+
python generate_gklr_tables.py
329+
python generate_concatenation_tables.py
330+
331+
# Tables (quick testing with --subset)
332+
python generate_benchmark_tables.py --subset
333+
python generate_mmd_tables.py --subset
334+
335+
# Figures (full computation)
336+
python generate_subsampling_figures.py
337+
python generate_perturbation_figures.py
338+
python generate_model_quality_figures.py
339+
python generate_phase_plot.py
340+
341+
# Figures (quick testing)
342+
python generate_subsampling_figures.py --subset
343+
python generate_perturbation_figures.py --subset
344+
```
345+
346+
### Make Targets
347+
348+
```bash
349+
make download # Download full dataset (manual step required)
350+
make download-subset # Create small subset for CI testing
351+
make tables # Generate all LaTeX tables
352+
make figures # Generate all figures
353+
make all # Generate everything
354+
make tables-submit # Submit table jobs to SLURM cluster
355+
make tables-collect # Collect results from completed SLURM jobs
356+
make clean # Remove generated outputs
357+
make help # Show available targets
358+
```
359+
360+
### Hardware Requirements
361+
362+
- **Memory:** 16GB RAM recommended for full dataset
363+
- **Storage:** ~4GB for data + outputs
364+
- **Time:** Full generation takes ~2-4 hours on a modern CPU
365+
366+
The `--subset` flag uses ~50 graphs per model, runs in minutes, and verifies code correctness (results are not publication-quality).
367+
368+
### Cluster Submission
369+
370+
Table generation scripts support SLURM cluster submission via [submitit](https://github.com/facebookincubator/submitit). Install the cluster extras first:
371+
372+
```bash
373+
pip install -e ".[cluster]"
374+
```
375+
376+
SLURM parameters are configured in YAML files (see `reproducibility/configs/slurm_default.yaml`):
377+
378+
```yaml
379+
slurm:
380+
partition: "cpu"
381+
timeout_min: 360
382+
cpus_per_task: 8
383+
mem_gb: 32
384+
```
385+
386+
Submit jobs, then collect results after completion:
387+
388+
```bash
389+
cd reproducibility
390+
391+
# Submit all table jobs to SLURM
392+
python generate_benchmark_tables.py --slurm-config configs/slurm_default.yaml
393+
394+
# After jobs complete, collect results and generate tables
395+
python generate_benchmark_tables.py --collect
396+
397+
# Or use Make targets
398+
make tables-submit # submit all
399+
make tables-submit SLURM_CONFIG=configs/my_cluster.yaml # custom config
400+
make tables-collect # collect all
401+
```
402+
403+
Use `--local` with `--slurm-config` to test the submission pipeline in-process without SLURM.
404+
405+
### Troubleshooting
406+
407+
**Memory issues:** Use `--subset` flag for testing, process one dataset at a time, or increase system swap space.
408+
409+
**Missing data:** Verify `data/polygraph_graphs/` exists in repo root, run `python download_data.py` to check data status, or download manually from Proton Drive.
410+
411+
**TabPFN issues:** TabPFN is pinned to v2.0.0 for reproducibility: `pip install tabpfn==2.0.0`.
246412

247413
## Citing
248414

249415
To cite our paper:
250416

251417
```latex
252418
@misc{krimmel2025polygraph,
253-
title={PolyGraph Discrepancy: a classifier-based metric for graph generation},
419+
title={PolyGraph Discrepancy: a classifier-based metric for graph generation},
254420
author={Markus Krimmel and Philip Hartout and Karsten Borgwardt and Dexiong Chen},
255421
year={2025},
256422
eprint={2510.06122},
257423
archivePrefix={arXiv},
258424
primaryClass={cs.LG},
259-
url={https://arxiv.org/abs/2510.06122},
425+
url={https://arxiv.org/abs/2510.06122},
260426
}
261427
```

environment.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,3 @@ dependencies:
77
- graph-tool
88
- pip:
99
- -e .[dev]
10-

logo/logo.tex

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -538,5 +538,3 @@
538538
\node[anchor=base west, text=edgeBColor, inner sep=0pt, font=\LogoTextFont] at ($(benchbbox.east |- 0,\TextBaselineY) + (\BenchTextGapLen,0)$) {enchmark};
539539
\end{tikzpicture}
540540
\end{document}
541-
542-

logo/logo_full.tex

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -571,5 +571,3 @@
571571
\node[anchor=base west, text=edgeBColor, inner sep=0pt, font=\LogoTextFont] at ($(benchbbox.east |- 0,\TextBaselineY) + (\BenchTextGapLen,0)$) {enchmark};
572572
\end{tikzpicture}
573573
\end{document}
574-
575-

0 commit comments

Comments
 (0)