pyspark-antipattern

A static analysis linter for PySpark — catch performance antipatterns before they reach your cluster.

Written in Rust, installable as a Python package, and designed to run in CI/CD pipelines. +60 rules across 8 categories covering driver actions, shuffle explosions, UDFs, loops, and more.

What it catches

Real antipatterns, caught at commit time:

Code	Rule	Why it matters
`df.collect()`	D001	Pulls all data to driver — OOM risk on large datasets
`for c in cols: df.withColumn(...)`	L003	Each call adds a projection — plan explodes exponentially
`array_distinct(collect_list(x))`	ARR001	Use `collect_set(x)` — one step instead of two
`df.rdd.collect()`	PERF001	Use `.toPandas()` — 10x faster with Arrow enabled
`df.join(other)`	S011	No condition = Cartesian product
`@udf` returning `StringType`	U001	Built-in string functions are orders of magnitude faster

Why this exists

PySpark is easy to misuse. .collect() on a 10 GB DataFrame, .withColumn() called in a loop, UDFs where built-in functions exist — these patterns work fine locally and silently destroy performance at scale. This tool catches them early, at commit time, before they reach your cluster.

Installation

pip install pyspark-antipattern

Usage

Check a single file:

pyspark-antipattern check pipeline.py

Check an entire directory recursively:

pyspark-antipattern check src/

Use a custom config location:

pyspark-antipattern check src/ --config path/to/pyproject.toml

Exit codes

0 — no errors (warnings are allowed)
1 — one or more error-level violations found

CLI output

Default output — violations only:

Each violation line includes a colored severity badge — [HIGH] in red, [MEDIUM] in yellow, [LOW] in green — immediately after the rule ID:

error[D001][HIGH]: Avoid using collect()
  --> pipeline.py:42:10

Filter by your cluster's PySpark version to suppress rules for newer APIs:

pyspark-antipattern check src/ --pyspark-version=3.3  # suppress rules requiring 3.4+

Filter by severity directly from the CLI:

pyspark-antipattern check src/ --severity=high    # only HIGH violations
pyspark-antipattern check src/ --severity=medium  # MEDIUM and HIGH

With show_information = true — inline explanation for each rule:

With show_best_practice = true — best-practice guidance for each rule:

Rules

Full documentation is available at https://skanderboudawara.github.io/pyspark-antipattern/.

Rules are organized by category in the docs/rules/ folder. Each rule has its own markdown file with a full explanation, best-practice guidance, and a severity badge indicating its performance impact.

Category	Folder	Focus
ARR — Array	`docs/rules/arr/`	Array function antipatterns
D — Driver	`docs/rules/driver/`	Actions that pull data to the driver node
F — Format	`docs/rules/format/`	Code style and DataFrame API misuse
L — Looping	`docs/rules/looping/`	DataFrame operations inside loops
P — Pandas	`docs/rules/pandas/`	Pandas interop pitfalls
PERF — Performance	`docs/rules/performance/`	Runtime performance antipatterns
S — Shuffle	`docs/rules/shuffle/`	Joins, partitioning, and data movement
U — UDF	`docs/rules/udf/`	User-defined functions and their alternatives

Each rule carries a severity reflecting its performance impact:

Severity	Meaning
🔴 HIGH	Major performance impact — OOM risk, full scans, shuffle explosion
🟡 MEDIUM	Moderate performance impact — avoidable overhead at scale
🟢 LOW	Minor impact — style, API correctness, small inefficiencies

Configuration

Add a [tool.pyspark-antipattern] section to your project's pyproject.toml:

[tool.pyspark-antipattern]

# Show only these rules — everything else is silenced (default: all active)
# select = ["D001", "S"]

# Cluster PySpark version — silences rules requiring a newer version (default: all)
# pyspark_version = "3.3"     # suppress rules that require PySpark 3.4+

# Downgrade these rules from error to warning (exit code stays 0)
warn = ["F008", "F011"]

# Completely silence these rules — no output, no exit code impact
# Accepts exact rule IDs or single-letter group prefixes
ignore = ["S004"]                # silence one rule
# ignore = ["F"]                 # silence all F rules
# ignore = ["S", "L", "D001"]    # silence all S and L rules

# Only report violations at or above this performance-impact level (default: all)
# severity = "medium"            # show only MEDIUM and HIGH violations
# severity = "high"              # show only HIGH violations

# Show inline explanation for each rule that fired (default: false)
show_information = false

# Show best-practice guidance for each rule that fired (default: false)
show_best_practice = false

# PERF003: fire when more than N shuffle ops occur without a checkpoint (default: 9)
max_shuffle_operations = 9

# S004: flag when the weighted count of .distinct() calls exceeds this (default: 5)
distinct_threshold = 5

# S008: flag when the weighted count of explode() calls exceeds this (default: 3)
explode_threshold = 3

# L001/L002/L003: flag for-loops where range(N) > threshold;
#                 while-loops always assume 99 iterations (default: 10)
loop_threshold = 10

# Directories to skip during recursive scanning (default: common build/venv dirs)
# exclude_dirs = ["my_generated_code", "vendor"]

Suppressing a specific line

Add a # noqa: pap: RULE_ID comment to suppress one or more rules on that line:

result = df.collect()  # noqa: pap: D001
bad_join = df.crossJoin(other)  # noqa: pap: S010, S002

CI/CD integration

GitHub Actions

- name: Lint PySpark code
  run: |
    pip install pyspark-antipattern
    pyspark-antipattern check src/

The job fails automatically if any error-level rule fires. Warnings are reported but do not block the pipeline.

Pre-commit hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pyspark-antipattern
        name: PySpark antipattern linter
        entry: pyspark-antipattern check
        language: system
        types: [python]
        pass_filenames: false
        args: ["src/"]

A word on strictness

This linter will challenge code that your team may have written deliberately and knowingly. That is by design.

Each violation is not a verdict — it is a question: "Did you mean to do this, and do you understand the trade-off?" If the answer is yes, suppress the rule on that line or downgrade it to a warning in your config. If the answer is no, you just avoided a production issue.

The strictest setup is the default: every rule is a hard error. Relax only what you have a documented reason to relax.

Author

Skander Boudawara — skander.education@proton.me

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
.githooks		.githooks
.github		.github
docs		docs
img		img
samples		samples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyspark-antipattern

What it catches

Why this exists

Installation

Usage

CLI output

Rules

Configuration

Suppressing a specific line

CI/CD integration

GitHub Actions

Pre-commit hook

A word on strictness

Author

About

Uh oh!

Releases 15

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pyspark-antipattern

What it catches

Why this exists

Installation

Usage

CLI output

Rules

Configuration

Suppressing a specific line

CI/CD integration

GitHub Actions

Pre-commit hook

A word on strictness

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Uh oh!

Contributors

Uh oh!

Languages