fix: avoid inactive backend probing in tests by voltjia · Pull Request #592 · InfiniTensor/InfiniOps

voltjia · 2026-05-07T10:22:55Z

Summary

Updates tests/conftest.py so skip checks use concrete platform selectors from --devices when they are provided.
Falls back to the active torch device selector, such as cuda, instead of probing every CUDA-like platform backend from Python.
Keeps the fix Python-only; this PR intentionally does not change .ci configuration.

Motivation

Closes #75

The regression is caused by skip_op_without_platform_impl mapping one torch device type to every platform sharing that type. For example, cuda was expanded to NVIDIA, MetaX, and Iluvatar, so a test running on one active CUDA-like backend could still call active_implementation_indices() for inactive sibling backends and abort in backend dispatch before pytest could skip the case.

Type of Change

feat — new feature / new operator / new platform
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

Test Results on Supported Platforms

Platform	Built	`pytest` Result	Notes / Hardware
NVIDIA	Yes	`4151 passed, 1375 skipped in 280.52s (0:04:40)`	Direct container run with `python -m pip install .[dev] && python -m pytest`; no #75 crash keywords.
Iluvatar	Yes	`5795 passed, 1447 skipped in 274.80s (0:04:34)`	Direct container run on GPU 6 with `python -m pip install .[dev] --no-build-isolation && python -m pytest`; no #75 crash keywords.
MetaX	Yes	`5795 passed, 1447 skipped in 341.72s (0:05:41)`	Direct container run with `python -m pip install .[dev] --no-build-isolation && python -m pytest`; no #75 crash keywords.
Cambricon	Yes	`12 failed, 3061 passed, 3857 skipped in 897.49s (0:14:57)`	Failures are `tests/test_add.py` int16 generation failures from `RuntimeError: "random_" not implemented for 'Short'`; no #75 crash keywords.
Moore	Yes	`300 failed, 5459 passed, 1483 skipped in 516.71s (0:08:36)`	Direct container run on GPU 6 with `MUSA_VISIBLE_DEVICES=6`, `python -m pip install .[dev] --no-build-isolation && python -m pytest`; failures are concentrated in `tests/test_gemm.py`; no #75 crash keywords.
Ascend	Yes	`3828 passed, 138 skipped in 435.82s (0:07:15)`	Direct container run with `python -m pip install .[dev] && python -m pytest`; pytest summary completed, wrapper status was `test=137` after summary; no #75 crash keywords.

Validation details

Local checks:
- git diff --check origin/master..HEAD
- python3 -m py_compile tests/conftest.py
- uvx ruff check tests/conftest.py
- uvx ruff format --check tests/conftest.py

Targeted regression checks:
- NVIDIA, no --devices: tests/test_cast.py::test_cast[cuda-input_dtype0-out_dtype0-0.001-0.001-shape0-None-None] -> 1 skipped in 3.68s
- NVIDIA, --devices nvidia: same test -> 1 skipped in 2.96s
- NVIDIA, CPU Matmul no --devices: tests/test_matmul.py::test_matmul[cpu-dtype0-0.01-0.01-False-False-a_shape0-b_shape0-c_shape0] -> 1 skipped in 3.87s

Additional Moore investigation:
- `MTHREADS_VISIBLE_DEVICES=6` alone does not restrict `torch_musa`; `torch.musa.device_count()` still reports 8 devices.
- `MUSA_VISIBLE_DEVICES=6` restricts `torch_musa` to one visible device and avoids the earlier apparent hang.
- Moore `tests/test_add.py` on GPU 6 with `MUSA_VISIBLE_DEVICES=6`: `324 passed, 108 skipped in 20.96s`.
- Moore full `pytest -n 1` on GPU 6 with `MUSA_VISIBLE_DEVICES=6`: `300 failed, 5459 passed, 1483 skipped in 550.14s (0:09:10)`.
- Moore full `pytest` without `-n` on GPU 6 with `MUSA_VISIBLE_DEVICES=6`: `300 failed, 5459 passed, 1483 skipped in 516.71s (0:08:36)`.

Benchmark / Performance Impact

N/A. This change only affects pytest skip-selection logic and does not alter operator implementations or runtime kernels.

Notes for Reviewers

The important behavior is that Python no longer expands cuda into all CUDA-like platform names before asking each operator for active implementations. If CI passes a concrete platform through --devices, that concrete platform is still used. If CI or a local run uses the torch device name, the selector is passed through to the pybind layer so it can resolve the backend compiled into the current wheel.

Iluvatar and Moore now both have completed full pytest summaries. The earlier Moore hang was traced to the runner using MTHREADS_VISIBLE_DEVICES without MUSA_VISIBLE_DEVICES; torch_musa only honored MUSA_VISIBLE_DEVICES for device visibility in this environment.

Checklist

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from master — the branch is rebased cleanly on top of the current master.
No fixup! / squash! / wip commits remain.

Scope and Design

Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
N/A — no public API changes.

General Code Hygiene (applies to all languages)

The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

N/A — no C++ files changed.

Python Specific (if Python files changed)

Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
ruff format --check passes cleanly — if not, run ruff format and commit the result.
Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
Type hints are added / kept consistent with the surrounding code.

Testing

pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
All supported platforms were tested and recorded in the table.
N/A — no new operator functionality was added.
Tests use pytest.mark.parametrize correctly: dependent parameters share one decorator (e.g. @pytest.mark.parametrize("dtype, rtol, atol", …)), independent parameters use separate decorators ordered by parameter declaration.
N/A — no new Payload-returning tests were added.
Default dtype / device parameterization is relied on, or overridden with an explicit pytest.mark.parametrize when necessary.
N/A — no new flaky test was added.
N/A — existing parametrized tests reproduce the bug on master; this PR fixes the test harness path without adding a new test file.

Build, CI, and Tooling

The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform.
N/A — no CMake changes were made, so compile_commands.json regeneration was not required for this PR.
N/A — no new backend or device was added.
N/A — CUDA-like backend mutual exclusion was not changed.
N/A — no C++ files changed; ruff checks passed locally.
No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).

Documentation

N/A — no user-facing behavior, build flags, or developer workflow were changed.
N/A — no new operator, dispatch helper, or public utility was added.
N/A — no breaking change.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
N/A — no third-party code was added.
N/A — no pointer arithmetic, memory access, or C++ bounds handling was changed.

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

voltjia · 2026-05-07T16:36:10Z

请 @zhangyue207 初审，@Ziminli 终审。

fix: avoid inactive backend probing in tests

6cc9a85

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

voltjia marked this pull request as ready for review May 7, 2026 16:35

voltjia requested a review from a team May 7, 2026 16:35

voltjia requested review from Ziminli and zhangyue207 May 7, 2026 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid inactive backend probing in tests#592

fix: avoid inactive backend probing in tests#592
voltjia wants to merge 1 commit intomasterfrom
fix/issue-75-cuda-platform-probing

voltjia commented May 7, 2026 •

edited

Loading

Uh oh!

voltjia commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

voltjia commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Type of Change

Platforms Affected

Test Results on Supported Platforms

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene (applies to all languages)

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

voltjia commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

voltjia commented May 7, 2026 •

edited

Loading