Skip to content

[PERF] Pytest Discovery - highly parameterized tests lead to inefficient test node creation in vscode_pytest #25948

@tboddyspargo

Description

@tboddyspargo

I'm raising this as a new issue, since #25348 didn't address the majority of the performance concern and I dug deeper to identify a variety of specific action items that I'm confident will help. Thanks to @eleanorjboyd for working with me on this in the past!


Unfortunately, #25658 doesn't appear to have made much improvement for my suite of tests. Possibly a little bit, but not the order of magnitude improvement I was expecting to see.

Note that an effective reproduction will require the presence of highly parameterized test functions (e.g. 10,000+ parameterize cases per test, 100,000+ total test cases in the suite).

Collecting at the command line (without vscode_pytest):

328987 tests collected in 10.43s

Collecting using VSCode Test Explorer (tested with ms-python.python versions 2026.0.0 and 2026.1.2026010901):

328987 tests collected in 66.07s (0:01:06)

Is there any logging/tracing within the build_test_tree logic that I can turn on to help you investigate the source of delay that aren't scaling well to hundreds of thousands of tests?

I can see from #25658 that you suspected the use of list and duplicate checks to be a significant contributor, so I'll point out that there are other such uses ("children" key) that may also require attention. E.g.

if test_node not in function_test_node["children"]:
function_test_node["children"].append(test_node)


I did some deeper testing with vscode_pytest locally and found that the key inefficiencies all seem to stem from either A) avoiding duplicates in a list (as you suspected) and B) performing redundant computations for every parameterized Item of a single function. Here's what I would suggest (prioritized by the impact on runtime performance):

Key Performance Issues:

  1. Continue avoiding inefficient deduplication any time a list is used, esp. for the test cases of a parameterized function (just remove the first duplicate check in process_parameterized_test?).
  2. create_test_node redundantly extracts the line number from every parameterized instance of a test function
    • This could be cached by the parent function id rather than the parameterized item ID

Performance-improving Refactors:
3. Alghough this will be more effort, I think that the payload size could be drastically reduced if you changed the JSON schema sent back to the TEST_RUN_PIPE to avoid duplicating the absolute path so many times.
- If you store the root package folder once and allow all test paths/IDs to remain relative it would save both computation time and payload size (memory and copy/transfer speed).
- If you store path, class+function name, and parameterized ID separately, then many fields of the test nodes themselves will be much smaller.
4. It might help to use cached_fspath any time that you need os.fspath to optimize for caching.
5. I would suggest extracting and consolidating the parent/module/file node creation logic that's duplicated for multiple branches of build_test_tree and process_parameterized_test
- The way I see it, the first two conditional branches of build_test_tree should result in the identification of a top-level function/class node (and the creation of all its children class/test nodes). Then, as a separate logical section within the for loop of build_test_tree, the file-level node can be added to file_nodes_dict (if not present) and the top-level item added to its children.
- Once you do that, I think it some additional short-circuiting opportunities will become obvious since many functions will share the same parent module/file.
6. Generally speaking, I think there may be some unnecessary extraction/computation/conversion of file paths (and possibly some IO cost that could be avoided).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions