Skip to content

[WIP] Add e2e test for KubeRay NativeWorkloadScheduling#6227

Draft
mboersma wants to merge 15 commits intokubernetes-sigs:mainfrom
mboersma:kuberay-native-scheduling-e2e
Draft

[WIP] Add e2e test for KubeRay NativeWorkloadScheduling#6227
mboersma wants to merge 15 commits intokubernetes-sigs:mainfrom
mboersma:kuberay-native-scheduling-e2e

Conversation

@mboersma
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add a new e2e test that exercises the unreleased NativeWorkloadScheduling feature from the kuberay workload-poc branch. This feature uses the Kubernetes-native scheduling.k8s.io/v1alpha2 API (Workload + PodGroup) for gang scheduling of Ray pods.

Changes:

  • New cluster template ci-version-native-scheduling with K8s feature gates GenericWorkload, GangScheduling, and runtime config for scheduling.k8s.io/v1alpha2
  • InstallHelmChartFromPath and InstallKubeRayOperatorFromSource helpers for installing kuberay from a local chart with custom image
  • KubeRayNativeSchedulingSpec test that creates a RayCluster with the opt-in annotation, verifies Workload and PodGroup resources are created, and confirms all pods reach Running state
  • New Ginkgo test case tagged [KubeRay] [NativeScheduling]

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 10, 2026
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jont828 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 10, 2026
@mboersma mboersma changed the title [WIP] Add e2e test for KubeRay NativeWorkloadScheduling on self-managed k8s 1.36+ [WIP] Add e2e test for KubeRay NativeWorkloadScheduling Apr 10, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.85%. Comparing base (a248a9c) to head (a94382b).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6227   +/-   ##
=======================================
  Coverage   43.85%   43.85%           
=======================================
  Files         289      289           
  Lines       25341    25341           
=======================================
  Hits        11113    11113           
  Misses      13450    13450           
  Partials      778      778           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… 1.36+

Add a new e2e test that exercises the unreleased NativeWorkloadScheduling
feature from the kuberay workload-poc branch. This feature uses the
Kubernetes-native scheduling.k8s.io/v1alpha2 API (Workload + PodGroup)
for gang scheduling of Ray pods.

Changes:
- New cluster template ci-version-native-scheduling with K8s feature
  gates GenericWorkload, GangScheduling, and runtime config for
  scheduling.k8s.io/v1alpha2
- InstallHelmChartFromPath and InstallKubeRayOperatorFromSource helpers
  for installing kuberay from a local chart with custom image
- KubeRayNativeSchedulingSpec test that creates a RayCluster with the
  opt-in annotation, verifies Workload and PodGroup resources are
  created, and confirms all pods reach Running state
- New Ginkgo test case tagged [KubeRay] [NativeScheduling]
…dential provider dependency

- Add scripts/ci-build-kuberay-operator.sh to clone marosset/kuberay@workload-poc,
  build the operator image, and push it to the local registry
- Source the build script from ci-e2e.sh when GINKGO_FOCUS matches NativeScheduling
- Remove ACR credential provider scripts and kubelet args from the
  ci-version-native-scheduling template (not needed without custom CCM)
- Remove cloud-provider-azure-chart-ci HelmChartProxy (use released CCM)
- Remove CLOUD_PROVIDER_AZURE_LABEL=azure-ci override from the test
- Add _kuberay-source/ to .gitignore
The ci-version-native-scheduling template requires the azure-ci CCM chart
variant with explicit image tags, same as other ci-version flavors. Without
it the released cloud-provider-azure chart fails to install because it
cannot auto-detect images for unreleased K8s versions.

- Restore template to full ci-version parity (ACR credential provider,
  cloud-provider-azure-chart-ci HelmChartProxy)
- Restore CLOUD_PROVIDER_AZURE_LABEL=azure-ci in the test
- Trigger CCM build in ci-e2e.sh when GINKGO_FOCUS matches NativeScheduling
@mboersma mboersma force-pushed the kuberay-native-scheduling-e2e branch from b956e97 to a8aa2fc Compare April 10, 2026 20:36
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

The Prow e2e-kuberay job sets GINKGO_FOCUS=\[KubeRay\], but the build
trigger only checked for NativeScheduling. This caused KUBERAY_SOURCE_DIR
to never be set, failing InstallKubeRayOperatorFromSource.
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

The ci-build-kuberay-operator.sh script computed REPO_ROOT as a relative
path (e.g. ./scripts/..), which resulted in KUBERAY_SOURCE_DIR being
exported as a relative path. When Ginkgo runs the test binary, the
working directory differs from the repo root, causing the Helm chart
lookup to fail with 'no such file or directory'.

Fix by resolving REPO_ROOT to an absolute path via pwd after the cd.
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

The kuberay-operator Helm chart includes ~71K lines of CRDs that need
to be applied to the API server. On a small self-managed VM
(Standard_D2s_v3), processing these CRDs plus ACR credential provider
auth and image pull can exceed the 5-minute timeout. The CI run
confirmed an exact 5-minute timeout hit (context deadline exceeded).

Increase to 10 minutes to give sufficient headroom.
When InstallHelmChartFromPath fails, dump:
- Pod status, container states, and restart counts
- Pod events (scheduling, image pull, etc.)
- Pod logs (operator startup errors)
- CRD list (to check if CRDs finished processing)

Also pass --debug to helm install for verbose Helm output
showing what it's waiting on during --wait.
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

…values

The kuberay operator Helm chart template uses Go's %t format verb on
featureGates[N].enabled values. When --set-string is used, Helm stores
these as strings ("true") rather than booleans, causing fmt.Sprintf
to produce "%terraform plan(string=true)" which fails strconv.ParseBool.

Using --set passes the values as booleans, matching what the chart expects.
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Last run looked like a flake, actually.

@mboersma
Copy link
Copy Markdown
Contributor Author

Test passed, and it looks like it's testing the right things. Now I'll add some more specific workload API checks.

@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Run the test again after adding more specific checks.

@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Add a parallel negative test that creates a RayCluster WITHOUT the
ray.io/native-workload-scheduling annotation and verifies that:
- No Workload resources are created
- No PodGroup resources are created
- Pods do not have schedulingGroup set

The test runs on its own cluster (vm-nonatsched) in a separate Ginkgo
Context, so it executes in parallel with the positive test.
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Just to ensure it's still passing after a refactoring to remove duplicated code.

- Unify 4 identical input structs into single KubeRaySpecInput
- Extract waitForRayClusterReady() and waitForRayPodRunning() helpers
- Simplify newRayClusterWithNativeScheduling() to build on newRayClusterUnstructured()
- Move podGVR to package level (was declared locally in 2 functions)
- Simplify Workload/PodGroup lookups from scan to direct Get by name
- Use rayClusterName variable consistently instead of hardcoded strings
- Net reduction of ~141 lines
@mboersma mboersma force-pushed the kuberay-native-scheduling-e2e branch from 51e99fb to f16dc8c Compare April 13, 2026 21:27
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

- Use NestedFieldNoCopy+BeNumerically instead of NestedInt64 for
  minCount assertion (JSON numbers are float64, not int64)
- Strengthen gang scheduling assertion to verify all-or-nothing:
  at most 1 worker Running, >= 19 Pending (not just pendingCount > 0)
- Hardcode apiServer feature-gates consistently with controller-manager
  and scheduler (remove K8S_FEATURE_GATES override that could cause
  component feature gate mismatch)
@mboersma
Copy link
Copy Markdown
Contributor Author

/test pull-cluster-api-provider-azure-e2e-kuberay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants