[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666
[CELEBORN-2310] Reject RESERVE_SLOTS when disks are full#3666saurabhd336 wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the worker’s local-disk “availability” checks to treat disks with actualUsableSpace <= 0 as unusable, so RESERVE_SLOTS can be rejected earlier during disk-full scenarios and avoid wasted push/write network I/O.
Changes:
- Refine
healthyWorkingDirs()to exclude disks that areHEALTHYbut haveactualUsableSpace <= 0. - Refine
createDiskFile()directory selection to avoid using a suggested mount point when its disk has no usable space.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Review: [CELEBORN-2310] Reject RESERVE_SLOTS when disks are full PR: #3666 | Author: saurabhd336 | Change: +6 / -2 in StorageManager.scala
Adds an actualUsableSpace > 0 check in two places so that workers reject RESERVE_SLOTS requests upfront when local disks are full, instead of accepting the reservation and only triggering a HARD_SPLIT after wasted network I/O on write.
|
|
@saurabhd336, thanks for contribution. Please address above comments of Copilot and Claude Code. |
|
@SteNicholas Addressed comments. PTAL! |
|
This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
|
@saurabhd336, could you please rebase the latest main branch to resolve the conflicts? |
…llReserveSlotsRejection
|
@SteNicholas Thanks for helping review. Rebased |
What changes were proposed in this pull request?
Disk full only lead to HARD_SPLITs as a response to writes. However, doesn't lead to reserve slot rejections. This means too many write retries (due to HARD_SPLITs on each write attempt) leads to wasted network I/O. We can reject RESERVE_SLOT during disk full to avoid the wasted data write network IO.
Why are the changes needed?
Reject reserve slots during disk full, avoid unnecessary network IO.
Does this PR resolve a correctness bug?
No.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added UTs, CI.