Skip to content

feat: support prechecking down peers before restarting tikv pod#6877

Open
liubog2008 wants to merge 4 commits intopingcap:mainfrom
liubog2008:support-detect-peer-down
Open

feat: support prechecking down peers before restarting tikv pod#6877
liubog2008 wants to merge 4 commits intopingcap:mainfrom
liubog2008:support-detect-peer-down

Conversation

@liubog2008
Copy link
Copy Markdown
Member

@liubog2008 liubog2008 commented May 5, 2026

  • support prechecking down peers before restarting tikv pod
  • support waiting until leaders are evicted before restarting tikv pod

liubog2008 added 2 commits May 5, 2026 21:31
Signed-off-by: liubo02 <liubo02@pingcap.com>
Signed-off-by: liubo02 <liubo02@pingcap.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign linuxgit for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot requested a review from howardlau1999 May 5, 2026 13:42
@github-actions github-actions Bot added the v2 for operator v2 label May 5, 2026
@ti-chi-bot ti-chi-bot Bot added the size/XXL label May 5, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 5, 2026

Codecov Report

❌ Patch coverage is 70.75472% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.61%. Comparing base (2b81667) to head (c7b20b0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6877      +/-   ##
==========================================
+ Coverage   37.44%   37.61%   +0.17%     
==========================================
  Files         392      392              
  Lines       22432    22483      +51     
==========================================
+ Hits         8399     8458      +59     
+ Misses      14033    14025       -8     
Flag Coverage Δ
unittest 37.61% <70.75%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: liubo02 <liubo02@pingcap.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances TiKV pod restart safety by introducing PD-based prechecks (down-peer regions and leader eviction) before allowing TiKV pod recreation, and refactors leader-eviction condition syncing into the eviction task flow.

Changes:

  • Add a PD API client method and types for querying regions with down peers (/pd/api/v1/regions/check/down-peer).
  • Gate TiKV pod recreation on (a) zero non-self down peers and (b) leaders being evicted, and trigger leader-eviction scheduling when needed.
  • Refactor syncing of TiKVCondLeadersEvicted from the status task into the leader-eviction task, with updated/added unit tests.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/pdapi/v1/types.go Adds PD response types for down-peer region checks.
pkg/pdapi/v1/client.go Adds GetDownPeerRegions PD client call and endpoint constant.
pkg/pdapi/v1/mock_generated.go Updates PD client mock to include GetDownPeerRegions.
pkg/pdapi/v1/client_test.go Adds unit test coverage for GetDownPeerRegions.
pkg/controllers/tikv/tasks/util.go Adds helper checks for leader-eviction status/timeout; fixes VolumeName import aliasing.
pkg/controllers/tikv/tasks/pod.go Adds restart prechecks (down peers + leaders evicted) and wires PD client usage into restart flow.
pkg/controllers/tikv/tasks/pod_test.go Extends pod task tests to cover down-peer filtering and leader-eviction gating behavior.
pkg/controllers/tikv/tasks/evict_leader.go Changes eviction scheduler management based on ShouldEvictLeader and syncs LeadersEvicted condition here.
pkg/controllers/tikv/tasks/evict_leader_test.go Adds tests for starting/stopping leader eviction scheduler behavior.
pkg/controllers/tikv/tasks/offline.go Switches offline flow to use the new leader-eviction check helper and ShouldEvictLeader.
pkg/controllers/tikv/tasks/status.go Removes leader-eviction condition syncing and related wait behavior from status task.
pkg/controllers/tikv/tasks/status_test.go Updates expectations after removing leader-eviction condition management from status task.
pkg/controllers/tikv/tasks/ctx.go Minor formatting/structure adjustments; no functional change observed.
pkg/controllers/tikv/builder.go Updates runner wiring to pass PD client manager into TaskPod.
api/core/v1alpha1/tikv_types.go Adds ReasonStoreNotExist and deprecates ReasonStoreIsRemoved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controllers/tikv/tasks/pod.go Outdated
Comment on lines +105 to +109
return task.Wait().With("cannot recreate pod, check down peer: %v", err)
}

if err := CheckTiKVLeadersEvicted(state.TiKV()); err != nil {
return task.Wait().With("cannot recreate pod, check leader count: %v", err)
Comment on lines +150 to +159
func countNonSelfDownPeers(downPeerInfo *pdapi.RegionsCheckInfo, store *pdv1.Store) int {
if store == nil || store.ID == "" {
return downPeerInfo.Count
}
if downPeerInfo.Count == 0 {
return 0
}

nonSelfDownPeerCount := 0
for _, region := range downPeerInfo.Regions {
case !state.PDSynced:
return task.Wait().With("pd is unsynced")
case state.Store == nil:
if state.Store == nil {
Comment thread pkg/controllers/tikv/tasks/pod.go Outdated
Comment on lines +78 to +82
pc, ok := state.GetPDClient(cm)
if !ok {
return task.Wait().With("wait if pd client is not registered")
}

Signed-off-by: liubo02 <liubo02@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL v2 for operator v2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants