Skip to content

Commit routing config before demoting old version in handleSetCurrent#9973

Draft
Shivs11 wants to merge 4 commits intomainfrom
commit-routing-first
Draft

Commit routing config before demoting old version in handleSetCurrent#9973
Shivs11 wants to merge 4 commits intomainfrom
commit-routing-first

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented Apr 16, 2026

What changed?

  • Fix non-atomic two-step sync in handleSetCurrent and setRamp where step 2 (demoting old version) failure leaves routing config uncommitted, causing orphaned CURRENT versions and burned revision numbers.

  • New flow: commit routing config immediately after step 1 (promote new version) succeeds, then fire-and-forget signal to old version instead of blocking sync activity. Gated behind workflow.GetVersion for NDE safety.

  • The only thing that I don't have in this PR are new tests to test this out. Happy to hear ideas if someone has any, but the core idea was that the current ones should be passing and testing the code paths. I am still thinking about this.

Why?

  • Reliability IMO

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

  • I would appreciate a very careful review on this one!

Shivs11 and others added 2 commits April 16, 2026 13:07
Fix non-atomic two-step sync in handleSetCurrent and setRamp where
step 2 (demoting old version) failure leaves routing config uncommitted,
causing orphaned CURRENT versions and burned revision numbers.

New flow: commit routing config immediately after step 1 (promote new
version) succeeds, then fire-and-forget signal to old version instead
of blocking sync activity. Gated behind workflow.GetVersion for NDE
safety.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SignalExternalWorkflow requires .Get() for the command to be committed
in the workflow task response. This matches the established pattern
used by syncSummary and signalPropagationComplete.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wfID := GenerateVersionWorkflowID(v.GetDeploymentName(), v.GetBuildId())
err := workflow.SignalExternalWorkflow(ctx, wfID, "", DemoteVersionSignalName, &deploymentspb.DemoteVersionSignalArgs{
RoutingConfig: routingConfig,
}).Get(ctx, nil)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is something that has stumped me - in my first commit, on this PR, I did not have the .Get() call which let to a bunch of functional tests failing. Adding the .Get() call made the tests work.

Now, according to my knowledge, it seems that when we have the .Get() present, it means that there is a guarantee that the signal reaches the server. I checked other places in worker-versioning workflows where we have also sent cross workflow signals, and we seem to be following the pattern of using the .Get() method. However, it does mean that there is a quick round trip between the worker and the server here.

I am thinking on the side if I should just wrap this signal up in a workflow.Go routine, so that we don't make this round trip....

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, my answer is here this seems to be a known limitation: #1289

It seems to me that if we do have a world in which the workflow task sends back the CAN command to the server, and a signal too, we drop the signal and just CAN. So, we can't not have this .Get() call which shall make this RPC trip!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

func (s *Versioning3Suite) waitForDeploymentDataPropagationQueryWorkerDeployment(tv *testvars.TestVars) {
if s.deploymentWorkflowVersion == workerdeployment.AsyncSetCurrentAndRamping {
if s.deploymentWorkflowVersion >= workerdeployment.AsyncSetCurrentAndRamping {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we need this to also run even when we have the VersionDataRevisionNumber (version number 2) running. My changes in this PR caught that, given that we have eventual consistency now for the old version changes too.

I think this also meant that some of the tests that used this function could be flaky on main. If you look closely, in the VersionDataRevisionNumber mode, we should honestly be waiting here until the propagation is completed! So, in theory, I think this change makes sense to me

The reason why this was caught(flaky) in this PR is because this PR introduced the latency of the update finally completing in the next workflow task, due to the extra RPC trip that I mentioned above.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant