Commit routing config before demoting old version in handleSetCurrent#9973
Commit routing config before demoting old version in handleSetCurrent#9973
Conversation
Fix non-atomic two-step sync in handleSetCurrent and setRamp where step 2 (demoting old version) failure leaves routing config uncommitted, causing orphaned CURRENT versions and burned revision numbers. New flow: commit routing config immediately after step 1 (promote new version) succeeds, then fire-and-forget signal to old version instead of blocking sync activity. Gated behind workflow.GetVersion for NDE safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SignalExternalWorkflow requires .Get() for the command to be committed in the workflow task response. This matches the established pattern used by syncSummary and signalPropagationComplete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| wfID := GenerateVersionWorkflowID(v.GetDeploymentName(), v.GetBuildId()) | ||
| err := workflow.SignalExternalWorkflow(ctx, wfID, "", DemoteVersionSignalName, &deploymentspb.DemoteVersionSignalArgs{ | ||
| RoutingConfig: routingConfig, | ||
| }).Get(ctx, nil) |
There was a problem hiding this comment.
so this is something that has stumped me - in my first commit, on this PR, I did not have the .Get() call which let to a bunch of functional tests failing. Adding the .Get() call made the tests work.
Now, according to my knowledge, it seems that when we have the .Get() present, it means that there is a guarantee that the signal reaches the server. I checked other places in worker-versioning workflows where we have also sent cross workflow signals, and we seem to be following the pattern of using the .Get() method. However, it does mean that there is a quick round trip between the worker and the server here.
I am thinking on the side if I should just wrap this signal up in a workflow.Go routine, so that we don't make this round trip....
There was a problem hiding this comment.
well, my answer is here this seems to be a known limitation: #1289
It seems to me that if we do have a world in which the workflow task sends back the CAN command to the server, and a signal too, we drop the signal and just CAN. So, we can't not have this .Get() call which shall make this RPC trip!
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| func (s *Versioning3Suite) waitForDeploymentDataPropagationQueryWorkerDeployment(tv *testvars.TestVars) { | ||
| if s.deploymentWorkflowVersion == workerdeployment.AsyncSetCurrentAndRamping { | ||
| if s.deploymentWorkflowVersion >= workerdeployment.AsyncSetCurrentAndRamping { |
There was a problem hiding this comment.
nit: we need this to also run even when we have the VersionDataRevisionNumber (version number 2) running. My changes in this PR caught that, given that we have eventual consistency now for the old version changes too.
I think this also meant that some of the tests that used this function could be flaky on main. If you look closely, in the VersionDataRevisionNumber mode, we should honestly be waiting here until the propagation is completed! So, in theory, I think this change makes sense to me
The reason why this was caught(flaky) in this PR is because this PR introduced the latency of the update finally completing in the next workflow task, due to the extra RPC trip that I mentioned above.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What changed?
Fix non-atomic two-step sync in handleSetCurrent and setRamp where step 2 (demoting old version) failure leaves routing config uncommitted, causing orphaned CURRENT versions and burned revision numbers.
New flow: commit routing config immediately after step 1 (promote new version) succeeds, then fire-and-forget signal to old version instead of blocking sync activity. Gated behind workflow.GetVersion for NDE safety.
The only thing that I don't have in this PR are new tests to test this out. Happy to hear ideas if someone has any, but the core idea was that the current ones should be passing and testing the code paths. I am still thinking about this.
Why?
How did you test it?
Potential risks