Skip to content

Commit f948cd9

Browse files
committed
ci: longer ssh poll timeout, wider key windows, debug artifacts
Three CI-specific robustness fixes layered on top of the GCP-validated fixes from the previous commit: 1. 'Wait for install.success marker': bump SSH ConnectTimeout from 5s to 30s and retry 3x per iteration. During the GCP test build, install.success existed 11 min before the poll loop finally detected it, because the 5s timeout was silently failing every iteration while TiWorker + DISM were saturating the VM. A longer timeout plus retries closes that hole. 2. 'Defeat Press any key to boot from CD': widen the Enter spray from 2..17s to 2..40s. ubuntu-latest runners are slower at cold OVMF init than the n2 GCP VMs the workaround was originally validated on, and a run today (23991338720) sat with disk=196K for 2 h because every Enter keystroke missed the bootmgr window. 3. 'Click past Win11 25H2 Setup pickers': double the Alt+N budget from 60 iterations (3 min) to 120 (6 min), and dump a screenshot to the QEMU monitor before bailing out so the failure leaves a diagnostic artifact. 4. New 'Capture diagnostic screenshot' + 'Upload debug artifacts' steps that always() upload serial.log + screen.png (generated from screendump via imagemagick) on both success and failure. Future CI failures will leave behind something we can actually look at instead of guessing from the poll loop's disk-size log.
1 parent 7f460d2 commit f948cd9

1 file changed

Lines changed: 56 additions & 9 deletions

File tree

.github/workflows/build.yml

Lines changed: 56 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -129,10 +129,14 @@ jobs:
129129
echo "QEMU started, PID=$(cat qemu.pid)"
130130
131131
- name: Defeat "Press any key to boot from CD" prompt
132-
# Spam Enter keys 2s..17s after start so one lands in the Windows
133-
# bootloader timeout window (~5s from OVMF load).
132+
# Spam Enter keys 2s..40s after start so one lands in the Windows
133+
# bootloader timeout window (~5s from OVMF load). GitHub-hosted
134+
# ubuntu-latest runners are noticeably slower than GCP n2 VMs at
135+
# cold OVMF init, so the original 2..17s window was too tight --
136+
# this one covers up to 40s which is still comfortably before the
137+
# Alt+N picker spray kicks in.
134138
run: |
135-
for delay in 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1; do
139+
for delay in 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2; do
136140
sleep $delay
137141
echo 'sendkey ret' | nc -w 1 -q 1 127.0.0.1 4444 >/dev/null
138142
done
@@ -151,7 +155,10 @@ jobs:
151155
# growing -- that signals Setup has read autounattend.xml and begun
152156
# partitioning / file copy.
153157
run: |
154-
for i in $(seq 1 60); do
158+
# 120 iterations x 3s = 6 min window (was 60 iterations = 3 min).
159+
# GCP n2 VMs reach the second picker within ~90 s; ubuntu-latest
160+
# runners may need twice that.
161+
for i in $(seq 1 120); do
155162
sleep 3
156163
echo 'sendkey alt-n' | nc -w 1 -q 1 127.0.0.1 4444 >/dev/null
157164
DISK_K=$(du -k "$QCOW2_NAME" | cut -f1)
@@ -161,11 +168,19 @@ jobs:
161168
fi
162169
done
163170
if [ "$DISK_K" -le 500000 ]; then
164-
echo "::error::Disk still ${DISK_K}K after 60 Alt+N presses, Setup did not advance"
171+
echo "::error::Disk still ${DISK_K}K after 120 Alt+N presses, Setup did not advance"
172+
echo 'screendump /tmp/screen.ppm' | nc -w 1 -q 1 127.0.0.1 4444 || true
173+
ls -la /tmp/screen.ppm 2>/dev/null || true
165174
exit 1
166175
fi
167176
168177
- name: Wait for install.success marker
178+
# ConnectTimeout=30 (not 5) + 3 retries per iteration. During heavy
179+
# FirstLogonCommands (TiWorker / DISM / virtio-win guest tools install)
180+
# the Windows SSH server can take >5s to accept connections, and a
181+
# short timeout would silently fail on every iteration and never
182+
# detect the marker -- validated on GCP test build where install.success
183+
# existed 11 min before the old 5s poll loop noticed.
169184
run: |
170185
MAX_WAIT=7200
171186
ELAPSED=0
@@ -178,10 +193,19 @@ jobs:
178193
exit 1
179194
fi
180195
DISK=$(du -sh "$QCOW2_NAME" | cut -f1)
181-
if sshpass -p "$WIN_PASS" ssh -o StrictHostKeyChecking=no \
182-
-o ConnectTimeout=5 -o UserKnownHostsFile=/dev/null \
183-
-p "$SSH_PORT" cocoon@localhost \
184-
"if exist C:\\install.success echo READY" 2>/dev/null | grep -q READY; then
196+
FOUND=""
197+
for try in 1 2 3; do
198+
if sshpass -p "$WIN_PASS" ssh -o StrictHostKeyChecking=no \
199+
-o ConnectTimeout=30 -o ServerAliveInterval=10 \
200+
-o UserKnownHostsFile=/dev/null \
201+
-p "$SSH_PORT" cocoon@localhost \
202+
"if exist C:\\install.success echo READY" 2>/dev/null | grep -q READY; then
203+
FOUND=1
204+
break
205+
fi
206+
sleep 2
207+
done
208+
if [ -n "$FOUND" ]; then
185209
echo "install.success detected at ${ELAPSED}s (disk=${DISK})"
186210
exit 0
187211
fi
@@ -317,6 +341,29 @@ jobs:
317341
echo "Pushed: $REF"
318342
echo "Pushed: ghcr.io/${REPO_LC}:${{ inputs.version_tag }}"
319343
344+
- name: Capture diagnostic screenshot
345+
if: always()
346+
run: |
347+
if [ -f qemu.pid ] && kill -0 "$(cat qemu.pid)" 2>/dev/null; then
348+
echo 'screendump screen.ppm' | nc -w 1 -q 1 127.0.0.1 4444 || true
349+
if [ -f screen.ppm ]; then
350+
sudo apt-get install -y -qq imagemagick 2>/dev/null || true
351+
convert screen.ppm screen.png 2>/dev/null || true
352+
fi
353+
fi
354+
355+
- name: Upload debug artifacts
356+
if: always()
357+
uses: actions/upload-artifact@v4
358+
with:
359+
name: debug-${{ github.run_id }}
360+
if-no-files-found: ignore
361+
retention-days: 7
362+
path: |
363+
serial.log
364+
screen.png
365+
screen.ppm
366+
320367
- name: Cleanup
321368
if: always()
322369
run: |

0 commit comments

Comments
 (0)