Skip to content

Commit b298661

Browse files
committed
libcontainer: skip EPERM from rootfsParentMountPrivate in userns
In a user namespace, mounts inherited from a more privileged mount namespace are locked by the kernel. Attempting to change their propagation to MS_PRIVATE returns EPERM. This is safe to ignore because prepareRoot() has already set MS_SLAVE recursively, which is sufficient for pivot_root() and prevents mount leaks. This affects kernels before Linux 6.17, where commit cffd0441872e ("use uniform permission checks for all mount propagation changes", CVE-2025-38498) reworked do_change_type() to use ns_capable() instead of the stricter check that returned EPERM in user namespaces. The fix is also backported to some enterprise kernels (e.g. RHEL 9 5.14.0-570.46.1). An integration test is added for the cross-userns exec scenario: when two containers have separate user namespaces but share an IPC namespace (the Kubernetes sandbox/workload pattern), runc exec must handle the setns ordering correctly. Fixes: #5241 Signed-off-by: yksun <yksun@alauda.io>
1 parent 496b68a commit b298661

2 files changed

Lines changed: 47 additions & 0 deletions

File tree

libcontainer/rootfs_linux.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1058,6 +1058,15 @@ func rootfsParentMountPrivate(path string) error {
10581058
}
10591059
path = filepath.Dir(path)
10601060
}
1061+
// In a user namespace, mounts inherited from a more privileged mount
1062+
// namespace are "locked" and cannot be changed to MS_PRIVATE (the
1063+
// kernel returns EPERM). This is safe to ignore because prepareRoot()
1064+
// has already set the propagation to MS_SLAVE recursively, which is
1065+
// sufficient for pivot_root() to succeed and prevents mount events
1066+
// from leaking to the parent namespace.
1067+
if err == unix.EPERM && userns.RunningInUserNS() {
1068+
return nil
1069+
}
10611070
return &mountError{
10621071
op: "remount-private",
10631072
target: path,

tests/integration/userns.bats

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,44 @@ function teardown() {
264264
run ! ip link del dummy0
265265
}
266266

267+
# Regression test for cross-userns exec failure.
268+
# When two containers have separate user namespaces but share IPC (the
269+
# Kubernetes sandbox/workload pattern), runc exec fails because nsexec
270+
# joins the user namespace BEFORE the IPC namespace. After joining a
271+
# different userns, setns to the IPC namespace (owned by the first
272+
# userns) returns EPERM.
273+
@test "userns exec with cross-userns shared IPC" {
274+
requires root
275+
276+
# Container A (simulates sandbox): own userns, creates IPC namespace.
277+
update_config '.process.args = ["sleep", "infinity"]'
278+
runc run -d --console-socket "$CONSOLE_SOCKET" sandbox_userns
279+
[ "$status" -eq 0 ]
280+
281+
sandbox_pid="$(__runc state sandbox_userns | jq .pid)"
282+
283+
# Container B (simulates workload): own userns (different instance),
284+
# but joins sandbox's IPC namespace via path.
285+
# Remove mqueue mount — mounting mqueue in a different userns than
286+
# the IPC namespace owner is not permitted.
287+
update_config '.process.args = ["sleep", "infinity"]
288+
| .linux.namespaces |= map(
289+
if .type == "ipc" then
290+
(.path = "/proc/'"$sandbox_pid"'/ns/ipc")
291+
else .
292+
end
293+
)
294+
| .mounts |= map(select(.type != "mqueue"))'
295+
runc run -d --console-socket "$CONSOLE_SOCKET" workload_userns
296+
[ "$status" -eq 0 ]
297+
298+
# Exec into workload container — this fails because nsexec joins the
299+
# workload's userns first, then tries to setns into the sandbox's IPC
300+
# namespace which is owned by a different userns → EPERM.
301+
runc exec workload_userns id
302+
[ "$status" -eq 0 ]
303+
}
304+
267305
@test "userns with network interface renamed" {
268306
requires root
269307

0 commit comments

Comments
 (0)