Reusable Docker base image for the EPFL RCP CaaS GPU cluster.
The image provides:
- CUDA runtime libraries.
- uv for per-project Python environments.
- An LDAP-mapped user so mounted RCP storage has the right UID/GID.
Build the image once, push it to the RCP registry, then reuse it for different uv projects. Each project is mounted as a volume and gets its own .venv from its pyproject.toml and uv.lock.
.
|-- Dockerfile # CUDA runtime base, system deps, uv, LDAP user
|-- .dockerignore # Minimal build context ignores
|-- build.sh # Build wrapper with required build args
`-- README.md
There is intentionally no pyproject.toml / uv.lock / source code here. This template builds a runtime, not a project.
- Docker installed. Apple Silicon Macs are fine;
build.shforces--platform linux/amd64.
When the container runs on RCP, the in-container UID/GID must match your EPFL identity. Otherwise mounted RCP storage may be read-only or owned by the wrong user.
SSH into the RCP jumphost and run id:
ssh <gaspar-username>@jumphost.rcp.epfl.ch idThe output looks like this:
uid=XXXXXX(<gaspar-username>) gid=YYYYY(<LAB-GROUP-NAME>) groups=YYYYY(<LAB-GROUP-NAME>),...
Read off the four build-args from that line:
| Build-arg | Where it comes from in the output |
|---|---|
LDAP_USERNAME |
the name inside the parentheses after uid=... |
LDAP_UID |
the number right after uid= |
LDAP_GROUPNAME |
the name inside the parentheses after gid=... (your primary group) |
LDAP_GID |
the number right after gid= |
If you have an SSH alias for the jumphost (e.g. Host EPFL_RCP in ~/.ssh/config), ssh EPFL_RCP id works too.
This command prints shell assignments you can paste into build.sh:
ssh <gaspar-username>@jumphost.rcp.epfl.ch \
'printf "LDAP_USERNAME=%s\nLDAP_UID=%s\nLDAP_GROUPNAME=%s\nLDAP_GID=%s\n" "$(id -un)" "$(id -u)" "$(id -gn)" "$(id -g)"'If SSH isn't available, look up the same values at it-info.epfl.ch or ask your lab admin.
The Dockerfile installs OpenGL, OSMesa, GLFW, and patchelf by default. These are useful for RL, robotics, simulation, NeRF, and video recording.
If your workload does not need rendering, comment the OpenGL/MuJoCo RUN block in the Dockerfile to make the image smaller.
Edit the values at the top of build.sh:
LDAP_USERNAME="<gaspar-username>"
LDAP_UID="<UID>"
LDAP_GROUPNAME="<LAB-GROUP-NAME>"
LDAP_GID="<GID>"
PROJECT="<harbor-project>" # Harbor project you created
IMAGE_NAME="rcp-uv-base"
IMAGE_TAG="v0.1"If you prefer environment variables, use export NAME=value before running build.sh. Plain NAME=value variables are not inherited by ./build.sh.
Then:
./build.shOr call docker build directly:
DOCKER_BUILDKIT=1 docker build \
--platform linux/amd64 \
--tag registry.rcp.epfl.ch/<harbor-project>/rcp-uv-base:v0.1 \
--build-arg LDAP_USERNAME=<gaspar-username> \
--build-arg LDAP_UID=<UID> \
--build-arg LDAP_GROUPNAME=<LAB-GROUP-NAME> \
--build-arg LDAP_GID=<GID> \
.Default is nvidia/cuda:12.2.0-runtime-ubuntu22.04 (~1.5 GB). It works for uv projects that install PyTorch, JAX, or other GPU libraries from prebuilt wheels. Those wheels usually bundle the CUDA/cuDNN pieces they need, so the base image does not need system cuDNN by default.
Swap to a heavier variant only when you actually need to:
Override --build-arg BASE_IMAGE=... |
When you need it |
|---|---|
nvidia/cuda:12.2.0-devel-ubuntu22.04 |
Compiling CUDA extensions from source: flash-attn, xformers --no-binary, custom CUDA kernels, anything that runs nvcc during pip install / uv sync. |
nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04 |
The above, plus libraries that link against system cuDNN headers at build time. |
nvcr.io/nvidia/pytorch:24.10-py3 |
NGC's pre-built PyTorch image. Heavy; usually unnecessary when uv already pins PyTorch. |
ubuntu:22.04 |
CPU-only local iteration, no GPU stack at all. |
If a build fails with nvcc: not found or a missing cudnn.h, that's the signal to step up to a devel variant.
docker run --rm -it \
registry.rcp.epfl.ch/<harbor-project>/rcp-uv-base:v0.1 \
bash -c 'whoami && id && uv --version'Expected output (placeholders for your real values):
<gaspar-username>
uid=<UID>(<gaspar-username>) gid=<GID>(<LAB-GROUP-NAME>) groups=<GID>(<LAB-GROUP-NAME>)
uv 0.6.x
If the username, UID, and GID are correct, push the image.
RCP runs a Harbor registry at: registry.rcp.epfl.ch
Open https://registry.rcp.epfl.ch and log in with your GASPAR account.
Each Harbor project is a namespace (it becomes registry.rcp.epfl.ch/<project>/...).
A whole lab can share one project, or you can create one per research direction.
- Public: any RCP user can pull. Good for sharing.
- Private: only members or robot accounts can pull.
You can choose at creation time, or change it later in project settings.
Members: add by GASPAR username and pick a role (Developer / Maintainer / ProjectAdmin / ...).
The currently deployed Harbor version does not support LDAP-group permissions. Add individual users instead.
Cluster jobs (Run.ai / Kubernetes) need to pull images from the registry, but must not use your GASPAR account. Create a Robot Account inside the project instead:
- Project > Robot Accounts > New Robot Account
- Set the expiration and permissions (usually
pullis enough) - Save the generated name and secret. They are shown only once.
Then create an image-pull Secret in your Run.ai / Kubernetes namespace using those credentials. See the RCP "how-to-use-secret" docs for the exact YAML.
docker login registry.rcp.epfl.ch
# Username: <GASPAR username>
# Password: <GASPAR password>docker build -t registry.rcp.epfl.ch/<project>/<image>:latest \
-t registry.rcp.epfl.ch/<project>/<image>:v0.1 .latest is the default tag. docker run without an explicit tag pulls latest.
docker push registry.rcp.epfl.ch/<project>/<image>:latest
docker push registry.rcp.epfl.ch/<project>/<image>:v0.1docker pull registry.rcp.epfl.ch/<project>/<image>:latestFor example, mirror a Docker Hub image into RCP:
docker pull alpine
docker tag alpine registry.rcp.epfl.ch/<project>/alpine:latest
docker push registry.rcp.epfl.ch/<project>/alpine:latestdocker logout registry.rcp.epfl.chThe base image has no project source in it. Mount a uv project, run uv sync, and the project gets its own .venv inside the mounted directory.
For a uv project with pyproject.toml and uv.lock in the project root:
docker run --rm -it \
--gpus all \
-v ~/dev/my-experiment:/home/<gaspar-username>/my-experiment \
-v ~/.cache/rcp-uv:/home/<gaspar-username>/.cache/uv \
-w /home/<gaspar-username>/my-experiment \
registry.rcp.epfl.ch/<harbor-project>/rcp-uv-base:v0.1 \
bash -lc 'uv sync --locked && uv run python train.py'What this does:
-v ~/dev/my-experiment:/home/<gaspar-username>/my-experimentmounts the project inside the container user's home directory.-v ~/.cache/rcp-uv:/home/<gaspar-username>/.cache/uvkeeps uv's package cache on your host. If you skip this mount, the project's.venvstill persists, but downloaded packages are cached only for that container run.-wmakes that mount the working directory.uv sync --lockedreads the project'spyproject.toml/uv.lockand creates.venvinside the mount. Because the mount is on your host or RCP shared storage, the.venvpersists between runs.uv run python train.pyruns your script in that venv.
You can swap train.py for bash to drop into an interactive shell, or for any other command (uv run pytest, uv run jupyter lab, etc.).
For each new experiment, change only the volume mount and the working directory:
# Project A
docker run ... -v ~/dev/project-a:/home/<gaspar-username>/project-a \
-v ~/.cache/rcp-uv:/home/<gaspar-username>/.cache/uv \
-w /home/<gaspar-username>/project-a \
registry.rcp.epfl.ch/<harbor-project>/rcp-uv-base:v0.1 \
bash -lc 'uv sync --locked && uv run python train.py'
# Project B
docker run ... -v ~/dev/project-b:/home/<gaspar-username>/project-b \
-v ~/.cache/rcp-uv:/home/<gaspar-username>/.cache/uv \
-w /home/<gaspar-username>/project-b \
registry.rcp.epfl.ch/<harbor-project>/rcp-uv-base:v0.1 \
bash -lc 'uv sync --locked && uv run python eval.py'The base image stays the same. Each project keeps its own pyproject.toml, uv.lock, and .venv inside its own directory.
The same pattern maps to a Run.ai or Kubernetes job:
- The
imagein the job spec is your pushed base:registry.rcp.epfl.ch/<harbor-project>/rcp-uv-base:v0.1. - The
-vvolume mount becomes a PVC mount onto your home or scratch path. - The uv cache path is
/home/<gaspar-username>/.cache/uv; mount that path, or mount the whole home directory, if you want package downloads to persist between cluster jobs. - The
--gpus allflag is replaced by the cluster's GPU request mechanism (Run.ai'sgpufield, or a Kubernetesnvidia.com/gpuresource). - The
command/argsis the sameuv sync --locked && uv run python ...line.
- Do not put passwords, API keys, or tokens in the Dockerfile or in git. Use Run.ai / Kubernetes Secrets.
- Containers are ephemeral. Put checkpoints, logs, datasets, and project source on mounted storage.
.venvlives on the mounted project volume, not in the image.- uv's package cache lives at
/home/<gaspar-username>/.cache/uv. Mount that path if you want downloads to persist. - Use versioned tags such as
v0.3for reproducible jobs. Do not rely onlatestlong term. - Re-run
uv lockin your project after editingpyproject.toml;uv sync --lockedwill fail if the lockfile is stale. - Apple Silicon Macs must build with
--platform linux/amd64. The cluster is amd64. - The base image is per-user because UID/GID are baked in at build time.