This is an experimental and untested setup for serving DeepSeek‑R1‑Distill‑Llama‑70B on your DGX H200 with vLLM under Slurm and Docker, using your shared NFS.
It follows your DGX quick reference (cluster switching, Docker-in-Slurm, NFS paths) and adds model download, server launch, tuning, and usage examples.
TL;DR
- Switch to the DGX cluster → 2) Pre‑pull a pinned vLLM image → 3) Pre‑download model to NFS → 4)
sbatchthe vLLM server (TP=4, BF16 + FP8 KV) → 5) Call it via OpenAI‑compatible API.
- DGX Slurm cluster name:
dgx-h200(via yourslurm-dgxhelper) - GPUs: 4× H200 (NVLink/NVSwitch)
- NFS mounts (same as Golem):
/nfs/datastore0,/nfs/datastore1,/nfs/datastore2,/nfs/nettarkivet1 - Model ID:
deepseek-ai/DeepSeek-R1-Distill-Llama-70B - Base model:
Llama-3.3-70B-Instruct(distilled) - Recommended context cap for Distill family: start with 32k (
--max-model-len 32768) unless you know you need more (see Notes). - Pinned vLLM container tag:
vllm/vllm-openai:v0.10.0(avoid:latestin production) - Service port: 8000 (internal)
You can change paths/tags later; all scripts are parameterized.
Add to ~/.bashrc (if you haven’t already):
# --- Minimal Slurm cluster switching prompt ---
update_prompt() {
# Always remove existing tags
PS1="$(echo "$PS1" | sed 's/^(dgx-h200) //;s/^(golem) //')"
# Only add tag if on DGX
if [ "$SLURM_CLUSTER" = "dgx-h200" ]; then
PS1="(dgx-h200) $PS1"
fi
}
slurm-dgx() {
export SLURM_CONF=/opt/nb/slurm.conf
export SLURM_CLUSTER="dgx-h200"
update_prompt
echo "Switched to DGX H200 cluster"
}
slurm-golem() {
export SLURM_CONF=/etc/slurm/slurm.conf
unset SLURM_CLUSTER
update_prompt
echo "Switched to Golem cluster"
}
slurm-status() {
if [ -z "$SLURM_CLUSTER" ]; then
echo "Currently on Golem (default cluster)"
else
echo "Current cluster: $SLURM_CLUSTER"
echo " Config: $SLURM_CONF"
fi
}Reload:
source ~/.bashrc
slurm-dgx
slurm-statusPick a consistent layout under NFS so jobs are portable across DGX/Golem:
export USER_BASE=/nfs/datastore1/$USER/deepseek_r1_70b
export MODELS_DIR=$USER_BASE/models
export LOGS_DIR=$USER_BASE/logs
export HF_HOME_DIR=$USER_BASE/hf_home
mkdir -p "$MODELS_DIR" "$LOGS_DIR" "$HF_HOME_DIR"We will keep the downloaded model under $MODELS_DIR/DeepSeek-R1-Distill-Llama-70B and the Hugging Face cache under $HF_HOME_DIR.
Set HF token securely (replace ***):
export HF_TOKEN=***YOUR_HF_READ_TOKEN***
# Optional: persist for this shell/session
echo 'export HF_TOKEN=***YOUR_HF_READ_TOKEN***' >> ~/.bashrcvLLM and
huggingface-cliwill also pick upHF_HOME/HF_HUB_CACHE. We set them in the scripts.
module add docker
docker pull vllm/vllm-openai:v0.10.0
docker images | grep vllm-openaiPinning a known‑good tag avoids surprises from upstream changes.
Submit this model prep batch job (it runs hf download inside the official vLLM image). The download typically takes 10-15 minutes to complete:
File: 01_model_prep.sbatch
#!/bin/bash -l
#SBATCH --job-name=ds_r1_70b_prep
#SBATCH --output=/nfs/datastore1/%u/deepseek_r1_70b/logs/prep_%j.out
#SBATCH --error=/nfs/datastore1/%u/deepseek_r1_70b/logs/prep_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --gres=gpu:0
# #SBATCH --partition=dgx
# --- Safe temporary directory (shared & visible) ---
export TMPDIR=/nfs/datastore1/$USER/tmp
mkdir -p "$TMPDIR"
set -euo pipefail
module add docker
# --- Project directories on shared NFS ---
USER_BASE="/nfs/datastore1/$USER/deepseek_r1_70b"
MODELS_DIR="$USER_BASE/models"
HF_HOME_DIR="$USER_BASE/hf_home"
LOGS_DIR="$USER_BASE/logs"
mkdir -p "$MODELS_DIR" "$HF_HOME_DIR" "$LOGS_DIR"
# --- Print quick environment info ---
echo "===== ENVIRONMENT INFO ====="
echo "Host: $(hostname)"
echo "Date: $(date)"
echo "SLURM_CLUSTER: ${SLURM_CLUSTER:-unknown}"
echo "TMPDIR: $TMPDIR"
echo "MODELS_DIR: $MODELS_DIR"
echo "HF_HOME_DIR: $HF_HOME_DIR"
echo "Disk usage on /nfs/datastore1:"
df -h /nfs/datastore1 | tail -n 1
echo "============================"
# --- Use minimal Python image and install latest HF CLI ---
docker pull python:3.11-slim
echo "🚀 Installing latest Hugging Face CLI inside container and starting download..."
docker run --rm \
-e HF_TOKEN="${HF_TOKEN:-}" \
-v "$MODELS_DIR":/model \
-v "$HF_HOME_DIR":/hf_home \
python:3.11-slim \
bash -c '
set -e
export DEBIAN_FRONTEND=noninteractive
apt-get update -qq && apt-get install -y -qq git curl && rm -rf /var/lib/apt/lists/*
echo "Python: $(python3 --version)"
pip install -U "huggingface_hub[cli]" > /dev/null
echo "✅ huggingface_hub installed. Starting model download..."
mkdir -p /model/DeepSeek-R1-Distill-Llama-70B
hf download deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
--local-dir /model/DeepSeek-R1-Distill-Llama-70B \
--no-force-download
echo "✅ Model download complete."
echo "Listing a few files in /model:"
ls -lh /model/DeepSeek-R1-Distill-Llama-70B | head
echo "Total size:"
du -sh /model/DeepSeek-R1-Distill-Llama-70B || true
'
echo "📂 Model files stored at: $MODELS_DIR/DeepSeek-R1-Distill-Llama-70B"
echo "===== JOB COMPLETE $(date) ====="Submit:
slurm-dgx
sbatch 01_model_prep.sbatch
squeue -u $USERFile: 02_vllm_serve_4xH200.sbatch
#!/bin/bash -l
#SBATCH --job-name=ds_r1_70b_vllm
#SBATCH --output=/nfs/datastore1/%u/deepseek_r1_70b/logs/vllm_%j.out
#SBATCH --error=/nfs/datastore1/%u/deepseek_r1_70b/logs/vllm_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=200G
#SBATCH --gres=gpu:4
#SBATCH --time=12:00:00
# #SBATCH --partition=dgx
# --- Safe temp directory (shared + visible) ---
export TMPDIR=/nfs/datastore1/$USER/tmp
mkdir -p "$TMPDIR"
set -euo pipefail
module add docker
# --- Paths ---
USER_BASE="/nfs/datastore1/$USER/deepseek_r1_70b"
MODELS_DIR="$USER_BASE/models"
HF_HOME_DIR="$USER_BASE/hf_home"
LOGS_DIR="$USER_BASE/logs"
mkdir -p "$MODELS_DIR" "$HF_HOME_DIR" "$LOGS_DIR"
# --- Networking ---
HOST_PORT="${HOST_PORT:-8000}"
# --- Container paths ---
CONTAINER_MODEL_DIR="/model"
CONTAINER_HF_HOME="/hf_home"
# --- NCCL debug level ---
export NCCL_DEBUG=WARN
# --- Pull pinned image ---
docker pull vllm/vllm-openai:v0.10.0
# --- Run server inside container ---
exec docker run \
--gpus all \
--rm \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--shm-size=16g \
-e HF_TOKEN="${HF_TOKEN:-}" \
-e HF_HOME="$CONTAINER_HF_HOME" \
-e NCCL_DEBUG="$NCCL_DEBUG" \
-v "$MODELS_DIR":"$CONTAINER_MODEL_DIR" \
-v "$HF_HOME_DIR":"$CONTAINER_HF_HOME" \
-p "${HOST_PORT}:8000" \
vllm/vllm-openai:v0.10.0 \
bash -c 'vllm serve /model/DeepSeek-R1-Distill-Llama-70B \
--served-model-name DeepSeek-R1-Distill-Llama-70B \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 512 \
--block-size 32 \
--download-dir /model \
--port 8000 \
--api-key changeme'Submit:
slurm-dgx
sbatch 02_vllm_serve_4xH200.sbatch
squeue -u $USERNotes on the chosen flags
--tensor-parallel-size 4: sharded across all 4 GPUs.--dtype bfloat16: H200 BF16 weights are stable/fast.--kv-cache-dtype fp8: halves KV memory and usually improves throughput on Hopper/H200.--enable-prefix-caching: reuse shared prompt prefixes (RAG, multi‑turn).--max-model-len 32768: practical default for DeepSeek‑R1‑Distill family. Raise only if needed.--gpu-memory-utilization 0.95: gives KV‑cache more headroom.--max-num-seqs 512: high concurrency without fragmentation. Tune with your workload.
The points below is just sketches of how it should be. Currently this is not exposed today. You can test it by ssh-into to bcm1.nb.no and then further to node1 (or wherever the model is running).
# Health check
curl -s http://localhost:8000/health
# Chat completion (replace API key if you changed it)
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer changeme" \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1-Distill-Llama-70B",
"messages": [
{"role":"user","content":"Solve 24*37 step by step and give final answer in \\boxed{}"}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 512,
"stream": true
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="changeme",
)
resp = client.chat.completions.create(
model="DeepSeek-R1-Distill-Llama-70B",
messages=[
{"role": "user", "content": "Please reason step by step and put the final result in \\boxed{}: What is the derivative of x^3?"}
],
temperature=0.6,
top_p=0.95,
max_tokens=512,
stream=False,
)
print(resp.choices[0].message.content)For a quick request‑level benchmark from a login node or inside another container, run a lightweight load (do not DDoS your own server):
python - <<'PY'
import concurrent.futures, time, json, requests
URL="http://localhost:8000/v1/chat/completions"
HDR={"Authorization":"Bearer changeme","Content-Type":"application/json"}
PAY={"model":"DeepSeek-R1-Distill-Llama-70B",
"messages":[{"role":"user","content":"Write one sentence about vLLM."}],
"max_tokens":64, "temperature":0.0}
def one():
t=time.time()
r=requests.post(URL, headers=HDR, data=json.dumps(PAY), timeout=60)
return time.time()-t, r.status_code
N=64
t0=time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as ex:
res=list(ex.map(lambda _: one(), range(N)))
dt=time.time()-t0
ok=sum(1 for d,s in res if s==200)
print(f"Requests: {ok}/{N} in {dt:.2f}s -> {ok/dt:.2f} req/s")
PYFor deeper profiling, use vllm bench serve from the image (not shown here to keep this guide concise).
- Use NFS for model + HF cache so that re‑scheduling/retries do not re‑download. We already mount
$MODELS_DIRand$HF_HOME_DIRinto the container. - Pin image tags (we used
v0.10.0) to avoid drift. - Shard via tensor parallelism for multi‑GPU on a single node (we set
-tp 4). - Tune concurrency with
--max-num-seqsand KV cache headroom with--gpu-memory-utilization. - Prefix caching helps repeated prompts/system messages; keep it on unless you see pathologies.
- FP8 KV cache gives you larger effective context and better batching. Keep it on for Hopper/H200.
- Logs & metrics go to your
--output/--errorfiles; keep them on NFS ($LOGS_DIR) as in the scripts. - Firewall / exposure: by default this binds to container
:8000mapped to host:8000. If needed for cross‑node access, coordinate porting/ACLs with the infra team.
# One-time: switch cluster and create dirs
slurm-dgx
export USER_BASE=/nfs/datastore1/$USER/deepseek_r1_70b
mkdir -p $USER_BASE/{models,logs,hf_home}
# Set HF token (once)
export HF_TOKEN=***YOUR_HF_READ_TOKEN***
# 1) Pre-download weights
sbatch 01_model_prep.sbatch
watch -n 5 "squeue -u $USER"
# 2) Start server on 4×H200
sbatch 02_vllm_serve_4xH200.sbatch
squeue -u $USER
# 3) Test (from the same node or via port if reachable)
curl -s http://localhost:8000/health- OOM during load: lower
--max-model-len(e.g., 24k), ensure--kv-cache-dtype fp8, and keep--gpu-memory-utilization <= 0.95. - Slow TTFT on very long prompts: it’s expected. Keep chunked prefill (enabled by default in modern vLLM), stream responses, and consider cutting the prompt or using prefix caching.
- Networking: if you cannot reach port
8000externally, keep usage on the allocated node (ssh into it) or ask infra to expose it. - Ray vs MP: vLLM auto‑selects a backend for multi‑GPU single‑node. We keep defaults; you generally don’t need Ray on a single node for serving.
- Prefix caching vs FP8 KV: if you observe odd behavior with very new vLLM builds, temporarily disable one of them to isolate (keep FP8 KV).
- Need higher concurrency / throughput: raise
--max-num-seqs(measure; 512→1024) and ensure--gpu-memory-utilization 0.95. - Need longer context: bump
--max-model-lengradually (e.g., 48k → 64k). Expect lower throughput and higher memory. - Memory pressure: reduce
--max-num-seqsand/or--max-model-len. - Stability over squeezing: if you hit edge issues, drop
--kv-cache-dtype fp8to fallback to BF16 KV.
- Interactive:
Ctrl+Cin the foreground shell (container stops). - Batch:
scancel <jobid>; Docker is launched with--rmso containers are cleaned up automatically.
If you must run without Docker:
python -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install -U "vllm[cuda]"
export HF_TOKEN=***; export HF_HOME=/nfs/datastore1/$USER/deepseek_r1_70b/hf_home
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
--tensor-parallel-size 4 --dtype bfloat16 --kv-cache-dtype fp8 \
--enable-prefix-caching --max-model-len 32768 --gpu-memory-utilization 0.95Docker remains preferred for reproducibility and ease on DGX.
File: client_test.py
from openai import OpenAI
import os
BASE_URL = os.environ.get("VLLM_BASE_URL", "http://localhost:8000/v1")
API_KEY = os.environ.get("VLLM_API_KEY", "changeme")
MODEL = os.environ.get("VLLM_MODEL", "DeepSeek-R1-Distill-Llama-70B")
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
msg = [{"role":"user","content":"Please reason step by step and answer in \\boxed{}: 24*37"}]
r = client.chat.completions.create(model=MODEL, messages=msg, temperature=0.6, top_p=0.95, max_tokens=256)
print(r.choices[0].message.content)Run:
python client_test.py- TP=4 (matches 4×H200)
- BF16 weights (
--dtype bfloat16) - FP8 KV cache (
--kv-cache-dtype fp8) - Prefix caching ON
- gpu_memory_utilization ~0.95
- max_num_seqs sized to your concurrency
- Model + HF cache on NFS, pre‑downloaded
- Pinned vLLM image tag
- Logs to NFS
DeepSeek’s Distill page suggests 32k generation/context settings in examples. The base Llama‑3.3‑70B supports long contexts (128k), but the distillation config and your serving budget may favor 32k as a high‑throughput default. If you raise --max-model-len, do it gradually and monitor memory/throughput.