Skip to content

NbAiLab/NB-DGX-vLLM-Llama70B-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

DGX H200 (4×H200) — vLLM Deployment Guide for deepseek-ai/DeepSeek-R1-Distill-Llama-70B

This is an experimental and untested setup for serving DeepSeek‑R1‑Distill‑Llama‑70B on your DGX H200 with vLLM under Slurm and Docker, using your shared NFS.
It follows your DGX quick reference (cluster switching, Docker-in-Slurm, NFS paths) and adds model download, server launch, tuning, and usage examples.

TL;DR

  1. Switch to the DGX cluster → 2) Pre‑pull a pinned vLLM image → 3) Pre‑download model to NFS → 4) sbatch the vLLM server (TP=4, BF16 + FP8 KV) → 5) Call it via OpenAI‑compatible API.

0) Assumptions and Paths

  • DGX Slurm cluster name: dgx-h200 (via your slurm-dgx helper)
  • GPUs: 4× H200 (NVLink/NVSwitch)
  • NFS mounts (same as Golem): /nfs/datastore0, /nfs/datastore1, /nfs/datastore2, /nfs/nettarkivet1
  • Model ID: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  • Base model: Llama-3.3-70B-Instruct (distilled)
  • Recommended context cap for Distill family: start with 32k (--max-model-len 32768) unless you know you need more (see Notes).
  • Pinned vLLM container tag: vllm/vllm-openai:v0.10.0 (avoid :latest in production)
  • Service port: 8000 (internal)

You can change paths/tags later; all scripts are parameterized.


1) One-time Slurm Switching Helpers (slightly changed from guide)

Add to ~/.bashrc (if you haven’t already):

# --- Minimal Slurm cluster switching prompt ---
update_prompt() {
    # Always remove existing tags
    PS1="$(echo "$PS1" | sed 's/^(dgx-h200) //;s/^(golem) //')"
    # Only add tag if on DGX
    if [ "$SLURM_CLUSTER" = "dgx-h200" ]; then
        PS1="(dgx-h200) $PS1"
    fi
}

slurm-dgx() {
    export SLURM_CONF=/opt/nb/slurm.conf
    export SLURM_CLUSTER="dgx-h200"
    update_prompt
    echo "Switched to DGX H200 cluster"
}

slurm-golem() {
    export SLURM_CONF=/etc/slurm/slurm.conf
    unset SLURM_CLUSTER
    update_prompt
    echo "Switched to Golem cluster"
}

slurm-status() {
    if [ -z "$SLURM_CLUSTER" ]; then
        echo "Currently on Golem (default cluster)"
    else
        echo "Current cluster: $SLURM_CLUSTER"
        echo "   Config: $SLURM_CONF"
    fi
}

Reload:

source ~/.bashrc
slurm-dgx
slurm-status

2) Create Project Layout on NFS

Pick a consistent layout under NFS so jobs are portable across DGX/Golem:

export USER_BASE=/nfs/datastore1/$USER/deepseek_r1_70b
export MODELS_DIR=$USER_BASE/models
export LOGS_DIR=$USER_BASE/logs
export HF_HOME_DIR=$USER_BASE/hf_home
mkdir -p "$MODELS_DIR" "$LOGS_DIR" "$HF_HOME_DIR"

We will keep the downloaded model under $MODELS_DIR/DeepSeek-R1-Distill-Llama-70B and the Hugging Face cache under $HF_HOME_DIR.


3) Authenticate to Hugging Face (non‑interactive)

Set HF token securely (replace ***):

export HF_TOKEN=***YOUR_HF_READ_TOKEN***
# Optional: persist for this shell/session
echo 'export HF_TOKEN=***YOUR_HF_READ_TOKEN***' >> ~/.bashrc

vLLM and huggingface-cli will also pick up HF_HOME / HF_HUB_CACHE. We set them in the scripts.


4) Pre‑pull a pinned vLLM container image (recommended)

module add docker
docker pull vllm/vllm-openai:v0.10.0
docker images | grep vllm-openai

Pinning a known‑good tag avoids surprises from upstream changes.


5) Pre‑download the model to NFS (so jobs don’t re‑download)

Submit this model prep batch job (it runs hf download inside the official vLLM image). The download typically takes 10-15 minutes to complete:

File: 01_model_prep.sbatch

#!/bin/bash -l
#SBATCH --job-name=ds_r1_70b_prep
#SBATCH --output=/nfs/datastore1/%u/deepseek_r1_70b/logs/prep_%j.out
#SBATCH --error=/nfs/datastore1/%u/deepseek_r1_70b/logs/prep_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --gres=gpu:0
# #SBATCH --partition=dgx

# --- Safe temporary directory (shared & visible) ---
export TMPDIR=/nfs/datastore1/$USER/tmp
mkdir -p "$TMPDIR"

set -euo pipefail
module add docker

# --- Project directories on shared NFS ---
USER_BASE="/nfs/datastore1/$USER/deepseek_r1_70b"
MODELS_DIR="$USER_BASE/models"
HF_HOME_DIR="$USER_BASE/hf_home"
LOGS_DIR="$USER_BASE/logs"
mkdir -p "$MODELS_DIR" "$HF_HOME_DIR" "$LOGS_DIR"

# --- Print quick environment info ---
echo "===== ENVIRONMENT INFO ====="
echo "Host:        $(hostname)"
echo "Date:        $(date)"
echo "SLURM_CLUSTER: ${SLURM_CLUSTER:-unknown}"
echo "TMPDIR:      $TMPDIR"
echo "MODELS_DIR:  $MODELS_DIR"
echo "HF_HOME_DIR: $HF_HOME_DIR"
echo "Disk usage on /nfs/datastore1:"
df -h /nfs/datastore1 | tail -n 1
echo "============================"

# --- Use minimal Python image and install latest HF CLI ---
docker pull python:3.11-slim

echo "🚀 Installing latest Hugging Face CLI inside container and starting download..."
docker run --rm \
  -e HF_TOKEN="${HF_TOKEN:-}" \
  -v "$MODELS_DIR":/model \
  -v "$HF_HOME_DIR":/hf_home \
  python:3.11-slim \
  bash -c '
    set -e
    export DEBIAN_FRONTEND=noninteractive
    apt-get update -qq && apt-get install -y -qq git curl && rm -rf /var/lib/apt/lists/*
    echo "Python: $(python3 --version)"
    pip install -U "huggingface_hub[cli]" > /dev/null
    echo "✅ huggingface_hub installed. Starting model download..."
    mkdir -p /model/DeepSeek-R1-Distill-Llama-70B
    hf download deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
        --local-dir /model/DeepSeek-R1-Distill-Llama-70B \
        --no-force-download
    echo "✅ Model download complete."
    echo "Listing a few files in /model:"
    ls -lh /model/DeepSeek-R1-Distill-Llama-70B | head
    echo "Total size:"
    du -sh /model/DeepSeek-R1-Distill-Llama-70B || true
  '

echo "📂 Model files stored at: $MODELS_DIR/DeepSeek-R1-Distill-Llama-70B"
echo "===== JOB COMPLETE $(date) ====="

Submit:

slurm-dgx
sbatch 01_model_prep.sbatch
squeue -u $USER

6) Start the vLLM server on 4×H200 (TP=4, BF16 + FP8 KV)

File: 02_vllm_serve_4xH200.sbatch

#!/bin/bash -l
#SBATCH --job-name=ds_r1_70b_vllm
#SBATCH --output=/nfs/datastore1/%u/deepseek_r1_70b/logs/vllm_%j.out
#SBATCH --error=/nfs/datastore1/%u/deepseek_r1_70b/logs/vllm_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=200G
#SBATCH --gres=gpu:4
#SBATCH --time=12:00:00
# #SBATCH --partition=dgx

# --- Safe temp directory (shared + visible) ---
export TMPDIR=/nfs/datastore1/$USER/tmp
mkdir -p "$TMPDIR"

set -euo pipefail
module add docker

# --- Paths ---
USER_BASE="/nfs/datastore1/$USER/deepseek_r1_70b"
MODELS_DIR="$USER_BASE/models"
HF_HOME_DIR="$USER_BASE/hf_home"
LOGS_DIR="$USER_BASE/logs"
mkdir -p "$MODELS_DIR" "$HF_HOME_DIR" "$LOGS_DIR"

# --- Networking ---
HOST_PORT="${HOST_PORT:-8000}"

# --- Container paths ---
CONTAINER_MODEL_DIR="/model"
CONTAINER_HF_HOME="/hf_home"

# --- NCCL debug level ---
export NCCL_DEBUG=WARN

# --- Pull pinned image ---
docker pull vllm/vllm-openai:v0.10.0

# --- Run server inside container ---
exec docker run \
  --gpus all \
  --rm \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --shm-size=16g \
  -e HF_TOKEN="${HF_TOKEN:-}" \
  -e HF_HOME="$CONTAINER_HF_HOME" \
  -e NCCL_DEBUG="$NCCL_DEBUG" \
  -v "$MODELS_DIR":"$CONTAINER_MODEL_DIR" \
  -v "$HF_HOME_DIR":"$CONTAINER_HF_HOME" \
  -p "${HOST_PORT}:8000" \
  vllm/vllm-openai:v0.10.0 \
  bash -c 'vllm serve /model/DeepSeek-R1-Distill-Llama-70B \
      --served-model-name DeepSeek-R1-Distill-Llama-70B \
      --tensor-parallel-size 4 \
      --dtype bfloat16 \
      --kv-cache-dtype fp8 \
      --enable-prefix-caching \
      --max-model-len 32768 \
      --gpu-memory-utilization 0.95 \
      --max-num-seqs 512 \
      --block-size 32 \
      --download-dir /model \
      --port 8000 \
      --api-key changeme'

Submit:

slurm-dgx
sbatch 02_vllm_serve_4xH200.sbatch
squeue -u $USER

Notes on the chosen flags

  • --tensor-parallel-size 4: sharded across all 4 GPUs.
  • --dtype bfloat16: H200 BF16 weights are stable/fast.
  • --kv-cache-dtype fp8: halves KV memory and usually improves throughput on Hopper/H200.
  • --enable-prefix-caching: reuse shared prompt prefixes (RAG, multi‑turn).
  • --max-model-len 32768: practical default for DeepSeek‑R1‑Distill family. Raise only if needed.
  • --gpu-memory-utilization 0.95: gives KV‑cache more headroom.
  • --max-num-seqs 512: high concurrency without fragmentation. Tune with your workload.

7) Using the service (OpenAI‑compatible)

The points below is just sketches of how it should be. Currently this is not exposed today. You can test it by ssh-into to bcm1.nb.no and then further to node1 (or wherever the model is running).

7.1 Quick curl checks

# Health check
curl -s http://localhost:8000/health

# Chat completion (replace API key if you changed it)
curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Llama-70B",
    "messages": [
      {"role":"user","content":"Solve 24*37 step by step and give final answer in \\boxed{}"}
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "max_tokens": 512,
    "stream": true
  }'

7.2 Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="changeme",
)

resp = client.chat.completions.create(
    model="DeepSeek-R1-Distill-Llama-70B",
    messages=[
        {"role": "user", "content": "Please reason step by step and put the final result in \\boxed{}: What is the derivative of x^3?"}
    ],
    temperature=0.6,
    top_p=0.95,
    max_tokens=512,
    stream=False,
)
print(resp.choices[0].message.content)

8) Throughput sanity checks (optional)

For a quick request‑level benchmark from a login node or inside another container, run a lightweight load (do not DDoS your own server):

python - <<'PY'
import concurrent.futures, time, json, requests
URL="http://localhost:8000/v1/chat/completions"
HDR={"Authorization":"Bearer changeme","Content-Type":"application/json"}
PAY={"model":"DeepSeek-R1-Distill-Llama-70B",
     "messages":[{"role":"user","content":"Write one sentence about vLLM."}],
     "max_tokens":64, "temperature":0.0}

def one():
    t=time.time()
    r=requests.post(URL, headers=HDR, data=json.dumps(PAY), timeout=60)
    return time.time()-t, r.status_code

N=64
t0=time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=32) as ex:
    res=list(ex.map(lambda _: one(), range(N)))
dt=time.time()-t0
ok=sum(1 for d,s in res if s==200)
print(f"Requests: {ok}/{N} in {dt:.2f}s -> {ok/dt:.2f} req/s")
PY

For deeper profiling, use vllm bench serve from the image (not shown here to keep this guide concise).


9) Operational tips (DGX + Docker + NFS)

  • Use NFS for model + HF cache so that re‑scheduling/retries do not re‑download. We already mount $MODELS_DIR and $HF_HOME_DIR into the container.
  • Pin image tags (we used v0.10.0) to avoid drift.
  • Shard via tensor parallelism for multi‑GPU on a single node (we set -tp 4).
  • Tune concurrency with --max-num-seqs and KV cache headroom with --gpu-memory-utilization.
  • Prefix caching helps repeated prompts/system messages; keep it on unless you see pathologies.
  • FP8 KV cache gives you larger effective context and better batching. Keep it on for Hopper/H200.
  • Logs & metrics go to your --output/--error files; keep them on NFS ($LOGS_DIR) as in the scripts.
  • Firewall / exposure: by default this binds to container :8000 mapped to host :8000. If needed for cross‑node access, coordinate porting/ACLs with the infra team.

10) Sample end‑to‑end: from zero to served

# One-time: switch cluster and create dirs
slurm-dgx
export USER_BASE=/nfs/datastore1/$USER/deepseek_r1_70b
mkdir -p $USER_BASE/{models,logs,hf_home}

# Set HF token (once)
export HF_TOKEN=***YOUR_HF_READ_TOKEN***

# 1) Pre-download weights
sbatch 01_model_prep.sbatch
watch -n 5 "squeue -u $USER"

# 2) Start server on 4×H200
sbatch 02_vllm_serve_4xH200.sbatch
squeue -u $USER

# 3) Test (from the same node or via port if reachable)
curl -s http://localhost:8000/health

11) Troubleshooting

  • OOM during load: lower --max-model-len (e.g., 24k), ensure --kv-cache-dtype fp8, and keep --gpu-memory-utilization <= 0.95.
  • Slow TTFT on very long prompts: it’s expected. Keep chunked prefill (enabled by default in modern vLLM), stream responses, and consider cutting the prompt or using prefix caching.
  • Networking: if you cannot reach port 8000 externally, keep usage on the allocated node (ssh into it) or ask infra to expose it.
  • Ray vs MP: vLLM auto‑selects a backend for multi‑GPU single‑node. We keep defaults; you generally don’t need Ray on a single node for serving.
  • Prefix caching vs FP8 KV: if you observe odd behavior with very new vLLM builds, temporarily disable one of them to isolate (keep FP8 KV).

12) When to change defaults

  • Need higher concurrency / throughput: raise --max-num-seqs (measure; 512→1024) and ensure --gpu-memory-utilization 0.95.
  • Need longer context: bump --max-model-len gradually (e.g., 48k → 64k). Expect lower throughput and higher memory.
  • Memory pressure: reduce --max-num-seqs and/or --max-model-len.
  • Stability over squeezing: if you hit edge issues, drop --kv-cache-dtype fp8 to fallback to BF16 KV.

13) Clean shutdown

  • Interactive: Ctrl+C in the foreground shell (container stops).
  • Batch: scancel <jobid>; Docker is launched with --rm so containers are cleaned up automatically.

Appendix A — Minimal bare‑metal (pip) alternative (not recommended here)

If you must run without Docker:

python -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install -U "vllm[cuda]"
export HF_TOKEN=***; export HF_HOME=/nfs/datastore1/$USER/deepseek_r1_70b/hf_home
vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
  --tensor-parallel-size 4 --dtype bfloat16 --kv-cache-dtype fp8 \
  --enable-prefix-caching --max-model-len 32768 --gpu-memory-utilization 0.95

Docker remains preferred for reproducibility and ease on DGX.


Appendix B — Small client script for your users

File: client_test.py

from openai import OpenAI
import os

BASE_URL = os.environ.get("VLLM_BASE_URL", "http://localhost:8000/v1")
API_KEY  = os.environ.get("VLLM_API_KEY", "changeme")
MODEL    = os.environ.get("VLLM_MODEL", "DeepSeek-R1-Distill-Llama-70B")

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

msg = [{"role":"user","content":"Please reason step by step and answer in \\boxed{}: 24*37"}]
r = client.chat.completions.create(model=MODEL, messages=msg, temperature=0.6, top_p=0.95, max_tokens=256)
print(r.choices[0].message.content)

Run:

python client_test.py

Appendix C — Quick performance checklist

  • TP=4 (matches 4×H200)
  • BF16 weights (--dtype bfloat16)
  • FP8 KV cache (--kv-cache-dtype fp8)
  • Prefix caching ON
  • gpu_memory_utilization ~0.95
  • max_num_seqs sized to your concurrency
  • Model + HF cache on NFS, pre‑downloaded
  • Pinned vLLM image tag
  • Logs to NFS

Notes on context length

DeepSeek’s Distill page suggests 32k generation/context settings in examples. The base Llama‑3.3‑70B supports long contexts (128k), but the distillation config and your serving budget may favor 32k as a high‑throughput default. If you raise --max-model-len, do it gradually and monitor memory/throughput.

About

Tutorial for setting up a vLLM with a Llama70B reasoning model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors