LLM fine-tuning with LoRA + NVFP4/MXFP8 on NVIDIA DGX Spark (Blackwell GB10)
Note: This project is a work in progress.
Fine-tune large language models using LoRA adapters with 4-bit/8-bit quantization optimized for NVIDIA DGX Spark and Blackwell GPUs.
- NVFP4 (4-bit): Native Blackwell FP4 training via Transformer Engine
- MXFP8 (8-bit): High-precision training with Transformer Engine
- bitsandbytes FP4: Works on any CUDA GPU
- DGX Spark optimized: Tested on GB10 (~41GB VRAM for 3B model)
- LoRA adapters: Memory-efficient fine-tuning (~240MB output)
- TensorBoard logging: Real-time training metrics
- Extended thinking:
/thinkand/no_thinkmodes for reasoning models
| Backend | Bits | VRAM (3B model) | GPU Support | Best For |
|---|---|---|---|---|
| bitsandbytes FP4 | 4-bit | ~45GB | Any CUDA GPU | Development |
| Transformer Engine NVFP4 | 4-bit | ~41GB | Blackwell | Production |
| Transformer Engine MXFP8 | 8-bit | ~50GB | Blackwell | Higher precision |
# Activate environment
conda activate pytorch
# Train (saves adapter only)
./run_training.sh
# Train + save merged model
python finetune.py --save-merged./run_training_docker.sh nvfp4./run_training_docker.sh mxfp8┌─────────────────────────────────────────────────────────────────────────────┐
│ COMPLETE PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. TRAIN 2. EXPORT 3. SERVE │
│ ────────────────────── ────────────────────── ─────────────────── │
│ │
│ ./run_training_docker.sh ./run_export_nvfp4.sh ./run_serve.sh │
│ nvfp4 │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ LoRA Adapter │──────▶│ NVFP4 Model │──────▶│ OpenAI API │ │
│ │ (~462MB, bf16) │ │ (~1.5GB, FP4) │ │ localhost:8000 │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Training with TE Merge + Quantize TensorRT-LLM │
│ NVFP4 compute nvidia-modelopt OpenAI compatible │
│ ~41GB VRAM │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
# Step 1: Train with NVFP4 (Blackwell optimized)
./run_training_docker.sh nvfp4
# Step 2: Export to NVFP4 format for TensorRT-LLM
./run_export_nvfp4.sh
# Step 3: Serve with OpenAI-compatible API
./run_serve.sh
# Step 4: Use the API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "smollm3-3b-nvfp4", "messages": [{"role": "user", "content": "Hello!"}]}'# Interactive chat with fine-tuned model
./run_inference_docker.sh nvfp4
# Single prompt
./run_inference_docker.sh nvfp4 "" "Explain machine learning"- NVIDIA GPU with CUDA support (Blackwell recommended for NVFP4/MXFP4)
- Miniconda
- Docker (for Transformer Engine)
# Create environment
conda create -n pytorch python=3.11 -y
conda activate pytorch
# Install PyTorch
pip install torch torchvision torchaudio
# Install dependencies
pip install transformers datasets accelerate peft trl bitsandbytes tensorboard
# Verify
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"# Pull NVIDIA PyTorch container (includes Transformer Engine 2.9+)
docker pull nvcr.io/nvidia/pytorch:25.11-py3
# Pull TensorRT-LLM container for serving (OpenAI API)
docker pull nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-devpython finetune.py [OPTIONS]
# Model and dataset
--model ID # HuggingFace model ID (default: HuggingFaceTB/SmolLM3-3B)
--dataset ID # HuggingFace dataset ID (default: TeichAI/claude-4.5-opus-high-reasoning-250x)
--output-dir PATH # Output directory for LoRA adapter (default: ./output/smollm3-3b-reasoning-lora)
# Quantization backends (pick one)
--use-fp4 # bitsandbytes FP4 4-bit (default, any GPU)
--use-nvfp4 # Transformer Engine NVFP4 4-bit (Blackwell + Docker)
--use-mxfp8 # Transformer Engine MXFP8 8-bit (Blackwell + Docker)
# Output options
--save-merged # Save merged model (adapter + base)
--merged-output PATH # Path for merged model (default: ./output/smollm3-3b-merged)# Basic training with bitsandbytes FP4 (any GPU)
python finetune.py
# Training with NVFP4 4-bit (Blackwell, inside Docker)
python finetune.py --use-nvfp4
# Training with MXFP8 8-bit (Blackwell, inside Docker)
python finetune.py --use-mxfp8
# Training + save merged model
python finetune.py --use-nvfp4 --save-merged
# Custom model and dataset
python finetune.py --use-nvfp4 --model meta-llama/Llama-3.2-3B --dataset your-org/your-dataset
# Custom output directory
python finetune.py --use-nvfp4 --output-dir ./output/my-custom-lora# NVFP4 inference - interactive mode
./run_inference_docker.sh nvfp4
# NVFP4 inference - single prompt
./run_inference_docker.sh nvfp4 "" "Explain quantum computing"
# MXFP8 inference
./run_inference_docker.sh mxfp8
# Custom adapter path
./run_inference_docker.sh nvfp4 ./output/my-custom-adapter# Basic inference (FP4)
python inference.py --adapter ./output/smollm3-3b-reasoning-lora --prompt "Hello"
# Without extended thinking
python inference.py --adapter ./output/smollm3-3b-reasoning-lora --prompt "What is 2+2?" --no-think
# Interactive mode
python inference.py --adapter ./output/smollm3-3b-reasoning-loraThe model supports extended thinking with /think and /no_think flags:
# Enable thinking (default) - detailed reasoning
python inference.py --prompt "Explain quantum computing"
# Disable thinking - direct answers
python inference.py --prompt "What is 2+2?" --no-thinkIn interactive mode, prefix your prompt:
User: /think Explain AI
User: /no_think What is 5+5?
python inference.py [OPTIONS]
--adapter PATH # Path to LoRA adapter
--backend [fp4|nvfp4-te] # Quantization backend
--prompt TEXT # Single prompt (omit for interactive)
--no-think # Disable extended thinking mode
--max-tokens INT # Max new tokens (default: 2048)
--temperature FLOAT # Temperature (default: 0.7)
--top-p FLOAT # Top-p sampling (default: 0.9)
# Export options
--merge # Merge LoRA with base model
--export-nvfp4 # Export to NVFP4 for TensorRT-LLM# Export fine-tuned model to NVFP4 format for TensorRT-LLM
./run_export_nvfp4.sh
# With custom paths
./run_export_nvfp4.sh ./output/smollm3-3b-reasoning-nvfp4-lora ./output/merged ./output/nvfp4This script:
- Merges LoRA adapter with base model
- Quantizes to NVFP4 using nvidia-modelopt
- Outputs TensorRT-LLM compatible model
# Start server on port 8000
./run_serve.sh
# Custom port and batch size
./run_serve.sh ./output/smollm3-3b-nvfp4 8080 8# Health check
curl http://localhost:8000/health
# Chat completion (OpenAI compatible)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "smollm3-3b-nvfp4",
"messages": [{"role": "user", "content": "Hello!"}]
}'# Merge only
python inference.py --merge --adapter ./output/smollm3-3b-reasoning-nvfp4-lora
# Export to NVFP4
python inference.py --export-nvfp4./output/
├── smollm3-3b-reasoning-lora/ # bitsandbytes FP4 adapter (~240MB)
│ └── logs/ # TensorBoard logs
├── smollm3-3b-reasoning-nvfp4-lora/ # NVFP4 4-bit adapter
│ └── logs/ # TensorBoard logs
├── smollm3-3b-reasoning-mxfp8-lora/ # MXFP8 8-bit adapter
│ └── logs/ # TensorBoard logs
├── smollm3-3b-merged/ # Merged model (~6GB)
└── smollm3-3b-nvfp4/ # NVFP4 export for TensorRT-LLM
| Output | Size | Use Case |
|---|---|---|
| LoRA Adapter | ~240MB | Development, multiple versions |
| Merged Model | ~6GB | Production inference |
| NVFP4 Export | ~1.5GB | TensorRT-LLM deployment |
# Monitor specific training
tensorboard --logdir ./output/smollm3-3b-reasoning-mxfp4-lora/logs --port 6006
# Monitor all trainings at once
tensorboard --logdir ./output --port 6006Edit finetune.py to adjust:
| Parameter | Default | Description |
|---|---|---|
MAX_SEQ_LENGTH |
8192 | Max sequence length |
per_device_train_batch_size |
16 | Batch size per GPU |
gradient_accumulation_steps |
1 | Effective batch = batch × accumulation |
num_train_epochs |
3 | Training epochs |
learning_rate |
2e-4 | Learning rate |
r (LoRA rank) |
64 | LoRA rank |
lora_alpha |
128 | LoRA alpha scaling |
For OOM errors:
# In finetune.py
MAX_SEQ_LENGTH = 4096 # Reduce from 8192
per_device_train_batch_size = 1 # Reduce from 2
gradient_accumulation_steps = 8 # Increase to maintain effective batch.
├── finetune.py # Main training script
├── inference.py # Inference + merge + export
├── run_training.sh # bitsandbytes FP4 (Conda, any GPU)
├── run_training_docker.sh # NVFP4/MXFP8 training (Docker, Blackwell)
├── run_inference_docker.sh # Inference with Docker
├── run_export_nvfp4.sh # Export to NVFP4 for TensorRT-LLM
├── run_serve.sh # Serve with TensorRT-LLM (OpenAI API)
├── nvfp4.py # Custom NVFP4 implementation (reference)
├── quantize_nvfp4_tensorrt.py # TensorRT export script
└── README.md
# Reduce sequence length and batch size in finetune.py
MAX_SEQ_LENGTH = 4096
per_device_train_batch_size = 1sudo usermod -aG docker $USER
# Logout and login againUse Docker instead of local installation:
./run_training_nvfp4.shFor gated models:
export HF_TOKEN="your_token_here"# Fast iteration with bitsandbytes FP4
./run_training.sh
python inference.py --adapter ./output/smollm3-3b-reasoning-lora# Full pipeline: Train → Export → Serve
./run_training_docker.sh nvfp4
./run_export_nvfp4.sh
./run_serve.sh# Test inference without serving
./run_inference_docker.sh nvfp4- DGX Spark Playbooks - Official NVIDIA examples
- SmolLM3-3B
- NVIDIA Transformer Engine
- LoRA: Low-Rank Adaptation
- bitsandbytes
- TensorRT-LLM