dgx-spark-finetune

LLM fine-tuning with LoRA + NVFP4/MXFP8 on NVIDIA DGX Spark (Blackwell GB10)

Note: This project is a work in progress.

Fine-tune large language models using LoRA adapters with 4-bit/8-bit quantization optimized for NVIDIA DGX Spark and Blackwell GPUs.

Features

NVFP4 (4-bit): Native Blackwell FP4 training via Transformer Engine
MXFP8 (8-bit): High-precision training with Transformer Engine
bitsandbytes FP4: Works on any CUDA GPU
DGX Spark optimized: Tested on GB10 (~41GB VRAM for 3B model)
LoRA adapters: Memory-efficient fine-tuning (~240MB output)
TensorBoard logging: Real-time training metrics
Extended thinking: /think and /no_think modes for reasoning models

Quantization Backends

Backend	Bits	VRAM (3B model)	GPU Support	Best For
bitsandbytes FP4	4-bit	~45GB	Any CUDA GPU	Development
Transformer Engine NVFP4	4-bit	~41GB	Blackwell	Production
Transformer Engine MXFP8	8-bit	~50GB	Blackwell	Higher precision

Quick Start

Option 1: bitsandbytes FP4 (Any GPU)

# Activate environment
conda activate pytorch

# Train (saves adapter only)
./run_training.sh

# Train + save merged model
python finetune.py --save-merged

Option 2: NVFP4 4-bit (Blackwell - Docker)

./run_training_docker.sh nvfp4

Option 3: MXFP8 8-bit (Blackwell - Docker)

./run_training_docker.sh mxfp8

Complete Workflow

┌─────────────────────────────────────────────────────────────────────────────┐
│                         COMPLETE PIPELINE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. TRAIN                    2. EXPORT                  3. SERVE            │
│  ──────────────────────      ──────────────────────     ─────────────────── │
│                                                                             │
│  ./run_training_docker.sh    ./run_export_nvfp4.sh      ./run_serve.sh      │
│         nvfp4                                                               │
│           │                         │                         │             │
│           ▼                         ▼                         ▼             │
│  ┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐    │
│  │ LoRA Adapter    │──────▶│ NVFP4 Model     │──────▶│ OpenAI API      │    │
│  │ (~462MB, bf16)  │       │ (~1.5GB, FP4)   │       │ localhost:8000  │    │
│  └─────────────────┘       └─────────────────┘       └─────────────────┘    │
│                                                                             │
│  Training with TE           Merge + Quantize          TensorRT-LLM          │
│  NVFP4 compute             nvidia-modelopt            OpenAI compatible     │
│  ~41GB VRAM                                                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Step-by-Step Commands

# Step 1: Train with NVFP4 (Blackwell optimized)
./run_training_docker.sh nvfp4

# Step 2: Export to NVFP4 format for TensorRT-LLM
./run_export_nvfp4.sh

# Step 3: Serve with OpenAI-compatible API
./run_serve.sh

# Step 4: Use the API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "smollm3-3b-nvfp4", "messages": [{"role": "user", "content": "Hello!"}]}'

Test Inference (without serving)

# Interactive chat with fine-tuned model
./run_inference_docker.sh nvfp4

# Single prompt
./run_inference_docker.sh nvfp4 "" "Explain machine learning"

Installation

Prerequisites

NVIDIA GPU with CUDA support (Blackwell recommended for NVFP4/MXFP4)
Miniconda
Docker (for Transformer Engine)

Setup Conda Environment

# Create environment
conda create -n pytorch python=3.11 -y
conda activate pytorch

# Install PyTorch
pip install torch torchvision torchaudio

# Install dependencies
pip install transformers datasets accelerate peft trl bitsandbytes tensorboard

# Verify
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"

Setup Docker (for NVFP4/MXFP8)

# Pull NVIDIA PyTorch container (includes Transformer Engine 2.9+)
docker pull nvcr.io/nvidia/pytorch:25.11-py3

# Pull TensorRT-LLM container for serving (OpenAI API)
docker pull nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev

Training Options

Command Line Arguments

python finetune.py [OPTIONS]

# Model and dataset
--model ID              # HuggingFace model ID (default: HuggingFaceTB/SmolLM3-3B)
--dataset ID            # HuggingFace dataset ID (default: TeichAI/claude-4.5-opus-high-reasoning-250x)
--output-dir PATH       # Output directory for LoRA adapter (default: ./output/smollm3-3b-reasoning-lora)

# Quantization backends (pick one)
--use-fp4               # bitsandbytes FP4 4-bit (default, any GPU)
--use-nvfp4             # Transformer Engine NVFP4 4-bit (Blackwell + Docker)
--use-mxfp8             # Transformer Engine MXFP8 8-bit (Blackwell + Docker)

# Output options
--save-merged           # Save merged model (adapter + base)
--merged-output PATH    # Path for merged model (default: ./output/smollm3-3b-merged)

Examples

# Basic training with bitsandbytes FP4 (any GPU)
python finetune.py

# Training with NVFP4 4-bit (Blackwell, inside Docker)
python finetune.py --use-nvfp4

# Training with MXFP8 8-bit (Blackwell, inside Docker)
python finetune.py --use-mxfp8

# Training + save merged model
python finetune.py --use-nvfp4 --save-merged

# Custom model and dataset
python finetune.py --use-nvfp4 --model meta-llama/Llama-3.2-3B --dataset your-org/your-dataset

# Custom output directory
python finetune.py --use-nvfp4 --output-dir ./output/my-custom-lora

Inference

Using Docker Scripts (Recommended)

# NVFP4 inference - interactive mode
./run_inference_docker.sh nvfp4

# NVFP4 inference - single prompt
./run_inference_docker.sh nvfp4 "" "Explain quantum computing"

# MXFP8 inference
./run_inference_docker.sh mxfp8

# Custom adapter path
./run_inference_docker.sh nvfp4 ./output/my-custom-adapter

Command Line (without Docker)

# Basic inference (FP4)
python inference.py --adapter ./output/smollm3-3b-reasoning-lora --prompt "Hello"

# Without extended thinking
python inference.py --adapter ./output/smollm3-3b-reasoning-lora --prompt "What is 2+2?" --no-think

# Interactive mode
python inference.py --adapter ./output/smollm3-3b-reasoning-lora

Extended Thinking Mode

The model supports extended thinking with /think and /no_think flags:

# Enable thinking (default) - detailed reasoning
python inference.py --prompt "Explain quantum computing"

# Disable thinking - direct answers
python inference.py --prompt "What is 2+2?" --no-think

In interactive mode, prefix your prompt:

User: /think Explain AI
User: /no_think What is 5+5?

Inference Options

python inference.py [OPTIONS]

--adapter PATH          # Path to LoRA adapter
--backend [fp4|nvfp4-te]  # Quantization backend
--prompt TEXT           # Single prompt (omit for interactive)
--no-think              # Disable extended thinking mode
--max-tokens INT        # Max new tokens (default: 2048)
--temperature FLOAT     # Temperature (default: 0.7)
--top-p FLOAT           # Top-p sampling (default: 0.9)

# Export options
--merge                 # Merge LoRA with base model
--export-nvfp4          # Export to NVFP4 for TensorRT-LLM

Model Export & Serving

Export to NVFP4 (One Command)

# Export fine-tuned model to NVFP4 format for TensorRT-LLM
./run_export_nvfp4.sh

# With custom paths
./run_export_nvfp4.sh ./output/smollm3-3b-reasoning-nvfp4-lora ./output/merged ./output/nvfp4

This script:

Merges LoRA adapter with base model
Quantizes to NVFP4 using nvidia-modelopt
Outputs TensorRT-LLM compatible model

Serve with TensorRT-LLM (OpenAI API)

# Start server on port 8000
./run_serve.sh

# Custom port and batch size
./run_serve.sh ./output/smollm3-3b-nvfp4 8080 8

API Usage

# Health check
curl http://localhost:8000/health

# Chat completion (OpenAI compatible)
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "smollm3-3b-nvfp4",
        "messages": [{"role": "user", "content": "Hello!"}]
    }'

Manual Export (without scripts)

# Merge only
python inference.py --merge --adapter ./output/smollm3-3b-reasoning-nvfp4-lora

# Export to NVFP4
python inference.py --export-nvfp4

Output Structure

./output/
├── smollm3-3b-reasoning-lora/        # bitsandbytes FP4 adapter (~240MB)
│   └── logs/                         # TensorBoard logs
├── smollm3-3b-reasoning-nvfp4-lora/  # NVFP4 4-bit adapter
│   └── logs/                         # TensorBoard logs
├── smollm3-3b-reasoning-mxfp8-lora/  # MXFP8 8-bit adapter
│   └── logs/                         # TensorBoard logs
├── smollm3-3b-merged/                # Merged model (~6GB)
└── smollm3-3b-nvfp4/                 # NVFP4 export for TensorRT-LLM

Output Size Comparison

Output	Size	Use Case
LoRA Adapter	~240MB	Development, multiple versions
Merged Model	~6GB	Production inference
NVFP4 Export	~1.5GB	TensorRT-LLM deployment

Monitoring

TensorBoard

# Monitor specific training
tensorboard --logdir ./output/smollm3-3b-reasoning-mxfp4-lora/logs --port 6006

# Monitor all trainings at once
tensorboard --logdir ./output --port 6006

Open http://localhost:6006

Configuration

Hyperparameters

Edit finetune.py to adjust:

Parameter	Default	Description
`MAX_SEQ_LENGTH`	8192	Max sequence length
`per_device_train_batch_size`	16	Batch size per GPU
`gradient_accumulation_steps`	1	Effective batch = batch × accumulation
`num_train_epochs`	3	Training epochs
`learning_rate`	2e-4	Learning rate
`r` (LoRA rank)	64	LoRA rank
`lora_alpha`	128	LoRA alpha scaling

Memory Optimization

For OOM errors:

# In finetune.py
MAX_SEQ_LENGTH = 4096  # Reduce from 8192
per_device_train_batch_size = 1  # Reduce from 2
gradient_accumulation_steps = 8  # Increase to maintain effective batch

File Structure

.
├── finetune.py               # Main training script
├── inference.py              # Inference + merge + export
├── run_training.sh           # bitsandbytes FP4 (Conda, any GPU)
├── run_training_docker.sh    # NVFP4/MXFP8 training (Docker, Blackwell)
├── run_inference_docker.sh   # Inference with Docker
├── run_export_nvfp4.sh       # Export to NVFP4 for TensorRT-LLM
├── run_serve.sh              # Serve with TensorRT-LLM (OpenAI API)
├── nvfp4.py                  # Custom NVFP4 implementation (reference)
├── quantize_nvfp4_tensorrt.py  # TensorRT export script
└── README.md

Troubleshooting

CUDA Out of Memory

# Reduce sequence length and batch size in finetune.py
MAX_SEQ_LENGTH = 4096
per_device_train_batch_size = 1

Docker Permission Denied

sudo usermod -aG docker $USER
# Logout and login again

Transformer Engine Not Found

Use Docker instead of local installation:

./run_training_nvfp4.sh

HuggingFace Token

For gated models:

export HF_TOKEN="your_token_here"

Workflow Recommendations

Development (Any GPU)

# Fast iteration with bitsandbytes FP4
./run_training.sh
python inference.py --adapter ./output/smollm3-3b-reasoning-lora

Production (Blackwell)

# Full pipeline: Train → Export → Serve
./run_training_docker.sh nvfp4
./run_export_nvfp4.sh
./run_serve.sh

Quick Test

# Test inference without serving
./run_inference_docker.sh nvfp4

References

DGX Spark Playbooks - Official NVIDIA examples
SmolLM3-3B
NVIDIA Transformer Engine
LoRA: Low-Rank Adaptation
bitsandbytes
TensorRT-LLM

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
finetune.py		finetune.py
inference.py		inference.py
nvfp4.py		nvfp4.py
quantize_nvfp4_tensorrt.py		quantize_nvfp4_tensorrt.py
run_export_nvfp4.sh		run_export_nvfp4.sh
run_inference_docker.sh		run_inference_docker.sh
run_multi_gpu.sh		run_multi_gpu.sh
run_quantize.sh		run_quantize.sh
run_serve.sh		run_serve.sh
run_training.sh		run_training.sh
run_training_docker.sh		run_training_docker.sh

Folders and files

Latest commit

History

Repository files navigation

dgx-spark-finetune

Features

Quantization Backends

Quick Start

Option 1: bitsandbytes FP4 (Any GPU)

Option 2: NVFP4 4-bit (Blackwell - Docker)

Option 3: MXFP8 8-bit (Blackwell - Docker)

Complete Workflow

Step-by-Step Commands

Test Inference (without serving)

Installation

Prerequisites

Setup Conda Environment

Setup Docker (for NVFP4/MXFP8)

Training Options

Command Line Arguments

Examples

Inference

Using Docker Scripts (Recommended)

Command Line (without Docker)

Extended Thinking Mode

Inference Options

Model Export & Serving

Export to NVFP4 (One Command)

Serve with TensorRT-LLM (OpenAI API)

API Usage

Manual Export (without scripts)

Output Structure

Output Size Comparison

Monitoring

TensorBoard

Configuration

Hyperparameters

Memory Optimization

File Structure

Troubleshooting

CUDA Out of Memory

Docker Permission Denied

Transformer Engine Not Found

HuggingFace Token

Workflow Recommendations

Development (Any GPU)

Production (Blackwell)

Quick Test

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages