FLUX.2 Quickstart

This guide covers training LoRAs on FLUX.2, Black Forest Labs' latest image generation model family.

Note: The default model flavour is klein-9b, but this guide focuses on dev (the full 12B transformer with 24B Mistral-3 text encoder) since it has the highest resource requirements. Klein models are easier to run - see Model Variants below.

Model Variants

FLUX.2 comes in three variants:

Variant	Transformer	Text Encoder	Total Blocks	Default
`dev`	12B params	Mistral-3 (24B)	56 (8+48)
`klein-9b`	9B params	Qwen3 (bundled)	32 (8+24)	✓
`klein-4b`	4B params	Qwen3 (bundled)	25 (5+20)

Key differences:

dev: Uses standalone Mistral-Small-3.1-24B text encoder, has guidance embeddings
klein models: Use Qwen3 text encoder bundled in the model repo, no guidance embeddings (guidance training options are ignored)

To select a variant, set model_flavour in your config:

{
  "model_flavour": "dev"
}

Important: For klein-4b and klein-9b, leave pretrained_text_encoder_model_name_or_path unset unless you intentionally want to replace the bundled Qwen3 text encoder. Setting that field overrides the Klein default and can trigger downloads of a different text encoder.

Model Overview

FLUX.2-dev introduces significant architectural changes from FLUX.1:

Text Encoder: Mistral-Small-3.1-24B (dev) or Qwen3 (klein)
Architecture: 8 DoubleStreamBlocks + 48 SingleStreamBlocks (dev)
Latent Channels: 32 VAE channels → 128 after pixel shuffle (vs 16 in FLUX.1)
VAE: Custom VAE with batch normalization and pixel shuffling
Embedding Dimension: 15,360 for dev (3×5,120), 12,288 for klein-9b (3×4,096), 7,680 for klein-4b (3×2,560)

Hardware Requirements

Hardware requirements vary significantly by model variant.

Klein Models (Recommended for Most Users)

Klein models are much more accessible:

Variant	bf16 VRAM	int8 VRAM	System RAM
`klein-4b`	~12GB	~8GB	32GB+
`klein-9b`	~22GB	~14GB	64GB+

Recommended for klein-9b: Single 24GB GPU (RTX 3090/4090, A5000) Recommended for klein-4b: Single 16GB GPU (RTX 4080, A4000)

FLUX.2-dev (Advanced)

FLUX.2-dev has significant resource requirements due to the Mistral-3 text encoder:

VRAM Requirements

The 24B Mistral text encoder alone requires significant VRAM:

Component	bf16	int8	int4
Mistral-3 (24B)	~48GB	~24GB	~12GB
FLUX.2 Transformer	~24GB	~12GB	~6GB
VAE + overhead	~4GB	~4GB	~4GB

Configuration	Approximate Total VRAM
bf16 everything	~76GB+
int8 text encoder + bf16 transformer	~52GB
int8 everything	~40GB
int4 text encoder + int8 transformer	~22GB

System RAM

Minimum: 96GB system RAM (loading 24B text encoder requires substantial memory)
Recommended: 128GB+ for comfortable operation

Recommended Hardware

Minimum: 2x 48GB GPUs (A6000, L40S) with FSDP2 or DeepSpeed
Recommended: 4x H100 80GB with fp8-torchao
With heavy quantization (int4): 2x 24GB GPUs may work but is experimental

Multi-GPU distributed training (FSDP2 or DeepSpeed) is essentially required for FLUX.2-dev due to the combined size of the Mistral-3 text encoder and transformer.

Prerequisites

Python Version

FLUX.2 requires Python 3.10 or later with recent transformers:

python --version  # Should be 3.10+
pip install transformers>=4.45.0

Model Access

FLUX.2 models require access approval on Hugging Face:

For dev:

Visit black-forest-labs/FLUX.2-dev
Accept the license agreement

For klein models:

Visit black-forest-labs/FLUX.2-klein-base-9B or black-forest-labs/FLUX.2-klein-base-4B
Accept the license agreement

Ensure you're logged in to Hugging Face CLI: huggingface-cli login

Installation

pip install 'simpletuner[cuda]'

# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130

For development setup:

git clone https://github.com/bghira/SimpleTuner
cd SimpleTuner
pip install -e ".[cuda]"

Configuration

Web Interface

simpletuner server

Access http://localhost:8001 and select FLUX.2 as the model family.

Manual Configuration

Create config/config.json:

View example config

{
  "model_type": "lora",
  "model_family": "flux2",
  "model_flavour": "dev",
  "pretrained_model_name_or_path": "black-forest-labs/FLUX.2-dev",
  "output_dir": "/path/to/output",
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "gradient_checkpointing": true,
  "mixed_precision": "bf16",
  "learning_rate": 1e-4,
  "lr_scheduler": "constant",
  "max_train_steps": 10000,
  "validation_resolution": "1024x1024",
  "validation_num_inference_steps": 20,
  "flux_guidance_mode": "constant",
  "flux_guidance_value": 1.0,
  "lora_rank": 16
}

Key Configuration Options

Guidance Configuration

Note: Klein models (klein-4b, klein-9b) do not have guidance embeddings. The following guidance options only apply to dev.

FLUX.2-dev uses guidance embedding similar to FLUX.1:

View example config

{
  "flux_guidance_mode": "constant",
  "flux_guidance_value": 1.0
}

Or for random guidance during training:

View example config

{
  "flux_guidance_mode": "random-range",
  "flux_guidance_min": 1.0,
  "flux_guidance_max": 5.0
}

Quantization (Memory Optimization)

For reduced VRAM usage:

View example config

{
  "base_model_precision": "int8-quanto",
  "text_encoder_1_precision": "int8-quanto",
  "base_model_default_dtype": "bf16"
}

TREAD (Training Acceleration)

FLUX.2 supports TREAD for faster training:

View example config

{
  "tread_config": {
    "routes": [
      {"selection_ratio": 0.5, "start_layer_idx": 2, "end_layer_idx": -2}
    ]
  }
}

Advanced Experimental Features

Show advanced experimental details

SimpleTuner includes experimental features that can significantly improve training stability and performance.

Scheduled Sampling (Rollout): reduces exposure bias and improves output quality by letting the model generate its own inputs during training.

⚠️ These features increase the computational overhead of training.

Dataset Configuration

Create config/multidatabackend.json:

View example config

[
  {
    "id": "my-dataset",
    "type": "local",
    "crop": true,
    "crop_aspect": "square",
    "crop_style": "center",
    "resolution": 1024,
    "minimum_image_size": 1024,
    "maximum_image_size": 1024,
    "resolution_type": "pixel_area",
    "cache_dir_vae": "cache/vae/flux2/my-dataset",
    "instance_data_dir": "datasets/my-dataset",
    "caption_strategy": "textfile",
    "metadata_backend": "discovery",
    "repeats": 10
  },
  {
    "id": "text-embeds",
    "type": "local",
    "dataset_type": "text_embeds",
    "default": true,
    "cache_dir": "cache/text/flux2",
    "write_batch_size": 64
  }
]

See caption_strategy options and requirements in DATALOADER.md.

Optional edit / reference conditioning

FLUX.2 can train either plain text-to-image (no conditioning) or with paired reference/edit images. To add conditioning, pair your main dataset to one or more conditioning datasets using conditioning_data and choose a conditioning_type:

View example config

[
  {
    "id": "flux2-edits",
    "type": "local",
    "instance_data_dir": "/datasets/flux2/edits",
    "caption_strategy": "textfile",
    "resolution": 1024,
    "conditioning_data": ["flux2-references"],
    "cache_dir_vae": "cache/vae/flux2/edits"
  },
  {
    "id": "flux2-references",
    "type": "local",
    "dataset_type": "conditioning",
    "instance_data_dir": "/datasets/flux2/references",
    "conditioning_type": "reference_strict",
    "resolution": 1024,
    "cache_dir_vae": "cache/vae/flux2/references"
  }
]

Use conditioning_type=reference_strict when you need crops aligned 1:1 with the edit image. reference_loose allows mismatched aspect ratios.
File names must match between edit and reference datasets; each edit image should have a corresponding reference file.
When supplying multiple conditioning datasets, set conditioning_multidataset_sampling (combined vs random) as needed; see OPTIONS.
Without conditioning_data, FLUX.2 falls back to standard text-to-image training.

LoRA Targets

Available LoRA target presets:

all (default): All attention and MLP layers
attention: Only attention layers (qkv, proj)
mlp: Only MLP/feed-forward layers
tiny: Minimal training (just qkv layers)

View example config

{
  "--flux_lora_target": "all"
}

Training

Login to Services

huggingface-cli login
wandb login  # optional

Start Training

simpletuner train

Or via script:

./train.sh

Memory Offloading

For memory-constrained setups, FLUX.2 supports group offloading for both the transformer and optionally the Mistral-3 text encoder:

--enable_group_offload \
--group_offload_type block_level \
--group_offload_blocks_per_group 1 \
--group_offload_use_stream \
--group_offload_text_encoder

The --group_offload_text_encoder flag is recommended for FLUX.2 since the 24B Mistral text encoder benefits significantly from offloading during text embedding caching. You can also add --group_offload_vae to include the VAE in offloading during latent caching.

Validation Prompts

Create config/user_prompt_library.json:

View example config

{
  "portrait_subject": "a professional portrait photograph of <subject>, studio lighting, high detail",
  "artistic_subject": "an artistic interpretation of <subject> in the style of renaissance painting",
  "cinematic_subject": "a cinematic shot of <subject>, dramatic lighting, film grain"
}

Inference

Using Trained LoRA

FLUX.2 LoRAs can be loaded with the SimpleTuner inference pipeline or compatible tools once community support develops.

Guidance Scale

Training with flux_guidance_value=1.0 works well for most use cases
At inference, use normal guidance values (3.0-5.0)

Differences from FLUX.1

Aspect	FLUX.1	FLUX.2-dev	FLUX.2-klein-9b	FLUX.2-klein-4b
Text Encoder	CLIP-L/14 + T5-XXL	Mistral-3 (24B)	Qwen3 (bundled)	Qwen3 (bundled)
Embedding Dim	CLIP: 768, T5: 4096	15,360	12,288	7,680
Latent Channels	16	32 (→128)	32 (→128)	32 (→128)
VAE	AutoencoderKL	Custom (BatchNorm)	Custom (BatchNorm)	Custom (BatchNorm)
Transformer Blocks	19 joint + 38 single	8 double + 48 single	8 double + 24 single	5 double + 20 single
Guidance Embeds	Yes	Yes	No	No

Troubleshooting

Out of Memory During Startup

Enable --offload_during_startup=true
Use --quantize_via=cpu for text encoder quantization
Reduce --vae_batch_size

Slow Text Embedding

Mistral-3 is large; consider:

Pre-caching all text embeddings before training
Using text encoder quantization
Batch processing with larger write_batch_size

Training Instability

Lower learning rate (try 5e-5)
Increase gradient accumulation steps
Enable gradient checkpointing
Use --max_grad_norm=1.0

CUDA Out of Memory

Enable quantization (int8-quanto or int4-quanto)
Enable gradient checkpointing
Reduce batch size
Enable group offloading
Use TREAD for token routing efficiency

Advanced: TREAD Configuration

TREAD (Token Routing for Efficient Architecture-agnostic Diffusion) speeds up training by selectively processing tokens:

View example config

{
  "tread_config": {
    "routes": [
      {
        "selection_ratio": 0.5,
        "start_layer_idx": 4,
        "end_layer_idx": -4
      }
    ]
  }
}

selection_ratio: Fraction of tokens to keep (0.5 = 50%)
start_layer_idx: First layer to apply routing
end_layer_idx: Last layer (negative = from end)

Expected speedup: 20-40% depending on configuration.

FilesExpand file tree

FLUX2.md

Latest commit

History

FLUX2.md

File metadata and controls

FLUX.2 Quickstart

Model Variants

Model Overview

Hardware Requirements

Klein Models (Recommended for Most Users)

FLUX.2-dev (Advanced)

VRAM Requirements

System RAM

Recommended Hardware

Prerequisites

Python Version

Model Access

Installation

Configuration

Web Interface

Manual Configuration

Key Configuration Options

Guidance Configuration

Quantization (Memory Optimization)

TREAD (Training Acceleration)

Advanced Experimental Features

Dataset Configuration

Optional edit / reference conditioning

LoRA Targets

Training

Login to Services

Start Training

Memory Offloading

Validation Prompts

Inference

Using Trained LoRA

Guidance Scale

Differences from FLUX.1

Troubleshooting

Out of Memory During Startup

Slow Text Embedding

Training Instability

CUDA Out of Memory

Advanced: TREAD Configuration

See Also