This guide covers training LoRAs on FLUX.2, Black Forest Labs' latest image generation model family.
Note: The default model flavour is
klein-9b, but this guide focuses ondev(the full 12B transformer with 24B Mistral-3 text encoder) since it has the highest resource requirements. Klein models are easier to run - see Model Variants below.
FLUX.2 comes in three variants:
| Variant | Transformer | Text Encoder | Total Blocks | Default |
|---|---|---|---|---|
dev |
12B params | Mistral-3 (24B) | 56 (8+48) | |
klein-9b |
9B params | Qwen3 (bundled) | 32 (8+24) | ✓ |
klein-4b |
4B params | Qwen3 (bundled) | 25 (5+20) |
Key differences:
- dev: Uses standalone Mistral-Small-3.1-24B text encoder, has guidance embeddings
- klein models: Use Qwen3 text encoder bundled in the model repo, no guidance embeddings (guidance training options are ignored)
To select a variant, set model_flavour in your config:
{
"model_flavour": "dev"
}Important: For
klein-4bandklein-9b, leavepretrained_text_encoder_model_name_or_pathunset unless you intentionally want to replace the bundled Qwen3 text encoder. Setting that field overrides the Klein default and can trigger downloads of a different text encoder.
FLUX.2-dev introduces significant architectural changes from FLUX.1:
- Text Encoder: Mistral-Small-3.1-24B (dev) or Qwen3 (klein)
- Architecture: 8 DoubleStreamBlocks + 48 SingleStreamBlocks (dev)
- Latent Channels: 32 VAE channels → 128 after pixel shuffle (vs 16 in FLUX.1)
- VAE: Custom VAE with batch normalization and pixel shuffling
- Embedding Dimension: 15,360 for dev (3×5,120), 12,288 for klein-9b (3×4,096), 7,680 for klein-4b (3×2,560)
Hardware requirements vary significantly by model variant.
Klein models are much more accessible:
| Variant | bf16 VRAM | int8 VRAM | System RAM |
|---|---|---|---|
klein-4b |
~12GB | ~8GB | 32GB+ |
klein-9b |
~22GB | ~14GB | 64GB+ |
Recommended for klein-9b: Single 24GB GPU (RTX 3090/4090, A5000) Recommended for klein-4b: Single 16GB GPU (RTX 4080, A4000)
FLUX.2-dev has significant resource requirements due to the Mistral-3 text encoder:
The 24B Mistral text encoder alone requires significant VRAM:
| Component | bf16 | int8 | int4 |
|---|---|---|---|
| Mistral-3 (24B) | ~48GB | ~24GB | ~12GB |
| FLUX.2 Transformer | ~24GB | ~12GB | ~6GB |
| VAE + overhead | ~4GB | ~4GB | ~4GB |
| Configuration | Approximate Total VRAM |
|---|---|
| bf16 everything | ~76GB+ |
| int8 text encoder + bf16 transformer | ~52GB |
| int8 everything | ~40GB |
| int4 text encoder + int8 transformer | ~22GB |
- Minimum: 96GB system RAM (loading 24B text encoder requires substantial memory)
- Recommended: 128GB+ for comfortable operation
- Minimum: 2x 48GB GPUs (A6000, L40S) with FSDP2 or DeepSpeed
- Recommended: 4x H100 80GB with fp8-torchao
- With heavy quantization (int4): 2x 24GB GPUs may work but is experimental
Multi-GPU distributed training (FSDP2 or DeepSpeed) is essentially required for FLUX.2-dev due to the combined size of the Mistral-3 text encoder and transformer.
FLUX.2 requires Python 3.10 or later with recent transformers:
python --version # Should be 3.10+
pip install transformers>=4.45.0FLUX.2 models require access approval on Hugging Face:
For dev:
- Visit black-forest-labs/FLUX.2-dev
- Accept the license agreement
For klein models:
- Visit black-forest-labs/FLUX.2-klein-base-9B or black-forest-labs/FLUX.2-klein-base-4B
- Accept the license agreement
Ensure you're logged in to Hugging Face CLI: huggingface-cli login
pip install 'simpletuner[cuda]'
# CUDA 13 / Blackwell users (NVIDIA B-series GPUs)
pip install 'simpletuner[cuda13]' --extra-index-url https://download.pytorch.org/whl/cu130For development setup:
git clone https://github.com/bghira/SimpleTuner
cd SimpleTuner
pip install -e ".[cuda]"simpletuner serverAccess http://localhost:8001 and select FLUX.2 as the model family.
Create config/config.json:
View example config
{
"model_type": "lora",
"model_family": "flux2",
"model_flavour": "dev",
"pretrained_model_name_or_path": "black-forest-labs/FLUX.2-dev",
"output_dir": "/path/to/output",
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"gradient_checkpointing": true,
"mixed_precision": "bf16",
"learning_rate": 1e-4,
"lr_scheduler": "constant",
"max_train_steps": 10000,
"validation_resolution": "1024x1024",
"validation_num_inference_steps": 20,
"flux_guidance_mode": "constant",
"flux_guidance_value": 1.0,
"lora_rank": 16
}Note: Klein models (
klein-4b,klein-9b) do not have guidance embeddings. The following guidance options only apply todev.
FLUX.2-dev uses guidance embedding similar to FLUX.1:
View example config
{
"flux_guidance_mode": "constant",
"flux_guidance_value": 1.0
}Or for random guidance during training:
View example config
{
"flux_guidance_mode": "random-range",
"flux_guidance_min": 1.0,
"flux_guidance_max": 5.0
}For reduced VRAM usage:
View example config
{
"base_model_precision": "int8-quanto",
"text_encoder_1_precision": "int8-quanto",
"base_model_default_dtype": "bf16"
}FLUX.2 supports TREAD for faster training:
View example config
{
"tread_config": {
"routes": [
{"selection_ratio": 0.5, "start_layer_idx": 2, "end_layer_idx": -2}
]
}
}Show advanced experimental details
SimpleTuner includes experimental features that can significantly improve training stability and performance.
- Scheduled Sampling (Rollout): reduces exposure bias and improves output quality by letting the model generate its own inputs during training.
⚠️ These features increase the computational overhead of training.
Create config/multidatabackend.json:
View example config
[
{
"id": "my-dataset",
"type": "local",
"crop": true,
"crop_aspect": "square",
"crop_style": "center",
"resolution": 1024,
"minimum_image_size": 1024,
"maximum_image_size": 1024,
"resolution_type": "pixel_area",
"cache_dir_vae": "cache/vae/flux2/my-dataset",
"instance_data_dir": "datasets/my-dataset",
"caption_strategy": "textfile",
"metadata_backend": "discovery",
"repeats": 10
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/flux2",
"write_batch_size": 64
}
]See caption_strategy options and requirements in DATALOADER.md.
FLUX.2 can train either plain text-to-image (no conditioning) or with paired reference/edit images. To add conditioning, pair your main dataset to one or more conditioning datasets using conditioning_data and choose a conditioning_type:
View example config
- Use
conditioning_type=reference_strictwhen you need crops aligned 1:1 with the edit image.reference_looseallows mismatched aspect ratios. - File names must match between edit and reference datasets; each edit image should have a corresponding reference file.
- When supplying multiple conditioning datasets, set
conditioning_multidataset_sampling(combinedvsrandom) as needed; see OPTIONS. - Without
conditioning_data, FLUX.2 falls back to standard text-to-image training.
Available LoRA target presets:
all(default): All attention and MLP layersattention: Only attention layers (qkv, proj)mlp: Only MLP/feed-forward layerstiny: Minimal training (just qkv layers)
View example config
{
"--flux_lora_target": "all"
}huggingface-cli login
wandb login # optionalsimpletuner trainOr via script:
./train.shFor memory-constrained setups, FLUX.2 supports group offloading for both the transformer and optionally the Mistral-3 text encoder:
--enable_group_offload \
--group_offload_type block_level \
--group_offload_blocks_per_group 1 \
--group_offload_use_stream \
--group_offload_text_encoderThe --group_offload_text_encoder flag is recommended for FLUX.2 since the 24B Mistral text encoder benefits significantly from offloading during text embedding caching. You can also add --group_offload_vae to include the VAE in offloading during latent caching.
Create config/user_prompt_library.json:
View example config
{
"portrait_subject": "a professional portrait photograph of <subject>, studio lighting, high detail",
"artistic_subject": "an artistic interpretation of <subject> in the style of renaissance painting",
"cinematic_subject": "a cinematic shot of <subject>, dramatic lighting, film grain"
}FLUX.2 LoRAs can be loaded with the SimpleTuner inference pipeline or compatible tools once community support develops.
- Training with
flux_guidance_value=1.0works well for most use cases - At inference, use normal guidance values (3.0-5.0)
| Aspect | FLUX.1 | FLUX.2-dev | FLUX.2-klein-9b | FLUX.2-klein-4b |
|---|---|---|---|---|
| Text Encoder | CLIP-L/14 + T5-XXL | Mistral-3 (24B) | Qwen3 (bundled) | Qwen3 (bundled) |
| Embedding Dim | CLIP: 768, T5: 4096 | 15,360 | 12,288 | 7,680 |
| Latent Channels | 16 | 32 (→128) | 32 (→128) | 32 (→128) |
| VAE | AutoencoderKL | Custom (BatchNorm) | Custom (BatchNorm) | Custom (BatchNorm) |
| Transformer Blocks | 19 joint + 38 single | 8 double + 48 single | 8 double + 24 single | 5 double + 20 single |
| Guidance Embeds | Yes | Yes | No | No |
- Enable
--offload_during_startup=true - Use
--quantize_via=cpufor text encoder quantization - Reduce
--vae_batch_size
Mistral-3 is large; consider:
- Pre-caching all text embeddings before training
- Using text encoder quantization
- Batch processing with larger
write_batch_size
- Lower learning rate (try 5e-5)
- Increase gradient accumulation steps
- Enable gradient checkpointing
- Use
--max_grad_norm=1.0
- Enable quantization (
int8-quantoorint4-quanto) - Enable gradient checkpointing
- Reduce batch size
- Enable group offloading
- Use TREAD for token routing efficiency
TREAD (Token Routing for Efficient Architecture-agnostic Diffusion) speeds up training by selectively processing tokens:
View example config
{
"tread_config": {
"routes": [
{
"selection_ratio": 0.5,
"start_layer_idx": 4,
"end_layer_idx": -4
}
]
}
}selection_ratio: Fraction of tokens to keep (0.5 = 50%)start_layer_idx: First layer to apply routingend_layer_idx: Last layer (negative = from end)
Expected speedup: 20-40% depending on configuration.
- FLUX.1 Quickstart - For FLUX.1 training
- TREAD Documentation - Detailed TREAD configuration
- LyCORIS Training Guide - LoRA and LyCORIS training methods
- Dataloader Configuration - Dataset setup
[ { "id": "flux2-edits", "type": "local", "instance_data_dir": "/datasets/flux2/edits", "caption_strategy": "textfile", "resolution": 1024, "conditioning_data": ["flux2-references"], "cache_dir_vae": "cache/vae/flux2/edits" }, { "id": "flux2-references", "type": "local", "dataset_type": "conditioning", "instance_data_dir": "/datasets/flux2/references", "conditioning_type": "reference_strict", "resolution": 1024, "cache_dir_vae": "cache/vae/flux2/references" } ]