This is the edit/img2img variant of LongCat‑Image. Read LONGCAT_IMAGE.md first; this file only lists what changes for the edit flavour.
| Base (text2img) | Edit | |
|---|---|---|
| Flavour | final / dev |
edit |
| Conditioning | none | requires conditioning latents (reference image) |
| Text encoder | Qwen‑2.5‑VL | Qwen‑2.5‑VL with vision context (prompt encoding needs ref image) |
| Pipeline | TEXT2IMG | IMG2IMG/EDIT |
| Validation inputs | prompt only | prompt and reference |
Keep aspect_bucket_alignment at 64. Do not disable conditioning latents; the edit pipeline expects them.
Fast config creation:
cp config/config.json.example config/config.jsonThen set model_family, model_flavour, dataset paths, and output_dir.
Use two aligned datasets: edit images (caption = edit instruction) and reference images. The edit dataset’s conditioning_data must point to the reference dataset ID. Filenames must match 1‑to‑1.
[
{
"id": "edit-images",
"type": "local",
"instance_data_dir": "/data/edits",
"caption_strategy": "textfile",
"resolution": 768,
"cache_dir_vae": "/cache/vae/longcat/edit",
"conditioning_data": ["ref-images"]
},
{
"id": "ref-images",
"type": "local",
"instance_data_dir": "/data/refs",
"caption_strategy": null,
"resolution": 768,
"cache_dir_vae": "/cache/vae/longcat/ref"
}
]See caption_strategy options and requirements in DATALOADER.md.
Notes:
- Aspect buckets: keep on the 64px grid.
- Reference captions are optional; if present they replace edit captions (usually undesired).
- VAE caches for edit and reference should be separate paths.
- If you see cache misses or shape errors, clear the VAE caches for both datasets and regenerate.
- Validation needs reference images to produce conditioning latents. Point the validation split of
edit-imagestoref-imagesviaconditioning_data. - Guidance: 4–6 works well; keep negative prompt empty.
- Preview callbacks are supported; latents are unpacked for decoders automatically.
- If validation fails due to missing conditioning latents, check that the validation dataloader includes both edit and reference entries with matching filenames.
Quick CLI validation:
simpletuner validate \
--model_family longcat_image \
--model_flavour edit \
--validation_resolution 768x768 \
--validation_guidance 4.5 \
--validation_num_inference_steps 40WebUI: choose the Edit pipeline, supply both the source image and the edit instruction.
After config and dataloader are set:
simpletuner train --config config/config.jsonEnsure the reference dataset is present during training so conditioning latents can be computed or loaded from cache.
- Missing conditioning latents: ensure the reference dataset is wired via
conditioning_dataand filenames match. - MPS dtype errors: the pipeline auto‑downgrades pos‑ids to float32 on MPS; keep the rest at float32/bf16.
- Channel mismatch in previews: previews un‑patchify latents before decoding (keep this SimpleTuner version).
- OOM during edit: lower validation resolution/steps, reduce
lora_rank, enable group offload, and preferint8-quanto/fp8-torchao.
{ "model_type": "lora", "model_family": "longcat_image", "model_flavour": "edit", "base_model_precision": "int8-quanto", // fp8-torchao also fine; helps fit 16–24 GB "train_batch_size": 1, "gradient_checkpointing": true, "learning_rate": 5e-5, "validation_guidance": 4.5, "validation_num_inference_steps": 40, "validation_resolution": "768x768" }