Partial Rust implementation of ByteLatent Transformer (BLT) entropy model with Burn framework, specifically designed for hypersphere embedding pipelines.
Note: This is NOT a complete port of the original BLT repository. This implementation focuses specifically on entropy-based text segmentation and pre-norm signal extraction for hypersphere embeddings. Many features from the original BLT are intentionally omitted.
BLT-Burn is a specialized implementation of select BLT components, extracting only the entropy model and embedding functionality needed for hypersphere-based systems. It provides:
- Pre-norm signal extraction - Captures embedding magnitudes before L2 normalization for prominence detection
- Entropy-based patching - Uses model confidence to determine natural segmentation boundaries
- Multimodal pre-tokenization - Supports text, images, audio, and code
- GPU acceleration - Uses WGPU backend (Metal on macOS, CUDA/Vulkan on Linux)
- bf16 model weights - Uses half-precision weights for efficiency
- A focused implementation of BLT's entropy model for text segmentation
- A tool for extracting pre-norm embeddings for hypersphere placement
- A preprocessing pipeline for the Sphere water-filling algorithms
- A Rust implementation of specific BLT components
- A complete port of the BLT repository
- A training framework for BLT models
- A general-purpose transformer library
- A replacement for the original BLT implementation
If you need the full BLT functionality, please refer to the original repository.
- Video Processing: Requires FFmpeg (v3.4 - 8.0) with development headers. See Video Processing for setup instructions.
# 1. Run ingestion (exports entropy + coherence automatically)
cargo run --release --bin ingest -- --file input.txt --output-dir output/
# 2. Apply entropy-weighted water-filling
python scripts/water_filling_integration.py --input output/ --entropy-weighted
# 3. Test and validate
python scripts/test_entropy_weighted.py --input output/item_0.safetensorsSee docs/PRE_NORM_SIGNAL_EXTRACTION.md for the full write-up.
git clone https://github.com/SashimiSaketoro/blt-burn.git
cd blt-burn
cargo build --releaseuse blt_burn::{model::LMTransformerConfig, tokenizer::BltTokenizer};
use burn::backend::wgpu::{Wgpu, WgpuDevice};
use burn::record::{FullPrecisionSettings, Recorder};
use burn_import::safetensors::SafetensorsFileRecorder;
let device = WgpuDevice::default();
let config = LMTransformerConfig {
dim: 768,
n_layers: 14,
n_heads: Some(12),
vocab_size: 260,
rope_theta: 10000.0,
// ... other config
};
let model = config.init::<Wgpu>(&device);
// Load weights from safetensors (auto-detected from HF cache)
let recorder = SafetensorsFileRecorder::<FullPrecisionSettings>::default();
let model = model.load_record(
recorder.load("model.safetensors".into(), &device)?
);
// Extract pre-norm embeddings
let output = model.forward_with_embeddings(tokens);
let embeddings = output.pre_norm_embeddings; // For sphere placement
let prominence = output.embedding_norms; // For water-fillingThe build script automatically detects the Facebook BLT entropy model in your HuggingFace cache:
# Download the model first (if not already cached)
hf download facebook/blt-entropy model.safetensors
# Build/run - model is auto-detected from HF cache
cargo build --release
cargo run --bin ingestSource Model: facebook/blt-entropy
Format: SafeTensors (also supports Burn MPK format with --use-mpk flag)
For datasets that won't fit on your internal drive (like FineWeb-Edu 10B):
# Use an external drive
cargo run --release --bin ingest -- \
--external-drive /Volumes/MyExternalDrive/blt_data \
--limit 1000
# Or specify paths manually
cargo run --release --bin ingest -- \
--cache-dir /Volumes/MyExternalDrive/cache \
--output-dir /Volumes/MyExternalDrive/output- ✅ RoPE & Causal Masking - Full positional encoding following Meta's BLT architecture
- ✅ Pre-L2-Norm Signal Extraction - Preserves magnitude variance for prominence detection
- ✅ Entropy-Based Patching - Monotonic boundary detection using model confidence
- ✅ Entropy-Weighted Allocation - Physics-inspired allocation for coherence-biased retrieval
- ✅ Multimodal Support - Text, images, audio, code pre-tokenization
- ✅ Direct SafeTensors Loading - Load Facebook's original weights without conversion
- ✅ GPU Acceleration - Automatic acceleration via WGPU
- ✅ FineWeb-Edu Integration - Built-in dataset utilities
- ✅ Water-Filling Ready - Output format optimized for hypersphere pipelines
- ✅ Hypergraph Sidecar - SQLite-based storage with explicit Trunk-Branch-Leaf topology alongside tensors
- ✅ JAX-Compatible Sharding - Automatic dataset sharding for distributed processing
BLT-Burn is the embedding backend for sphere-pipeline, which adds:
# Unified pipeline: BLT encoding → Sphere optimization → ROOTS index → Harmonics
sphere-ingest --local-dir ./data --output ./corpus
# Stream any HuggingFace dataset
sphere-ingest --hf-dataset ytz20/LMSYS-Chat --text-column response --output ./corpusPolar Zone Architecture: The sphere is divided into semantic zones:
- North pole (θ < 15°): Instruction zone (behavioral anchors)
- Equatorial torus: Content zone (documents, facts)
- South pole (θ > 165°): QA pairs zone (fine-tuning examples)
# Force data to QA zone (south pole)
sphere-ingest --hf-dataset org/qa-pairs --target-zone qa --output ./corpusSee sphere-pipeline/README.md for full documentation.
BLT-Burn includes a multi-format pre-tokenization system:
| Modality | Decode Backend | Entropy Patching | Status |
|---|---|---|---|
| Text | Raw Bytes / HF Tokenizer | ✅ Yes | Stable |
| Image | image crate (Rust) |
✅ Yes | Beta |
| Audio | symphonia (Rust) |
✅ Yes | Beta |
| Code | tree-sitter (Rust) |
✅ Yes | Beta |
| Video | ffmpeg-next |
🚧 Optional | Beta (Requires --features video) |
pdfium-render (Rust) |
✅ Yes | Beta (Multi-view support) | |
| Binary | goblin (Rust) |
✅ Yes | Beta (ELF/PE/Mach-O) |
For compound documents like PDFs, BLT-Burn supports multi-view extraction:
# Enable multi-view mode: emits raw + text + images as separate hypergraph branches
cargo run --bin ingest -- --file document.pdf --multiview-pdf
# Or select a specific mode
cargo run --bin ingest -- --file document.pdf --pdf-mode text_only # default
cargo run --bin ingest -- --file document.pdf --pdf-mode raw_only
cargo run --bin ingest -- --file document.pdf --pdf-mode image_onlyMulti-view mode creates same_source hyperedges connecting all views of the same document, enabling downstream models to learn cross-modal associations.
Automatic format detection based on file signatures:
- JPEG (
FF D8), PNG (89 PNG) - PDF (
%PDF-), MP4/Video (ftyp) - WAV (
RIFF), MP3 (ID3or sync bytes) - ELF binaries (
7F ELF), ZIP archives (PK) - Code files (shebang, import statements)
To maintain portability and ease of deployment, BLT-Burn prioritizes pure-Rust implementations where possible:
- Audio: Uses
symphoniafor pure-Rust decoding (MP3, OGG, MP4, WAV, FLAC, etc.) - Images: Uses
imagecrate (pure-Rust JPEG/PNG/GIF support) - Documents:
pdfcrate for PDF parsing (pure-Rust) - Binaries:
goblinfor ELF/PE/Mach-O analysis (pure-Rust) - Video:
ffmpeg-nextv8.0 (optional, requires system FFmpeg 3.4-8.0)
cargo run --bin ingest -- --huggingface-dataset openai/gdpval --limit 1now streams Hugging Face splits directly with Polars.- The loader resolves
hf://datasets/...URIs fromdataset_info.json, downloads the referenced Parquet shards throughhf-hub, and falls back to Arrow IPC or JSON automatically. - Because it leans on the Polars plugin architecture (hashing/spatial ops), no Python runtime or
burn-datasetshim is required—everything stays inside the Rust process. Polars plugin overview
- Every ingestion run now performs a prefetch pass: it scans the entire Polars
DataFrame, deduplicates allimages/,files/, andhf://references, and hydrates them once into<output>/.cache/<dataset-slug>/…before per-row work begins. - The cache is persistent—if you keep the output directory around, downstream training or navigation stages can reuse the already-downloaded assets without touching Hugging Face again.
- When a dataset only exposes pointers (e.g. TreeVGR’s
images/...paths), the loader can stream individual files out of remote archives hosted under a different Hugging Face repo. The TreeVGR ingestion, for example, automatically pulls the matching image out oflmms-lab/LLaVA-NeXT-Data/llava_next_raw_format_images_*.tar.gzwithout requiring Python or pre-extraction of the entire tarball. - Prefetch still happens lazily per dataset—files are fetched on demand, but once cached they’re treated as the authoritative copy for subsequent ingest runs or analysis tools.
BLT-Burn uses FFmpeg for video codec support (H.264, H.265/HEVC, VP8, VP9, AV1, MPEG-4, MPEG-2).
Setup by Platform:
🍎 macOS (Homebrew)
# Install FFmpeg and pkg-config
brew install ffmpeg pkg-config
# Set environment (add to ~/.zshrc for permanent setup)
source scripts/setup_ffmpeg_env.sh
# Build with video feature
cargo build --features video🐧 Linux (Ubuntu/Debian)
# Install FFmpeg development headers
sudo apt install ffmpeg libavcodec-dev libavformat-dev \
libswscale-dev libavutil-dev pkg-config
# Build with video feature
cargo build --features video🐧 Linux (Fedora)
# Install FFmpeg development headers
sudo dnf install ffmpeg ffmpeg-devel
# Build with video feature
cargo build --features video🪟 Windows
# Download FFmpeg from: https://www.gyan.dev/ffmpeg/builds/
# Extract to C:\ffmpeg (or your preferred location)
# Set environment variable
$env:FFMPEG_DIR = "C:\ffmpeg"
# Build with video feature
cargo build --features videoCLI Options:
--no-audio-video: Skip audio/video processing entirely- Video feature is opt-in: build without
--features videoto skip FFmpeg dependency
Process a directory of files, extracting embeddings and generating the hypergraph sidecar:
# 1. Run ingest (video requires --features video and FFmpeg setup)
cargo run --release --bin ingest -- \
--input ./samples \
--output ./out
# 1b. For large files, enable JAX sharding
cargo run --release --bin ingest -- \
--text "Your large text content..." \
--output ./out \
--num-shards 4 # Creates 4 shards for distributed processing
# 2. Inspect the resulting topology
python scripts/inspect_sphere_result.py ./out/output.safetensorsBLT-Burn produces a SQLite "sidecar" file alongside the tensor output. This preserves the Trunk-Branch-Leaf topology that flattens into the tensor.
Example structure (when exported as JSON with --export-json):
{
"nodes": [
{ "Trunk": { "source_hash": "a1b2c3...", "total_bytes": 50000 } },
{ "Branch": { "label": "text_content", "modality": "text" } },
{ "Leaf": { "bytes": [], "label": "text_token", "metadata": { "start_offset": 0, "end_offset": 5 } } }
],
"edges": [
{ "label": "contains", "weight": 1.0 },
{ "label": "next", "weight": 1.0 }
],
"topology": {
"edges": [
[0, [0, 1]], // Edge 0 connects Node 0 -> Node 1
[1, [1, 2]] // Edge 1 connects Node 1 -> Node 2
]
}
}- API Reference - Complete library documentation
- Pre-Norm Signal Guide - Why pre-norm matters
- Signal Extraction Details - Technical deep-dive
blt-burn/
├── src/
│ ├── model.rs # BLT transformer architecture
│ ├── tokenizer.rs # Text tokenization
│ ├── hf_resolver.rs # Centralized HuggingFace path resolution
│ ├── modalities/ # Multimodal pre-tokenization
│ │ ├── mod.rs # Trait + factory + auto-detection
│ │ ├── text.rs # Raw/HF tokenizer
│ │ ├── image.rs # Patch extraction
│ │ ├── audio.rs # Symphonia decoder
│ │ ├── code.rs # Tree-sitter AST
│ │ ├── document.rs # PDF multi-view
│ │ ├── video.rs # FFmpeg frames
│ │ └── binary.rs # ELF/PE sections
│ ├── pretokenize.rs # Legacy (deprecated, use modalities)
│ ├── patcher.rs # Entropy & patch extraction
│ ├── dataset.rs # FineWeb-Edu utilities
│ ├── generic_processor.rs # Schema-inferred dataset processing
│ ├── polars_dataset_loader.rs # Polars-based HF dataset loading
│ ├── prefetch.rs # Async document prefetching
│ ├── batching.rs # Length sorting & batch stats
│ ├── quantization.rs # INT8/INT4 model quantization
│ ├── sidecar.rs # Hypergraph SQLite storage
│ └── bin/
│ └── ingest.rs # Main ingestion pipeline
├── cubecl.toml # GPU memory optimization config
├── docs/
│ └── OPTIMIZATION_GUIDE.md # Future research areas
└── Cargo.toml
- Rust 1.70+
- macOS (for Metal) or Linux/Windows (CPU/WGPU)
- Python 3.9+ (for scripts)
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0) license.
This codebase is a partial Rust/Burn implementation of specific components from the ByteLatent Transformer (BLT) architecture.
- Original Work: BLT (Meta Research)
- Original License: CC-BY-NC 4.0
- Scope: This implementation includes only:
- The entropy model for text segmentation
- Embedding extraction (pre-L2-norm)
- Basic tokenization
- NOT included: Training code, full transformer capabilities, compression features, or other BLT functionality
- Modifications:
- Implemented select components in Rust using the Burn framework
- Added multimodal pre-tokenization system
- Added pre-norm signal extraction for hypersphere integration
- Optimized for Metal acceleration (Apple Silicon)
Commercial Use: Commercial use of this software is prohibited under the terms of the CC-BY-NC 4.0 license, unless you obtain separate permission from the original rights holders (Meta) and the authors of this derivative work.
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
-
tokenizer::BltTokenizer::encode_bytes
Turn raw bytes into entropy-model-ready tokens. -
model::LMTransformer::forward_with_embeddings
Run the model and get bothpre_norm_embeddings(for geometry) andembedding_norms(for prominence). -
pretokenize::detect_modality
Auto-detect content type (Image, Audio, Video, Code) from file signatures.
-
prefetch::DocumentPrefetcher
Background document loading with bounded channel for I/O overlap. -
batching::BatchStats
Document size distribution and length-sorted processing utilities. -
quantization::quantize_model
Apply INT8/INT4 quantization to model weights using Burn's QuantScheme.
--quantize int8|int4 # Enable model quantization
--quant-stats # Print quantization statistics and exit
--prefetch-buffer N # Documents to buffer ahead (default: 4)
--batch-stats # Print document size distribution
--profile # Enable CubeCL kernel profiling
--entropy-histogram # Export entropy distribution as JSON
--output-format FORMAT # Output format: safetensors (default) or webdataset
--multiview-pdf # Multi-view PDF: emit raw + text + images with cross-view edges
--pdf-mode MODE # PDF mode: raw_only, text_only (default), image_only
--huggingface-dataset # Load dataset from HuggingFace Hub
--hf-subset # HuggingFace dataset subset
--skip-missing-files # Skip missing files instead of using fallbacks
--hf-token TOKEN # HuggingFace authentication token (also via HF_TOKEN env)BLT-Burn supports two output formats:
| Format | Description | Use Case |
|---|---|---|
| SafeTensors (default) | Individual .safetensors files with .hypergraph.db sidecars |
Most use cases, random access |
| WebDataset | Sharded .tar.gz archives |
PyTorch DataLoader streaming |
# Default: SafeTensors output (recommended)
cargo run --bin ingest -- --text "hello world"
# Optional: WebDataset for PyTorch streaming
cargo run --bin ingest -- --text "hello world" --output-format webdataset --webdataset-shard-size 1000WebDataset shards are compatible with the PyTorch WebDataset loader:
import webdataset as wds
dataset = wds.WebDataset("output/shard_*.tar.gz").decode()Version: 0.7.0
Last Updated: 2025-11-26