What happens when INT8 quantization goes wrong? This project reveals a 37x performance degradation with INT8 quantization on Apple M3, challenging conventional optimization wisdom and demonstrating why hardware-aware optimization is crucial for edge AI deployment.
- ✅ 3.6x model size reduction achieved (10.3MB → 2.9MB)
- ❌ 37x slower inference with INT8 on M3 (4.12ms → 152ms)
- 🔍 Root cause identified: 63 DynamicQuantizeLinear operations without hardware acceleration
Edge AI promises intelligent computing on resource-constrained devices. The standard playbook says: "Quantize your model to INT8 for faster inference."
But what if that's wrong?
This project systematically investigates model optimization techniques for edge deployment, uncovering a critical hardware dependency that can make or break your optimization strategy. Through comprehensive benchmarking of YOLOv5n on Apple M3, we demonstrate that optimization without hardware awareness is optimization in the dark.
# Clone and setup
git clone https://github.com/candleboxyz/exp--onnx_int8_paradox.git
cd exp--onnx_int8_paradox
# Initialize virtual environment (adjust with preferred package managing)
uv init
# Download YOLOv5n model
wget https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5nu.pt
# Run complete optimization pipeline
python main.py --model yolov5nu.pt --output results
# View results
cat results/report.md
# Optional: Run live demo
python main.py --model yolov5nu.pt --demo|
🔬 Controlled Benchmark (Single-thread)
|
🌍 Practical Benchmark (Auto-thread)
|
Original PyTorch model: 5.3 MB
↓
ONNX FP32 model: 10.3 MB (includes graph metadata)
↓
ONNX INT8 model: 2.9 MB ✅ (3.6x compression)
Modular design ensures reproducibility and extensibility:
graph TB
subgraph "Core Pipeline"
MC[ModelConverter<br/>converter.py] --> MQ[ModelQuantizer<br/>quantizer.py]
MQ --> MA[ModelAnalyzer<br/>analyzer.py]
MA --> B[Benchmarker<br/>benchmarker.py]
end
subgraph "Support Modules"
ORT[build_session<br/>ort_session_constrain.py]
DIAG[diagnose_model<br/>diagnose_int8.py]
end
subgraph "Demo"
DEMO[EdgeYOLODemo<br/>demo.py]
end
B -.uses.-> ORT
DEMO -.uses.-> ORT
MA -.analyzes.-> MQ
DIAG -.validates.-> MQ
style MC stroke:#98ee9f
style MQ stroke:#ffcc7b
style MA stroke:#fc8597
style B stroke:#63beff
Industry documentation consistently reports INT8 benefits:
Apple M3 measurements tell a different story:
- ✅ 3.6x memory reduction (close to theoretical)
- ❌ 37x slowdown (catastrophic performance degradation)
The analysis revealed 63 DynamicQuantizeLinear operations creating a critical bottleneck:
For each of 63 layers:
FP32 → [Compute Scale] → [Quantize] → INT8 → [Conv] → [Dequantize] → FP32
↑_________________overhead per operation_________________↑
Root Cause Analysis:
- M3 Neural Engine has INT8 support3 but operates independently of the CPU3
- ONNX Runtime's CPUExecutionProvider cannot access the Neural Engine
- Without CPU INT8 acceleration, each quantization operation adds overhead instead of optimization
- CoreMLExecutionProvider could potentially access Neural Engine but requires model conversion
This creates a perfect storm: the quantization overhead (63 operations) far exceeds any theoretical benefit from reduced precision computation.
| Platform | INT8 Performance | Source | Hardware Support |
|---|---|---|---|
| Apple M3 (CPU) | 37x slower ❌ | This work | Neural Engine has INT83, CPU doesn't utilize it |
| Qualcomm Snapdragon* | Up to 3x faster ✅ | Qualcomm4 | Hexagon DSP with HVX acceleration |
| NVIDIA T4* | ~5x faster ✅ | NVIDIA5 | TensorRT with DP4A instructions |
* Vendor-reported performance; not independently verified in this project
Key Insight: The dramatic performance difference demonstrates that INT8 optimization is not universal - it's entirely dependent on hardware support at the execution level.
We tested identical models under two scenarios to isolate the impact of threading:
| Configuration | FP32 Performance | INT8 Performance | Analysis |
|---|---|---|---|
| Single-thread (controlled) | Optimal (4.12ms) | Catastrophic (152ms) | INT8 operations serialize |
| Multi-thread (practical) | Degraded (5.65ms) | Improved (63.32ms) | Parallelization helps INT8 |
Key Insight: FP32 operations are so optimized on M3 that multi-threading adds overhead. INT8 operations desperately need parallelization to be remotely viable.
| Metric | FP32 (Single) | INT8 (Single) | FP32 (Multi) | INT8 (Multi) |
|---|---|---|---|---|
| Mean (ms) | 4.12 | 152.02 | 5.65 | 63.32 |
| Std Dev (ms) | 0.44 | 10.95 | 0.69 | 4.24 |
| CV | 0.108 | 0.072 | 0.122 | 0.067 |
| P95 (ms) | 4.81 | 170.23 | 6.93 | 70.89 |
| IQR (ms) | 0.52 | 15.23 | 0.89 | 5.67 |
Observation: INT8 shows lower CV (coefficient of variation) - not because it's stable, but because the bottleneck is consistent!
converter.py: PyTorch → ONNX conversion with validationquantizer.py: Dynamic INT8 quantization implementationbenchmarker.py: Multi-scenario performance evaluationanalyzer.py: Model structure and operation analysisort_session_constrain.py: Centralized session management for reproducibility
- Abstract Base Class:
ModelOptimizerprovides template for optimization strategies - Factory Function:
build_session()centralizes ONNX Runtime configuration for consistency - Strategy Pattern: Multiple benchmarking scenarios (controlled vs practical) with unified interface
The testing revealed catastrophic INT8 performance on Apple M3 CPU, contrasting sharply with vendor-reported speedups on other platforms:
| Platform | INT8 vs FP32 | Why the Difference? |
|---|---|---|
| Apple M3 Test (this) | 37x slower ❌ | No CPU INT8 acceleration |
| Qualcomm Claims4 | Up to 3x faster ✅ | Dedicated Hexagon DSP |
| NVIDIA Reports5 | ~5x faster ✅ | Hardware DP4A instructions |
This stark contrast proves that optimization strategies must be validated on target hardware.
Single-configuration benchmarks hide critical performance characteristics:
- Thread scheduling impact: 58% performance difference between single vs multi-thread for INT8
- Parallelization requirements: INT8 operations desperately need parallelization to be remotely viable
- Hidden bottlenecks: Only comprehensive profiling revealed 63 quantization operations
We achieved 3.6x size reduction but suffered 37x speed degradation. For edge deployment:
- Size matters for storage/transmission
- Speed matters for user experience
- Different models needed for different constraints
- Single Platform Testing: Results specific to Apple M3 Max; other hardware based on vendor documentation
- Limited Model Coverage: Only tested YOLOv5n; different architectures may show different patterns
- Dynamic Quantization Only: Static quantization with calibration unexplored
- No Direct NPU Testing: Comparison with actual NPU devices pending
- Test on actual NPU hardware (Qualcomm Snapdragon, MediaTek Dimensity)
- Implement static quantization with calibration dataset
- Compare with platform-specific tools (TensorRT, Core ML, SNPE)
- Explore structured pruning and knowledge distillation
- Develop hardware-aware optimization selector
- Benchmark on edge devices (Raspberry Pi, Jetson Nano, Coral Dev Board)
-
Always benchmark on your target hardware
- Expected: 3-4x INT8 speedup (based on literature)
- Reality: 37x slowdown on M3 CPU
- Lesson: Never assume optimization benefits transfer across platforms
-
Understand the execution path
- M3 has INT8 support in Neural Engine ✅
- ONNX Runtime uses CPU execution ❌
- Result: Hardware capability ≠ Software accessibility
-
Profile before optimizing
- 63 DynamicQuantizeLinear operations discovered only through profiling
- Each adds overhead without hardware acceleration
- Bottleneck location matters more than optimization technique
-
Consider the full deployment context
- Size reduction: Success (3.6x) ✅
- Speed improvement: Failure (37x slower) ❌
- Decision: Choose optimization based on actual constraints
-
Document negative results
- Failed optimizations are valuable learning experiences
- Help others avoid the same pitfalls
- Contribute to collective understanding
This project demonstrates critical challenges in edge AI deployment:
- Generic optimizations fail: The 37x slowdown proves that "one-size-fits-all" optimization doesn't work
- NPU compilers are crucial: Specialized compilation tools must bridge the model-hardware gap
- Co-design is essential: Hardware and software must evolve together
The measured 37x penalty on M3 CPU versus 3-5x acceleration on NPU-equipped devices45 validates that:
- Edge AI requires specialized hardware acceleration
- Software optimization alone is insufficient
- Compiler technology must be hardware-aware
This research underscores why the industry is investing heavily in:
- Custom AI accelerators (NPUs, TPUs)
- Hardware-aware compilation frameworks
- Model-hardware co-optimization tools
This project is licensed under the MIT License - see the LICENSE file for details.
"In edge AI, the difference between success and failure often lies not in the model, but in understanding the hardware it runs on. What works on one platform may fail catastrophically on another."
Footnotes
-
Jacob, B., et al. (2018). "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR. https://arxiv.org/abs/1712.05877 ↩
-
Krishnamoorthi, R. (2018). "Quantizing Deep Convolutional Networks for Efficient Inference." https://arxiv.org/abs/1806.08342 ↩
-
"Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency" arXiv:2502.05317v1 (2025). Shows Neural Engine supports INT8/FP16 but operates independently of CPU. ↩ ↩2 ↩3
-
Qualcomm AI Hub Documentation. "Quantized models can have up to a 3x improvement in performance." https://app.aihub.qualcomm.com/docs/hub/quantize_examples.html (Accessed 2025) ↩ ↩2 ↩3
-
"Fast INT8 Inference for Autonomous Vehicles with TensorRT 3." NVIDIA Developer Blog. Reports ~5x speedup on Pascal GPUs. https://developer.nvidia.com/blog/int8-inference-autonomous-vehicles-tensorrt/ ↩ ↩2 ↩3