🔬 ONNX INT8 Paradox: An Experimental Analysis

📌 TL;DR

What happens when INT8 quantization goes wrong? This project reveals a 37x performance degradation with INT8 quantization on Apple M3, challenging conventional optimization wisdom and demonstrating why hardware-aware optimization is crucial for edge AI deployment.

✅ 3.6x model size reduction achieved (10.3MB → 2.9MB)
❌ 37x slower inference with INT8 on M3 (4.12ms → 152ms)
🔍 Root cause identified: 63 DynamicQuantizeLinear operations without hardware acceleration

🎯 Why This Matters

Edge AI promises intelligent computing on resource-constrained devices. The standard playbook says: "Quantize your model to INT8 for faster inference."

But what if that's wrong?

This project systematically investigates model optimization techniques for edge deployment, uncovering a critical hardware dependency that can make or break your optimization strategy. Through comprehensive benchmarking of YOLOv5n on Apple M3, we demonstrate that optimization without hardware awareness is optimization in the dark.

⚡ Quick Start

# Clone and setup
git clone https://github.com/candleboxyz/exp--onnx_int8_paradox.git
cd exp--onnx_int8_paradox

# Initialize virtual environment (adjust with preferred package managing)
uv init

# Download YOLOv5n model
wget https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5nu.pt

# Run complete optimization pipeline
python main.py --model yolov5nu.pt --output results

# View results
cat results/report.md

# Optional: Run live demo
python main.py --model yolov5nu.pt --demo

📊 Key Results at a Glance

Performance Under Different Scenarios

🔬 Controlled Benchmark (Single-thread)

Model	Latency	FPS
FP32	4.12ms	242.6
INT8	152.02ms	6.6
Degradation	37x slower ⚠️

🌍 Practical Benchmark (Auto-thread)

Model	Latency	FPS
FP32	5.65ms	177.1
INT8	63.32ms	15.8
Degradation	11x slower ⚠️

Model Size Optimization

Original PyTorch model:     5.3 MB
           ↓
ONNX FP32 model:          10.3 MB (includes graph metadata)
           ↓
ONNX INT8 model:           2.9 MB ✅ (3.6x compression)

🏗️ System Architecture

Modular design ensures reproducibility and extensibility:

graph TB
    subgraph "Core Pipeline"
        MC[ModelConverter<br/>converter.py] --> MQ[ModelQuantizer<br/>quantizer.py]
        MQ --> MA[ModelAnalyzer<br/>analyzer.py]
        MA --> B[Benchmarker<br/>benchmarker.py]
    end
    
    subgraph "Support Modules"
        ORT[build_session<br/>ort_session_constrain.py]
        DIAG[diagnose_model<br/>diagnose_int8.py]
    end
    
    subgraph "Demo"
        DEMO[EdgeYOLODemo<br/>demo.py]
    end
    
    B -.uses.-> ORT
    DEMO -.uses.-> ORT
    MA -.analyzes.-> MQ
    DIAG -.validates.-> MQ
    
    style MC stroke:#98ee9f
    style MQ stroke:#ffcc7b
    style MA stroke:#fc8597
    style B stroke:#63beff

🔬 Technical Deep Dive

The Quantization Paradox

Industry documentation consistently reports INT8 benefits:

4x memory reduction (theoretical maximum)
2-4x speedup on appropriate hardware¹²

Apple M3 measurements tell a different story:

✅ 3.6x memory reduction (close to theoretical)
❌ 37x slowdown (catastrophic performance degradation)

Why Did INT8 Fail on Apple M3?

The analysis revealed 63 DynamicQuantizeLinear operations creating a critical bottleneck:

For each of 63 layers:
FP32 → [Compute Scale] → [Quantize] → INT8 → [Conv] → [Dequantize] → FP32
       ↑_________________overhead per operation_________________↑

Root Cause Analysis:

M3 Neural Engine has INT8 support³ but operates independently of the CPU³
ONNX Runtime's CPUExecutionProvider cannot access the Neural Engine
Without CPU INT8 acceleration, each quantization operation adds overhead instead of optimization
CoreMLExecutionProvider could potentially access Neural Engine but requires model conversion

This creates a perfect storm: the quantization overhead (63 operations) far exceeds any theoretical benefit from reduced precision computation.

Hardware Performance Comparison

Platform	INT8 Performance	Source	Hardware Support
Apple M3 (CPU)	37x slower ❌	This work	Neural Engine has INT8³, CPU doesn't utilize it
Qualcomm Snapdragon*	Up to 3x faster ✅	Qualcomm⁴	Hexagon DSP with HVX acceleration
NVIDIA T4*	~5x faster ✅	NVIDIA⁵	TensorRT with DP4A instructions

* Vendor-reported performance; not independently verified in this project

Key Insight: The dramatic performance difference demonstrates that INT8 optimization is not universal - it's entirely dependent on hardware support at the execution level.

Controlled Variable Analysis

We tested identical models under two scenarios to isolate the impact of threading:

Configuration	FP32 Performance	INT8 Performance	Analysis
Single-thread (controlled)	Optimal (4.12ms)	Catastrophic (152ms)	INT8 operations serialize
Multi-thread (practical)	Degraded (5.65ms)	Improved (63.32ms)	Parallelization helps INT8

Key Insight: FP32 operations are so optimized on M3 that multi-threading adds overhead. INT8 operations desperately need parallelization to be remotely viable.

📈 Comprehensive Performance Metrics

Statistical Analysis

Metric	FP32 (Single)	INT8 (Single)	FP32 (Multi)	INT8 (Multi)
Mean (ms)	4.12	152.02	5.65	63.32
Std Dev (ms)	0.44	10.95	0.69	4.24
CV	0.108	0.072	0.122	0.067
P95 (ms)	4.81	170.23	6.93	70.89
IQR (ms)	0.52	15.23	0.89	5.67

Observation: INT8 shows lower CV (coefficient of variation) - not because it's stable, but because the bottleneck is consistent!

🛠️ Implementation Details

Core Modules

converter.py: PyTorch → ONNX conversion with validation
quantizer.py: Dynamic INT8 quantization implementation
benchmarker.py: Multi-scenario performance evaluation
analyzer.py: Model structure and operation analysis
ort_session_constrain.py: Centralized session management for reproducibility

Design Patterns

Abstract Base Class: ModelOptimizer provides template for optimization strategies
Factory Function: build_session() centralizes ONNX Runtime configuration for consistency
Strategy Pattern: Multiple benchmarking scenarios (controlled vs practical) with unified interface

🎓 Lessons Learned

1. Hardware Awareness is Non-negotiable

The testing revealed catastrophic INT8 performance on Apple M3 CPU, contrasting sharply with vendor-reported speedups on other platforms:

Platform	INT8 vs FP32	Why the Difference?
Apple M3 Test (this)	37x slower ❌	No CPU INT8 acceleration
Qualcomm Claims⁴	Up to 3x faster ✅	Dedicated Hexagon DSP
NVIDIA Reports⁵	~5x faster ✅	Hardware DP4A instructions

This stark contrast proves that optimization strategies must be validated on target hardware.

2. Benchmarking Methodology Matters

Single-configuration benchmarks hide critical performance characteristics:

Thread scheduling impact: 58% performance difference between single vs multi-thread for INT8
Parallelization requirements: INT8 operations desperately need parallelization to be remotely viable
Hidden bottlenecks: Only comprehensive profiling revealed 63 quantization operations

3. Size ≠ Speed

We achieved 3.6x size reduction but suffered 37x speed degradation. For edge deployment:

Size matters for storage/transmission
Speed matters for user experience
Different models needed for different constraints

🚧 Limitations & Future Work

Current Limitations

Single Platform Testing: Results specific to Apple M3 Max; other hardware based on vendor documentation
Limited Model Coverage: Only tested YOLOv5n; different architectures may show different patterns
Dynamic Quantization Only: Static quantization with calibration unexplored
No Direct NPU Testing: Comparison with actual NPU devices pending

Future Directions

Test on actual NPU hardware (Qualcomm Snapdragon, MediaTek Dimensity)
Implement static quantization with calibration dataset
Compare with platform-specific tools (TensorRT, Core ML, SNPE)
Explore structured pruning and knowledge distillation
Develop hardware-aware optimization selector
Benchmark on edge devices (Raspberry Pi, Jetson Nano, Coral Dev Board)

💡 Key Takeaways for Practitioners

Always benchmark on your target hardware
- Expected: 3-4x INT8 speedup (based on literature)
- Reality: 37x slowdown on M3 CPU
- Lesson: Never assume optimization benefits transfer across platforms
Understand the execution path
- M3 has INT8 support in Neural Engine ✅
- ONNX Runtime uses CPU execution ❌
- Result: Hardware capability ≠ Software accessibility
Profile before optimizing
- 63 DynamicQuantizeLinear operations discovered only through profiling
- Each adds overhead without hardware acceleration
- Bottleneck location matters more than optimization technique
Consider the full deployment context
- Size reduction: Success (3.6x) ✅
- Speed improvement: Failure (37x slower) ❌
- Decision: Choose optimization based on actual constraints
Document negative results
- Failed optimizations are valuable learning experiences
- Help others avoid the same pitfalls
- Contribute to collective understanding

🤝 Industry Implications

This project demonstrates critical challenges in edge AI deployment:

The Hardware-Software Gap

Generic optimizations fail: The 37x slowdown proves that "one-size-fits-all" optimization doesn't work
NPU compilers are crucial: Specialized compilation tools must bridge the model-hardware gap
Co-design is essential: Hardware and software must evolve together

Why This Matters

The measured 37x penalty on M3 CPU versus 3-5x acceleration on NPU-equipped devices⁴⁵ validates that:

Edge AI requires specialized hardware acceleration
Software optimization alone is insufficient
Compiler technology must be hardware-aware

This research underscores why the industry is investing heavily in:

Custom AI accelerators (NPUs, TPUs)
Hardware-aware compilation frameworks
Model-hardware co-optimization tools

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

"In edge AI, the difference between success and failure often lies not in the model, but in understanding the hardware it runs on. What works on one platform may fail catastrophically on another."

Jacob, B., et al. (2018). "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR. https://arxiv.org/abs/1712.05877 ↩
Krishnamoorthi, R. (2018). "Quantizing Deep Convolutional Networks for Efficient Inference." https://arxiv.org/abs/1806.08342 ↩
"Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency" arXiv:2502.05317v1 (2025). Shows Neural Engine supports INT8/FP16 but operates independently of CPU. ↩ ↩² ↩³
Qualcomm AI Hub Documentation. "Quantized models can have up to a 3x improvement in performance." https://app.aihub.qualcomm.com/docs/hub/quantize_examples.html (Accessed 2025) ↩ ↩² ↩³
"Fast INT8 Inference for Autonomous Vehicles with TensorRT 3." NVIDIA Developer Blog. Reports ~5x speedup on Pascal GPUs. https://developer.nvidia.com/blog/int8-inference-autonomous-vehicles-tensorrt/ ↩ ↩² ↩³

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
models		models
output		output
results		results
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 ONNX INT8 Paradox: An Experimental Analysis

📌 TL;DR

🎯 Why This Matters

⚡ Quick Start

📊 Key Results at a Glance

Performance Under Different Scenarios

Model Size Optimization

🏗️ System Architecture

🔬 Technical Deep Dive

The Quantization Paradox

Why Did INT8 Fail on Apple M3?

Hardware Performance Comparison

Controlled Variable Analysis

📈 Comprehensive Performance Metrics

Statistical Analysis

🛠️ Implementation Details

Core Modules

Design Patterns

🎓 Lessons Learned

1. Hardware Awareness is Non-negotiable

2. Benchmarking Methodology Matters

3. Size ≠ Speed

🚧 Limitations & Future Work

Current Limitations

Future Directions

💡 Key Takeaways for Practitioners

🤝 Industry Implications

The Hardware-Software Gap

Why This Matters

📄 License

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 ONNX INT8 Paradox: An Experimental Analysis

📌 TL;DR

🎯 Why This Matters

⚡ Quick Start

📊 Key Results at a Glance

Performance Under Different Scenarios

Model Size Optimization

🏗️ System Architecture

🔬 Technical Deep Dive

The Quantization Paradox

Why Did INT8 Fail on Apple M3?

Hardware Performance Comparison

Controlled Variable Analysis

📈 Comprehensive Performance Metrics

Statistical Analysis

🛠️ Implementation Details

Core Modules

Design Patterns

🎓 Lessons Learned

1. Hardware Awareness is Non-negotiable

2. Benchmarking Methodology Matters

3. Size ≠ Speed

🚧 Limitations & Future Work

Current Limitations

Future Directions

💡 Key Takeaways for Practitioners

🤝 Industry Implications

The Hardware-Software Gap

Why This Matters

📄 License

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages