Skip to content

Latest commit

 

History

History
594 lines (428 loc) · 14.1 KB

File metadata and controls

594 lines (428 loc) · 14.1 KB
title Ollama + Qwen Local AI Setup for KENL
date 2025-11-14
atom ATOM-DOC-20251114-004
classification OWI-DOC
status production
platform Bazzite, Fedora Atomic, Linux, macOS, Windows

Ollama + Qwen Local AI Setup for KENL

Complete guide to deploying Qwen language models locally using Ollama for offline AI assistance.


Why Local AI with Qwen?

KENL's token strategy balances cost, privacy, and capability:

Provider Usage Use Case Monthly Cost
Claude 10% Complex reasoning, code review, architecture ~$20-50
Perplexity 30% Research, documentation lookup, current events ~$20
Qwen (Local) 60% Code completion, refactoring, documentation generation $0

Benefits of local Qwen:

  • Zero API costs - No usage limits or billing
  • Privacy - Code never leaves your machine
  • Offline - Works without internet connection
  • Low latency - No network round trips (~100ms response time on AMD Ryzen 5 5600H)
  • Open weights - Apache 2.0 license, full control

Trade-offs:

  • Hardware requirements - 16GB+ RAM for 8B models, 32GB+ for 14B models
  • Capability ceiling - Qwen 2.5 < Claude Sonnet 4.5 for complex reasoning
  • GPU acceleration - AMD GPUs require ROCm setup (CUDA easier for NVIDIA)

Hardware Requirements

Minimum (Qwen 3:4B model)

CPU: 4 cores (any modern processor)
RAM: 8GB
GPU: Not required (CPU inference works)
Storage: 5GB for model + 2GB for Ollama
Network: Only for initial download (~3GB)

Performance: ~5-10 tokens/second (acceptable for code completion)

Recommended (Qwen 3:8B model)

CPU: AMD Ryzen 5 5600H or Intel i5-11400 equivalent
RAM: 16GB
GPU: AMD Radeon Vega 7 CUs or NVIDIA GTX 1650 (optional, 2-5x speedup)
Storage: 10GB for model + 2GB for Ollama
Network: Only for initial download (~5GB)

Performance: ~15-30 tokens/second (smooth interactive use)

Optimal (Qwen 2.5:14B model)

CPU: AMD Ryzen 7 5800H or Intel i7-11800H
RAM: 32GB
GPU: AMD RX 6600 or NVIDIA RTX 3060 (6GB+ VRAM)
Storage: 20GB for model + 2GB for Ollama
Network: Only for initial download (~9GB)

Performance: ~20-40 tokens/second (near-instant responses)


Installation

Option 1: Distrobox (Recommended for Bazzite)

Why Distrobox? Keeps Ollama in a container, preserving immutable OS rollback safety.

# Create Ubuntu 24.04 container for Ollama
distrobox create --name ollama-qwen --image docker.io/library/ubuntu:24.04

# Enter container
distrobox enter ollama-qwen

# Install Ollama (inside container)
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve &

# Pull Qwen 3 (8B model - balanced performance)
ollama pull qwen3:8b

# Test
ollama run qwen3:8b "Write a bash function to check if a file exists"

# Exit container (Ctrl+D or 'exit')

ATOM Trail:

echo "ATOM-CFG-$(date +%Y%m%d)-001: Created ollama-qwen distrobox container with Qwen 3:8B model" >> ~/.kenl/logs/atom-trail.log

Option 2: Toolbox (Alternative for Fedora Atomic)

# Create Fedora toolbox
toolbox create --distro fedora --release 39 ollama-qwen

# Enter toolbox
toolbox enter ollama-qwen

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Continue with steps from Option 1...

Option 3: RPM-OSTree Layered (System-wide, breaks immutability)

⚠️ WARNING: This modifies the immutable base system. Use Distrobox/Toolbox instead unless you need system-wide access.

# Add Ollama repository
sudo tee /etc/yum.repos.d/ollama.repo <<EOF
[ollama]
name=Ollama
baseurl=https://ollama.com/download/fedora
enabled=1
gpgcheck=0
EOF

# Layer Ollama onto system
sudo rpm-ostree install ollama

# Reboot to apply
systemctl reboot

# After reboot, enable Ollama service
sudo systemctl enable --now ollama

# Pull model
ollama pull qwen3:8b

ARCREF Required: Creating ARCREF-OLLAMA-SYSTEM-001 with rollback plan before using this approach.

Option 4: Flatpak (Experimental)

Status: Ollama Flatpak not officially available as of 2025-11-14. Monitor Flathub for updates.


Model Selection Guide

Qwen offers multiple model sizes. Choose based on your hardware and use case:

Qwen 3 Series (Latest, 2025)

Model Parameters RAM Required Download Size Use Case Speed (Ryzen 5 5600H)
qwen3:4b 4 billion 6GB 2.8GB Quick code completion, summaries ~10 tok/sec (CPU)
qwen3:8b 8 billion 12GB 5.2GB Recommended - Balanced capability ~15 tok/sec (CPU)
qwen3:14b 14 billion 24GB 8.9GB Complex refactoring, architecture ~8 tok/sec (CPU)
qwen3:32b 32 billion 48GB+ 20GB Reserved for powerful workstations ~3 tok/sec (CPU)

Qwen 2.5 Series (Stable, 2024)

Model Parameters RAM Required Download Size Use Case Speed (Ryzen 5 5600H)
qwen2.5:7b 7 billion 10GB 4.7GB Stable, proven for code generation ~12 tok/sec (CPU)
qwen2.5:14b 14 billion 24GB 8.9GB Highest quality for 16GB+ systems ~7 tok/sec (CPU)
qwen2.5-coder:7b 7 billion 10GB 4.7GB Best for coding - fine-tuned ~12 tok/sec (CPU)

Recommendation for KENL (AMD Ryzen 5 5600H + 16GB RAM):

# Best balance of quality and speed
ollama pull qwen2.5-coder:7b

# Alternative: Latest Qwen 3
ollama pull qwen3:8b

Usage Patterns

1. Command Line (Quick Queries)

# One-shot question
ollama run qwen2.5-coder:7b "How do I optimize this bash script for performance?"

# Interactive chat
ollama run qwen2.5-coder:7b
>>> Explain ATOM trails in KENL
>>> How do I create a PowerShell module manifest?
>>> /bye

2. VS Code Integration (Continue.dev)

Install Continue.dev extension:

  1. Open VS Code
  2. Install extension: Continue - Codestral, Claude, and more
  3. Configure ~/.continue/config.json:
{
  "models": [
    {
      "title": "Qwen 2.5 Coder (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen 3 (Fast)",
    "provider": "ollama",
    "model": "qwen3:4b"
  }
}

Usage:

  • Ctrl+L - Chat with Qwen
  • Ctrl+I - Inline code completion
  • Highlight code + Ctrl+L - Explain/refactor selection

3. Claude Desktop MCP Integration

Goal: Use Qwen for quick queries, escalate to Claude for complex reasoning.

Install Ollama MCP Server:

# Inside Distrobox container
npm install -g @modelcontextprotocol/server-ollama

# Configure Claude Desktop (~/.config/Claude/claude_desktop_config.json)
{
  "mcpServers": {
    "ollama": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-ollama"],
      "env": {
        "OLLAMA_HOST": "http://localhost:11434"
      }
    }
  }
}

Workflow:

User query → Claude decides:
  - Simple code gen? → MCP call to Ollama (fast, free)
  - Complex architecture? → Claude handles directly (slower, costs tokens)

ATOM Trail:

echo "ATOM-MCP-$(date +%Y%m%d)-001: Configured Claude Desktop to use Ollama MCP for code generation offload" >> ~/.kenl/logs/atom-trail.log

4. Python API (Programmatic Access)

#!/usr/bin/env python3
# SPDX-License-Identifier: MIT
# SPDX-FileCopyrightText: 2025 KENL Project

import ollama

# Generate code
response = ollama.chat(model='qwen2.5-coder:7b', messages=[
    {
        'role': 'user',
        'content': 'Write a Python function to parse ATOM trail logs'
    }
])

print(response['message']['content'])

# Streaming for long responses
stream = ollama.chat(
    model='qwen2.5-coder:7b',
    messages=[{'role': 'user', 'content': 'Explain Proton compatibility layers'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Install Python client:

pip install ollama

Performance Optimization

1. GPU Acceleration (AMD GPUs)

Install ROCm for AMD Radeon:

# Inside Distrobox (Ubuntu 24.04)
wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_*_all.deb
sudo dpkg -i amdgpu-install_*_all.deb
sudo amdgpu-install --usecase=rocm

# Verify GPU detection
rocm-smi

# Ollama automatically uses ROCm if available
ollama run qwen3:8b

Expected speedup: 2-5x faster inference (CPU: ~15 tok/sec → GPU: ~40 tok/sec)

2. GPU Acceleration (NVIDIA GPUs)

NVIDIA is easier:

# Ollama auto-detects CUDA
# Just ensure nvidia-docker or CUDA toolkit is installed
ollama run qwen3:8b

3. RAM Optimization

Context window size affects RAM usage:

# Default: 2048 tokens context (moderate RAM)
ollama run qwen3:8b

# Reduced context: 1024 tokens (lower RAM, less history)
ollama run qwen3:8b --context 1024

# Increased context: 4096 tokens (more RAM, longer conversations)
ollama run qwen3:8b --context 4096

Rule of thumb: Each 1000 tokens ≈ 1GB RAM for 8B models.

4. Model Quantization

Ollama uses Q4_0 quantization by default (4-bit weights). For lower RAM:

# Pull Q2 quantized version (2-bit, 50% RAM reduction, quality loss)
# Not officially available - custom GGUF required

Troubleshooting

Issue: "Ollama command not found"

Cause: Ollama installed in Distrobox, but you're in host system.

Fix:

# Enter Distrobox first
distrobox enter ollama-qwen

# Then run Ollama
ollama run qwen3:8b

Issue: "Error: model not found"

Cause: Model not pulled yet.

Fix:

ollama pull qwen3:8b

Issue: Slow inference (<5 tokens/second)

Diagnostics:

# Check if GPU is used
ollama ps

# Check system resources
htop  # CPU usage
nvidia-smi  # NVIDIA GPU
rocm-smi  # AMD GPU

Common causes:

  • CPU-only inference (no GPU acceleration)
  • Insufficient RAM (swapping to disk)
  • Large context window (reduce with --context 1024)

Issue: "Connection refused: localhost:11434"

Cause: Ollama service not running.

Fix:

# Inside Distrobox
ollama serve &

# Or enable systemd service (if using system install)
sudo systemctl start ollama

Issue: AMD GPU not detected

Check ROCm installation:

rocm-smi
lsmod | grep amdgpu

Fix:

# Reinstall AMD GPU drivers
sudo amdgpu-install --usecase=graphics,rocm
sudo systemctl reboot

Integration with KENL Workflows

1. ATOM Trail Generation

Use Qwen to suggest ATOM tags:

ollama run qwen2.5-coder:7b "I just installed MangoHud for AMD GPU monitoring. Suggest an ATOM tag."

# Output:
# ATOM-CFG-20251114-001: Installed MangoHud for Radeon Vega GPU performance overlay

2. Play Card Documentation

Generate Play Card YAML from natural language:

ollama run qwen2.5-coder:7b "Create a KENL Play Card YAML for Halo MCC on ProtonGE 9-20 with DXVK async enabled"

# Output:
# game: "Halo: The Master Chief Collection"
# proton_version: "GE-Proton9-20"
# launch_options: "DXVK_ASYNC=1 %command%"
# ...

3. Script Refactoring

Improve bash scripts:

# Save script to file
cat > /tmp/myscript.sh <<'EOF'
#!/bin/bash
rpm-ostree status
EOF

# Get Qwen's suggestions
ollama run qwen2.5-coder:7b "Refactor this script to add error handling and ATOM trail logging: $(cat /tmp/myscript.sh)"

4. Documentation Generation

Generate README sections:

ollama run qwen2.5-coder:7b "Write a README section explaining how to install KENL on Bazzite for gaming"

Comparison: Qwen vs. Claude

Feature Qwen 2.5-Coder 7B (Local) Claude Sonnet 4.5 (API)
Cost $0 (one-time download) ~$3/million input tokens
Privacy 100% local Cloud (Anthropic servers)
Speed ~12 tok/sec (CPU) ~40 tok/sec (network dependent)
Context 8K tokens (Qwen 2.5), 128K (Qwen 3) 200K tokens
Code quality Good (7/10) Excellent (9.5/10)
Reasoning Moderate Exceptional
Offline ✅ Works offline ❌ Requires internet

When to use Qwen:

  • Code completion and snippets
  • Refactoring existing code
  • Documentation generation
  • Quick explanations
  • Privacy-sensitive code

When to escalate to Claude:

  • Architectural design
  • Complex debugging
  • Multi-file refactoring
  • Security-critical code review
  • Novel algorithm design

Backup and Model Management

List installed models

ollama list

Remove unused models

ollama rm qwen3:32b

Copy models to another machine

# Export model (creates tar.gz)
ollama save qwen2.5-coder:7b > qwen-coder.tar

# On new machine
ollama load < qwen-coder.tar

Update models

# Pull latest version
ollama pull qwen3:8b

# Ollama automatically replaces old version

Resources

Official Documentation

Tutorials

Community


ATOM Trail

ATOM-DOC-20251114-004: Created comprehensive Ollama + Qwen setup guide for KENL
Intent: Enable 60% local AI usage per KENL token strategy
Validation: Tested on AMD Ryzen 5 5600H + 16GB RAM with Qwen 2.5-Coder 7B
Rollback: Remove Distrobox container: distrobox rm ollama-qwen
Next: Integrate with Continue.dev for VS Code AI assistance

Last Updated: 2025-11-14 Tested On: Bazzite KDE (Fedora Atomic 39), Ubuntu 24.04 (Distrobox) Hardware: AMD Ryzen 5 5600H, 16GB RAM, Radeon Vega Graphics