Skip to content

Releases: YV17labs/GhostDesk

v7.2.0

22 Apr 00:06

Choose a tag to compare

Highlights

  • Reliable screen_changed feedback. Input tools no longer return false negatives. Polling now compares the full screen at quarter resolution via a bounding-box ratio, so any real UI change is caught regardless of where it lands — particularly for keyboard actions, where focus is unrelated to the mouse cursor and the previous zone-based check was systematically wrong.
  • New mouse_move tool. Lets agents trigger hover-only UI reactions (CSS :hover states, dropdowns that appear on mouse-over, tooltips) without clicking.

Changes

Added

  • mouse_move — moves the cursor without pressing any button, for hover-triggered UI states.
  • "Filling tabular UIs" section in the server instructions — explains that spreadsheets and grid forms can be driven in a single key_type (\t for next cell, \n for next row) or a clipboard_set + paste, removing the need to click each cell.
  • screens_differ() and capture_png(scale=…) — public helpers in screen/_shared.py for downsampled, threshold-based comparison.

Fixed

  • screen_changed reliability — input tools now poll the full screen instead of a 200×200 px zone around the mouse cursor. key_type and key_press no longer report false negatives, and mouse tools detect effects that land away from the click point (toasts, dropdowns, distant menus). The polling baseline is decoded once per call to keep the hot loop cheap.
  • Latent bug in screens_stablePillow.ImageChops.difference() on RGBA captures left the diff's alpha channel at 0, which made getbbox() ignore real RGB changes. Captures are now converted to RGB before comparison.
  • Test suite — tool-count assertion bumped to 13 to reflect mouse_move.

Changed

  • Middleware now logs tool call durations in milliseconds instead of seconds.

Removed

  • _cursor module — get_cursor_position() was orphaned by the feedback refactor; the module and its callers in _wayland.py have been dropped.

Documentation

  • README points the llama.cpp fork to the integration/webp-turbo branch.
  • Demo video moved to GitHub's user-attachments CDN; the demos/ directory has been removed from the repo.
  • .DS_Store files are now ignored and untracked.

v7.1.0

19 Apr 20:42
f970f5b

Choose a tag to compare

Native MCP surfaces the server wasn't exposing yet (resources, lifespan warm-up, icons, tool annotations), stricter HTTP-transport security, finer-grained tool feedback through MCP notifications/message, and a consolidated system-level brief delivered through the spec-canonical instructions field.

Added

  • MCP resources. ghostdesk://apps (JSON catalogue of installed GUI apps) and ghostdesk://clipboard (current clipboard text) mirror the app_list / clipboard_get tools so clients that surface resources in a dedicated picker can reach read-only state without spending an agent turn on a tool call.
  • FastMCP lifespan. The server pre-binds zwlr_virtual_pointer_v1 and zwp_virtual_keyboard_v1 during ASGI startup. Missing compositor protocols now fail at boot instead of surfacing mid-request on the first mouse_click.
  • MCP context notifications on tools. mouse_* and key_* push a warning when the 200×200 zone around the action does not change within 2 s — the miss is visible in the client's transcript, not only in the tool result dict. app_launch and clipboard_set mirror their outcomes through ctx.info / ctx.error.
  • GhostDesk icon on every MCP surface. The branded mark is advertised on the server itself, every tool, and both resources through MCP's icons field. Inlined as a base64 SVG data URI — no packaging asset to ship alongside the wheel.
  • ToolAnnotations on every tool. readOnlyHint, destructiveHint, and idempotentHint let MCP clients differentiate approval flows for read-only vs destructive actions: screen_shot / clipboard_get / app_list are tagged read-only + idempotent, mouse_click / mouse_drag / key_press are tagged destructive, etc.
  • Origin header validation (MCP Streamable HTTP spec § DNS-rebinding). Browser requests must match GHOSTDESK_ALLOWED_ORIGINS (comma-separated) or get a 403. Non-browser clients (no Origin header) pass through unchanged.
  • Loopback bind by default. GHOSTDESK_HOST defaults to 127.0.0.1; the container entrypoint exports 0.0.0.0 so Docker port-publishing still reaches the server, but standalone uv run ghostdesk no longer silently exposes the port to the LAN.

Changed

  • Consolidated system-level brief. The full agent doctrine (SEE → ACT → SEE, prefer-keyboard, interruption handling, scroll-to-end, final self-check) is now carried by the server instructions field — the MCP spec-canonical payload delivered in the initialize response and auto-injected by every compliant client. Per the MCP spec, prompts are user-controlled templates (slash commands, picker entries), which makes them the wrong mechanism for a system-level brief that must always reach the model. One document, guaranteed delivery.
  • Package layout for MCP surfaces. resources is now a package (matching apps, clipboard, input, screen) — every domain with a register(mcp) function follows the same __init__.py convention.
  • warn_on_miss helper. Lives in input/feedback.py alongside build_feedback and poll_for_change, so mouse and keyboard tools share the miss-warning path without crossing underscore-prefixed module boundaries.
  • mcp[cli] pinned to >=1.27. Unlocks the ToolAnnotations, Icon, and lifespan APIs used throughout this release.

Fixed

  • Wheel scroll direction inverted. mouse_scroll(direction="up") (and "left") silently scrolled the other way: the virtual-pointer axis_discrete request was sent with discrete=+1 regardless of value's sign, violating the wl_pointer protocol invariant that the two must match within a frame. Firefox — like any wheel-aware client — trusts delta_discrete, so every "up" scroll collapsed into "down" and pinned at the page bottom. Sign is now carried in _SCROLL_VECTORS alongside value, and a static test locks the invariant.

Removed

  • Standalone SYSTEM_PROMPT.md. Its content is now folded into the server instructions field, delivered automatically at session init. Users who referenced the markdown file directly no longer need to — the guidance now reaches the model through the MCP handshake.

Full Changelog: v7.0.1...v7.1.0

v7.0.1

15 Apr 22:39

Choose a tag to compare

Fixed

  • Missing envsubst in runtime images. entrypoint.sh uses envsubst to inject GHOSTDESK_SCREEN_WIDTH / GHOSTDESK_SCREEN_HEIGHT into the Sway config, but the binary was not part of the runtime stack — containers booted into a crash loop (envsubst: command not found). Added gettext-base to both docker/base/Dockerfile and .devcontainer/Dockerfile.

v7.0.0

15 Apr 19:07

Choose a tag to compare

Major platform overhaul: migration from X11 / Openbox to a native Wayland / Sway stack, end-to-end TLS, per-request coordinate model space for mixed frontier + local model fleets, and a simplified agent-first documentation story.

Highlights

  • Native Wayland / Sway stack. The devcontainer and runtime images now boot a Wayland session managed by supervisord. wl-copy / wl-paste replace the X11 clipboard path and grim replaces the X11 capture tool. The input stack drops dotool in favour of direct Wayland virtual-pointer / virtual-keyboard protocols.
  • GhostDesk-Model-Space HTTP header. The coordinate-normalisation middleware now rescales LLM coordinates to screen pixels per request, driven by the header (e.g. 1000 for the Qwen family). No header → pass-through for frontier models (Claude, GPT-4o, Gemini). One MCP server can now serve mixed fleets without a restart, and small local models reach frontier-level click precision with no grid overlay.
  • Grid mode retired. The ruler overlay, rulers.py and the "precision recipe" in the small-model prompt are removed — the new coordinate path makes them unnecessary.
  • wayvnc from pinned source. wayvnc / neatvnc / aml are built from a pinned master commit inside a dedicated vnc-builder Docker stage so classic VNC Auth (RFB security type 2) can be advertised — required for noVNC 1.6 interop.
  • End-to-end TLS. websockify and the MCP server auto-detect a mounted certificate at /etc/ghostdesk/tls/server.{crt,key} (or via GHOSTDESK_TLS_CERT / GHOSTDESK_TLS_KEY) and switch to wss:// / https:// at boot. README gains an mkcert quickstart.
  • VNC hardening. GHOSTDESK_VNC_ADDRESS is hard-pinned to 127.0.0.1; override attempts are logged and ignored. Password + token + TLS wired together end to end.
  • Environment variables namespaced under GHOSTDESK_* (GHOSTDESK_PORT, GHOSTDESK_SCREEN_WIDTH, …). Standard POSIX vars (TZ, LANG) unchanged.
  • arm64 base image builds cleanly from a clean checkout — Raspberry Pi / Apple Silicon / ARM servers are first-class.
  • Docker layout restructured into per-service subdirectories (docker/base, docker/init, docker/services/...).
  • Tool surface renamed to a consistent verb_noun convention; README restructured around an agents-first pitch.
  • License change: AGPL-3.0 with Commons Clause → FSL-1.1-ALv2. Cleaner language, explicit permitted purposes, explicit Competing Use prohibition, and each released version auto-transitions to Apache 2.0 on its second anniversary.

Fixed

  • _desktop._parse_exec now strips a leading env wrapper when resolving .desktop entries.

Full changelog

See CHANGELOG.md · v6.0.0…v7.0.0

v6.0.0

10 Apr 17:51

Choose a tag to compare

New Features

  • Grid ruler overlayscreenshot() now accepts grid=True to draw a coordinate ruler in the margins of a region crop (major ticks every 50px on X / 20px on Y, alternating magenta/cyan minor gridlines), letting smaller vision models read click coordinates straight off the labels instead of estimating pixel offsets
  • Small-model prompt — New dedicated prompt with an explicit click-coordinate recipe and workflow built around the grid ruler, targeted at compact vision models that struggle with raw pixel counting
  • Adaptive detection paddingscreen module now adjusts detection padding dynamically with clearer module boundaries between capture, rulers, and shared encoding

Refactoring

  • WebP by defaultscreenshot() now returns WebP instead of PNG by default, significantly cutting the token cost of every capture for agents
  • GPA-GUI-Detector dropped — Removed the external GUI detector dependency in favor of a lighter, more predictable ruler-based approach
  • Cursor size — Adjusted to 24px for better visibility in captures, removed LLM-specific cursor comments
  • Wheel build cleanup — Removed unused force-include config from the wheel build

Performance

  • Faster feedback poll — Visual feedback loop now compares raw PNG bytes directly instead of computing MD5 hashes, reducing reaction-time latency on every mouse/keyboard action

Fixes

  • press_key is case-tolerant — Multi-character keysyms are now normalized: press_key("Return"), press_key("return") and press_key("RETURN") all work equivalently

Documentation

  • README — Now recommends the llama.cpp fork over LM Studio for local inference; clarifies that small/medium models require both vision and tool use
  • Screenshot region= — Clarified that region= is a true native crop, not a zoom or interpolation
  • Small-model guide — New prompt with explicit click-coordinate recipe, plus a menu grid precision screenshot illustrating the workflow on small/medium models
  • SYSTEM_PROMPT.md — Renamed and restructured, critical rules emphasized

Testing

  • Coverage additions — New test suites for capture._reencode, server.main, middleware (error handling and coercion), _logging configuration, and the screen._shared module

v5.0.0

08 Apr 23:58
c605fef

Choose a tag to compare

New Features

  • Visual feedback system — Mouse and keyboard actions now return screen_changed and reaction_time_ms, giving agents immediate confirmation of their interactions
  • Ruler-based coordinate system — New screen/rulers.py produces zoomed screenshots with coordinate rulers (major ticks every 50px, minor ticks every 25px) for precise, reliable targeting
  • process_status tool — New shell tool to inspect the state and logs of processes launched via launch()
  • Precision-focused agent protocol — New SYSTEM_PROMPT.md documents the two-step ruler-based coordinate protocol for agents

Refactoring

  • Modular input feedback — Extracted visual feedback into a dedicated input/feedback.py module
  • Cursor module — Extracted cursor handling into its own _cursor.py
  • Shared image encoding — Consolidated WebP/PNG encoding into a single save_image_bytes() utility in screen/_shared.py, reused by capture.py and rulers.py
  • Legacy cleanup — Removed obsolete screen/grounding.py, screen/overlay.py, screen/reader.py, shell/wait.py, and the inspect() tool

Container & Environment

  • SYS_ADMIN capability — Added to the container for proper privilege handling
  • GNOME keyring — Now unlocked at container start for secure credential storage
  • Locale persistence — Fixed locale issues across container restarts

Documentation

  • README — Documented the ruler-based coordinate protocol and the new process_status tool
  • System prompt — Anonymized for generic desktop control, removing user-specific references
  • Obsolete docs removed — Cleaned up legacy inspect() documentation and demo screenshots

Testing

  • Massive coverage additions — New test suites for feedback, _shared, process_status, _logging, middleware, and cursor modules
  • Test hygiene — Moved assertions inside patch() contexts and added filterwarnings for pydantic RuntimeWarnings

Commits

  • feat: add SYS_ADMIN capability, unlock gnome-keyring, fix locale persistence
  • refactor: extract cursor, feedback, and process_status modules; remove wait tool
  • feat: add visual feedback to mouse and keyboard actions; update LLM instructions
  • refactor: extract save_image_bytes utility and consolidate image encoding
  • chore: anonymize system prompt for generic desktop control
  • fix: move test assertions inside patch context and add filterwarnings
  • feat: update documentation and build files for visual feedback v5
  • chore: remove obsolete screenshot.webp demo image
  • chore: remove obsolete inspect() documentation
  • docs: add missing process_status tool to README
  • test: add comprehensive tests for ghostdesk.screen._shared module
  • test: add comprehensive tests for _logging configuration
  • test: add comprehensive tests for middleware error handling and coercion
  • test: add coverage for capture.py _reencode and server.py main function
  • feat(rulers): add minor ticks every 25px and always show major labels

v4.1.0

07 Apr 14:42

Choose a tag to compare

New Features

  • Base Docker image — Introduced a dedicated base Docker image to separate foundational layers from the application image, improving build times and layer caching
  • Split CI workflow — CI pipeline now builds base and latest images independently, enabling more granular and efficient deployments
  • Gnome Keyring support — Added gnome-keyring-daemon to supervisor for secure credential storage within the container

Refactoring

  • Shared Docker scripts — Moved Docker scripts to a shared directory for better reuse across image variants

Documentation

  • Custom image guide — Added a dedicated section in the README explaining how to build custom images on top of the base image
  • SVG logo — Replaced PNG logo with SVG in README for better rendering quality

Maintenance

  • Cleanup — Removed unused files and updated .dockerignore for a leaner build context

Commits

  • feat: introduce base Docker image and rework Dockerfiles
  • feat: split CI workflow for base and latest images
  • feat: add gnome-keyring daemon to supervisor
  • refactor: move docker scripts to shared directory
  • docs: add custom image section to README
  • docs: use SVG logo instead of PNG in README
  • chore: remove unused files and update dockerignore

v4.0.1

06 Apr 15:36

Choose a tag to compare

Bug Fixes

  • Healthcheck reliability — Replaced curl-based healthcheck with supervisorctl status to verify the MCP server process is running. This eliminates false-negative healthchecks caused by HTTP endpoint timing issues during container startup

Documentation

  • Docker examples improved — Added required environment variables (DISPLAY, RESOLUTION, etc.) to all Docker run/compose examples for easier onboarding
  • Restart policy — Added restart: unless-stopped to Docker Compose examples for production-ready deployments

Commits

  • fix: use supervisorctl for healthcheck instead of curl on MCP endpoint
  • docs: add restart policy to docker examples
  • docs: add required environment variables to docker examples

v4.0.0

06 Apr 14:23

Choose a tag to compare

Major Changes

  • SOM Grounding (Intelligent UI Detection) — Every call to screenshot() now returns structured JSON with every detected UI element (buttons, labels, text fields, links) and their exact (x, y) click coordinates via OCR (RapidOCR + ONNX Runtime). Result: ~90% click accuracy on large LLMs and medium-sized models (~30B parameters)

  • inspect() tool — Text-only vision — New tool that returns a complete structured view of the screen (elements, windows, cursor, screen dimensions, region) as JSON without sending an image to the LLM. Drastically reduces API costs by eliminating image tokens (~1000+ tokens per screenshot saved)

  • Visual overlay modescreenshot(overlay=True) draws colored bounding boxes with (x, y) coordinate labels on every detected element. Ideal for debugging, demos, and visual proof of agent behavior

  • Region targeting — Both screenshot(region=...) and inspect(region=...) support scoped capture for denser detection on specific screen areas. Coordinates remain absolute — no offset math needed

  • Redesigned desktop environment — New taskbar with system clock (tint2), easy app switching, polished wallpapers optimized for 1280×800 and 1280×1024

  • Larger screen: 1280×1024 — Up from 1280×800 (+28% vertical space), giving models significantly more information per screenshot

  • Small model support — Tested with models as small as 3B active parameters (Qwen3.5-35B-A3B). Built-in optimized instructions guide smaller models effectively

  • Internationalization — New TZ and LOCALE environment variables for timezone and locale configuration (e.g. Europe/Paris, fr_FR.utf8)

  • Restructured codebase — Tools reorganized into dedicated modules (screen/, input/, clipboard/, shell/). Added ONNX Runtime dependency for OCR inference

Testing

  • All unit tests pass
    • Screenshot and inspect return correct metadata structure (screen, region, cursor, windows, elements)
    • OCR element detection validated with bounding box coordinates
    • Overlay rendering tested with label placement

Commits

  • refactor: restructure tools into dedicated modules and add onnxruntime dependency
  • feat: regenerate wallpapers from SVG for 1280x800 and 1280x1024 resolutions
  • feat: SOM grounding integration, desktop environment overhaul and small model prompt
  • feat: re-enable inspect tool and improve annotation label readability
  • refactor: simplify small model prompt and clarify inspect() text-only limitation
  • docs: overhaul README with enterprise workforce section, streamline instructions
  • docs: move demos after pitch, improve screenshot layout in README
  • refactor: rename annotate to overlay, unify screenshot and inspect output
  • docs: update instructions and README for new screenshot/inspect API
  • chore: remove standalone prompt files
  • fix: include captured region in metadata for spatial awareness
  • docs: add region field to JSON example and screenshot docstring

v3.0.0

01 Apr 20:18
8d1da67

Choose a tag to compare

Major Changes

  • Removed AT-SPI accessibility layer — Models are capable enough to interact with the desktop using screenshots alone. Removed _atspi.py, clickables.py, and system dependencies (python3-gi, gir1.2-atspi-2.0, at-spi2-core, dconf-cli)
  • New window listing via xdotool — Screenshot now includes open windows with app name, title, and geometry (x, y, width, height)
  • Standardized API responses — All tools return consistent {"result": ...} format. Screenshot metadata includes cursor position and windows list
  • Improved performance — Window query runs concurrently with screen capture. Openbox startup extracted to dedicated script with reliable X detection
  • PROMPT.md — Added system prompt for desktop assistant agents

Testing

  • All 96 unit tests pass
    • Manual testing confirms correct window detection (Firefox, GNOME Terminal)
    • Screenshot response properly formatted and validated

Commits

  • remove: drop AT-SPI accessibility layer and system dependencies
    • feat: list open windows via xdotool in screenshot metadata
    • refactor: extract Openbox startup into dedicated script
    • docs: add PROMPT.md — system prompt for desktop assistant agents