Releases: YV17labs/GhostDesk
v7.2.0
Highlights
- Reliable
screen_changedfeedback. Input tools no longer return false negatives. Polling now compares the full screen at quarter resolution via a bounding-box ratio, so any real UI change is caught regardless of where it lands — particularly for keyboard actions, where focus is unrelated to the mouse cursor and the previous zone-based check was systematically wrong. - New
mouse_movetool. Lets agents trigger hover-only UI reactions (CSS:hoverstates, dropdowns that appear on mouse-over, tooltips) without clicking.
Changes
Added
mouse_move— moves the cursor without pressing any button, for hover-triggered UI states.- "Filling tabular UIs" section in the server instructions — explains that spreadsheets and grid forms can be driven in a single
key_type(\tfor next cell,\nfor next row) or aclipboard_set+ paste, removing the need to click each cell. screens_differ()andcapture_png(scale=…)— public helpers inscreen/_shared.pyfor downsampled, threshold-based comparison.
Fixed
screen_changedreliability — input tools now poll the full screen instead of a 200×200 px zone around the mouse cursor.key_typeandkey_pressno longer report false negatives, and mouse tools detect effects that land away from the click point (toasts, dropdowns, distant menus). The polling baseline is decoded once per call to keep the hot loop cheap.- Latent bug in
screens_stable—Pillow.ImageChops.difference()on RGBA captures left the diff's alpha channel at 0, which madegetbbox()ignore real RGB changes. Captures are now converted to RGB before comparison. - Test suite — tool-count assertion bumped to 13 to reflect
mouse_move.
Changed
- Middleware now logs tool call durations in milliseconds instead of seconds.
Removed
_cursormodule —get_cursor_position()was orphaned by the feedback refactor; the module and its callers in_wayland.pyhave been dropped.
Documentation
- README points the llama.cpp fork to the
integration/webp-turbobranch. - Demo video moved to GitHub's user-attachments CDN; the
demos/directory has been removed from the repo. .DS_Storefiles are now ignored and untracked.
v7.1.0
Native MCP surfaces the server wasn't exposing yet (resources, lifespan warm-up, icons, tool annotations), stricter HTTP-transport security, finer-grained tool feedback through MCP notifications/message, and a consolidated system-level brief delivered through the spec-canonical instructions field.
Added
- MCP resources.
ghostdesk://apps(JSON catalogue of installed GUI apps) andghostdesk://clipboard(current clipboard text) mirror theapp_list/clipboard_gettools so clients that surface resources in a dedicated picker can reach read-only state without spending an agent turn on a tool call. - FastMCP lifespan. The server pre-binds
zwlr_virtual_pointer_v1andzwp_virtual_keyboard_v1during ASGI startup. Missing compositor protocols now fail at boot instead of surfacing mid-request on the firstmouse_click. - MCP context notifications on tools.
mouse_*andkey_*push awarningwhen the 200×200 zone around the action does not change within 2 s — the miss is visible in the client's transcript, not only in the tool result dict.app_launchandclipboard_setmirror their outcomes throughctx.info/ctx.error. - GhostDesk icon on every MCP surface. The branded mark is advertised on the server itself, every tool, and both resources through MCP's
iconsfield. Inlined as a base64 SVG data URI — no packaging asset to ship alongside the wheel. ToolAnnotationson every tool.readOnlyHint,destructiveHint, andidempotentHintlet MCP clients differentiate approval flows for read-only vs destructive actions:screen_shot/clipboard_get/app_listare tagged read-only + idempotent,mouse_click/mouse_drag/key_pressare tagged destructive, etc.- Origin header validation (MCP Streamable HTTP spec § DNS-rebinding). Browser requests must match
GHOSTDESK_ALLOWED_ORIGINS(comma-separated) or get a403. Non-browser clients (noOriginheader) pass through unchanged. - Loopback bind by default.
GHOSTDESK_HOSTdefaults to127.0.0.1; the container entrypoint exports0.0.0.0so Docker port-publishing still reaches the server, but standaloneuv run ghostdeskno longer silently exposes the port to the LAN.
Changed
- Consolidated system-level brief. The full agent doctrine (SEE → ACT → SEE, prefer-keyboard, interruption handling, scroll-to-end, final self-check) is now carried by the server
instructionsfield — the MCP spec-canonical payload delivered in theinitializeresponse and auto-injected by every compliant client. Per the MCP spec,promptsare user-controlled templates (slash commands, picker entries), which makes them the wrong mechanism for a system-level brief that must always reach the model. One document, guaranteed delivery. - Package layout for MCP surfaces.
resourcesis now a package (matchingapps,clipboard,input,screen) — every domain with aregister(mcp)function follows the same__init__.pyconvention. warn_on_misshelper. Lives ininput/feedback.pyalongsidebuild_feedbackandpoll_for_change, so mouse and keyboard tools share the miss-warning path without crossing underscore-prefixed module boundaries.mcp[cli]pinned to>=1.27. Unlocks theToolAnnotations,Icon, and lifespan APIs used throughout this release.
Fixed
- Wheel scroll direction inverted.
mouse_scroll(direction="up")(and"left") silently scrolled the other way: the virtual-pointeraxis_discreterequest was sent withdiscrete=+1regardless ofvalue's sign, violating thewl_pointerprotocol invariant that the two must match within a frame. Firefox — like any wheel-aware client — trustsdelta_discrete, so every "up" scroll collapsed into "down" and pinned at the page bottom. Sign is now carried in_SCROLL_VECTORSalongsidevalue, and a static test locks the invariant.
Removed
- Standalone
SYSTEM_PROMPT.md. Its content is now folded into the serverinstructionsfield, delivered automatically at session init. Users who referenced the markdown file directly no longer need to — the guidance now reaches the model through the MCP handshake.
Full Changelog: v7.0.1...v7.1.0
v7.0.1
Fixed
- Missing
envsubstin runtime images.entrypoint.shusesenvsubstto injectGHOSTDESK_SCREEN_WIDTH/GHOSTDESK_SCREEN_HEIGHTinto the Sway config, but the binary was not part of the runtime stack — containers booted into a crash loop (envsubst: command not found). Addedgettext-baseto bothdocker/base/Dockerfileand.devcontainer/Dockerfile.
v7.0.0
Major platform overhaul: migration from X11 / Openbox to a native Wayland / Sway stack, end-to-end TLS, per-request coordinate model space for mixed frontier + local model fleets, and a simplified agent-first documentation story.
Highlights
- Native Wayland / Sway stack. The devcontainer and runtime images now boot a Wayland session managed by supervisord.
wl-copy/wl-pastereplace the X11 clipboard path andgrimreplaces the X11 capture tool. The input stack dropsdotoolin favour of direct Waylandvirtual-pointer/virtual-keyboardprotocols. GhostDesk-Model-SpaceHTTP header. The coordinate-normalisation middleware now rescales LLM coordinates to screen pixels per request, driven by the header (e.g.1000for the Qwen family). No header → pass-through for frontier models (Claude, GPT-4o, Gemini). One MCP server can now serve mixed fleets without a restart, and small local models reach frontier-level click precision with no grid overlay.- Grid mode retired. The ruler overlay,
rulers.pyand the "precision recipe" in the small-model prompt are removed — the new coordinate path makes them unnecessary. - wayvnc from pinned source.
wayvnc/neatvnc/amlare built from a pinnedmastercommit inside a dedicatedvnc-builderDocker stage so classic VNC Auth (RFB security type 2) can be advertised — required for noVNC 1.6 interop. - End-to-end TLS.
websockifyand the MCP server auto-detect a mounted certificate at/etc/ghostdesk/tls/server.{crt,key}(or viaGHOSTDESK_TLS_CERT/GHOSTDESK_TLS_KEY) and switch towss:///https://at boot. README gains anmkcertquickstart. - VNC hardening.
GHOSTDESK_VNC_ADDRESSis hard-pinned to127.0.0.1; override attempts are logged and ignored. Password + token + TLS wired together end to end. - Environment variables namespaced under
GHOSTDESK_*(GHOSTDESK_PORT,GHOSTDESK_SCREEN_WIDTH, …). Standard POSIX vars (TZ,LANG) unchanged. - arm64 base image builds cleanly from a clean checkout — Raspberry Pi / Apple Silicon / ARM servers are first-class.
- Docker layout restructured into per-service subdirectories (
docker/base,docker/init,docker/services/...). - Tool surface renamed to a consistent
verb_nounconvention; README restructured around an agents-first pitch. - License change: AGPL-3.0 with Commons Clause → FSL-1.1-ALv2. Cleaner language, explicit permitted purposes, explicit Competing Use prohibition, and each released version auto-transitions to Apache 2.0 on its second anniversary.
Fixed
_desktop._parse_execnow strips a leading env wrapper when resolving.desktopentries.
Full changelog
See CHANGELOG.md · v6.0.0…v7.0.0
v6.0.0
New Features
- Grid ruler overlay —
screenshot()now acceptsgrid=Trueto draw a coordinate ruler in the margins of a region crop (major ticks every 50px on X / 20px on Y, alternating magenta/cyan minor gridlines), letting smaller vision models read click coordinates straight off the labels instead of estimating pixel offsets - Small-model prompt — New dedicated prompt with an explicit click-coordinate recipe and workflow built around the grid ruler, targeted at compact vision models that struggle with raw pixel counting
- Adaptive detection padding —
screenmodule now adjusts detection padding dynamically with clearer module boundaries between capture, rulers, and shared encoding
Refactoring
- WebP by default —
screenshot()now returns WebP instead of PNG by default, significantly cutting the token cost of every capture for agents - GPA-GUI-Detector dropped — Removed the external GUI detector dependency in favor of a lighter, more predictable ruler-based approach
- Cursor size — Adjusted to 24px for better visibility in captures, removed LLM-specific cursor comments
- Wheel build cleanup — Removed unused
force-includeconfig from the wheel build
Performance
- Faster feedback poll — Visual feedback loop now compares raw PNG bytes directly instead of computing MD5 hashes, reducing reaction-time latency on every mouse/keyboard action
Fixes
press_keyis case-tolerant — Multi-character keysyms are now normalized:press_key("Return"),press_key("return")andpress_key("RETURN")all work equivalently
Documentation
- README — Now recommends the llama.cpp fork over LM Studio for local inference; clarifies that small/medium models require both vision and tool use
- Screenshot
region=— Clarified thatregion=is a true native crop, not a zoom or interpolation - Small-model guide — New prompt with explicit click-coordinate recipe, plus a menu grid precision screenshot illustrating the workflow on small/medium models
SYSTEM_PROMPT.md— Renamed and restructured, critical rules emphasized
Testing
- Coverage additions — New test suites for
capture._reencode,server.main, middleware (error handling and coercion),_loggingconfiguration, and thescreen._sharedmodule
v5.0.0
New Features
- Visual feedback system — Mouse and keyboard actions now return
screen_changedandreaction_time_ms, giving agents immediate confirmation of their interactions - Ruler-based coordinate system — New
screen/rulers.pyproduces zoomed screenshots with coordinate rulers (major ticks every 50px, minor ticks every 25px) for precise, reliable targeting process_statustool — New shell tool to inspect the state and logs of processes launched vialaunch()- Precision-focused agent protocol — New
SYSTEM_PROMPT.mddocuments the two-step ruler-based coordinate protocol for agents
Refactoring
- Modular input feedback — Extracted visual feedback into a dedicated
input/feedback.pymodule - Cursor module — Extracted cursor handling into its own
_cursor.py - Shared image encoding — Consolidated WebP/PNG encoding into a single
save_image_bytes()utility inscreen/_shared.py, reused bycapture.pyandrulers.py - Legacy cleanup — Removed obsolete
screen/grounding.py,screen/overlay.py,screen/reader.py,shell/wait.py, and theinspect()tool
Container & Environment
- SYS_ADMIN capability — Added to the container for proper privilege handling
- GNOME keyring — Now unlocked at container start for secure credential storage
- Locale persistence — Fixed locale issues across container restarts
Documentation
- README — Documented the ruler-based coordinate protocol and the new
process_statustool - System prompt — Anonymized for generic desktop control, removing user-specific references
- Obsolete docs removed — Cleaned up legacy
inspect()documentation and demo screenshots
Testing
- Massive coverage additions — New test suites for
feedback,_shared,process_status,_logging,middleware, andcursormodules - Test hygiene — Moved assertions inside
patch()contexts and addedfilterwarningsfor pydantic RuntimeWarnings
Commits
- feat: add SYS_ADMIN capability, unlock gnome-keyring, fix locale persistence
- refactor: extract cursor, feedback, and process_status modules; remove wait tool
- feat: add visual feedback to mouse and keyboard actions; update LLM instructions
- refactor: extract save_image_bytes utility and consolidate image encoding
- chore: anonymize system prompt for generic desktop control
- fix: move test assertions inside patch context and add filterwarnings
- feat: update documentation and build files for visual feedback v5
- chore: remove obsolete screenshot.webp demo image
- chore: remove obsolete inspect() documentation
- docs: add missing process_status tool to README
- test: add comprehensive tests for ghostdesk.screen._shared module
- test: add comprehensive tests for _logging configuration
- test: add comprehensive tests for middleware error handling and coercion
- test: add coverage for capture.py _reencode and server.py main function
- feat(rulers): add minor ticks every 25px and always show major labels
v4.1.0
New Features
- Base Docker image — Introduced a dedicated base Docker image to separate foundational layers from the application image, improving build times and layer caching
- Split CI workflow — CI pipeline now builds base and latest images independently, enabling more granular and efficient deployments
- Gnome Keyring support — Added
gnome-keyring-daemonto supervisor for secure credential storage within the container
Refactoring
- Shared Docker scripts — Moved Docker scripts to a shared directory for better reuse across image variants
Documentation
- Custom image guide — Added a dedicated section in the README explaining how to build custom images on top of the base image
- SVG logo — Replaced PNG logo with SVG in README for better rendering quality
Maintenance
- Cleanup — Removed unused files and updated
.dockerignorefor a leaner build context
Commits
- feat: introduce base Docker image and rework Dockerfiles
- feat: split CI workflow for base and latest images
- feat: add gnome-keyring daemon to supervisor
- refactor: move docker scripts to shared directory
- docs: add custom image section to README
- docs: use SVG logo instead of PNG in README
- chore: remove unused files and update dockerignore
v4.0.1
Bug Fixes
- Healthcheck reliability — Replaced
curl-based healthcheck withsupervisorctl statusto verify the MCP server process is running. This eliminates false-negative healthchecks caused by HTTP endpoint timing issues during container startup
Documentation
- Docker examples improved — Added required environment variables (
DISPLAY,RESOLUTION, etc.) to all Docker run/compose examples for easier onboarding - Restart policy — Added
restart: unless-stoppedto Docker Compose examples for production-ready deployments
Commits
- fix: use supervisorctl for healthcheck instead of curl on MCP endpoint
- docs: add restart policy to docker examples
- docs: add required environment variables to docker examples
v4.0.0
Major Changes
-
SOM Grounding (Intelligent UI Detection) — Every call to
screenshot()now returns structured JSON with every detected UI element (buttons, labels, text fields, links) and their exact(x, y)click coordinates via OCR (RapidOCR + ONNX Runtime). Result: ~90% click accuracy on large LLMs and medium-sized models (~30B parameters) -
inspect()tool — Text-only vision — New tool that returns a complete structured view of the screen (elements, windows, cursor, screen dimensions, region) as JSON without sending an image to the LLM. Drastically reduces API costs by eliminating image tokens (~1000+ tokens per screenshot saved) -
Visual overlay mode —
screenshot(overlay=True)draws colored bounding boxes with(x, y)coordinate labels on every detected element. Ideal for debugging, demos, and visual proof of agent behavior -
Region targeting — Both
screenshot(region=...)andinspect(region=...)support scoped capture for denser detection on specific screen areas. Coordinates remain absolute — no offset math needed -
Redesigned desktop environment — New taskbar with system clock (tint2), easy app switching, polished wallpapers optimized for 1280×800 and 1280×1024
-
Larger screen: 1280×1024 — Up from 1280×800 (+28% vertical space), giving models significantly more information per screenshot
-
Small model support — Tested with models as small as 3B active parameters (Qwen3.5-35B-A3B). Built-in optimized instructions guide smaller models effectively
-
Internationalization — New
TZandLOCALEenvironment variables for timezone and locale configuration (e.g.Europe/Paris,fr_FR.utf8) -
Restructured codebase — Tools reorganized into dedicated modules (
screen/,input/,clipboard/,shell/). Added ONNX Runtime dependency for OCR inference
Testing
- All unit tests pass
- Screenshot and inspect return correct metadata structure (screen, region, cursor, windows, elements)
- OCR element detection validated with bounding box coordinates
- Overlay rendering tested with label placement
Commits
- refactor: restructure tools into dedicated modules and add onnxruntime dependency
- feat: regenerate wallpapers from SVG for 1280x800 and 1280x1024 resolutions
- feat: SOM grounding integration, desktop environment overhaul and small model prompt
- feat: re-enable inspect tool and improve annotation label readability
- refactor: simplify small model prompt and clarify inspect() text-only limitation
- docs: overhaul README with enterprise workforce section, streamline instructions
- docs: move demos after pitch, improve screenshot layout in README
- refactor: rename annotate to overlay, unify screenshot and inspect output
- docs: update instructions and README for new screenshot/inspect API
- chore: remove standalone prompt files
- fix: include captured region in metadata for spatial awareness
- docs: add region field to JSON example and screenshot docstring
v3.0.0
Major Changes
- Removed AT-SPI accessibility layer — Models are capable enough to interact with the desktop using screenshots alone. Removed _atspi.py, clickables.py, and system dependencies (python3-gi, gir1.2-atspi-2.0, at-spi2-core, dconf-cli)
- New window listing via xdotool — Screenshot now includes open windows with app name, title, and geometry (x, y, width, height)
- Standardized API responses — All tools return consistent
{"result": ...}format. Screenshot metadata includes cursor position and windows list - Improved performance — Window query runs concurrently with screen capture. Openbox startup extracted to dedicated script with reliable X detection
- PROMPT.md — Added system prompt for desktop assistant agents
Testing
- All 96 unit tests pass
-
- Manual testing confirms correct window detection (Firefox, GNOME Terminal)
-
- Screenshot response properly formatted and validated
Commits
- remove: drop AT-SPI accessibility layer and system dependencies
-
- feat: list open windows via xdotool in screenshot metadata
-
- refactor: extract Openbox startup into dedicated script
-
- docs: add PROMPT.md — system prompt for desktop assistant agents