Technical Whitepaper

MCC-H

Let AI work like a human

Pre-alpha · Apple Silicon

Release date to be announced soon

Abstract

MCC-H is a universal GUI-only AI agent that controls virtual machines and desktops like a human: through screenshots, mouse, keyboard, and SSH. Instead of building new interfaces for AI, we let AI use ours — the same screens, mice, and keyboards humans use. The agent sees what we see and acts as we do.

On the name: MCC — Mission Control Room. It feels like launching rockets from a control room, watching screens and sending commands. The H is for Houston, where that control room lives. And, eventually, it may have a problem — as Swigert said.

This document describes the concepts, implementation, achievements, roadmap, and open questions. MCC-H runs on Houston, a macOS hypervisor layer that provides VM lifecycle, screenshot capture, input injection, and on-device models (supported where possible, otherwise cloud models).

Core Concepts

Stateless Agent & Recipes

MCC-H uses a stateless agent model: no persistent memory between sessions. Each task starts fresh. Knowledge is encoded in recipes — reusable, shareable, verifiable task flows — not in agent state. Recipes are sequences of tool calls (snapshots, clicks, typing, SSH) with assessments and screenshots. They can be generated by the agent, verified or edited by humans, and exported as HTML + screenshots (ZIP) for sharing and auditing. PGP signing for trust and provenance is planned.

Human-like Interaction Loop

The agent follows a human-like loop until the task is done:

  1. Assessment — What happened so far? What does the last result mean?
  2. Clarification — Why am I doing this next step? What outcome do I expect?
  3. Action — Execute (click, type, SSH, etc.)
  4. Observation — What changed on screen? Did it match expectations?

Every tool call carries assessment and clarification so the agent reasons explicitly at each step. The agent receives a fixed features mask — interactive elements with known coordinates: checkboxes, radio buttons, text regions, icons. Just like a human: you see what changed, then you think where to click or what to type.

Observe, Don't Poll

Instead of polling, the agent observes changes in a selected domain (visual or audio). Act only when something changes, unless a schedule says otherwise. Event-driven proactivity builds on this idea (work in progress).

Actions Over Planning

If we only talk, we only plan — nothing happens. Only actions have value. The agent is designed to act, not just reason. The loop is: observe → decide → act → observe.

What Was Achieved

MCP Tool Suite

The agent exposes a full set of tools via the Model Context Protocol:

  • take_snapshot — Screenshot + OCR + vision overlay
  • power_on / power_off — VM lifecycle
  • mouse_click, mouse_double_click, mouse_move, mouse_scroll
  • keyboard_type — Text and key combos (Ctrl+C, etc.)
  • start_ssh_session, send_to_ssh_session, close_ssh_session
  • wait — Timed delays
  • secrets_list, secrets_get, secrets_set, secrets_delete
  • config_list, config_set, config_delete
  • start_task, finalize_task, ask_user, get_skill

Vision Pipeline

Houston (the hypervisor) runs a full OCR and vision pipeline. On-device models are used where possible; otherwise cloud models. On Apple Silicon:

ToolModelPurpose
OCRApple VisionText recognition with coordinates
CheckboxDetectorYOLOv8Checkbox state (checked/unchecked)
WebFormDetectorYOLOv839 form field classes (button, input, dropdown, etc.)
OmniParserDetectorOmniParser-v2.0Cross-platform desktop icons
UIElementsDetectorYOLOv11macOS accessibility-style elements
IconCaptionDetectorqwen3-vl-2b (LM Studio)Icon captioning via local VL model

All CoreML models use cpuAndNeuralEngine for Apple ARM optimization.

Recipe System

Full recipe capture with assessment/clarification per step, screenshot embedding, markdown export, and ZIP export (HTML + images) for sharing and auditing.

AI Provider Support

Claude (Anthropic), OpenRouter, ChatGPT, and custom OpenAI-compatible endpoints.

Proven Task Completions

We have successfully completed:

  • Debian installation — Full OS install from ISO in a VM
  • macOS configuration — System setup and app configuration
  • Custom requests — Arbitrary tasks phrased in natural language

Sample recipes (coming soon):

We ask the community to test more recipes and operating systems and share them with us if possible. Your contributions help expand what MCC-H can do.

Unexpected Model Behavior

To our surprise, when models encounter failures they often attempt recovery or make smart moves we didn't anticipate. The agents exhibit emergent problem-solving behavior — retrying with different approaches, backing out of dead ends, or adapting their strategy when something doesn't work. This suggests the GUI-only paradigm may unlock capabilities we haven't explicitly designed for.

Interacting with MCC-H

Configuration

Before running tasks, configure your AI provider. MCC-H supports:

  • Claude (Anthropic) — API key
  • OpenRouter — API key, model selection
  • ChatGPT — OAuth or API key
  • Custom — Any OpenAI-compatible endpoint

Cheap setup for a MacBook with 16 GB RAM: OpenRouter with qwen3-vl-8b-instruct as the vision model and gemini-flash or kimi-2.5 as the main agent. Add your keys in the app settings.

The community is welcome to try MCC-H on high-end Macs with local models (LM Studio, Ollama, etc.) for a fully on-device experience.

Giving Instructions

Interaction is conversational. Start with what you want the agent to do, for example:

I want you to install Debian in this VM from the ISO.
I want you to configure macOS to use dark mode and set up the dock.
I want you to open a terminal and run apt update.

The agent will assess your request, take snapshots of the screen, and act step by step until the task is done or it needs your input. The agent can ask for clarifications when something is ambiguous — for example, which option to choose or what value to enter. You can interrupt, clarify, or inject new instructions at any time.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  MCC-H (Electron)                                                │
│  ├── Vue UI (chat, VM config, recipe view)                       │
│  ├── MCP Server (tools: take_snapshot, mouse_click, etc.)        │
│  ├── AI Agent (Claude / OpenRouter / ChatGPT)                    │
│  └── Recipe Store (in-memory → export to ZIP)                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              │ HTTP (127.0.0.1:5899)
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Houston (Swift, macOS) — Hypervisor                             │
│  ├── VM management (Apple Virtualization.framework)               │
│  ├── Screenshot capture (IOSurface)                              │
│  ├── Input injection (mouse, keyboard)                           │
│  └── OCR pipeline (on-device models)                             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  Guest VM (Linux / macOS)                                        │
│  └── Desktop or TUI to be controlled                            │
└─────────────────────────────────────────────────────────────────┘

Roadmap

  • Event-driven proactivity — Agent reacts automatically when the screen changes
  • Windows as controlled OS — Support for Windows guests
  • Recipes store and ecosystem — Public repository, discovery, ratings, community sharing
  • Audio bridges — Agent speaks into microphone and listens to audio (e.g. during calls)
  • PGP signing for recipes — Trust and provenance (similar to Debian packages)

Community & Open Questions

We ask the community to test more recipes and operating systems and share them with us if possible. Your contributions help expand what MCC-H can do.

We also welcome input on:

  • Extended testing — On-device VL models (LM Studio, Ollama) for icon captioning and vision
  • Faster on-device models — Smaller, faster vision/language models for form fields, icons, OCR
  • Form fields and icons recognition — Better detection accuracy, new model backends, training data
  • Recipe sharing — Creating, verifying, sharing recipes for common tasks (OS install, app setup)
  • General improvements — Bug fixes, UX, documentation

Participate

MCC-H is an open experiment. We invite developers, researchers, and enthusiasts to contribute, test, and shape the future of GUI-only AI agents.

mcc-h.ai · Houston hypervisor · Pre-alpha