GitHub - unitedbyai/droidclaw: turn old phones into ai agents - give it a goal in plain english. it reads the screen, thinks about what to do, taps and types via adb, and repeats until the job is done.
Service

GitHub - unitedbyai/droidclaw: turn old phones into ai agents - give it a goal in plain english. it reads the screen, thinks about what to do, taps and types via adb, and repeats until the job is done.

unitedbyai
2026.02.19
·GitHub·by 이호민
#Agent#AI#Android#Automation#LLM

Key Points

  • 1DroidClaw is an AI agent that controls Android phones by understanding screen content and executing actions like tapping and typing, enabling automation of any app without APIs.
  • 2It operates on a perception-reasoning-action loop, using an LLM to interpret accessibility trees or screenshots and decide subsequent ADB commands to achieve a given English goal.
  • 3The system includes robust failure handling mechanisms like stuck loop detection and repetition tracking, and offers interactive, AI-powered workflow, and deterministic flow modes for diverse automation needs.

DroidClaw is an AI agent designed to control Android phones by interpreting screen content and executing actions, enabling automation of tasks without requiring APIs or integrations. It transforms old Android devices into intelligent agents capable of achieving user-defined goals in plain English.

The core methodology of DroidClaw is a perception \rightarrow reasoning \rightarrow action loop that iterates until a goal is achieved or a step limit is reached.

  1. Perception: The agent first perceives the current screen state. This primarily involves dumping the accessibility tree via ADB, parsing its XML output into structured, interactive UI elements. It also diffs the current screen state with the previous one to detect changes. Optionally, a screenshot can be captured for visual processing.
  2. Reasoning: The perceived screen state, the user's goal, and the conversation history are sent to a Large Language Model (LLM). The LLM processes this information and returns a structured output containing its think process, a plan, and the specific action to take. For instance, the LLM might identify a search icon at specific coordinates and decide to tap it.
  3. Action: The LLM's chosen action (e.g., tap, type, swipe) is executed via ADB. The result of this action (success or failure) is fed back to the LLM in the next step, providing crucial feedback. After execution, the agent checks if the overall goal is done. If not, the loop repeats, returning to the perception phase.

To prevent the system from "falling apart" due to the inherent fragility of LLM-controlled UIs, DroidClaw incorporates several robust failure handling mechanisms:

  • Stuck Loop Detection: If the screen state remains unchanged for a predefined number of steps (e.g., 3), recovery hints are injected into the LLM's prompt, guiding it to break the loop. These hints are context-aware, tailored to the type of failing action.
  • Repetition Tracking: A sliding window monitors recent actions. If the agent repeatedly executes the same action, such as tapping the same coordinates multiple times (e.g., 3+), it is explicitly instructed to stop and attempt an alternative approach.
  • Drift Detection: If the agent continuously spams navigation actions (e.g., swipe, back, wait) without meaningful interaction, it is nudged to take a more direct action.
  • Vision Fallback: In cases where the accessibility tree is empty (common in webviews, Flutter apps, or games), a screenshot is automatically sent to the LLM instead, alongside coordinate-based tap suggestions, allowing visual reasoning. This can be configured as a fallback or to always be active.
  • Action Feedback: Every action's outcome (success/failure and an accompanying message) is provided back to the LLM in the subsequent turn, enabling the agent to learn from its immediate past.
  • Multi-turn Memory: A conversation history is maintained across steps, providing the LLM with context on what actions have already been attempted, preventing redundant or ineffective retries.

DroidClaw offers a rich set of 28 atomic actions, which map directly to ADB commands, categorised as:

  • Basic Interactions: tap, type, enter, longpress, clear, paste, swipe, scroll.
  • Navigation: home, back, launch, switch_app, open_url, open_settings.
  • Clipboard: clipboard_get, clipboard_set.
  • Multi-step Skills: Higher-level compound actions that encapsulate common patterns, reducing LLM decision-making. Examples include read_screen (auto-scrolls and collects text), submit_message, copy_visible_text, wait_for_content, find_and_tap, and compose_email (fills fields using Android intents).
  • System: screenshot, shell, keyevent, pull_file, push_file, wait, done.

The system supports three operational modes:

  • Interactive Mode: The user provides a single goal in plain English, and the LLM determines the necessary sequence of actions on the fly. Best for one-off tasks and exploration.
  • Workflows (AI-powered): Defined in JSON files, these consist of a sequence of sub-goals, potentially spanning multiple applications. The LLM handles navigation and interaction within each step. Ideal for multi-app tasks, recurring routines, and complex processes. formData can inject specific data into steps.
  • Flows (Deterministic): Defined in YAML files, these are fixed sequences of predefined actions (taps, types). They do not involve the LLM, offering instant execution, akin to a macro. Suitable for simple, highly repeatable tasks where no AI reasoning is required.

Configuration is managed via an .env file, allowing selection of LLM providers (Groq, Ollama, OpenAI, OpenRouter, Bedrock), setting MAX_STEPS (agent's retry limit), STUCK_THRESHOLD, VISION_MODE (off/fallback/always), MAX_ELEMENTS (UI elements sent to LLM), MAX_HISTORY_STEPS, and STREAMING_ENABLED.

The codebase is modular, with kernel.ts orchestrating the main loop, actions.ts implementing atomic ADB commands, skills.ts defining compound actions, llm-providers.ts handling LLM integrations and the system prompt, and sanitizer.ts parsing accessibility XML.

Connectivity can be via USB or remotely using Tailscale, enabling control of the Android device from any location by connecting ADB over the network, effectively turning the phone into an always-on remote AI agent.