GitHub - rounak/PhoneAgent: An AI agent that can get things done across iPhone apps.
Service

GitHub - rounak/PhoneAgent: An AI agent that can get things done across iPhone apps.

rounak
2025.06.08
ยทGitHubยทby Anonymous
#AI Agent#iOS#OpenAI#GPT-4#UI Testing

Key Points

  • 1PhoneAgent is an AI agent designed for iPhones that uses OpenAI models to automate tasks across multiple applications by interacting with their accessibility trees.
  • 2It leverages Xcode's UI testing harness to inspect and control apps without jailbreaking, communicating via a TCP server to process commands like tapping, typing, and opening applications.
  • 3While capable of executing complex instructions and supporting voice input, the experimental software has limitations including issues with in-flight animations and not yet seeing a visual representation of the screen.

PhoneAgent is an AI-powered agent designed for iPhones that enables task automation and interaction across multiple applications, emulating human user behavior. Developed during an OpenAI hackathon, the system leverages OpenAI's gpt-4.1 model (also referred to as gpt-4) to interpret user commands and execute actions on the device.

The core methodology of PhoneAgent circumvents the sandboxed nature of iOS applications and the need for jailbreaking by utilizing Xcode's UI testing harness, specifically XCTest APIs. This allows the agent to inspect and interact with the operating system and various applications. The gpt-4.1 model receives as input the accessibility tree contents of the current application, which provides a structured, semantic representation of the UI elements available on the screen.

Upon processing the accessibility information and a user's prompt, the model is equipped with a specific set of tools to perform actions:
\begin{itemize}
\item Getting contents of the current app: This refers to querying and obtaining the accessibility tree of the foreground application.
\item Tapping on a UI element: The agent can programmatically simulate taps on specific UI components identified within the accessibility tree.
\item Typing in a text field: It can input text into designated text fields.
\item Opening an app: The agent can launch installed applications.
\end{itemize}
Communication between the host PhoneAgent application and the underlying UI test runner is facilitated via a TCP Server. This server acts as an intermediary, allowing the host app to trigger prompts and relay information to the XCTest environment, which then executes the model's determined actions.

Key features of PhoneAgent include its ability to understand and interact with an app's accessibility tree, perform fundamental UI operations such as tapping, swiping, scrolling, typing, and opening applications. It supports follow-up tasks through completion notifications, enables voice interaction via a microphone button, and offers an optional "Always On" mode that listens for a customizable wake word (defaulting to "Agent") even when the application is backgrounded. The system securely stores the user's OpenAI API key on the device's keychain.

Despite its capabilities, PhoneAgent has limitations. These include challenges with keyboard input, potential confusion of the model when the view hierarchy is captured during active UI animations, and the model's tendency to prematurely give up on long-running tasks without adequate waiting mechanisms. While the current implementation primarily relies on accessibility tree data, the paper notes the possibility of incorporating an image representation of the screen using XCTest APIs in future iterations.