Agent-Computer Interface Paradigms: How do AI Agents Interact with Software?

Thank you to my reviewers Kevin Zhang, Hanyi Du, Rubab Uddin, and Qian Zheng AI Agent Concept

Not a day goes by where you don't see 100 posts about how 2025 is the year of AI agents. But what exactly defines an AI agent, and how do they communicate with digital environments? People commonly mistaken simple LLM-powered chatbots for AI agents, but real AI agents have the ability to perform actions and interact with existing software systems.

While numerous companies are developing AI agents with specialized capabilities, the ChatGPT moment of agents will probably belong to a general-purpose agent that can seamlessly interact across multiple computing environments and carry out effectively any task asked of it. Manus AI is one of the first to demonstrate this vision to consumers.

Much like how biological organisms evolved specialized sensory systems to perceive and interact with their environments, AI agents are developing distinct interface modalities to sense and manipulate software. Just as evolution favored organisms that could integrate multi-sensory inputs—sight, sound, touch, smell—the next evolution in AI agents may favor those that can seamlessly integrate multiple interface types. Future general purpose AI agents will need to be capable of OS interactions, remote web browsing, and API tool-use, and infrastructure has to emerge to tackle the challenges that arise from each of these ACI categories.

This article explores the current landscape of Agent-Computer Interfaces (ACIs) and lays the ground work for future deep-dives into the infrastructural challenges to achieve true multi-category agency.

Understanding AI Agents and ACIs

At their core, AI agents are software systems whose workflows are controlled by outputs from Language Models (LMs). These outputs determine how and when to use various tools or software services. While there's a broad spectrum of agency—from simple task routing to complex decision-making—today's bar for AI agents requires a system to be capable of autonomously deciding when and how to invoke tools to perform actions based on LM-driven instructions.

The concept of an Agent-Computer Interface (ACI) was first introduced by John Yang et al. from Princeton in May 2024 in the SWE-agent paper. ACIs serve as abstraction layers between LM-driven agents and the digital environment. Much like robots navigating and interacting in the physical world, there are two fundamental ACI components:

Sensors are components that let agents take in information about its environment and create inputs for LMs to process.
Actuators are components that translate LM output into some action on the agent's environment.

The Three Main Categories of ACI

My observation of current agent-computer interfaces indicate that they fall into three distinct categories, each tailored to different types of agent interactions:

1. Computer-use ACI

This type primarily interacts directly with operating systems, command lines, and graphical user interfaces (GUIs). It includes tools for cursor movement, clicking buttons, typing text, and other direct interactions. For GUI interactions, multimodal models with vision capabilities are essential.

Computer-use ACIs often employ a mixture of accessibility tree, screenshots, or the Set of Marks (SOM) method for its sensor. For single modal LMs, they might have to navigate OS environments through XML accessibility trees. Whereas vision language models (VLMs) could use direct screenshots and pixel counting. SOM on the other hand is a method where the agent identifies and interacts with UI elements by creating visual reference points on the interface. This perception-based approach allows agents to "see" and interact with elements that might not be easily accessible through programmatic means, similar to how humans visually navigate unfamiliar interfaces.

Example agents here include Anthropic's Claude Computer-use, and ChatGPT Operator.

Computer Use

Fig. 1: Computer use illustration. Reproduced from OSWorld (May 2024)

2. Browsing ACI

Browsing ACIs are used by web agents to enable browser interactions. While they traditionally rely on parsing the Document Object Model (DOM) of webpages as its sensor to identify interactive elements, recent web agents combine DOM parsing with the Set of Marks (SOM) approach.

DOM parsing and the SOM method represent different approaches to interface interaction—structural versus perceptual—VLM-powered agents can leverage both. DOM parsing provides structural understanding of a webpage's elements, but can fail with dynamically generated content or visually distinct but structurally similar elements. The SOM method complements DOM parsing by creating visual anchors that help the agent identify elements based on how they appear rather than just their underlying code structure.

This hybrid approach—combining structural DOM understanding with visual perception—enhances the agent's ability to navigate complex websites and perform tasks reliably. It also provides greater resilience when websites update their underlying code without changing their visual appearance.

Example agents here include the recently viral Browser-use and Convergence AI's Proxy—probably the best web agent on the market right now.

Browser Use

Fig. 2: Browser Use agent in action with SOM and DOM parsing. Reproduced from Browser Use (Mar 2025)

3. Tool-use ACI

Tool-use ACIs are centered around APIs, external services, and code execution environments. This category relies on reading and understanding documentation, API schemas, and code as its sensors. These interfaces increasingly adopt structured communication protocols like Anthropic's Model-Context Protocol (MCP), though MCP is not without its flaws. Unlike traditional API integration, LLM-driven tool calling introduces unique challenges where the model must parse capabilities, determine when to invoke tools, generate valid parameters, and interpret responses back in natural language—all while maintaining semantic context across multiple calls.

Tool-calling requires defining tools either ahead of time with explicit schemas (static approach) or allowing the LLM to generate code for ad-hoc interactions (dynamic approach). Both methods present their own challenges. Static tool definitions consume valuable context tokens but provide structured guardrails, while dynamic scripting offers flexibility but increases security and reliability risks. Effective tool-calling schemas must slot and prioritize essential information while minimizing token usage. LLMs need comprehensive parameter descriptions to generate valid requests but struggle with overly verbose schemas that consume limited context windows. This creates a technical trade-off not present in traditional API integration patterns. API interdependencies also introduce complications. When outputs from one tool must serve as inputs to another, agents need to maintain semantic understanding across service boundaries.

Function Calling

Fig. 3: Function calling (tool-use) illustration. Reproduced from OpenAI (Mar 2025)

It is worth noting that, ACIs are not mutually exclusive, they can be used in tandem or be nested. For example, browser automation may be invoked as a tool-use through an API, computer-use might access and utilize a browser available in its VM instance, or web-use agents might have API tool-calls available in its action space. This type of hybridization is most commonly seen in designs of general AI agents that are intended to tackle a breadth of different tasks. Regardless of which type of ACI an agent interacts with, LMs need to translate their decisions and thoughts into structured outputs to trigger actions, and this too can influence ACI effectiveness. We will briefly cover the two most common output types below.

Actuator Languages: JSON vs. Code-Based Agent Outputs

AI agents commonly use the above categories of ACIs through two types of outputs:

JSON-based Agents: These agents interact by outputting structured JSON specifying tool names and parameters. This method provides consistency, reducing the risk of unexpected behavior and improving reliability. However, JSON interactions can be token-intensive, prone to syntax errors in complex structures, and require upfront tool definitions.
Code-based Agents: Alternatively, agents can write and execute actual code scripts to interact with their environment. Code agents offer greater generality, fewer syntax errors due to familiar training patterns, and allow multiple tool interactions within one script. On the downside, executing arbitrary code poses inherent risks, complexity, and potential security vulnerabilities.

The optimal choice is likely context-dependent—with JSON potentially outperforming in highly structured environments and code excelling in open-ended tasks. As AI agents mature, we are likely to see increased hybridization of ACI categories and output types.

JSON vs Code

Fig. 4: Agent using JSON as action vs code as action. Reproduced from Executable Code Actions Elicit Better LLM Agents (June 2024)

The Performance Advantage of Multi-ACI Agents

The parallel between ACI development and biological sensory evolution provides an interesting throught experiment into why multi-ACI agents represent a necessary step to effective general AI agents.

In nature, while some organisms evolved extreme specialization in one sensory system (like a star-nosed mole's extraordinary touch sensitivity), casual observation suggests that the most adaptable and successful species developed balanced, integrated multi-sensory capabilities. Similarly, while specialized agents may excel in narrow domains, general-purpose agents require integrated multi-ACI capabilities. Just as biological evolution produced creatures with sensory specializations matched to their ecological niches, we'll likely see specialized agents optimized for particular digital environments. However, the AI equivalent of apex predators—the most generally capable agents—will be those that seamlessly integrate all ACI categories.

Recent research, such as the beyond-browsing paper from CMU, provides indication that agents capable of operating across multiple ACI categories can outperform single-ACI agents. This performance gap widens particularly for complex tasks requiring diverse interactions.

WebArena Performance

Fig. 5: Browsing vs API vs hybrid agent performance on WebArena. Reproduced from Beyond Browsing (Jan 2025)

Hypothetically, agents that can dynamically switch between JSON and code outputs based on the task context may also achieve greater use-case flexibility and consistency. JSON provides structured reliability for well-defined tasks, while code offers flexibility for novel or complex situations. The highest-performing agents may leverage both approaches strategically.

While hybridizing ACIs may lead to more powerful agents, it also introduces new infrastructure challenges—particularly around unified authentication, session management, and security. These challenges represent one current frontier in agent infrastructure development.

Authentication Differences Across ACIs

Computer-use and web-use authentication appear to share the same pattern at first glance: Generally, these agents either inherit the user's existing authenticated session (e.g. Anthropic's Claude computer-use) if used locally or require the user to intervene and input credentials (ChatGPT Operator) if used in a sandbox.

While tool-use authentication uses a different pattern: Agents using APIs or external services typically authenticate through API keys, OAuth tokens, or similar credentials managed separately from user logins.

However, a deep-dive into authentication in computer-use and remote browsing quickly reveals completely different sets of challenges around granularity limitations, persistence risk, identity context among others. Human-in-the-loop design for the JSON vs code output types and action actuation will likely also be different. As Rubab from Moveworks puts it—At what point should the autonomous operation of an agent be paused for review? We will save the deep dive on agent authentication across ACI categories for a separate article, but there is a clear need for a unified authentication infrastructure to streamline and secure cross-category agent operations.

Looking Ahead: The Age of Multi-ACI Agents

Just as biological evolution progressed from single-celled organisms with primitive light sensitivity to complex creatures with integrated multi-sensory systems, AI agents may evolve from single-ACI specialists toward integrated multi-ACI generalists.

The integration challenges, particularly around unified authentication across ACI categories, may parallel early evolutionary bottlenecks that needed to be solved. Much like how the development of the vertebrate brain created a central processing hub for multiple sensory inputs, new infrastructure is needed to coordinate authentication and context across different digital interfaces.

As these multi-ACI agents mature, we'll witness a profound shift in software interaction patterns. Traditional SaaS product interfaces—designed for human sensory systems—will increasingly be supplemented or replaced by interfaces optimized for agent perception. The primary "users" of these interfaces will increasingly be AI agents rather than humans.

The next article which will be on our company blog will deep-dive into the challenges around cross-category agent authentication, and how ACI.dev is working to address them. If you are excited about this topic or AI agents in general, please reach out, I'd love to chat!

Please subscribe and follow me on LinkedIn and X for more content!

Many thanks to Kevin Zhang, Hanyi Du, Rubab Uddin, and Qian Zheng for reviewing this article!

Blog Posts

VibeOps:Turn Cursor into Lovable — Close the Dev Loop with a Single MCP

ACI.dev Goes Open Source: Building Infrastructure for AI Agents