
April 3, 2026
20 min read
Every major AI lab in 2024 was racing to ship a better chatbot. By early 2026, the race had shifted entirely. The question was no longer “can AI understand me?” — it was “can AI do things for me?” That shift is what Large Action Models are about.
OpenAI shipped Operator. Anthropic shipped computer use. Google launched Project Mariner. Salesforce open-sourced xLAM — an 8-billion parameter model that outperforms GPT-4o on function calling. This guide covers everything: what LAMs actually are under the hood, the real benchmark numbers, the challenges nobody is talking about, and what founders building MVPs today need to know.
TL;DR
The simplest way to understand the difference: ask GPT-4 to book you a flight and it will tell you how. Ask a LAM and it will open the browser, navigate to Kayak, fill in your dates, compare results, and book the cheapest option — without you touching the keyboard.
Large Language Models are prediction engines. They predict the next token in a sequence. Their output is text — words, code, structured data — all of it lives inside the conversation. Large Action Models are execution engines. Their output is state change in the real world. A file gets created. An email gets sent. A database row gets updated. The action is the output.
| Dimension | LLM | LAM |
|---|---|---|
| Output | Text / tokens | Real-world actions |
| Autonomy | Passive — human follows up | Active — completes end-to-end tasks |
| Tool use | Prompted manually | Built-in, systematic |
| State | Stateless within context | Maintains task state across steps |
| Environment | Text-only world model | Screens, APIs, file systems |
| Feedback loop | None | Continuous — observes outcome, adapts |
| Optimization target | Next-token prediction | Task completion success |
| Training data | Text corpora | Action trajectories + demonstrations |
Where LLMs operate in a single pass (prompt in, tokens out), LAMs run in a continuous loop. Every step in the loop is informed by the outcome of the previous one. That feedback is what makes LAMs genuinely agentic.
Perceive
Parse user input + current environment state: screenshots, API responses, DOM, sensor data.
Plan
Decompose goal into ordered sub-tasks with dependency graphs. Chain-of-thought reasoning.
Execute
Invoke tools: click UI elements, call APIs, type into forms, run code, read/write files.
Adapt
Evaluate the outcome. Handle errors. Re-plan if the environment changed unexpectedly.
Iterate
Loop until the goal is achieved, a human checkpoint is needed, or failure is declared.
An action space defines the complete set of operations a LAM can execute. This single design choice determines everything from capability to reliability to latency. There are three main types, and production systems in 2026 are converging on hybrids.
search(query)send_email(to, subject, body)create_order(items, payment)update_crm(contact_id, data)✓ Structured, verifiable, low error rates
✗ Requires API integrations per service
Salesforce xLAM, OpenAI function calling
click(x, y)type(text)scroll(direction, amount)screenshot()✓ Works with any app — no API needed
✗ Higher latency, fragile on UI changes
OpenAI Operator, Anthropic Computer Use
Prefers API calls when availableFalls back to GUI when no APIValidates outcomes either waySelf-corrects on failure✓ Best of both worlds — speed + reach
✗ Most complex to build and maintain
Most production 2026 systems
Training a LAM is fundamentally different from training an LLM. The data unit is no longer a passage of text — it's an action trajectory: a sequence of (state, action, outcome) tuples recording how a human or strong AI completed a real task.
Salesforce's ActionStudio framework — one of the most rigorous open-source LAM training pipelines available — ships with 97,755 action trajectories across 30,000+ APIs in 300+ domains. Multi-turn trajectories average 9 steps each. That scale of high-quality trajectory data is what separates LAMs that actually work from demos.
Unlike LLM training where a “bad” training example produces a slightly worse text output, a bad action trajectory teaches the model to take a wrong step in a multi-step task — compounding all the way to failure. Quality over quantity is non-negotiable.
Standard RLHF for LLMs uses a reward model trained on human preference rankings of text outputs. For LAMs, the reward signal is grounded in reality:
For GUI-based LAMs, the hardest technical problem is grounding: connecting “click the Submit button” to the correct pixel coordinates on a real screen. Classic approaches parsed HTML accessibility trees — brittle, breaks with dynamic frontends. Modern approaches use Vision Transformers to literally “see” the screen as pixels and locate elements visually.
Microsoft's GUI-Actor (2025) introduced a coordinate-free visual grounding approach that generates multiple candidate interaction regions per forward pass, reducing the pixel-precision errors that plague earlier models. The UGround dataset — 10 million GUI elements with referring expressions across 1.3 million screenshots — has become the canonical pre-training dataset for visual grounding.
Launched January 2025, Operator is OpenAI's production-ready LAM product. It combines GPT-4o's vision capabilities with action-specific reinforcement learning. The core innovation: no custom API integrations required. It works on any website by visually interpreting screenshots, just like a human would.
38.1%
OSWorld (full computer use)
Screenshot-only mode
58.1%
WebArena
Web task completion
87%
WebVoyager
Real-world web tasks
Updated to o3 Operator in May 2025, with additional safety fine-tuning and reduced susceptibility to prompt injection. As of July 2025, fully integrated into ChatGPT as “ChatGPT agent.” For sensitive actions — entering payment info, confirming purchases — it pauses and asks for human confirmation.
Shipped in October 2024 with Claude 3.5 Sonnet, Anthropic's computer use gives Claude a set of tools: screenshot capability plus mouse and keyboard control. The architecture uses a hybrid topology — high-level semantic planning happens on Anthropic's cloud, while the actual keyboard/mouse manipulation occurs locally on the user's machine, reducing latency for interactive tasks.
As of 2025-2026, Claude Sonnet 4.5 leads the OSWorld benchmark at 61.4%. Claude Opus 4.5 broke the 80% barrier on SWE-bench Verified (80.9%), the software engineering benchmark that measures real code changes on GitHub issues.
Anthropic leans hardest on safety constraints. The model pauses for confirmation more aggressively than competitors — which reduces raw autonomy but builds the kind of trust that enterprise customers actually need before deploying agents with real access to their systems.
Powered by Gemini 2.0, announced December 2024 and expanded at Google I/O 2025. Mariner differentiates itself with two standout features. First, multi-task parallelism — handling up to 10 simultaneous tasks in separate sandboxed browser sessions. Second, “Teach & Repeat” — the model learns custom workflows from a single human demonstration and replicates them on demand.
Mariner scores 83.5% on WebVoyager — among the highest of any production GUI agent. Currently available to Google AI Ultra subscribers ($249.99/month) in the US, with deeper integration into Google Search's AI Mode for booking restaurants, finding event tickets, and other consumer tasks.
xLAM is the most important story in LAMs that most founders aren't paying attention to. Salesforce AI Research built a family of open-source models — 1B to 70B parameters — focused entirely on function calling and tool use (API action space, not GUI). The flagship result: the 8B parameter xLAM-2 model outperforms GPT-4o on both BFCL v3 accuracy (72.83% vs. 72.08%) and multi-turn task accuracy (69.25% vs. 47.62%).
from openai import OpenAI # xLAM-2 is OpenAI-compatible
client = OpenAI(
base_url="https://api.together.xyz/v1",
api_key="YOUR_TOGETHER_API_KEY",
)
tools = [
{
"type": "function",
"function": {
"name": "search_flights",
"description": "Search for available flights",
"parameters": {
"type": "object",
"properties": {
"origin": {"type": "string"},
"destination": {"type": "string"},
"date": {"type": "string", "format": "date"},
"max_price": {"type": "number"},
},
"required": ["origin", "destination", "date"],
},
},
}
]
response = client.chat.completions.create(
model="Salesforce/xLAM-2-8b-fc-r",
messages=[
{"role": "user", "content": "Find me a flight from NYC to SF next Tuesday under $300"}
],
tools=tools,
tool_choice="auto",
)
# The model returns a structured function call — not free text
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name) # search_flights
print(tool_call.function.arguments) # {"origin":"JFK","destination":"SFO","date":"2026-04-07","max_price":300}The xLAM-2 architecture was trained with the APIGen-MT pipeline — a two-phase framework that generates verifiable multi-turn agent training data using LLM committee review and iterative feedback loops. The result: training data quality that produces a model with stronger multi-turn reasoning than GPT-4o, at a fraction of the inference cost.
Benchmark scores get misquoted constantly. Here's how to read them honestly.
| Benchmark | Measures | Best Model | Score | Human |
|---|---|---|---|---|
| OSWorld | Full computer use (Ubuntu) | Claude Sonnet 4.5 | 61.4% | 72.4% |
| WebArena | Web task completion | OpenAI CUA | 58.1% | 78% |
| WebVoyager | Real-world web navigation | Google Mariner | 83.5% | ~95% |
| BFCL v3 | Function calling accuracy | xLAM-2-8b | 72.83% | — |
| tau-bench (multi-turn) | Multi-turn tool use | xLAM-2-8b | 69.25% | — |
| GAIA Level 3 | Complex reasoning + actions | Writer Action Agent | 61% | 92% |
| SWE-bench Verified | Real GitHub code issues | Claude Opus 4.5 | 80.9% | ~86% |
The pattern is clear: LAMs are approaching human performance on narrow benchmarks, but complex real-world tasks (GAIA Level 3, full OSWorld) still show meaningful gaps. More importantly, benchmark performance and production reliability are different things. In practice, production reliability on real-world enterprise tasks is typically 30–50% lower than benchmark numbers until extensive edge-case handling is built in.
When an LLM hallucinates, it produces a wrong sentence. A reader notices, ignores it, moves on. When a LAM hallucinates an action — deletes a file, sends a draft email, cancels a subscription — the damage is real and potentially irreversible. Worse, errors propagate: a wrong action at step 3 of a 10-step task corrupts every subsequent step. You don't discover the problem until step 9.
Critical: Prompt Injection
Prompt injection is the #1 security vulnerability for production LAMs. A malicious website can embed hidden instructions — invisible to humans, visible to the AI reading the page — that redirect the agent to perform unauthorized actions: steal credentials, exfiltrate data, make unintended purchases.
OWASP ranks it as LLM01:2025, appearing in 73%+ of production AI deployments during security audits. Prompt injection is present in more agentic systems than any other vulnerability class.
OpenAI's official position: “Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved.’”
LAMs are inherently slower than LLMs. Every action step requires a model inference call. GUI-based agents add screen-capture overhead, network latency, and real-world wait times (page loads, API responses). A 10-step task running at 3 seconds per step is 30 seconds minimum — and that's if nothing fails. For asynchronous background workflows, this is fine. For interactive use, it's a significant UX constraint that changes what products you can build.
Enterprises won't deploy LAMs with broad permissions unless they can audit exactly what actions were taken and why. The minimum viable audit trail for a production LAM includes: every action taken, the screen state that prompted it, the model's reasoning step, and the outcome. This infrastructure is non-negotiable for regulated industries — finance, healthcare, legal — and is still underdeveloped in most open-source frameworks.
The LAM API layer matured significantly between 2024 and 2026. Founders can now build genuinely useful LAM-powered products without training their own models. Here's how to think about it.
Use existing LAM API
OpenAI Agents SDK, Anthropic Computer Use, Google Mariner API
Web/GUI tasks, speed to market priority, standard task domains
Fine-tune a specialized model
xLAM base, Llama-xLAM-2 (open weights)
You have proprietary trajectory data, need domain superiority at lower cost, latency matters at scale
Multi-agent orchestration
OpenAI Agents SDK (March 2025), LangGraph with action nodes
Tasks span multiple specialized domains, reliability is critical, complex workflow logic
Competitor intelligence agent
~2–3 daysMonitors competitor websites, pricing pages, and job postings. Synthesizes weekly reports. Works on any site without API access. Stack: Anthropic Computer Use + Claude + cron.
Lead enrichment pipeline
~3–5 daysTakes a list of company names, navigates LinkedIn and web properties, fills your CRM with enriched profiles. Replaces 3–5 hours of manual research per sales rep per day.
Legacy system automation
~1–2 weeksYour customer has enterprise software from 2008 with no modern API. A GUI LAM can navigate it, extract data, and push it into modern systems — zero integration work required.
Compliance monitoring
~3–5 daysAgent navigates regulatory websites (SEC, FDA, GDPR portals), identifies new rules relevant to the client's industry, and delivers structured summaries to the legal team.
Developer workflow agent
~2–4 daysUses xLAM function-calling to automate routine dev tasks: writing test cases, updating changelogs, performing PR reviews with structured feedback. Integrates with GitHub API.
import anthropic
import logging
# 1. Audit log — every action, every outcome
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
audit_log = logging.getLogger("lam_audit")
client = anthropic.Anthropic()
SENSITIVE_ACTIONS = {"delete", "send_email", "make_payment", "update_database"}
def run_lam_task(user_goal: str) -> dict:
"""Minimal safe LAM agent with HITL, audit trail, and sandboxed scope."""
messages = [{"role": "user", "content": user_goal}]
step = 0
max_steps = 20 # Hard cap — prevent runaway loops
while step < max_steps:
response = client.beta.messages.create(
model="claude-opus-4-5-20251001",
max_tokens=4096,
tools=[computer_use_tool],
messages=messages,
)
for action in response.content:
# 2. Audit every action before execution
audit_log.info("Step %d | Action: %s | Input: %s", step, action.type, action.input)
# 3. Human-in-the-loop for sensitive/irreversible actions
action_name = str(action.input.get("action", "")).lower()
if any(s in action_name for s in SENSITIVE_ACTIONS):
confirm = input(f"[CONFIRM] Agent wants to: {action.input}. Approve? (y/n): ")
if confirm.lower() != "y":
audit_log.warning("Step %d | BLOCKED by human: %s", step, action.input)
return {"status": "blocked", "step": step}
# 4. Execute in sandboxed environment (separate credentials, min permissions)
result = execute_in_sandbox(action)
audit_log.info("Step %d | Outcome: %s", step, result)
if response.stop_reason == "end_turn":
return {"status": "success", "steps": step}
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": format_tool_results(response)})
step += 1
return {"status": "max_steps_reached", "steps": step}Build these into your LAM from the first commit — not as an afterthought:
⚡
Domain-specific LAMs will dominate
xLAM's 8B model outperforming GPT-4o is the template. Fine-tuned domain-specific LAMs — legal, medical, financial, DevOps — will dramatically outperform general models at 10% of the cost. Founders who collect action trajectory data in their domain have a real moat.
🔗
Multi-agent orchestration goes mainstream
Deloitte identifies multi-agent orchestration as a defining 2026 enterprise technology trend. The architecture: planner agent → specialist agents → validator agent. The market could reach $8.5B by 2026 and $35B by 2030.
🤖
Physical AI: LAMs leave the screen
Vision-Language-Action models for robot manipulation are approaching commercial deployment. IBM research forecasts 'world foundation models' — physics-aware LAMs for robotics — entering commercial production in 2026.
🔄
RPA replacement is underway
Traditional RPA vendors (UiPath, Automation Anywhere) are integrating LAM capabilities or being disrupted. The migration: Phase 1 — hybrid RPA + LAM. Phase 2 — LAM absorbs rule-based tasks. Phase 3 — full agentic automation with minimal hardcoded rules.
Reality Check: Gartner's Warning
Gartner predicts 40%+ of agentic AI projects will be canceled by 2027 due to soaring costs, unclear business value, and inadequate risk controls. Technical capability and production deployment at scale are entirely different problems. The biggest mistake early LAM startups make: assuming 70% benchmark accuracy translates to 70% production reliability. It doesn't. Build for the 30–50% failure rate first.
Large Action Models are not a research concept anymore. OpenAI, Anthropic, and Google are shipping production LAMs today. An open-source 8B model from Salesforce outperforms GPT-4o on the function-calling benchmark that matters most for enterprise use cases. The infrastructure — APIs, SDKs, fine-tuning pipelines — is available to any founder right now.
What isn't solved: prompt injection, production reliability in complex multi-step workflows, and the auditability infrastructure that regulated industries require. These aren't blockers — they're the engineering surface where 2026's LAM startups will win or lose.
The LLM era was the “read” era. LAMs are the “write” era — AI that doesn't just understand your world but actively changes it. The question for founders isn't whether to build with LAMs. It's which action trajectory dataset you're going to collect that nobody else has.