🖱️Computer Use Tool
Let Claude control a computer — screenshots, mouse, and keyboard
Computer Use Tool
Imagine giving Claude the ability to see your screen and interact with it — just like a human sitting in front of a computer. That is exactly what the Computer Use tool does. It is one of Claude's most advanced and exciting features.
What is Computer Use?
The Computer Use tool gives Claude the ability to:
- Take screenshots to see what is on the screen
- Move the mouse cursor and click on elements
- Type on the keyboard and enter text
- Scroll up and down on pages
- Execute keyboard shortcuts like Ctrl+C and Ctrl+V
In other words, Claude can interact with any application or website just like you do.
How Does It Work? The Screenshot-Analyze-Action Loop
The Computer Use tool works in a continuous loop:
- Take a screenshot — Claude captures an image of the current screen
- Analyze the image — Claude understands what it sees (buttons, input fields, text, etc.)
- Take action — Claude decides what to do (click, type, scroll)
- Take a new screenshot — to verify the result
- Repeat — until the task is complete
┌─────────────────────────────────────────────┐
│ Computer Use Tool Loop │
│ │
│ 📸 Take Screenshot │
│ ↓ │
│ 🧠 Analyze: What do I see? │
│ ↓ │
│ 🎯 Decide: What do I do next? │
│ ↓ │
│ 🖱️ Execute: Click / Type / Scroll │
│ ↓ │
│ 📸 Take New Screenshot to Verify │
│ ↓ │
│ ✅ Is the task complete? │
│ Yes → Done | No → Repeat │
└─────────────────────────────────────────────┘
Tool Definition in the API
To enable the Computer Use tool, you need to define it in your tools list:
{
"tools": [
{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1
}
]
}Important note: You must specify the screen resolution (width and height) so Claude can understand coordinates correctly.
Full Example: Sending a Request with the Computer Tool
const client = new Anthropic();
async function computerUseLoop(task) {
let messages = [
{
role: "user",
content: task,
},
];
while (true) {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
tools: [
{
type: "computer_20250124",
name: "computer",
display_width_px: 1920,
display_height_px: 1080,
display_number: 1,
},
],
messages,
});
// Check if Claude wants to use a tool
const toolUseBlocks = response.content.filter(
(block) => block.type === "tool_use"
);
if (toolUseBlocks.length === 0) {
// Claude is done — no more tools needed
const textBlock = response.content.find(
(block) => block.type === "text"
);
return textBlock?.text || "Task complete";
}
// Process each tool request
const toolResults = [];
for (const toolUse of toolUseBlocks) {
const result = await executeComputerAction(toolUse.input);
toolResults.push({
type: "tool_result",
tool_use_id: toolUse.id,
content: result,
});
}
// Add Claude's response and tool results to the conversation
messages.push({ role: "assistant", content: response.content });
messages.push({ role: "user", content: toolResults });
}
}Implementing Computer Actions
When Claude requests an action, you will receive data like this:
{
"action": "click",
"coordinate": [945, 520]
}{
"action": "type",
"text": "Hello world"
}{
"action": "key",
"text": "ctrl+a"
}{
"action": "screenshot"
}{
"action": "scroll",
"coordinate": [960, 540],
"delta_x": 0,
"delta_y": -300
}Your execution function should handle each action type:
async function executeComputerAction(input) {
const { action } = input;
switch (action) {
case "screenshot":
// Take a screenshot and return it as a base64 image
const screenshot = await takeScreenshot();
return [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: screenshot,
},
},
];
case "click":
// Click on the specified coordinates
const [x, y] = input.coordinate;
await moveMouse(x, y);
await mouseClick(input.button || "left");
// Take a screenshot after clicking
const afterClick = await takeScreenshot();
return [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: afterClick,
},
},
];
case "type":
// Type the specified text
await typeText(input.text);
const afterType = await takeScreenshot();
return [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: afterType,
},
},
];
case "key":
// Press a key or keyboard shortcut
await pressKey(input.text);
const afterKey = await takeScreenshot();
return [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: afterKey,
},
},
];
case "scroll":
// Scroll in the specified direction
const [sx, sy] = input.coordinate;
await scrollAt(sx, sy, input.delta_x, input.delta_y);
const afterScroll = await takeScreenshot();
return [
{
type: "image",
source: {
type: "base64",
media_type: "image/png",
data: afterScroll,
},
},
];
default:
return [{ type: "text", text: "Unknown action: " + action }];
}
}Sandbox Setup (Mandatory!)
Security warning: Never run the Computer Use tool directly on your personal machine. Always use an isolated environment (Sandbox).
Why is a Sandbox Necessary?
- Claude can click on anything on the screen — including things you do not want it to click
- It might open sensitive files or accidentally delete data
- It could interact with other applications open on your machine
- In a sandbox, the worst that happens is you restart the sandbox
Common Sandbox Options
┌──────────────────────────────────────────────────────┐
│ Sandbox Environment Options │
│ │
│ 🐳 Docker Container (Most Common) │
│ - Quick to set up │
│ - Fully isolated from your system │
│ - Easy to stop and restart │
│ │
│ 🖥️ Virtual Machine (VM) │
│ - Complete isolation │
│ - Can run any operating system │
│ - Slower to set up │
│ │
│ ☁️ Cloud VM │
│ - Does not affect your machine at all │
│ - Easy to scale │
│ - Requires internet connection │
└──────────────────────────────────────────────────────┘
Setting Up a Docker Sandbox
FROM ubuntu:22.04
# Install desktop environment
RUN apt-get update && apt-get install -y \
xfce4 \
xfce4-goodies \
tightvncserver \
firefox \
python3 \
python3-pip \
xdotool \
scrot \
&& rm -rf /var/lib/apt/lists/*
# Set up VNC
RUN mkdir -p /root/.vnc && \
echo "password" | vncpasswd -f > /root/.vnc/passwd && \
chmod 600 /root/.vnc/passwd
# Start the desktop
CMD ["vncserver", ":1", "-geometry", "1920x1080", "-depth", "24"]# Build and run the container
docker build -t claude-sandbox .
docker run -d -p 5901:5901 --name claude-computer claude-sandbox
# Now you can connect to the screen via VNC on port 5901Practical Examples
Example 1: Filling a Web Form
const result = await computerUseLoop(
"Open Firefox and go to https://example.com/form " +
"then fill in the form with the following data: " +
"Name: John Smith, Email: john@example.com, " +
"Age: 30, then click the Submit button"
);
console.log(result);Claude will automatically:
- Open the browser and navigate to the URL
- Take a screenshot and identify form fields
- Click on each field and type the data
- Click the submit button
- Verify the operation succeeded
Example 2: Browsing a Website and Gathering Information
const result = await computerUseLoop(
"Open Firefox and go to https://news.ycombinator.com " +
"and collect the titles of the first 5 articles on the homepage"
);
console.log(result);Example 3: Testing a User Interface
const result = await computerUseLoop(
"Open Firefox and go to http://localhost:3000 " +
"and test the following points: " +
"1. Does the login button work? " +
"2. Does an error appear when entering incorrect data? " +
"3. Does the design look correct on the screen? " +
"Record your observations for each point"
);
console.log(result);Available Actions — Quick Reference
| Action | Description | Parameters |
|---|---|---|
screenshot | Take a screenshot | None |
click | Click on a point | coordinate, button (optional) |
double_click | Double click | coordinate |
type | Type text | text |
key | Press a key/shortcut | text (e.g. "ctrl+c") |
scroll | Scroll | coordinate, delta_x, delta_y |
drag | Drag | start_coordinate, end_coordinate |
wait | Wait | duration (in seconds) |
Limitations and Important Notes
This Tool is in Beta
- The tool is still in Beta stage
- Performance can vary — sometimes Claude clicks in the wrong place
- Very small text can be difficult for Claude to read
- Animated or rapidly changing applications may cause confusion
Tips for Best Results
- Use an appropriate screen resolution — 1920x1080 or 1280x720 work well
- Make instructions clear and specific — "Click the blue button in the top right" is better than "click there"
- Break down large tasks — instead of one complex task, split it into steps
- Monitor execution — especially in early development stages
- Handle errors — add retry logic for failures
When to Use Computer Use vs Other Tools?
┌─────────────────────────────────────────────────────────┐
│ When to Use the Computer Use Tool? │
│ │
│ ✅ Use it when: │
│ • There is no API for the app you want to interact │
│ with │
│ • You need to visually test a user interface │
│ • You are dealing with legacy apps without an API │
│ • You want to automate processes that require human │
│ interaction │
│ │
│ ❌ Do NOT use it when: │
│ • An API is available — use regular tool_use │
│ • You want to process data — use the code execution │
│ tool │
│ • You want web search — use the search tool │
│ • The task is purely text-based — no screen needed │
└─────────────────────────────────────────────────────────┘
Full Python Example
client = anthropic.Anthropic()
def take_screenshot():
"""Take a screenshot using scrot"""
subprocess.run(["scrot", "/tmp/screen.png"], check=True)
with open("/tmp/screen.png", "rb") as f:
return base64.b64encode(f.read()).decode()
def execute_action(action_input):
"""Execute an action on the computer"""
action = action_input["action"]
if action == "screenshot":
img = take_screenshot()
return [{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": img
}}]
elif action == "click":
x, y = action_input["coordinate"]
subprocess.run(["xdotool", "mousemove", str(x), str(y)])
subprocess.run(["xdotool", "click", "1"])
img = take_screenshot()
return [{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": img
}}]
elif action == "type":
subprocess.run(["xdotool", "type", "--", action_input["text"]])
img = take_screenshot()
return [{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": img
}}]
elif action == "key":
subprocess.run(["xdotool", "key", action_input["text"]])
img = take_screenshot()
return [{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": img
}}]
elif action == "scroll":
x, y = action_input["coordinate"]
subprocess.run(["xdotool", "mousemove", str(x), str(y)])
clicks = abs(action_input["delta_y"]) // 100
direction = "5" if action_input["delta_y"] < 0 else "4"
for _ in range(max(1, clicks)):
subprocess.run(["xdotool", "click", direction])
img = take_screenshot()
return [{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": img
}}]
return [{"type": "text", "text": f"Unknown action: {action}"}]
def computer_use_loop(task):
"""Main computer use loop"""
messages = [{"role": "user", "content": task}]
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=[{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1,
}],
messages=messages,
)
tool_blocks = [b for b in response.content if b.type == "tool_use"]
if not tool_blocks:
text_blocks = [b for b in response.content if b.type == "text"]
return text_blocks[0].text if text_blocks else "Task complete"
results = []
for tb in tool_blocks:
result = execute_action(tb.input)
results.append({
"type": "tool_result",
"tool_use_id": tb.id,
"content": result,
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": results})
# Run the example
result = computer_use_loop("Open Firefox and search Google for 'Anthropic Claude'")
print(result)Pricing
The Computer Use tool consumes more tokens than regular requests because of:
- Images: Each screenshot counts as image tokens (based on resolution)
- The loop: Multiple cycles may be needed to complete a single task
- Estimate: A simple task may cost 10-50 cents, a complex task may cost more
Summary
| Aspect | Details |
|---|---|
| What it is | A tool that lets Claude interact with a computer visually |
| How it works | Loop: screenshot -> analyze -> action -> repeat |
| Status | Beta — experimental |
| Security | Always use a Sandbox |
| Best for | Automating interfaces without APIs, UI testing |
| Alternative | Regular API tools when an API is available |
Next: We will learn about the Code Execution tool — which lets Claude write and run code in a safe environment.