HomeAdvanced FeaturesComputer Use Tool
advanced15 min read· Module 6, Lesson 8

🖱️Computer Use Tool

Let Claude control a computer — screenshots, mouse, and keyboard

Computer Use Tool

Imagine giving Claude the ability to see your screen and interact with it — just like a human sitting in front of a computer. That is exactly what the Computer Use tool does. It is one of Claude's most advanced and exciting features.

What is Computer Use?

The Computer Use tool gives Claude the ability to:

  • Take screenshots to see what is on the screen
  • Move the mouse cursor and click on elements
  • Type on the keyboard and enter text
  • Scroll up and down on pages
  • Execute keyboard shortcuts like Ctrl+C and Ctrl+V

In other words, Claude can interact with any application or website just like you do.

How Does It Work? The Screenshot-Analyze-Action Loop

The Computer Use tool works in a continuous loop:

  1. Take a screenshot — Claude captures an image of the current screen
  2. Analyze the image — Claude understands what it sees (buttons, input fields, text, etc.)
  3. Take action — Claude decides what to do (click, type, scroll)
  4. Take a new screenshot — to verify the result
  5. Repeat — until the task is complete
┌─────────────────────────────────────────────┐ │ Computer Use Tool Loop │ │ │ │ 📸 Take Screenshot │ │ ↓ │ │ 🧠 Analyze: What do I see? │ │ ↓ │ │ 🎯 Decide: What do I do next? │ │ ↓ │ │ 🖱️ Execute: Click / Type / Scroll │ │ ↓ │ │ 📸 Take New Screenshot to Verify │ │ ↓ │ │ ✅ Is the task complete? │ │ Yes → Done | No → Repeat │ └─────────────────────────────────────────────┘

Tool Definition in the API

To enable the Computer Use tool, you need to define it in your tools list:

JSON
{ "tools": [ { "type": "computer_20250124", "name": "computer", "display_width_px": 1920, "display_height_px": 1080, "display_number": 1 } ] }

Important note: You must specify the screen resolution (width and height) so Claude can understand coordinates correctly.

Full Example: Sending a Request with the Computer Tool

JavaScript
const client = new Anthropic(); async function computerUseLoop(task) { let messages = [ { role: "user", content: task, }, ]; while (true) { const response = await client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 4096, tools: [ { type: "computer_20250124", name: "computer", display_width_px: 1920, display_height_px: 1080, display_number: 1, }, ], messages, }); // Check if Claude wants to use a tool const toolUseBlocks = response.content.filter( (block) => block.type === "tool_use" ); if (toolUseBlocks.length === 0) { // Claude is done — no more tools needed const textBlock = response.content.find( (block) => block.type === "text" ); return textBlock?.text || "Task complete"; } // Process each tool request const toolResults = []; for (const toolUse of toolUseBlocks) { const result = await executeComputerAction(toolUse.input); toolResults.push({ type: "tool_result", tool_use_id: toolUse.id, content: result, }); } // Add Claude's response and tool results to the conversation messages.push({ role: "assistant", content: response.content }); messages.push({ role: "user", content: toolResults }); } }

Implementing Computer Actions

When Claude requests an action, you will receive data like this:

JSON
{ "action": "click", "coordinate": [945, 520] }
JSON
{ "action": "type", "text": "Hello world" }
JSON
{ "action": "key", "text": "ctrl+a" }
JSON
{ "action": "screenshot" }
JSON
{ "action": "scroll", "coordinate": [960, 540], "delta_x": 0, "delta_y": -300 }

Your execution function should handle each action type:

JavaScript
async function executeComputerAction(input) { const { action } = input; switch (action) { case "screenshot": // Take a screenshot and return it as a base64 image const screenshot = await takeScreenshot(); return [ { type: "image", source: { type: "base64", media_type: "image/png", data: screenshot, }, }, ]; case "click": // Click on the specified coordinates const [x, y] = input.coordinate; await moveMouse(x, y); await mouseClick(input.button || "left"); // Take a screenshot after clicking const afterClick = await takeScreenshot(); return [ { type: "image", source: { type: "base64", media_type: "image/png", data: afterClick, }, }, ]; case "type": // Type the specified text await typeText(input.text); const afterType = await takeScreenshot(); return [ { type: "image", source: { type: "base64", media_type: "image/png", data: afterType, }, }, ]; case "key": // Press a key or keyboard shortcut await pressKey(input.text); const afterKey = await takeScreenshot(); return [ { type: "image", source: { type: "base64", media_type: "image/png", data: afterKey, }, }, ]; case "scroll": // Scroll in the specified direction const [sx, sy] = input.coordinate; await scrollAt(sx, sy, input.delta_x, input.delta_y); const afterScroll = await takeScreenshot(); return [ { type: "image", source: { type: "base64", media_type: "image/png", data: afterScroll, }, }, ]; default: return [{ type: "text", text: "Unknown action: " + action }]; } }

Sandbox Setup (Mandatory!)

Security warning: Never run the Computer Use tool directly on your personal machine. Always use an isolated environment (Sandbox).

Why is a Sandbox Necessary?

  • Claude can click on anything on the screen — including things you do not want it to click
  • It might open sensitive files or accidentally delete data
  • It could interact with other applications open on your machine
  • In a sandbox, the worst that happens is you restart the sandbox

Common Sandbox Options

┌──────────────────────────────────────────────────────┐ │ Sandbox Environment Options │ │ │ │ 🐳 Docker Container (Most Common) │ │ - Quick to set up │ │ - Fully isolated from your system │ │ - Easy to stop and restart │ │ │ │ 🖥️ Virtual Machine (VM) │ │ - Complete isolation │ │ - Can run any operating system │ │ - Slower to set up │ │ │ │ ☁️ Cloud VM │ │ - Does not affect your machine at all │ │ - Easy to scale │ │ - Requires internet connection │ └──────────────────────────────────────────────────────┘

Setting Up a Docker Sandbox

DOCKERFILE
FROM ubuntu:22.04 # Install desktop environment RUN apt-get update && apt-get install -y \ xfce4 \ xfce4-goodies \ tightvncserver \ firefox \ python3 \ python3-pip \ xdotool \ scrot \ && rm -rf /var/lib/apt/lists/* # Set up VNC RUN mkdir -p /root/.vnc && \ echo "password" | vncpasswd -f > /root/.vnc/passwd && \ chmod 600 /root/.vnc/passwd # Start the desktop CMD ["vncserver", ":1", "-geometry", "1920x1080", "-depth", "24"]
Terminal
# Build and run the container docker build -t claude-sandbox . docker run -d -p 5901:5901 --name claude-computer claude-sandbox # Now you can connect to the screen via VNC on port 5901

Practical Examples

Example 1: Filling a Web Form

JavaScript
const result = await computerUseLoop( "Open Firefox and go to https://example.com/form " + "then fill in the form with the following data: " + "Name: John Smith, Email: john@example.com, " + "Age: 30, then click the Submit button" ); console.log(result);

Claude will automatically:

  1. Open the browser and navigate to the URL
  2. Take a screenshot and identify form fields
  3. Click on each field and type the data
  4. Click the submit button
  5. Verify the operation succeeded

Example 2: Browsing a Website and Gathering Information

JavaScript
const result = await computerUseLoop( "Open Firefox and go to https://news.ycombinator.com " + "and collect the titles of the first 5 articles on the homepage" ); console.log(result);

Example 3: Testing a User Interface

JavaScript
const result = await computerUseLoop( "Open Firefox and go to http://localhost:3000 " + "and test the following points: " + "1. Does the login button work? " + "2. Does an error appear when entering incorrect data? " + "3. Does the design look correct on the screen? " + "Record your observations for each point" ); console.log(result);

Available Actions — Quick Reference

ActionDescriptionParameters
screenshotTake a screenshotNone
clickClick on a pointcoordinate, button (optional)
double_clickDouble clickcoordinate
typeType texttext
keyPress a key/shortcuttext (e.g. "ctrl+c")
scrollScrollcoordinate, delta_x, delta_y
dragDragstart_coordinate, end_coordinate
waitWaitduration (in seconds)

Limitations and Important Notes

This Tool is in Beta

  • The tool is still in Beta stage
  • Performance can vary — sometimes Claude clicks in the wrong place
  • Very small text can be difficult for Claude to read
  • Animated or rapidly changing applications may cause confusion

Tips for Best Results

  1. Use an appropriate screen resolution — 1920x1080 or 1280x720 work well
  2. Make instructions clear and specific — "Click the blue button in the top right" is better than "click there"
  3. Break down large tasks — instead of one complex task, split it into steps
  4. Monitor execution — especially in early development stages
  5. Handle errors — add retry logic for failures

When to Use Computer Use vs Other Tools?

┌─────────────────────────────────────────────────────────┐ │ When to Use the Computer Use Tool? │ │ │ │ ✅ Use it when: │ │ • There is no API for the app you want to interact │ │ with │ │ • You need to visually test a user interface │ │ • You are dealing with legacy apps without an API │ │ • You want to automate processes that require human │ │ interaction │ │ │ │ ❌ Do NOT use it when: │ │ • An API is available — use regular tool_use │ │ • You want to process data — use the code execution │ │ tool │ │ • You want web search — use the search tool │ │ • The task is purely text-based — no screen needed │ └─────────────────────────────────────────────────────────┘

Full Python Example

Python
client = anthropic.Anthropic() def take_screenshot(): """Take a screenshot using scrot""" subprocess.run(["scrot", "/tmp/screen.png"], check=True) with open("/tmp/screen.png", "rb") as f: return base64.b64encode(f.read()).decode() def execute_action(action_input): """Execute an action on the computer""" action = action_input["action"] if action == "screenshot": img = take_screenshot() return [{"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": img }}] elif action == "click": x, y = action_input["coordinate"] subprocess.run(["xdotool", "mousemove", str(x), str(y)]) subprocess.run(["xdotool", "click", "1"]) img = take_screenshot() return [{"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": img }}] elif action == "type": subprocess.run(["xdotool", "type", "--", action_input["text"]]) img = take_screenshot() return [{"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": img }}] elif action == "key": subprocess.run(["xdotool", "key", action_input["text"]]) img = take_screenshot() return [{"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": img }}] elif action == "scroll": x, y = action_input["coordinate"] subprocess.run(["xdotool", "mousemove", str(x), str(y)]) clicks = abs(action_input["delta_y"]) // 100 direction = "5" if action_input["delta_y"] < 0 else "4" for _ in range(max(1, clicks)): subprocess.run(["xdotool", "click", direction]) img = take_screenshot() return [{"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": img }}] return [{"type": "text", "text": f"Unknown action: {action}"}] def computer_use_loop(task): """Main computer use loop""" messages = [{"role": "user", "content": task}] while True: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, tools=[{ "type": "computer_20250124", "name": "computer", "display_width_px": 1920, "display_height_px": 1080, "display_number": 1, }], messages=messages, ) tool_blocks = [b for b in response.content if b.type == "tool_use"] if not tool_blocks: text_blocks = [b for b in response.content if b.type == "text"] return text_blocks[0].text if text_blocks else "Task complete" results = [] for tb in tool_blocks: result = execute_action(tb.input) results.append({ "type": "tool_result", "tool_use_id": tb.id, "content": result, }) messages.append({"role": "assistant", "content": response.content}) messages.append({"role": "user", "content": results}) # Run the example result = computer_use_loop("Open Firefox and search Google for 'Anthropic Claude'") print(result)

Pricing

The Computer Use tool consumes more tokens than regular requests because of:

  • Images: Each screenshot counts as image tokens (based on resolution)
  • The loop: Multiple cycles may be needed to complete a single task
  • Estimate: A simple task may cost 10-50 cents, a complex task may cost more

Summary

AspectDetails
What it isA tool that lets Claude interact with a computer visually
How it worksLoop: screenshot -> analyze -> action -> repeat
StatusBeta — experimental
SecurityAlways use a Sandbox
Best forAutomating interfaces without APIs, UI testing
AlternativeRegular API tools when an API is available

Next: We will learn about the Code Execution tool — which lets Claude write and run code in a safe environment.