HomeAdvanced FeaturesComputer Use Tool

advanced15 min read· Module 6, Lesson 8

🖱️Computer Use Tool

Let Claude control a computer — screenshots, mouse, and keyboard

Computer Use Tool

Imagine giving Claude the ability to see your screen and interact with it — just like a human sitting in front of a computer. That is exactly what the Computer Use tool does. It is one of Claude's most advanced and exciting features.

What is Computer Use?

The Computer Use tool gives Claude the ability to:

Take screenshots to see what is on the screen
Move the mouse cursor and click on elements
Type on the keyboard and enter text
Scroll up and down on pages
Execute keyboard shortcuts like Ctrl+C and Ctrl+V

In other words, Claude can interact with any application or website just like you do.

How Does It Work? The Screenshot-Analyze-Action Loop

The Computer Use tool works in a continuous loop:

Take a screenshot — Claude captures an image of the current screen
Analyze the image — Claude understands what it sees (buttons, input fields, text, etc.)
Take action — Claude decides what to do (click, type, scroll)
Take a new screenshot — to verify the result
Repeat — until the task is complete

┌─────────────────────────────────────────────┐
│       Computer Use Tool Loop                │
│                                             │
│   📸 Take Screenshot                        │
│         ↓                                   │
│   🧠 Analyze: What do I see?               │
│         ↓                                   │
│   🎯 Decide: What do I do next?            │
│         ↓                                   │
│   🖱️ Execute: Click / Type / Scroll         │
│         ↓                                   │
│   📸 Take New Screenshot to Verify          │
│         ↓                                   │
│   ✅ Is the task complete?                  │
│      Yes → Done  |  No → Repeat            │
└─────────────────────────────────────────────┘

Tool Definition in the API

To enable the Computer Use tool, you need to define it in your tools list:

JSON
{
  "tools": [
    {
      "type": "computer_20250124",
      "name": "computer",
      "display_width_px": 1920,
      "display_height_px": 1080,
      "display_number": 1
    }
  ]
}

Important note: You must specify the screen resolution (width and height) so Claude can understand coordinates correctly.

Full Example: Sending a Request with the Computer Tool

JavaScript

const client = new Anthropic();

async function computerUseLoop(task) {
  let messages = [
    {
      role: "user",
      content: task,
    },
  ];

  while (true) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 4096,
      tools: [
        {
          type: "computer_20250124",
          name: "computer",
          display_width_px: 1920,
          display_height_px: 1080,
          display_number: 1,
        },
      ],
      messages,
    });

    // Check if Claude wants to use a tool
    const toolUseBlocks = response.content.filter(
      (block) => block.type === "tool_use"
    );

    if (toolUseBlocks.length === 0) {
      // Claude is done — no more tools needed
      const textBlock = response.content.find(
        (block) => block.type === "text"
      );
      return textBlock?.text || "Task complete";
    }

    // Process each tool request
    const toolResults = [];
    for (const toolUse of toolUseBlocks) {
      const result = await executeComputerAction(toolUse.input);
      toolResults.push({
        type: "tool_result",
        tool_use_id: toolUse.id,
        content: result,
      });
    }

    // Add Claude's response and tool results to the conversation
    messages.push({ role: "assistant", content: response.content });
    messages.push({ role: "user", content: toolResults });
  }
}

Implementing Computer Actions

When Claude requests an action, you will receive data like this:

JSON
{
  "action": "click",
  "coordinate": [945, 520]
}

JSON
{
  "action": "type",
  "text": "Hello world"
}

JSON
{
  "action": "key",
  "text": "ctrl+a"
}

JSON
{
  "action": "screenshot"
}

JSON
{
  "action": "scroll",
  "coordinate": [960, 540],
  "delta_x": 0,
  "delta_y": -300
}

Your execution function should handle each action type:

JavaScript
async function executeComputerAction(input) {
  const { action } = input;

  switch (action) {
    case "screenshot":
      // Take a screenshot and return it as a base64 image
      const screenshot = await takeScreenshot();
      return [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: screenshot,
          },
        },
      ];

    case "click":
      // Click on the specified coordinates
      const [x, y] = input.coordinate;
      await moveMouse(x, y);
      await mouseClick(input.button || "left");
      // Take a screenshot after clicking
      const afterClick = await takeScreenshot();
      return [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: afterClick,
          },
        },
      ];

    case "type":
      // Type the specified text
      await typeText(input.text);
      const afterType = await takeScreenshot();
      return [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: afterType,
          },
        },
      ];

    case "key":
      // Press a key or keyboard shortcut
      await pressKey(input.text);
      const afterKey = await takeScreenshot();
      return [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: afterKey,
          },
        },
      ];

    case "scroll":
      // Scroll in the specified direction
      const [sx, sy] = input.coordinate;
      await scrollAt(sx, sy, input.delta_x, input.delta_y);
      const afterScroll = await takeScreenshot();
      return [
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/png",
            data: afterScroll,
          },
        },
      ];

    default:
      return [{ type: "text", text: "Unknown action: " + action }];
  }
}

Sandbox Setup (Mandatory!)

Security warning: Never run the Computer Use tool directly on your personal machine. Always use an isolated environment (Sandbox).

Why is a Sandbox Necessary?

Claude can click on anything on the screen — including things you do not want it to click
It might open sensitive files or accidentally delete data
It could interact with other applications open on your machine
In a sandbox, the worst that happens is you restart the sandbox

Common Sandbox Options

┌──────────────────────────────────────────────────────┐
│  Sandbox Environment Options                         │
│                                                      │
│  🐳 Docker Container (Most Common)                  │
│     - Quick to set up                                │
│     - Fully isolated from your system                │
│     - Easy to stop and restart                       │
│                                                      │
│  🖥️ Virtual Machine (VM)                            │
│     - Complete isolation                             │
│     - Can run any operating system                   │
│     - Slower to set up                               │
│                                                      │
│  ☁️ Cloud VM                                        │
│     - Does not affect your machine at all            │
│     - Easy to scale                                  │
│     - Requires internet connection                   │
└──────────────────────────────────────────────────────┘

Setting Up a Docker Sandbox

DOCKERFILE
FROM ubuntu:22.04

# Install desktop environment
RUN apt-get update && apt-get install -y \
    xfce4 \
    xfce4-goodies \
    tightvncserver \
    firefox \
    python3 \
    python3-pip \
    xdotool \
    scrot \
    && rm -rf /var/lib/apt/lists/*

# Set up VNC
RUN mkdir -p /root/.vnc && \
    echo "password" | vncpasswd -f > /root/.vnc/passwd && \
    chmod 600 /root/.vnc/passwd

# Start the desktop
CMD ["vncserver", ":1", "-geometry", "1920x1080", "-depth", "24"]

Terminal
# Build and run the container
docker build -t claude-sandbox .
docker run -d -p 5901:5901 --name claude-computer claude-sandbox

# Now you can connect to the screen via VNC on port 5901

Practical Examples

Example 1: Filling a Web Form

JavaScript
const result = await computerUseLoop(
  "Open Firefox and go to https://example.com/form " +
  "then fill in the form with the following data: " +
  "Name: John Smith, Email: john@example.com, " +
  "Age: 30, then click the Submit button"
);
console.log(result);

Claude will automatically:

Open the browser and navigate to the URL
Take a screenshot and identify form fields
Click on each field and type the data
Click the submit button
Verify the operation succeeded

Example 2: Browsing a Website and Gathering Information

JavaScript
const result = await computerUseLoop(
  "Open Firefox and go to https://news.ycombinator.com " +
  "and collect the titles of the first 5 articles on the homepage"
);
console.log(result);

Example 3: Testing a User Interface

JavaScript
const result = await computerUseLoop(
  "Open Firefox and go to http://localhost:3000 " +
  "and test the following points: " +
  "1. Does the login button work? " +
  "2. Does an error appear when entering incorrect data? " +
  "3. Does the design look correct on the screen? " +
  "Record your observations for each point"
);
console.log(result);

Available Actions — Quick Reference

Action	Description	Parameters
`screenshot`	Take a screenshot	None
`click`	Click on a point	`coordinate`, `button` (optional)
`double_click`	Double click	`coordinate`
`type`	Type text	`text`
`key`	Press a key/shortcut	`text` (e.g. "ctrl+c")
`scroll`	Scroll	`coordinate`, `delta_x`, `delta_y`
`drag`	Drag	`start_coordinate`, `end_coordinate`
`wait`	Wait	`duration` (in seconds)

Limitations and Important Notes

This Tool is in Beta

The tool is still in Beta stage
Performance can vary — sometimes Claude clicks in the wrong place
Very small text can be difficult for Claude to read
Animated or rapidly changing applications may cause confusion

Tips for Best Results

Use an appropriate screen resolution — 1920x1080 or 1280x720 work well
Make instructions clear and specific — "Click the blue button in the top right" is better than "click there"
Break down large tasks — instead of one complex task, split it into steps
Monitor execution — especially in early development stages
Handle errors — add retry logic for failures

When to Use Computer Use vs Other Tools?

┌─────────────────────────────────────────────────────────┐
│  When to Use the Computer Use Tool?                     │
│                                                         │
│  ✅ Use it when:                                        │
│  • There is no API for the app you want to interact     │
│    with                                                 │
│  • You need to visually test a user interface            │
│  • You are dealing with legacy apps without an API       │
│  • You want to automate processes that require human     │
│    interaction                                           │
│                                                         │
│  ❌ Do NOT use it when:                                 │
│  • An API is available — use regular tool_use           │
│  • You want to process data — use the code execution   │
│    tool                                                 │
│  • You want web search — use the search tool            │
│  • The task is purely text-based — no screen needed     │
└─────────────────────────────────────────────────────────┘

Full Python Example

Python

client = anthropic.Anthropic()

def take_screenshot():
    """Take a screenshot using scrot"""
    subprocess.run(["scrot", "/tmp/screen.png"], check=True)
    with open("/tmp/screen.png", "rb") as f:
        return base64.b64encode(f.read()).decode()

def execute_action(action_input):
    """Execute an action on the computer"""
    action = action_input["action"]

    if action == "screenshot":
        img = take_screenshot()
        return [{"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img
        }}]

    elif action == "click":
        x, y = action_input["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        subprocess.run(["xdotool", "click", "1"])
        img = take_screenshot()
        return [{"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img
        }}]

    elif action == "type":
        subprocess.run(["xdotool", "type", "--", action_input["text"]])
        img = take_screenshot()
        return [{"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img
        }}]

    elif action == "key":
        subprocess.run(["xdotool", "key", action_input["text"]])
        img = take_screenshot()
        return [{"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img
        }}]

    elif action == "scroll":
        x, y = action_input["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y)])
        clicks = abs(action_input["delta_y"]) // 100
        direction = "5" if action_input["delta_y"] < 0 else "4"
        for _ in range(max(1, clicks)):
            subprocess.run(["xdotool", "click", direction])
        img = take_screenshot()
        return [{"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img
        }}]

    return [{"type": "text", "text": f"Unknown action: {action}"}]


def computer_use_loop(task):
    """Main computer use loop"""
    messages = [{"role": "user", "content": task}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=[{
                "type": "computer_20250124",
                "name": "computer",
                "display_width_px": 1920,
                "display_height_px": 1080,
                "display_number": 1,
            }],
            messages=messages,
        )

        tool_blocks = [b for b in response.content if b.type == "tool_use"]

        if not tool_blocks:
            text_blocks = [b for b in response.content if b.type == "text"]
            return text_blocks[0].text if text_blocks else "Task complete"

        results = []
        for tb in tool_blocks:
            result = execute_action(tb.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": tb.id,
                "content": result,
            })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": results})


# Run the example
result = computer_use_loop("Open Firefox and search Google for 'Anthropic Claude'")
print(result)

Pricing

The Computer Use tool consumes more tokens than regular requests because of:

Images: Each screenshot counts as image tokens (based on resolution)
The loop: Multiple cycles may be needed to complete a single task
Estimate: A simple task may cost 10-50 cents, a complex task may cost more

Summary

Aspect	Details
What it is	A tool that lets Claude interact with a computer visually
How it works	Loop: screenshot -> analyze -> action -> repeat
Status	Beta — experimental
Security	Always use a Sandbox
Best for	Automating interfaces without APIs, UI testing
Alternative	Regular API tools when an API is available

Next: We will learn about the Code Execution tool — which lets Claude write and run code in a safe environment.

Module 6

8/9

⚙️ Code Execution Tool

MCP — Model Context Protocol 🔌

8/9