2024-10-22

Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Computer Use: What It Is and How It Works

Models can now control a sandboxed desktop via screenshots + mouse/keyboard actions in a loop.
Reference implementation uses Docker/VMs; not a native “Claude Desktop” app.
It can scroll, click, type, open apps/browsers, and even persist through slow app startups, but struggles with finer actions like dragging/zooming.
Many see this as ideal for GUI-based automation, end-to-end tests, and “agents that actually do work,” including on legacy Windows/Mac apps.

Privacy, Security, and Safety Concerns

Strong worries about sending screenshots and granting remote control, especially on real workstations with PII/PHI or corporate data.
Multiple commenters advocate strict sandboxing (VMs, remote desktops, limited accounts) and “read-only” or confirm-before-click modes.
People anticipate incidents: accidental deletion, being tricked by phishing UIs, or exfiltration of sensitive data.
Some see this as a likely way CAPTCHAs and web anti-bot defenses will be bypassed.

RPA, Legacy Software, and Accessibility

Widely compared to Robotic Process Automation (UiPath, etc.): same idea of automating GUIs when no clean API exists.
Many note this may be the only practical way to integrate with entrenched, GUI-only enterprise tools (medical, tax, ERP, banking).
Others highlight accessibility potential: AI as a powerful screen-reader / voice-driven operator for people with visual or motor impairments.

Model Quality: Coding and Reasoning

New Claude 3.5 Sonnet (“New”/20241022) is reported to be much better at coding than GPT-4o by several users, with fewer hallucinations and cleaner Python/Rust.
Benchmarks cited: big gains on SWE-bench Verified and Aider’s coding/refactor leaderboards; competitive but below o1-preview on some reasoning tests.
Haiku 3.5 is said to reach roughly prior Opus-level performance at much lower cost, though pricing vs 4o-mini draws some criticism.

Versioning, Product Positioning, and UX

Heavy confusion/annoyance over naming: “Claude 3.5 Sonnet (New)” instead of 3.6 or 4.0, plus dated model IDs.
Opus 3.5’s status is unclear; some think Sonnet 3.5 has effectively displaced Opus 3.0 for most tasks.
Rate limits on the chat UI frustrate frequent users; many route through APIs or third‑party tools.
Branding and UX are praised as warmer and less “dramatic” than competitors, but missing features (e.g., robust LaTeX, real-time voice) are noted.

Developer Workflow and Tools

Strong migration pattern: many coders report switching from GPT-based tools to Claude, especially via editors like Cursor, Continue.dev, Cody, Aider, etc.
Desired next step: tight integration between code edits and browser results using Computer Use, so agents can iteratively debug UIs on their own.

Broader Implications and Skepticism

Some see this as a step toward “FSD for computers” and a threat to many remote/white-collar roles.
Others argue reliability, error handling, and organizational constraints will keep humans heavily in the loop for the foreseeable future.

Related topics