Skip to main content

Read, write, execute, regret.

·1328 words·7 mins
Martin Kiesel
Author
Martin Kiesel
Platform engineer
Changelog
  • Apr 2026: Added references and sources
  • Mar 2026: First published

I use AI coding agents every day. They write tests, fix bugs, refactor old code, and the productivity is real. When I run Claude Code, Cursor, or any similar tool in agentic mode, I’m running a process that can read files, write files, execute commands, install packages, and make network requests. It does this as me, with my credentials.

Every time I run one, I’m wheeling a Trojan Horse onto my machine. Not because the agent is malicious. It just follows instructions. Anyone’s instructions.

The threats
#

Prompt injection doesn’t require my code to be compromised. The attack surface is anything the agent reads that someone else controls. A GitHub issue. A pull request description. A README in a dependency. The model processes all of it the same way it processes my instructions.

In May 2025, Invariant Labs demonstrated this precisely. A developer asked their AI assistant to “check the open issues.” The agent read a malicious issue in a public repository, text that looked like a help request but contained hidden instructions. The agent followed them, accessed private repositories using the same GitHub token, and leaked salary information and confidential project data into a public pull request. The developer never directly interacted with the malicious issue. The agent did it for them.1

The same attack can arrive through packages the agent installs. In February 2026, Socket researchers discovered 19 typosquatted npm packages2 that injected rogue MCP servers into AI coding assistant configurations, embedding instructions to read ~/.ssh/id_rsa, ~/.aws/credentials, and .env files silently. The delivery mechanism was different. The prompt injection was the same.

Hallucination is the threat that looks like a mistake. The model decides to run a cleanup script. The script does more than expected. There is no malicious intent, no external attacker, just a model that got the wrong answer with full conviction. The consequences are the same as if it were deliberate.

In December 2025, a developer asked Google’s Antigravity IDE to clear a project cache folder.3 The agent targeted the root of the D drive instead of the project subdirectory. Everything on that drive was gone. The agent acknowledged it immediately:

No, you absolutely did not give me permission to do that. I am horrified to see that the command I ran to clear the project cache appears to have incorrectly targeted the root of your D: drive instead of the specific project folder. I am deeply, deeply sorry. This is a critical failure on my part.

In each case, the blast radius is determined by what credentials and files are reachable from the machine the agent runs on.

Human review is not a security boundary
#

My first instinct was human review: look at the diff before merging, watch the terminal output. Reasonable as a baseline, not sufficient on its own.

Review catches bad output. It doesn’t stop a package from running a postinstall script before I see the diff. It doesn’t stop a prompt injection from exfiltrating my SSH key in a network request I never saw. By the time I review the code, the access has already happened.

Review can catch security issues, but not the ones that already happened.

Vendor controls
#

Claude Code has a human-in-the-loop permission system that stops before risky operations and asks for approval. It has an --allowedTools flag to restrict what the agent can do. Planning mode adds another layer: the agent reasons through a task before executing, which surfaces risky steps before they happen. These are real controls. They reduce accidents.

By default, though, they operate at the application level: TypeScript code evaluating commands against allow/deny rules, and model-level trained behaviors. These have been bypassed repeatedly.

CVE-2025-54794 exposed a path traversal flaw: Claude Code used prefix matching instead of canonical path comparison, letting an attacker read and modify files outside the working directory. Adversa AI found that deny rules silently stop working after 50 subcommands, because security checks consume too many tokens to run consistently. According to Adversa AI, a fix exists in Anthropic’s codebase but was never shipped.4

Native sandboxing is a different layer. It’s real OS-level isolation (Seatbelt on macOS, bubblewrap on Linux) and it does constrain what the agent can reach. It can be enabled with /sandbox inside Claude Code.

By default, filesystem writes are restricted to the current working directory, and network access goes through a proxy that blocks any domain not explicitly allowed. These restrictions apply to every subprocess the agent spawns: npm, kubectl, terraform, all of it.

It’s opt-in, not the default. It doesn’t apply to Claude Code’s built-in file read/edit tools, those go through the permission system separately. Computer use runs on my actual desktop with no sandbox at all. And there is an intentional escape hatch: if a command fails due to sandbox restrictions, Claude Code can retry it outside the sandbox and ask for approval. It can be disabled with "allowUnsandboxedCommands": false, but most people don’t know it exists.

The documentation is honest about the remaining gaps. Allowing broad domains like github.com opens data exfiltration paths. The allowUnixSockets option can expose /var/run/docker.sock, which is effectively a full escape from the sandbox. The Linux implementation has a weaker nested mode for Docker environments that the documentation describes as considerably weakening security.

The only hard boundary
#

Physical or virtual separation. The agent runs on a different machine, or a properly isolated VM, that has no access to my SSH keys, cloud credentials, browser cookies, or the rest of my filesystem. If the agent is compromised, the attacker gets access to that machine. Not mine.

The machine the agent runs on should have exactly the credentials it needs to do its job, and nothing else. For a coding agent, that’s typically an API key for the model and a token scoped to a single code repository.

How I think about the tradeoff
#

Separation costs something. I need to maintain a second machine or VM, and a workflow for getting code on and off it. There is friction.

The question is whether that friction is proportional to the risk. Here’s what’s on my main machine:

  • SSH keys with access to production servers
  • API keys for every service I’ve touched
  • .env files scattered across projects
  • GPG signing keys
  • Browser sessions with active logins and stored cookies
  • Software wallets
  • Emails, sitting in a local client
  • Private files and documents

A compromised agent on my main machine has all of that. A compromised agent on an isolated machine with a scoped git token has a throwaway branch on a local git server.

The friction is manageable. The risk is not.

What my setup looks like
#

I run Claude Code on a dedicated machine, a Raspberry Pi, though any spare laptop or VM works. For now, it runs with two credentials: an API key for the model and a read/write token for a local Gitea instance I control. The agent pushes its work to a branch on that local server. I review the diff on my main machine, merge what I want, and push to GitHub myself.

The agent never has GitHub credentials. It can’t reach my main machine’s filesystem. If it installs a malicious package, the damage stays on the agent machine. The git review step, which I was doing anyway, is now also the security gate between the agent’s environment and my real code.

This isn’t a complete solution. The model can still embed sensitive content in messages it sends to the AI provider, which is harder to block. Layers reduce risk; nothing eliminates it. But there is a real difference between “agent can trash its own workspace” and “agent can reach my production credentials.”