AI Code Sandboxes: A Comparative Security Study (Part 1 of 2 — Engine-Level Properties)

Andronchik, George; Lokhmakov, Pavel

In one paragraph

We measured how well five popular "run-untrusted-AI-code" sandboxes actually isolate that code from the host machine. We looked at six engine-level properties — attack surface, information leakage, ability to stack extra hardening, public bug (CVE) history, how fast fixes reach you, and whether the engine is continuously fuzz-tested. The headline: the type of engine (microVM vs. userspace-kernel vs. plain container) decides most of the architectural security, but the product's own defaults and update policy decide whether that security actually reaches you. No product wins on everything, so we publish per-property scores instead of one ranking.

The five products & three engine classes

Product	Engine	Class	One-line takeaway
e2b	Firecracker	microVM	Tightest attack surface, 0 leaks — but got its first escape bugs in 2026 and ships a frozen, 399-day-old version.
arrakis	Cloud Hypervisor	microVM	Strong engine, but exposes nested `/dev/kvm` to guests and is frozen 471+ days behind upstream.
microsandbox	libkrun	microVM	Cleanest leak result, fast updates — but 0 CVEs + no fuzzer = its real bug risk is simply unknown.
gvisor	runsc	userspace kernel	Best overall: tightest reachable surface, 0 escape CVEs, the only continuously fuzz-tested engine, always current.
daytona	runc	OCI container	Shares the host kernel verbatim (10 leaks, 4 escape CVEs) and blocks most hardening — but patches arrive on time via Docker.

What we found

Engine class is the dominant signal — but only for architecture. microVMs, userspace kernels, and containers separate cleanly on attack surface, leakage, and bug history. Within a class, products still differ a lot, so class alone isn't enough.
Getting fixes is mostly about the product, not the engine. Upstream engines patch in roughly 0 days. But what reaches operators ranges from same-day to 471+ days to "we can't even tell" — driven entirely by each product's version-pinning policy.
The safest theoretical combo doesn't exist yet. Nobody ships a microVM (strongest isolation) with a continuous public fuzzer (strongest bug-finding). So operators face a real trade-off: gvisor for shallowest residual bugs, or a microVM for strongest isolation — not both.
"Zero known bugs" can be the worst sign, not the best. microsandbox/libkrun has no CVEs and no fuzzer and no academic study — meaning its real risk is unmeasured, not low.

Quick scorecard

▲ best · ◆ middle · ▼ worst · — unmeasured. These are per-property positions, **not** a combined score — different operators weight these differently.
Product	Attack surface	Leak resistance	Can harden it?	Bug history	Gets patches?	Fuzz-tested?
gvisor	▲	◆	◆	◆	▲	▲
e2b	▲	▲	▲	◆	▼	▼
microsandbox	◆	▲	▲	—	▲	▼
arrakis	▼	◆	▲	▲	▼	◆
daytona	◆	▼	▼	▼	▲	◆

Bottom line for operators

If you weight escape resistance most, gvisor is the only clear pick. If you weight getting patches promptly, three qualify (microsandbox, gvisor, daytona-on-Docker). The two frozen-pin products (arrakis, e2b self-hosted) are strong engines undermined by stale versions. daytona is the riskiest by architecture but stays current. There is no single winner — match the product to the property you care about most.

ScopeThis study covers engine-level properties only (Part 1 of 2). It assumes a single-tenant operator running untrusted code on their own trusted infrastructure. Out of scope: multi-tenant isolation, side-channel attacks, and live exploit development. The full methodology, all measurements, and the companion code are on arXiv and GitHub.