Engine-Level Properties — Part 1 of 2
AI Code Sandboxes:
A Comparative Security Study
Six engine-level axes read together across five AI-sandbox products and three engine classes — attack surface · leakage · stackability · CVE history · patch cadence · fuzzing posture. No single axis is a sufficient basis for judgement; the cross-axis reading is the load-bearing analysis.
This page is a reading copy. The authoritative version of record is published on arXiv (2606.08433) under CC BY 4.0; that page is set as the canonical source for indexing. Companion code (Apache-2.0): orbitalab/RnD-ai-sandboxes-sec-study-part-1.
In one paragraph
We measured how well five popular "run-untrusted-AI-code" sandboxes actually isolate that code from the host machine. We looked at six engine-level properties — attack surface, information leakage, ability to stack extra hardening, public bug (CVE) history, how fast fixes reach you, and whether the engine is continuously fuzz-tested. The headline: the type of engine (microVM vs. userspace-kernel vs. plain container) decides most of the architectural security, but the product's own defaults and update policy decide whether that security actually reaches you. No product wins on everything, so we publish per-property scores instead of one ranking.
The five products & three engine classes
| Product | Engine | Class | One-line takeaway |
|---|---|---|---|
| e2b | Firecracker | microVM | Tightest attack surface, 0 leaks — but got its first escape bugs in 2026 and ships a frozen, 399-day-old version. |
| arrakis | Cloud Hypervisor | microVM | Strong engine, but exposes nested /dev/kvm to guests and is frozen 471+ days behind upstream. |
| microsandbox | libkrun | microVM | Cleanest leak result, fast updates — but 0 CVEs + no fuzzer = its real bug risk is simply unknown. |
| gvisor | runsc | userspace kernel | Best overall: tightest reachable surface, 0 escape CVEs, the only continuously fuzz-tested engine, always current. |
| daytona | runc | OCI container | Shares the host kernel verbatim (10 leaks, 4 escape CVEs) and blocks most hardening — but patches arrive on time via Docker. |
What we found
- Engine class is the dominant signal — but only for architecture. microVMs, userspace kernels, and containers separate cleanly on attack surface, leakage, and bug history. Within a class, products still differ a lot, so class alone isn't enough.
- Getting fixes is mostly about the product, not the engine. Upstream engines patch in roughly 0 days. But what reaches operators ranges from same-day to 471+ days to "we can't even tell" — driven entirely by each product's version-pinning policy.
- The safest theoretical combo doesn't exist yet. Nobody ships a microVM (strongest isolation) with a continuous public fuzzer (strongest bug-finding). So operators face a real trade-off: gvisor for shallowest residual bugs, or a microVM for strongest isolation — not both.
- "Zero known bugs" can be the worst sign, not the best. microsandbox/libkrun has no CVEs and no fuzzer and no academic study — meaning its real risk is unmeasured, not low.
Quick scorecard
| Product | Attack surface | Leak resistance | Can harden it? | Bug history | Gets patches? | Fuzz-tested? |
|---|---|---|---|---|---|---|
| gvisor | ▲ | ◆ | ◆ | ◆ | ▲ | ▲ |
| e2b | ▲ | ▲ | ▲ | ◆ | ▼ | ▼ |
| microsandbox | ◆ | ▲ | ▲ | — | ▲ | ▼ |
| arrakis | ▼ | ◆ | ▲ | ▲ | ▼ | ◆ |
| daytona | ◆ | ▼ | ▼ | ▼ | ▲ | ◆ |
Bottom line for operators
If you weight escape resistance most, gvisor is the only clear pick. If you weight getting patches promptly, three qualify (microsandbox, gvisor, daytona-on-Docker). The two frozen-pin products (arrakis, e2b self-hosted) are strong engines undermined by stale versions. daytona is the riskiest by architecture but stays current. There is no single winner — match the product to the property you care about most.