Towards a Rust Symbolic Execution Engine

tl;dr: After a few months of on-and-off effort – 353 Claude Code messages across 39 sessions in the last month alone – I have a Rust port of Angr’s symbolic execution engine that matches or beats the Python original on 13 of 16 CTF benchmarks (up to 12.7x on the largest), and captures the same flag as Python on every benchmark it solves. Rust is now the source of truth for state; Python is called back only for unhandled externals. Getting there took two failed attempts and a pivot to agent loops with persistent memory (beads + a Ralph-style script that accumulated 30+ hours of runtime). This post is a trip report.

Angr

Angr is a fantastic tool that has evolved from a symbolic execution engine into a fully-featured binary analysis engine over many years of research. When Angr was first created, the developers chose to implement the core state exploration engine in Python. This made a lot of sense as the major bottleneck with symbolic execution is often constraint solving. They used Python bindings for z3 which is implemented in C++ which allowed them to get native performance where it mattered the most. But what if they had implemented the core engine in Rust from the beginning (unlikely given Angr’s first public release coming just three months after Rust 1.0 in May 2015)?

Early days

In January, I read Vjeux’s blog post about porting 100k lines of Typescript to Rust and wanted to see for myself whether the models were capable of achieving something like this for Angr.

When I first started this effort (in late January), I gave Claude a link to the blog post and said that I wanted to do the same for Angr. Specifically, I said that I wanted to leverage Rust’s O(1) Copy-on-Write primitives to see what we could improve. I did not know the scope of the work or what exactly needed to be done. Claude investigated the Angr source code and we planned an implementation of SimMemory in Rust which it implemented pretty quickly. However, after implementation, the Python-to-Rust overhead was too high for this to be useful. So, I did the next logical thing and asked Claude to put together a plan to create a VEX engine written in Rust with support for all the architectures that Angr supports. I iterated many times on unimplemented instructions, had it write differential tests with the “vanilla” Angr engine, and let it fix the Rust engine when it found differences.

After a few days, I was excited to see that I had a working VEX engine and entered a phase of assuming that one more plan/implement cycle would “finish” this project. I ran the following planning prompt (paraphrased) many times:

i am trying to build a rust symbolic execution engine for angr. do a comprehensive analysis of the differences between the engines and design a plan to address the findings.

Each time Claude planned an implementation, I would review it and make suggestions. Other than that (and spending the tokens), I mainly left Claude to investigate deficiencies and implement fixes. Eventually, I became increasingly tired of watching Claude struggle and decided to experiment with different agentic loops. This made some progress but did not achieve my goal – the Rust symbolic execution engine was still slower than the Python implementation. I had Claude fork the repo and try again to see if that would fix it but that did not work either.

In an attempt to salvage the work, I reiterated what I thought was my original goal for the project and found that Claude and I had diverged on the “goal” for the project. During the initial planning, I had not imposed a requirement that Rust should be the source of truth so Claude was in implementation hell trying to synchronize states across dual solvers in Rust and Python. rust_manager.py grew to 5,167 lines maintaining eleven separate caches that tried to keep Rust and Python views of the world consistent: reconstructing registers (claripy BVV wrapping on every callback), syncing memory pages (endianness bugs – flareon10 alone had three separate ones here), syncing constraints (identity mismatches between Python and Rust symbol names), and holding two solvers that each saw a different subset of constraints – neither complete. This generated the majority of our bugs: identity conflicts, constraint desync, empty stash exports, wrong outputs, UNSAT on valid states. Compounding it, Python (claripy) and Rust (z3-rs) each had their own Z3 context; constraints could not be shared directly and had to be translated. Z3_solver_translate has a silent same-context footgun – it returns a solver with zero assertions when source and destination contexts match – and structurally different ASTs for semantically identical constraints solved 3-7x slower in Z3 (hackcon: 66s Rust vs. 9s Python). Each plan/implement cycle was fixing a sync bug instead of the architecture that generated sync bugs in the first place.

I could have burned tokens for months and never reached my goal. I was never even close to one plan/implement cycle away from completion. After realizing this, it was time to change approaches.

Relevant branches:

Evolved approach

One thing that I had noticed during my manual loops was how often Claude had to “relearn” something that it had already discovered. This wasted both time and tokens. I decided to experiment with beads to see if it could help keep Claude on track over many sessions.

I initialized a new branch from the upstream repo and started with a new planning phase with the following prompt (paraphrased):

my goal is to have a rust symex that gets initialized in python and then hands off execution to rust. rust becomes the source of truth for the state of the analysis and callbacks to python only when there is an unhandled external call. this allows users to extend the engine without writing rust but we get all of the neat performance improvements from using rust

I had Claude explore the existing Rust engine branches to see what it could learn from not only the current state of the repo but also the commit history. I had it store this knowledge in beads. I then had it investigate several additional sources to develop a more comprehensive understanding of the current state. Specifically, I had it examine the Python implementation to see what it could learn from the Angr developers. I also cloned the z3 repo to see what we could learn from z3 itself about how to better construct constraints from the Rust engine. These findings all went into the beads issues and/or memories. The two major architecture changes based on this exploration were:

RustStateProxy – a lightweight (manager, state_id) handle instead of a full SimState sync; the real SimState is only constructed when a SimProcedure actually needs one, which keeps the common path cheap.
Rust solver as the single source of truth – no bidirectional sync; Python solver failures fall back to Rust via eval_with_fallback.

After enumerating and exploring the Rust engine code from multiple angles (including for code quality) and building up a substantial backlog, I realized that I did not want to babysit the implementation. Given that all the planning had already been done, I had Claude write a Bash script to implement a Ralph loop. This script evolved over the next few weeks as I encountered failure modes (e.g., OOMs and session limits). The most recent script is stored in the repo. The loop is a thin bash wrapper that, each iteration, consults beads for the next ready task, spawns a fresh claude -p session with a long-form prompt, lets Claude claim the task, run cargo check and pytest, commit, and close the task in beads.

Across the evolved phase the loop accumulated 30+ hours of runtime in aggregate (broken up across many restarts – OOMs, rate limits, session boundaries). I used Abox – inspired by Emil’s blog post – to reduce the chances of ruining my computer.

Final results

After all that, here are the latest benchmark results.

Numbers come from tests/benchmarks/run_single.py --both run sequentially over 16 CTF binaries from angr-examples/examples. Each benchmark runs in an isolated subprocess with a 4 GB memory limit (RLIMIT_AS); timing is wall-clock end-to-end, including angr project creation, state initialization, exploration, and constraint solving. Hardware: an 8-vCPU AMD EPYC Milan VM, 8 GB RAM, Ubuntu 24.04, no swap. This is a notional benchmark to motivate the trip report: single runs, variance not characterized, and some binaries were excluded where either engine failed. A future paper-grade evaluation (SynSec or similar) would pin versions, report medians and stddev, and enumerate exclusions; here, treat sub-1.1x deltas as noise.

Correctness is checked by flag recovery: each CTF binary has a known flag, and a benchmark passes only if the engine produces inputs that reach it. The Rust engine captures the same flag as Python on every benchmark in the table below. Stronger differential checks (state-by-state, constraint-by-constraint) are future work.

Benchmark	Python (s)	Rust (s)	Speedup
csaw_wyvern	16.79	1.32	12.7x
ekopartyctf2016_rev250	33.38	4.48	7.5x
flareon2015_5	53.07	7.41	7.2x
defcamp_r100	1.09	0.23	4.7x
ais3_crackme	2.71	0.98	2.8x
whitehatvn2015_re400	3.47	1.36	2.5x
codegate_2017-angrybird	7.14	2.82	2.5x
sym-write	1.06	0.45	2.4x
strcpy_find	0.90	0.48	1.9x
csgames2018	1.70	1.07	1.6x
google2016_unbreakable_1	2.68	1.83	1.5x
google2016_unbreakable_0	1.39	0.98	1.4x
flareon2015_10	7.62	5.51	1.4x
fauxware	0.40	0.42	0.95x
flareon2015_2	4.51	4.80	0.94x
securityfest_fairlight	10.47	20.54	0.51x

The Rust engine is now faster than the Python engine on several benchmarks! I had hoped to have this done in time to submit a paper to SynSec but maybe next year. The branch with the latest Rust engine:

https://github.com/jcrussell/angr/tree/rust-symex

Three benchmarks are still slower – fauxware (0.95x), flareon2015_2 (0.94x), and securityfest_fairlight (0.51x).

fauxware (0.95x) and flareon2015_2 (0.94x). Both are under 5% deltas on sub-second runs; at N=1 these are most likely run-to-run noise rather than real regressions. I would want N>=5 before reading meaning into them.

securityfest_fairlight (0.51x) – the Rust VEX interpreter is slower than Python’s SimEngine on symbolic-heavy blocks. Fairlight has 8,102 VEX statements and 28 rand() calls producing symbolic values that flow through arithmetic; 13.9s of the runtime is in block_exec alone. We already moved rand() to native (saved 9s of Python callback overhead) and pushed exit() to a fast-path deadend (saved 3s of state creation), but the raw Rust interpreter grew proportionally. The root cause is that Python’s SimEngine benefits from claripy’s backend: claripy constructs Z3 ASTs via the C API with automatic simplification during solver preprocessing, while our Rust interpreter builds RustBV::Expression trees that get converted to Z3 ASTs later via build_z3_ast(). For symbolic-heavy blocks, that extra indirection loses to Python’s direct path; Z3 solving also dominates here, with exit times of 56ms, 77ms, 283ms, then a jump to 3.8s at the SAT phase transition. Fixing fairlight would require either JIT compilation, building Z3 ASTs directly during interpretation (skipping the RustBV layer), or more aggressive expression simplification than the 17 eager rules we currently run.

None of these are blockers for the port’s usefulness – they just mark the current state.

What did not work

The autonomous loop was not as hands-off as I wanted. Three failure modes kept interrupting what should have been hands-off runs:

OOM cascades. The loop had per-subprocess RLIMIT_AS=4GB, but no guardrail on the Claude session itself. A memory-heavy task would crash the session mid-iteration and Claude would lose whatever context it had and leave partial changes in the working tree.
Regression cascades. Fixes frequently cascaded into regressions on other benchmarks because we never set up a pre-commit gate running the full suite. By the time Claude caught a regression it was often two or three commits deep.
Status-check tax. Over the month, I manually checked in using a separate Claude session to gauge the progress of the agentic worker. Claude sometimes misread the system state (reporting a process as stopped when it was still running, suggesting wrong causes for OOM kills), which turned a status check into another round of debugging.

Lessons learned

I have learned a lot throughout this project. When I read Lalit’s blog post a couple of weeks ago, several of his lessons hit close to home. Like him, I ended up throwing away my first attempt to switch architectures (rust-engine-v2 → rust-symex). Like him, I shifted roles over the course of the project – early on I acted as a semi-technical manager delegating design and implementation; by the evolved-approach phase I was making the architectural calls (shared Z3 context, proxy state) and letting the loop execute. And like him, I felt the pull of a “slot machine” – watching the loop tick and being tempted to kick off another plan/implementation cycle. Beads really helped break this cycle. A few additional thoughts:

Reference implementations are insanely helpful to save tokens. Creating a Rust symbolic execution engine from scratch would have been much harder (although I have not run that experiment). Having Claude distill lessons from the Python engine was a huge source of quick wins. In “A Deepness in the Sky” by Vernor Vinge, there are “programmer archaeologists” who write programs by stitching together existing code. I think this future is far more likely now.
Bugs in libraries hurt. In the old days, if you encountered and reported a bug in a library, it was plausible that you could pivot to something else while you waited for the upstream maintainer to fix it. This was usually required because diving into someone else’s code had a high upfront cost. When you are developing with AI and the library’s maintainers are not, the speed mismatch makes forking inevitable. A challenge will be how do we get changes into upstream libraries if not every library is some form of vibe maintained?

Two memories that were both counterintuitive enough for the agentic loop that they are worth calling out:

avoid-check-assumptions. check_assumptions() looks like it should be faster than push/assert/check/pop (one call instead of four). It is actually 2x slower because it prevents Z3 from reusing its internal bit-blasting cache between calls. Fairlight went from 14s to 28.6s when we tried it. Completely counterintuitive.
flareon10-three-bugs. One benchmark, three independent bugs stacked on top of each other: (1) wrong endianness flag for memory sync, (2) memory writes wider than 128 bits silently truncated by u128, (3) step_func called once instead of per-step. Each fix revealed the next. Without the memory of all three, we would have “fixed” it and been mystified when it broke again.

Next steps

I do not expect to see the Rust Symbolic Execution engine merge into Angr in its current form. There’s some junk that accumulated in the repo, and the z3-rs fork – a minimal patch that lets Rust borrow Python’s Z3 context so AST pointers pass through without translation – should be upstreamed before anything else can be. What I hope comes from this effort is a proof-of-concept that demonstrates the utility of a Rust engine with enough pitfalls and lessons documented to reduce the cost of building a more polished version that can be merged.

Lincoln supposedly said that if he had six hours to chop down a tree he would spend the first four sharpening the axe. If I picked this up again, the loop itself would be the first thing I would rebuild. I would make the wrapper more self-healing (including automatic restarts after OOMs and exhausting session rate limits). I would also add a structured-output status channel that pings me only when something actually changes. The cost for these small tweaks is small relative to the time they would save on the next port.

Rewriting Angr’s core symbolic execution engine in Rust could be worth the upfront cost. Agentic workflows with persistent memory and a reference implementation mean it has never been a better time to start.

Angr#

Early days#

Evolved approach#

Final results#

What did not work#

Lessons learned#

Next steps#

Links#