Forge end-to-end testing

Recap

A few months ago, I was using Claude to fix bugs and add features to Forge, a tiling window manager for Gnome, that I now use every day. The last “piece” was to add robust end-to-end (E2E) testing to reduce the need for manual testing and I was making progress but there was little sense of urgency – Forge was now working great for me – so I decided to instead focus on other projects. With those other projects now humming along, I decided to revisit Forge.

Scaffolding

I have started to use beads to track state across sessions and decided to try to incorporate those into my Forge work. The first step was to populate the backlog based on the initial state of my fork. This started by prompting Claude:

we have just started using beads to track issues and norms for this repo. do a comprehensive review of the repo (including the commit history) from multiple angles to determine all the beads that we should create and all the memories that we should store. use a swarm of subagents.

I reviewed the plan and everything looked reasonable. Claude did also find a .todos/ directory that snuck into the repo a while ago (where I had been tracking what to do next) which we trimmed from the repo. After this initial set of beads, I downloaded the E2E logs from Github and asked Claude to triage the failures to identify root causes and create beads as needed. This led to a healthy backlog to implement.

When I first started working on E2E tests, I did not have a robust sandbox so I was reluctant to use Claude’s --dangerously-skip-permissions flag. This made creating the E2E tests, even when starting from an example, take forever as I monitored and approved each action. Now that I can use abox and Claude has an auto mode, I felt confident to let Claude iterate on E2E tests without supervision during implementation. However, for the first few iterations, I was still stuck in the loop. Specifically, since I still “own” push/pull operations and did not have E2E tests running locally, the way I tested was to push to Github and download the logs when CI/CD finished. This was awful for many reasons so after a few iterations, I paused to get E2E testing working locally. This was relatively simple as abox allows you to run Docker containers inside the VM (although it does require baking abox’s CA certificate into the containers for them to build properly). Once E2E tests were running locally, we were ready to iterate.

At its core, this is essentially an observe-orient-decide-act (OODA) loop. Claude uses these kinds of feedback loops all the time – it generates some code, runs it through the compiler, parses the errors, and sets up the next edit. In this context, instead of using a compiler to report errors, we are using the full E2E suite. This allowed me to adjust my planning so that the end “goal” was to work towards an E2E suite that passes on all supported Gnome versions.

Aside: we are running Claude in a sandbox with trusted data (i.e., data that we generate ourselves). Agentic loops that utilize untrusted data have significant challenges that sandboxing can only partially mitigate.

With this scaffolding all in place, I was able to set broader goals during planning sessions (e.g., work towards a fully functioning E2E test suite based on the open beads), let Claude pick a theme, and then iterate on the implementation. In the last few weeks, I have started asking Claude to peer review every plan before it presents it to me – this seems to reduce the number of iterations before I am ready to approve a plan.

Rabbit Hole

Early on during implementation, Claude became convinced that Forge was fundamentally broken on Gnome/Mutter 49+ (which dropped support for X11):

Mutter 49 (Aug 2025) added Wayland configure-path tightening: xdg_toplevel edge constraints + a new should_configure() check in src/wayland/meta-window-wayland.c (commit 6c7565eee5) infer tile state from geometry that matches workspace edges. For a window whose programmatic geometry equals workspace dimensions minus a small chrome offset (~16px), Mutter latches it into tile_mode (META_TILE_LEFT/RIGHT/MAXIMIZED) and silently overrides subsequent move_resize_frame() width/height — position applies, size is rejected. Forge’s default tile geometry exactly matches this pattern, causing 13/90 E2E tests to fail on Fedora 43 (Mutter 49.5).

Claude was stuck here for quite a while. First, I gave it access to the Mutter source code locally so that it could cross-check its assumption. Second, I asked it to give me a prompt for Gemini’s Deep Research to explore but that seemed to only reinforce what Claude had already determined. Claude was convinced that we needed to submit an upstream issue to request a meta_window_untile API to fix the issue. I did not believe that this could be true and continued to push back. This led to Claude instrumenting the extension, running single tests in isolation, and then finally bisecting the tests to find that state was being contaminated across E2E tests. After further debugging, Claude discovered the major underlying issue: the way we were feeding keyboard input (through Clutter’s VirtualInputDevice) for super+c was incorrectly causing the window to snap into tiling. Claude updated the E2E tests to invoke actions via D-Bus instead (cutting out the problematic keyboard inputs) and most of the tests passed. It also found and fixed a minor bug in Forge itself where the layout was not reset after the last window closed. There was never an issue with Mutter.

Oof Loop

After emerging from the rabbit hole, Claude iterated towards E2E tests that passed across all the Gnome versions. There was a minor issue with one version of Fedora but Claude was able to figure it out after a nudge to start working and broken containers side-by-side to compare. After that, we entered the “Oof loop” phase where the tests would pass locally but not always in CI/CD. The issues stemmed from either:

Timing differences between local machine and CI/CD infrastructure.
Applying fixes with spot checking for regressions instead of full local E2E.

This was definitely the most tedious part of the process. CI/CD takes 10+ minutes to run so the throughput for iterations is a few per hour at best.

End State

There is still a bit more work to do on the E2E test infrastructure but I am happy with how it is progressing. One of the nicer features that we added was a recording feature so that we could see the E2E tests in action:

I pushed the beads for the repo to Dolthub. I am still unsure how these should interact with Github issues. At some point soon, I want to ingest all the issues from the upstream repo into beads to capture anything that I might have missed. Until then, I will keep capturing (and fixing) bugs as I encounter them.

Recap#

Scaffolding#

Rabbit Hole#

Oof Loop#

End State#

Links#

Recap

Scaffolding

Rabbit Hole

Oof Loop

End State

Links