Did you know that SQLite has 590x more test code and test scripts than the library itself? I did not, at least until last week (thanks, HN). From How SQLite is Tested:

As of version 3.42.0 (2023-05-16), the SQLite library consists of approximately 155.8 KSLOC of C code. (KSLOC means thousands of “Source Lines Of Code” or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 590 times as much test code and test scripts - 92053.1 KSLOC.

For a library as important as SQLite, this level of rigorous testing makes sense – many important systems use SQLite and bugs that may lead to data loss (or worse, corruption) should be avoided. But what about less “critical” libraries? Like those supporting all modern digital infrastructure?

xkcd 2347.

xkcd 2347.

Should we not test all code like SQLite? We do not know what code will be used where – why not build the test infrastructure with the code? Put simply, who has time for that? Developers are notoriously bad at writing tests for their own code. AI offers a new possibility – what if we can use it to test all code as thoroughly as SQLite?

In my experiments with Forge, I was nervous to let AI start changing things without a pathway to ensure that the extension still worked correctly without manually using the extension. In parallel to fixing issues and adding features, I worked to generate a comprehensive test suite. I mildly guided Claude but, for the most part, I let it figure out a test plan, implement the tests, and then update the plan for the next phase. This worked well and even discovered several bugs in the process. However, the number of tests (and LoC of tests) grew quickly with unknown “quality”. Early on, I did not even bother measuring coverage and when I finally got around to enabling codecov, the coverage was around 55% (up to 62% now). This was better than I expected and I decided to spend some tokens refining the tests. Specifically, at commit #38 (db36193), I asked Claude to review the tests and focus on testing behaviors (not implementation). This removed ~190 tests which caused a small hit to the branch coverage but the other coverages were largely unchanged.

Test coverage over time (commits).

Test coverage over time (commits).

Over the follow commits, I continued to refine the testing suite. There is no doubt that the testing infrastructure introduced a ton of code into the repo (fortunately, not 590x):

LoC over time (commits). LoC based on cloc.

LoC over time (commits). LoC based on cloc.

Regression Tests

One of my goals with building the testing infrastructure was to measure whether bugs were fixed or not in my latest code based on whether the tests pass and not via manual testing. Since I had already iterated through the issue board several times, I asked Claude to generate a list of issues that were suitable to be converted into regression tests. I did not try too hard to turn every such issue into a unit test (and a few were duplicates) but Claude was able to implement a total of 26 regression tests (which grew the size of the tests). I still need to finish this experiment and confirm these fail on the earlier version of the repo.

End-to-end Testing

One of the biggest challenges with testing a Gnome extension is that it often requires a lot of manual testing. Here’s what I asked Gemini to explore via deep research:

i have a gnome extension that manages tiling windows. i have a lot of unit tests with mocked APIs but it would be nice to do automated end-to-end tests without needing a human in the loop or a GUI so that i can put this in CI/CD. What are my options? The repo is here: https://github.com/jcrussell/forge

This pointed me towards the gnome-shell-pod repo and I used Gemini to generate an implementation plan for Claude. I pasted this into a planning session and then let Claude work. This took forever (many hours) but it eventually figured out how to implement end-to-end tests. I probably could have helped it accomplish this faster but I was focusing on other things. I am so glad I did not have to figure this out and I could let Claude do the trial-and-error.

The commit introducing the E2E testing is on the dev branch. It took long enough to implement that I did not get to experiment with it as much as I would have liked.

CI/CD

Whenever I set up CI/CD on a repo that has not had it enabled in the past, I usually prepare myself for a world of pain. One of the coolest discoveries this time was that gh can pull CI/CD logs meaning that I could create a simple loop to: push, check CI/CD worked, and fix it if it did not. I could probably have replicated the CI/CD stages locally and should probably do that in the future but this was a surprisingly easy process. The only downside was that gh does not have a long-polling mechanism (that Claude was able to find) for when CI/CD is complete.