Defect Injection testing - the lost art of 'Bedbugging'

Defect Injection testing - the lost art of 'Bedbugging'

Table of Contents

“Quis custodiet ipsos custodes” - Juvenal

Let’s get meta: how to test the tests?

One of the most irksome questions I’ve come across in QA (Automation in general) is figuring out whether you’re doing what you expect. In the QA realm, this is summed up as: are my tests (or manual testers) going to catch the bugs that I expect them to catch?

Colleagues and friends tell tales of that time they ripped out huge swathes of test code, only to discover that it didn’t matter. Bugs would slip right through the cracks, defeating the entire purpose, because they weren’t really testing anything but patience. It’s like cargo cult science - if you write automated tests, bugs just disappear. Right?

So. Wrong. Nevermind that automation code isn’t special; it rots just like other any other code, calling forth the supervillain of the software world: maintenence. Just the friction that’s being added to the average developer’s workflow results in negative behaviours, such as skipping tests when no one is watching.

Besides happenstance and heroism, I’ve wondered how one could programmatically address this problem (and slim down test suites in the process). I took inspiration from Chaos Engineering a la Netflix’s Chaos Monkey to demonstrate one method: defect injection aka bedbugging.

Defect Injection demo - the setup

Let’s start with the Conduit application a Medium clone and a suite of Cypress E2E tests (link to code is coming) that cover some basic functionality. To create defects, we’re going to mangle the responses & response codes from the APIs that we hit during the Cypress tests. The expectation is that our tests should catch the errors that we’ve created, and/or we see a graceful partial failure in line with expectations. We’ll use MITMProxy to make this a reality.

Since we only care about the APIs that are called during a test run, I’ve used the “automocker” plugin, or better yet, the “autorecord” plugin to identify the endpoints that we care about in each case. Note that there are a bunch of other ways to do this using traffic collection or proxies. Example of captured requests

Here’s a graphic of the APIs called in each functional test: APIs we hit

Now, we’ll fire up MITMProxy in it’s standard HTTP mode. This was tricky for me, but I’m not good at this. Let me know if I should post my notes if that’d help you… Let’s set up a filter to make it easier to see what we’re doing:

f:conduit|api

MITMProxy can be scripted to do some pretty fascinating stuff automatically. We’re going to keep it simple and just intercept requests:

i:tags

Demonstrating Defect Injection for fun and profit

We’re ready to rock. This screen recording shows how defect injection works with this setup.

Download link in case your browser doesn’t load it below

Here’s what you’re seeing:

  1. run an E2E test suite that covers homepage functionality, such as: login, article content, user settings, and popular tags. This runs without error.
  2. intercept the /tags route, then run the E2E homepage test suite again. You’ll notice that the “popular tags” don’t show up, and we catch that with our final test:
    it("can see popular tags on homepage", function() {
      cy.wait(500);
      cy.get(".tag-list").find('.tag-pill').should('have.length.greaterThan', 1)
    });
  1. intercept the /articles route, then run the E2E homepage test suite again. You’ll notice that the feed loads forever, and we don’t have a test failure. So - this is an opportunity to think through what we should do in this state, or to put a test in place that requires articles to load in <5 secs or something.

Note: in the 2nd example, if we’d just checked that the class “tag-list” has loaded, we’d have missed this error:

Tag list is easy to miss

I’d argue that this feature isn’t essential so the graceful failure is pretty solid, even without an error message in the UI.


Concluding musings

Hopefully the defect injection strategy to test your tests sparks your interest, too. There are some interesting parallels to Chaos Testing, though there’s something nice about how this doesn’t require you to actually take down services or machines - and also how directly it maps to the actual user experience. I imagine this would also have some interesting applications to root cause analysis or reproducing nasty bugs, since we have pretty good control and visibility at the API layer.

The next step here would be to automate the extraction of APIs hit by each E2E test, and programmatically mangling responses using MITMProxy’s scripting facility. Also probably good to make the proxy setup a little bit easier or better documented. So do let me know if that would be useful to your organization so I feel like finishing it :P

One last thing: This doesn’t just apply to automated testing; it can also be used to check that manual test scripts are working (and being run at all - a fear that many who outsource QA have). Turns out a similar technique, called “bedbugging” , was employed to ensure that bored radar operators didn’t miss rare events - spawning a popular software engineering test coverage technique. Implementing this with a manual test team, and learning from the fault seeding efforts of yesteryear, would be super interesting as well!

Related Posts

Provocative Post about Planning Products!

Provocative Post about Planning Products!

Some provocative recommendations in this post: The size of your backlog is inversely proportional to your product’s success While I think you ought to be synthesizing more than this suggests, the overplanning and under-reliance on what your customers actually do with your software is a massive fail.

Read More
Thoughts on Return-To-Office and Agile Manifesto

Thoughts on Return-To-Office and Agile Manifesto

As I think about our San Francisco office and realise that the pain of commuting is well outweighed by the chance to move faster, have fun in a “clubhouse” with interesting people, and to mentor some really talented young people… the Agile Manifesto came back to me.

Read More
Slaying Software Zombies with LLMs (and lesser incantations)

Slaying Software Zombies with LLMs (and lesser incantations)

Please enjoy my PyOhio talk on slaying software zombies! I’m talking about a very real problem of dealing with metacode (tests, docs, bugs, requirements etc…) that are no longer completely accurate (thus, zombies) due to not being in sync with the product & its code updating.

Read More

Get new posts via email

Intuit Mailchimp

Copyright 2024-infinity, Paul Pereyda Karayan. Design by Zeon Studio