No matter how much time you spend on your test automation, you'll inevitably reach a point where one or more of your tests begin to fail intermittently. And like a cruel joke from the Universe, it usually happens at the worst possible time, like before an unmissable critical deadline or right in the middle of your busiest week ever.

When this situation occurs, many testers take the easy way out and block the noise of the failing tests. The usual response is to mark the failing test as pending, so it doesn't run with the rest of the test suite. Occasionally, some people outright delete the test without a second thought. Some teams even go as far as to leave the failing test but silence all notifications telling them about the broken build.

It always begins with what seems like a justifiable moment. The testers making these decisions usually have some reasoning - or excuses, if you prefer the term - for going down this route. One of the most popular justifications for disabling a test or shutting down notifications is, "Oh, don't worry - I'll fix that next week after I have some time."

You can guess what usually happens next. Next week becomes the week after, which becomes next month, which becomes three months, and so on. In the meantime, more flaky tests are bound to appear in the test suite. Since the team has already deemed it acceptable to disable or discard anything interrupting their work, they add it to the now-expanding list of pending or deleted tests.

Eventually, it reaches a point where nobody on the team decides to fix the problems with the unstable tests. Everyone on the team stops trusting the test suite altogether. No one's even sure what value it's providing, and it becomes another thing that's burdening the team. They went from one test that occasionally failed to a test suite that no one wants to look at anymore. It may sound like a stretch, but I've experienced this more than once.

There are times where a flaky test is no longer useful. These tests should get removed since it's not providing any value to the test suite. However, the problem isn't the flaky test or the intermittent failures in your test suite. The problem is ignoring the issue for an extended period, with no concrete plans to correct it. One skipped test can render your test suite useless in no time.

The Broken Windows Theory

In the field of criminology, there's a theory called the Broken Windows Theory that discusses how leaving minor offenses unattended, like vandalism and loitering, leads to more severe crimes. The theory's name comes from observations that if a building has a broken window left unrepaired for too long, the rest of the building's windows eventually reach the same state. In some people's minds, having one broken window permits them to damage others. We've all had moments where we've thought, "I already did this once, who cares if I do it again?"

This theory became popularized in software development circles thanks to the book The Pragmatic Programmer. Authors Andrew Hunt and David Thomas used the idea to discuss how leaving "broken windows" in a codebase, like poor code decisions and inefficient implementations, can "rot" the code faster than any other problem. One excused "hack" leads to another, and then another until your codebase is an unmaintainable mess.

The advice given in The Pragmatic Programmer is to limit the damage by "boarding up" the windows - comment out offending code, use dummy data, display "Not Implemented" messages, and so on. However, this approach doesn't entirely fix the problem, especially when the only intention is to cover up the issue with no real plan to fix the underlying issue.

With our automated test suites, it's the same problem. If you leave a failing test in your test suite unattended for a long time, it signals to the rest of your team - either consciously or unconsciously - that it's okay for them to do the same. By covering up flaky tests by marking them as "pending" or using incorrect data or mocks to label the test as passed, you hide the problem and decrease your work's overall quality. "Boarding up" your broken windows solves the immediate issue, but sooner or later, you'll have to replace the window, or you'll risk having a structure full of covered-up problems.

What to do if you're on a test automation project with too many "broken windows"

As testers, we'll all get to situations where things are not as ideal as we'd want. Whether you're joining a new team with a patched-up test suite or your current test suite deteriorated into an intolerable mess no one wants to touch, these tips can help you course-correct and get your automation back in a valuable state.

Fix problems first before adding new tests

Continuing to add new tests when something's not working right is one of the biggest mistakes teams can make with their test suite. Testers and developers alike want to stay on course with their schedule, so they put up with the broken things around them for the sake of getting the work for this sprint done. What ends up happening is lots of lousy code slipping through, digging the team deeper and deeper into a hole they can't climb out from.

Successful test automation relies on long-term maintenance. The longer you ignore a problem in your environment, the harder it becomes to maintain. Fixing a flaky test that someone introduced to the test suite a few days ago is much easier than fixing the same test six months later. You'll likely have forgotten the details about the test case and have to spend hours getting back to a state where you'll have the context to solve the issue.

If your team has the habit of creating new tests and ignoring issues in the test suite, put a stop to it as soon as possible. Establish a workflow where fixing broken tests takes priority over adding new ones. It doesn't mean that when a test breaks, the team has to drop everything to resolve the issue, as long as someone is working on getting the build back to a stable state. The point is not to leave any windows "broken" for too long.

Delete failing tests that have stuck around for too long

Software development and automated testing are often living, breathing systems. They frequently change, both with new functionality and older sections that are no longer providing value to your organization or the customer. Every time someone writes new code for your automated test suite, that code won't live forever. Tests can outlive their usefulness and become a burden, often showing up in the form of failing test runs.

Many testers have an aversion to deleting any code, especially code that they've written. Even when it's clear that the test is not functioning as expected or not necessary anymore, they decide to keep it around. They mark the test as pending or comment out any assertions, thinking that the code "can be useful someday". Spoiler alert: once a test outlives its usefulness, it rarely gets back to a useful state.

If the underlying application under test permanently changed and broke something, or your team found better ways to test some functionality, you don't need those tests anymore. In almost all cases, any time you notice an automated test is no longer useful, delete it from your codebase and don't look back. It'll improve your test suite's maintenance and give you peace of mind.

Make the problems difficult to ignore

When a team gets bombarded by constant messages from their continuous integration system, especially during time crunches, a common tactic is to disable the service's notifications. Although we'd love for problems to go away when we don't see them, the fact is the problem is still there. Even worse, the issue will likely get worse as others forget about it. Out of sight, out of mind.

This problem's solution is simple - don't ever turn off notifications from your automated build environment. Even if it's painful to receive constant reminders of a broken test suite, insist on keeping those messages on so someone can fix the issue. If something's annoying enough, it will get dealt with quickly.

However, there's another issue to remain vigilant about - alert fatigue. If a team receives constant warnings about a flaky test, they'll eventually become desensitized to them. The problem with this numbing towards alerts is that we miss major alerts that get caught up with all the noise. We've all been affected by some form of alert fatigue. For example, you missed an important text because you swiped it away with all the other non-essential notifications on your smartphone. Or perhaps your computer stopped working because you always ignored the "low disk space" alerts that popped up every few minutes.

The best way around this is to not rely on automatic notifications from a computer. Instead, it would help if you mention the issue every reasonable chance you get. Many development and testing teams have daily standups or weekly meetings to discuss what's been going on with your project. These moments provide an excellent opportunity to bring awareness to what's happening and seek assistance to solve the problem. If you show you're interested in becoming the solution to your broken test suite, others will find ways to help you.

Cover your automation shortcomings with manual testing

Ideally, we'd all have time to dedicate towards fixing unstable tests and keeping them in a maintainable, healthy state. However, sometimes we don't have that luxury. You and your team might be way behind schedule with your work, or there aren't enough resources to allot towards correcting an issue immediately.

If there's a legitimate scenario where you can't work on fixing your tests, it's acceptable to mark the test as pending and continue with your work. But that doesn't mean you get to ignore the problem. If the test is important enough to keep in the automated test suite, yet it's not feasible to correct at this time, make sure you cover that scenario with manual testing.

Even with highly efficient and robust automation, the best testing strategy involves lots of manual testing. Many teams with good test automation think they can get away only with their automated test suite, but manual and exploratory testing is still an essential part of quality. It's especially important when your automation breaks down for any reason and you can't fix the problems promptly.

If you're part of a team that doesn't incorporate some form of manual testing in their everyday workflow, start a discussion to begin the process. If your team already does manual testing, remember to cover the gaps left behind by any flaky tests that you can't stabilize right now. It will likely feel less efficient than your automated test, but it's better to have stable and more complete coverage on what matters the most than an unstable, untrustworthy test suite full of holes.

Summary

No one is immune to getting test failures because of an unstable test case in their automation. It's nearly impossible to have a fully-stable test suite, especially when it comes to always-changing applications.

Unfortunately, the way many teams cope with this problem is by covering up the issue. They keep the failing tests but comment them out or mark them as pending, so the entire build doesn't complain. Some even go as far as shutting off their notifications to avoid dealing with the issue. These actions lead to a "broken window" that's left unattended for far too long, leading others to ignore similar problems. Before you know it, no one cares about the results, and your test suite loses its value.

If you're on a project containing too many "broken windows", you can take action to avoid making the problem worst. The best thing you can do is to attend to the problem immediately. You and your team should get into the habit of correcting problems first before working on new functionality. The earlier you attack a problem, the easier it becomes to deal with.

Another step you can take is to determine whether the failing test is still valuable to your test suite. As your project grows, some of your code will outlive its usefulness. If your automated test no longer serves its original purpose, get rid of it as soon as possible. Don't get attached to any of your tests, thinking that they still might be useful someday - most likely, it won't.

If someone insists on reducing or eliminating failure notifications from your tests, insist on keeping them around. You and your team need reminders to fix the problems instead of covering them up and potentially forgetting about them. Still, make sure you deal with the issue soon to avoid alert fatigue, where the team receives notifications so frequently that they ignore them. Letting alerts slide can lead to them missing important ones.

At times, you won't have the time or resources to fix a flaky test at once. For justifiable reasons, like projects running behind schedule or a sudden shortage of team members, it's acceptable to temporarily comment out that test or mark it as pending for now. But don't leave the gap in coverage from that silenced test. Use manual testing in the meantime to ensure you're still checking what you need.

It might sound far-fetched that a single test can bring down an entire automation strategy, but it can happen. By taking preventative measures once the inevitable test failure occurs in your automated test suite, you can avoid falling deeper into a hole that you can't get out of easily.

How do you and your team cope with flaky tests? Share your experiences with others in the comments section below!