End-to-end tests are great when it comes to testing the entirety of a system, especially in this age of microservices and third-party integrations. As part of test automation processes, they're an essential part of helping testers free up their time from doing boring and repetitive tasks while keeping a system well-tested in a fraction of the time.

Unfortunately, end-to-end tests are also notoriously unstable and flaky, no matter how much work you put into building your test suite. They can work well one second, and then suddenly start failing with no explanation, only to "magically" get back to a green build state just by re-running them. If you've spent any time dealing with automated end-to-end testing, you know what I'm talking about.

A Real-World Example of End-To-End Test Frustration

I ran into a frustrating situation with end-to-end tests recently with a company I work with. They have an end-to-end test suite that takes our continuous integration service about 15 minutes to set up the testing environment and execute the tests, which we run every time someone updates one of the primary branches in the code repository.

One afternoon, we received an alert from our continuous integration service that the end-to-end tests failed. When I checked the logs, we had some failures due to a third-party service required in the application was temporarily inaccessible. It's an unfortunate side-effect since we can't shut down this external service during testing.

After verifying that the third-party service was up again, I re-ran the tests. A few minutes later, another alert from the CI system came in. This time, the CI service choked for some unknown reason while spinning up the infrastructure for the application to test. Another temporary issue that was out of our hands.

I re-ran the tests again. Guess what happened? The tests failed yet again. This time, a single test failed because the application took a few seconds longer than expected to complete an action and the test suite timed out. It's something the team should look into later since it was likely due to a dynamic selector, but at this point I just want this test suite to pass.

I re-ran the tests once again. Another failure alert. This time, the CI service itself failed to pull in one of its dependencies, something that we have absolutely no control over. I'm about to give up when, after re-running the tests, everything passed without a hitch. That's five retries to get a green build, if you're keeping score.

It took us nearly two hours of reviewing logs and waiting for the test retries for a process that takes at most 15 minutes. I don't know how I resisted the urge to rip out the end-to-end tests from the whole process and not worry about them for the rest of the product's lifetime. This is just an example of why these tests have such a bad reputation, despite their usefulness.

Improving the Stability of Your End-To-End Tests

The situation I described was an extreme example since we never had to retry the end-to-end tests more than once before this happened. Many of these failures were problems that seemed to be out of out of our hands. Still, we could do a few things to mitigate these problems and reduce the overall flakiness, which we eventually addressed and helped cut down random test failures.

In my years of experience with helping build and improve automated end-to-end flakiness, I've spotted a few common issues that often lead to flakiness and instability in end-to-end tests. Here are five ways to combat some of the frequent causes of brittle test suites to increase their stability. Fixing some or all of these will help reduce those hair-pulling moments like the one mentioned above.

Don't put your test to sleep

One of the quickest ways to kill the stability of your end-to-end tests is by telling a test case to pause for a specific number of seconds to check for the existence of an element or the result of an action. Typically, these delays come in the form of a sleep command, where test execution freezes for the number of seconds defined in the command and proceeds with the next step in the automated test. The reason why these commands are sprinkled through some test suites is to give the application under test an opportunity to have the necessary state for the test to proceed.

The problem with these sleep commands is that the duration you want the test to wait for won't be enough for various reasons. Maybe the application is slower than usual to respond to the database during the tests, or it's attempting to communicate with a third-party service that's experiencing some slowness. Inexperienced testers often bump up the number of seconds, which fixes the immediate issue at the expense of making the tests slower. Upping the sleep duration also doesn't guarantee it'll fix the problem in the long run. The number of seconds you set today might not work tomorrow.

To resolve this, testers shouldn't halt tests via sleep or similar commands and use implicit or explicit waits against the application instead. These waits let the test automatically check for an element to show up or an action to finish before moving on, either explicitly (where you define a condition before proceeding) or implicitly (where the framework waits for something based the next step in the test case). Most modern end-to-end testing frameworks have this functionality baked in, such as TestCafe's Smart Assertion Query Mechanism. Using this functionality massively increases the stability of end-to-end tests since the testing framework will do its best to ensure the application state is what's expected.

Use element selectors you can trust

A common issue with end-to-end tests is keeping the selectors used to identify elements on the application up to date. With today's rapid software development cycles, it's becoming more and more challenging to stay in sync with changes between automated tests and the application under test. This situation is worse when the "developer vs. tester" mindset is prevalent in an organization, and there's little to no communication between teams to fix any issues quickly.

To make matters worse, modern web applications are dynamic in nature, with the contents of a page always changing. In some applications, you have too many different variables coming into play with how a web page or screen renders. There are also libraries and frameworks using components that generate their own markup, making it impossible for automation engineers to create stable selectors. All

One solution for these problems is to reduce the reliance on CSS or other frequently-changing selectors to locate elements in an application, and opt for using other attributes that won't change often. A typical suggestion is to use accessibility attributes to locate elements, such as aria-label in a website, since they won't change often (with the bonus of encouraging developers to make their applications more accessible). Another suggestion is to add data-testid attributes to elements, as these are unlikely to change even when the element's other attributes do. Whichever route your team chooses, finding selectors that don't rely on presentation will make your end-to-end tests more resistant to failures due to frontend changes.

Keep your test infrastructure performant

In my experience, testers will run end-to-end tests on their systems only when they're working on them, and developers will likely never run them locally unless there's a problem they're looking to debug. Usually, it's because these tests are slow to run or require complicated setup. Instead, the majority of the time that end-to-end tests get executed are in a continuous integration service on hardware that's controlled by a third-party service, like GitHub, Azure, or CircleCI. Running tests mostly on CI services is fine, but have you ever thought what your CI service uses to run your tests?

An often-overlooked part of end-to-end testing processes is the infrastructure where the tests run. Typically, these CI services provide default environments that are significantly lower powered than the standard developer or automation tester laptop. For example, CircleCI's default server class has 2 vCPUs and 4 GB of memory. While that might sound sufficient, some end-to-end tests might require spinning up your application's services in the same instance and can quickly eat away at those resources. When the underlying infrastructure runs out of resources, you'll begin seeing random timeouts and failures during your test runs.

If you suspect that low-powered CI services are affecting your end-to-end tests, experiment with a test run on a system with extra resources. If you're self-hosting a CI system like Jenkins, use a more powerful server to see if it helps resolve some of the instability. For paid services, you can pay extra to spin your test infrastructure on higher-powered environments. Some services also let you spin your own runners to get the resources you need. You might be surprised at how an under-powered continuous integration service can negatively affect your automated tests.

Don't be afraid to move your tests elsewhere

Many organizations lean too heavily on one form of testing over others. Focusing too much on one area of automated testing can be a detriment to your QA processes, since it won't give you as much coverage as you might expect. It also blinds teams towards finding better ways to build their tests for specific purposes. For instance, developers who mostly write unit tests will likely ignore or forget to do other forms of testing that verifies their code works well with other parts of the system.

This issue occurs frequently with end-to-end testing. Many teams, upon seeing the benefits of having their applications tested as if someone were manually running those tests, begin putting too much unnecessary strain on that side of the test pyramid. One example comes from a team which I helped audit their test automation processes. I noticed that they had many end-to-end tests that were written to test specific components of their frontend. The team never thought of doing component testing as they only focused on the setup they had available. Given that they had to spin up their entire application for this, they were wasting precious time and resources on something they could test through other means in isolation.

Over time, most teams will have automated tests that can be shifted around elsewhere to make the most out of them. It's well worth the effort to take stock of your entire automated testing processes to analyze whether your organization can improve its test cases, especially if you rely on automated end-to-end tests. You might find some test cases that are better suited to API testing or isolated component tests instead, which will make your tests much faster and reliable in the long run.

Review and refactor your code frequently

Many automation testers fall into the trap of writing the code for a test scenario, commit it to their codebase, and never touch it again. After all, as the old adage goes, don't fix what isn't broken. Often, the only time someone updates an automated test case is when a problem occurs. Unfortunately, doing this is a missed opportunity to keep your automated testing processes running smoothly for the foreseeable future.

Even if your test suite has been working great for a long time, you may still find plenty of opportunities for improving your tests. As testing tools and frameworks evolve, new enhancements and bug fixes can make your code perform optimally. You also continuously grow and develop as a tester, and discover better ways to write those tests. For example, you might have done a hacky workaround on a test despite having it cause sporadic failures because it's the only way you knew how to do it, or your testing library didn't have the support. Over time, you may have learned how to do things the right way, or your tools now have what you need.

The next time you're adding new automated test cases to you codebase, take a look at some older tests you wrote a few months or even a few years ago. There's an excellent chance you'll immediately spot some areas where you can make significant improvements in minutes. I'm constantly amazed when I look back at my code from just a few months ago and notice how many places I could make things better. Don't be afraid to spend some extra time maintaining what you built in the past, as doing so will lead to more efficient and effective end-to-end tests.

Summary: You Can Make a Difference With Your End-To-End Tests

The complexities of automated end-to-end testing can make it seem like you're forever destined to deal with slow, flaky tests. But it doesn't have to be that way. The five strategies mentioned in this article will help you and your team minimize those issues. Each one of those strategies will lead you to fewer test failures, faster development cycles, and happier team members who don't have to deal with all the headaches of an unstable and inconsistent automated testing strategy.

It doesn't mean that you'll have consistently stable end-to-end tests if you follow all the strategies from this article. You'll still experience the occasional failure despite your best intentions—it's the nature of the beast. The goal for these strategies is to build a robust and resilient environment for your automated tests, and make the time and effort put into building up these processes worth it for your organization.

In what ways have you improved the stability and robustness of your end-to-end tests? Share your personal strategies with the testing community by leaving a comment below!