How can Chaos Testing Improve Software Quality?


Software is the vehicle for delivering most business applications, and it must be able to withstand adverse circumstances. One enterprise standard deployed with agile and DevOps teams is Chaos Testing.

Netflix pioneered chaos testing in 2010. They experienced a significant network outage with their previous provider, and their development and operations teams decided to make the bold move- migrating their entire infrastructure to Amazon Web Services (AWS.)

During the exodus, they established principles for testing on production systems to ensure the integrity and reliability of services would remain intact. In this article, we will discuss the importance of chaos testing, what entities should and should not deploy the principles, and how it fits into DevOps and DevAug life cycles.

What is Chaos Testing?

Chaos testing, also known as chaos engineering, identifies vulnerabilities in a system before they cause unplanned downtime or a negative user experience. Engineering teams utilize monitoring tools to actively run testing on production environments. Using this approach offers real-time, real-world scenarios of the application under duress. The benefits are clear with these practices- an enterprise organization may avoid costly repairs following major failures during operation, which could save millions over time!

According to PagerDuty, there are five main principles of chaos testing:

  1. Ensure the system works and defines a steady state.
  2. Hypothesize the system’s steady state will hold.
  3. Ensure minimal impact to users.
  4. Introduce chaos.
  5. Monitor and repeat.

Here is a breakdown of the concepts...

1. Ensure the system works and define a steady state

The goal of any control system is to achieve its normal working behavior, which would be below a 1% error rate. The key aspect in doing so successfully can either involve ensuring that everything works as it should or defining what constitutes an adequate “steady state” output and then maneuvering towards those measurements over time until they are met.

2. Hypothesize the system’s steady state will hold

When a system reaches its steady state, it must be assumed that this condition will continue both in controlled and experimental environments.

3. Ensure minimal impact to users

This is the cardinal rule of chaos testing. The chaos engineer must balance a fine line of art and science ensuring they stress test the system while not adversely affecting users of the production system.

4. Introduce chaos

The true test of any system is when it is unknown if it will fail or not. By running chaos testing applications, different variables are introduced to simulate real-world scenarios, including everything from a server crash to malfunctioning hardware and severed network connections- all in an attempt to see how the system handles these conditions before they actually happen.

5. Monitor and repeat.

The goal of chaos engineering is to find any weaknesses in the system and make it more reliable. Do this by consistently testing, introducing some form or another of disorder to pinpoint those vulnerabilities that could prove costly if they go unnoticed.

Who Should and Who Shouldn’t Perform Chaos Testing

Although the chaos testing concept is excellent, it is not suitable for all software vendors. However, here are those software development scenarios that are suggested to venture into this methodology of testing:

  • Systems that require scalability.
  • Mission-critical systems which run the business.
  • Systems that require 24x7 and/or more than 99% SLA uptime.
  • Larger and more complex software systems rather than desktop software or smaller systems.

Chaos Testing Importance in DevOps

A chaos engineer is a person on the development or QA team responsible for executing tests and determining results. They minimize customer impact during production runs with techniques such as failover scripts and data backup plans in case something goes wrong. They also identify best practices that can be scaled up across teams through automation of tasks like reporting metrics from different environments - whether its failed experiments run locally using dev/test infrastructure vs staging environment containing live production code.

As mentioned, the chaos engineer must navigate the hair-thin line using science and art to stress the production system without affecting real users effectively. Knowing when to abort chaos testing is the key to his position.

How to Get started with Chaos Testing in DevAug?

Chaos testing sets out to purposely try to break a system so to determine there are no vulnerabilities. We spoke upon automated systems, however, we have not spoken in depth about artificial intelligence’s role in this practice. Artificial intelligence and machine learning will pioneer chaos engineering.

DevAug is a methodology of using AI to enhance software development. DevAug focused development teams will train the AI to understand the business logic of the software, ensure quality control objectives, instruct AI of the steady state, and warn it not to reach a critical failure state, then turn the machine loose to scout for vulnerabilities- reporting its findings for analysis.

This is the evolution of software testing- using AI to aid test automation tasks. Therefore, automation engineers will focus more on training AI.

We highly recommend those DevAug teams interested in software and systems efficiency to give chaos engineering a try. To get started, experiment with tools that Netflix shared on Github called Chaos Monkey. It is a toolset to test the resiliency of software to achieve quality assurance.

This test suite features several tools to test a range of scenarios, including simulating outages, performing systems health checks, best practices inspections, security analysis, plus freeing cluttered resources to ensure smooth operation.

All of these tools are designed to encourage system architects to innovate and build in such a way that, even if chaos monkeys wreaked havoc in their systems- they can withstand damage and without adversely affecting real users of the system.

We encourage professionals in the field to advance further with AI infusion, allowing an algorithm to build, manage, and control chaos testing is the evolution of software resiliency. It is the DevAug way!