From Click to Chaos: Chaos Testing Within Neo4j Aura

Photo of Chris Heisz

Chris Heisz

Senior Software Engineer

Linking Argo Workflows and Backstage for Automated Testing

By Chris Heisz and Luke Beamish

Two-headed dragon breathing fire and ice — one representing Argo, the other representing Backstage.

Every good software engineer should want to test their changes before merging them. But what do you do when that software is running in a complex multi-cloud environment across thousands of Kubernetes clusters? Do you look for ways to run tests locally? Do you spin up a simplified environment? Or do you run tests in a large production-like environment?

Each option has its own pros and cons, and most organizations will use some combination of the above. However, some software simply has to be tested in a production-like environment, which may be shared among hundreds of engineers. This may be complicated or difficult to run tests against. At Neo4j, we’ve run into this when ensuring that the open source software developers can test their changes within our cloud offering, Aura.

In this article, we’ll talk about how we use Argo Workflows alongside Backstage to make it simple to run chaos tests against instances of Neo4j on multiple K8s clusters across Google Cloud Platform (GCP), AWS, and Azure.

With great scale comes great responsibility. To ensure that Neo4j remains performant and stable, we built chaos tests that allow engineers to validate the stability of new features in a realistic cloud environment. At its simplest, one of these tests spins up a Docker container next to a Neo4j database running in Aura and fires off a heavy query. We then monitor the health of the database, ensuring that it recovers without crashes, slowdowns, or degraded performance.

We’ll dive into how we designed these tests, the challenges we faced, and how our approach improves the reliability of Neo4j Aura.

Testing Neo4j Aura in Production: A Mission to Mars

Planet Earth sending a rocket to Planet Mars, representing a new feature arriving to Production

Imagine our production Aura database lives on Mars, and we, as engineers, develop and test our features on Earth. Each new feature we deploy is like a rocket launching into space, traveling across the unknown before reaching its destination.

To ensure that these new features survive the journey and function reliably once they land on Mars, we need a way to stress-test them under real-world conditions — even within production if we choose to.

Executing a Simple Chaos Test

A simple way we can test a new feature in Neo4j Aura is by executing a chaos test directly against a database within the Kubernetes cluster:

Code example presenting a Kubernetes Job

Let’s do it in the simplest way:

  • Start by creating a dedicated test database in our production environment.
  • Deploy a Kubernetes job that knows the URL of the database and the Cypher queries we want to send (Cypher being the query language for graph databases).
  • This job then spins up two containers, each responsible for a crucial part of the test:
    1. Query overload — One container bombards the database with intense Cypher queries, deliberately trying to break it and test its ability to handle extreme load.
    2. Health check — The second container monitors the health of the Aura database, ensuring that the pods haven’t crashed and are still functioning as expected after the stress test.

The outcome? We can manually observe whether the Kubernetes job passes or fails, giving us direct feedback on the stability of the database under pressure.

Frustrated man sitting on a park’s bench
Source: Pexels

The Problem: Manual Testing Doesn’t Scale

But here’s the thing: This is a lot of toil to repeat every time we want to run a chaos or load test. And we have more than 100 engineers working on Neo4j changes, all having to recreate the same test setup repeatedly.

Not everyone in our team spends their days working with Kubernetes (not everyone is lucky enough to be a cloud engineer!). For example, some of our engineers are mathematicians focused on algorithmic improvements to Neo4j.

The bottom line? Not everyone feels comfortable interacting with Kubernetes. If chaos testing is too complicated, people won’t use it — and that’s a problem.

Bringing Chaos Testing Closer to Home

Rockets representing new features are sent from Planet Earth to the Moon instead of Mars.

Instead of running our Kubernetes job directly on production, let’s take a step back. We still need to test the stability of our databases, but running these tests against live customer environments isn’t exactly ideal.

So, what if we ran them in our development environment instead?

Our development environment is identical to production, with one key difference: It isn’t exposed to customers. That means we can safely simulate real-world conditions without the risk of affecting live workloads.

But there’s still a problem: Manually defining a Kubernetes job every time we want to run a chaos test is inefficient. Instead of drafting them by hand, we need a better approach — one that automates the process while keeping it flexible.

Choosing the Right Tool: Enter Argo Workflows

We evaluated multiple tools to help orchestrate our chaos tests, and while we liked several of them, after a few weeks of testing, we landed on Argo Workflows.

Why Argo Workflows?

Lunar base representing Argo Workflows

Think of Argo Workflows as our lunar base — a container-native workflow engine that lets us orchestrate parallel jobs inside Kubernetes.

We’re already very familiar with GitHub Actions, and since we’re a multi-cloud company, GitHub Actions still plays an important role in running workloads that operate outside of GCP, AWS, and Azure.

However, there are some issues:

  • When you use GitHub Actions in a mono repository, workflow runtimes can be slow.
  • Every time we trigger a job, it clones the entire repository, even if we use a shallow copy.
  • It then needs to build dependencies and authenticate with our Kubernetes clusters.
  • Some complex tasks take 10–15 minutes to run — far from ideal when testing at scale.

By switching to Argo Workflows, the impact was immediate:

? Chaos tests that used to take 12 minutes now take just 2 minutes.

? It runs inside Kubernetes, so all it needs is a service account — no complex authentication required.

? Network latency is significantly lower, as the requests never leave the cluster.

Argo Workflows provided the speed, efficiency, and scalability we needed to make chaos testing fast and accessible for engineers across Neo4j.

Automating Chaos Tests With Argo Workflows

Code example showing an Argo Workflow YAML

Once you install Argo Workflows in a Kubernetes cluster, it introduces a powerful feature: Workflow Templates. These are custom resources that the Argo Workflow Controller recognizes and can execute on demand.

Think of a Workflow Template like a VHS tape: It’s a pre-recorded sequence of steps that we can replay anytime. (If you’re too young to remember VHS tapes, just imagine it as a reusable automation script!)

How It Works

When we create a Workflow Template, it’s stored inside Kubernetes, and we can refer back to it whenever we need to rerun the same workflow. Let’s break it down:

  • Line 4: The template declares, “I am a Workflow Template, and my name is cypher-send-query.”
  • Line 10: It defines required inputs, like NEO4J_URI and the Cypher queries to execute for chaos testing.
  • Line 13+: This is where the business logic happens. The workflow pulls the official Neo4j Cypher image and runs a shell script, passing in the required parameters.

Using Argo Workflows, we’ve turned chaos testing into a simple, repeatable process. Instead of manually setting up Kubernetes jobs every time, engineers can trigger predefined Workflow Templates with their own parameters — and let Kubernetes handle the rest.

We can go even further:

  • We can automatically create an Aura database as part of the workflow.
  • After the chaos test runs, we can verify the database’s health, ensuring no pods crashed.

With this approach, chaos testing at Neo4j can be automated, reproducible, and accessible to all engineers.

Scaling Up: Running Chaos Tests Across 1,000+ Kubernetes Clusters

If we only had one Kubernetes cluster to manage, Argo Workflows alone would be enough. However, at Neo4j Aura, we operate more than a thousand Kubernetes clusters across multiple cloud providers — so we needed a way to send chaos test requests to the right cluster, seamlessly and efficiently.

To achieve this, we built a lightweight microservice in Go.

A satellite representing a micro-service orbiting the Moon.

This microservice is only responsible for:

  • Accepting an incoming request to trigger a chaos test.
  • Determining which Kubernetes cluster the target database is in.
  • Sending the request to that cluster, where the Argo Workflow is triggered immediately.

That’s it — nothing more, nothing less.

It acts as a bridge between engineers and Kubernetes, ensuring that chaos tests run next to the correct database, no matter which cloud provider or region it resides in.

A code example showcasing the one-liner that creates the Argo Workflow Kubernetes resource within the Kubernetes cluster.

Looking at the actual code, the microservice logic is remarkably straightforward. The core function that creates the Argo Workflow resource is just a few lines long. It simply:

  1. Creates the Workflow resource
  2. Passes in the necessary parameters

If you’ve worked with GitHub Actions, this process may feel familiar. It’s similar to how a GitHub Workflow calls a custom GitHub Action, passing in parameters and letting the system handle execution.

A satellite representing a micro-service orbiting the Moon.

For Aura engineers, this microservice allows chaos testing by letting them send a simple API request with a JSON payload. The payload references an existing Argo Workflow Template and passes in the required parameters. The workflow triggers automatically in the correct cluster where the database is running.

And that’s the entire back end. No manual setup. No guessing where a database is located. Just a lightweight, automated way to execute chaos tests at scale.

Mission Control: We Need You

A comic-style illustration showing Mission Control

So far, we have a way to trigger our chaos tests via cURL requests, so long as they know that they can, what the endpoint is, and what parameters they need to send — on top of all the other things they already need to remember. Ideally, we should have a nice way for new engineers to quickly run our tests with as little cognitive load as possible.

Fortunately, we already have a Backstage instance that our developers are familiar with, and which is easy to integrate with existing services. If you haven’t come across Backstage already, it’s an open source framework for creating developer portals that we highly recommend you check out. A bonus of using Backstage is that it has a large library of plugins that continues to grow rapidly. These can provide even greater functionality to the tooling you serve to developers. For example, we’ll use the Spotify RBAC plugin to quickly and easily control who can run our chaos tests.

A snapshot of Spotify Backstage’s scaffolder plugin

Scaffolder Plugin

A core part of the Backstage offering is its Scaffolder plugin. Ostensibly, it’s meant as a mechanism to create new software from predefined templates and provides several key features:

  • Engineers can define workflows that can be triggered within the Backstage UI, which will run in the Backstage back end.
  • Users can explore the available workflows within the Backstage UI, as well as see which team maintains them.
  • Each workflow can have a complex input form with validation, making it easy for users to make sure they’re sending the correct information to their workflow. The inputs can also use the Backstage Software Catalog, (if you need to specify which component you want to run tests against, for example).
  • When the workflows run, users can see the logs in real time.
  • Users can see the state of different workflows, including which ones they triggered.
  • Backstage maintainers can add their own custom actions that any workflow template can reference. Since every action runs in the Backstage back end, this opens up a lot of possibilities.
A snapshot of Spotify’s Backstage’s scaffolder plugin’s form

When all of these features are taken together out of context, the Scaffolder sounds less like a tool for templating software and more like a multi-purpose workflow tool. And that is exactly what we treat it as at Neo4. In fact, we rebranded it entirely, so everyone at Neo4j knows it as Aura Actions rather than the Scaffolder.

Most of the features listed above are really useful for managing chaos tests. We want engineers without a background in cloud technologies to quickly trigger tests, knowing that the inputs to their workflows have been validated. We also want developers to be able to discover the chaos tests for themselves, without us having to do too much sign-posting. And we want developers to see what went wrong in the logs for their workflow if the chaos tests go wrong.

Extending the Scaffolder

If you want to add your own custom actions to the Scaffolder, you can easily do so with the Backstage CLI tool by running yarn backstage-cli new and selecting scaffolder-module (the Backstage docs go into more detail than we will here).

Image of a CLI showing how one can create new plugins

Once you’ve got your new module, you can go to the provided example action and replace it with your own implementation.

You can see in our example below that we’re calling the createTemplate function from the scaffolder node plugin and passing in the definition for our action. We’re then giving our action a unique ID (we usually follow the same naming convention as the built-in actions, which uses colons to separate category and subcategory), and we’ve defined the schema that the input for your action should follow.

Code example showing how to create a new Scaffolder integration

Backstage invokes the handler function, which contains our business logic. In our case, it’s quite simple since it’s just calling out to the API service mentioned earlier. There’s nothing fancy about our request just because it’s within a Backstage action. We have a simple helper function that extracts the Argo Workflow ID from the response back from our API, which it then uses to make a subsequent request for the logs for that run. These logs can display back to the user by simply passing them to the output() method of the handler’s context argument.

In our example, we’ve set up the action so it returns all the logs in one go, but realistically, you’ll probably want to stream the logs back as the Workflow runs:

Code example showing how to create a new Scaffolder integration

In our case, we’re using a microservice to create the actual workflow resources in K8s, but if your Backstage instance can reach the K8s API server, it could also apply the resources directly using the JavaScript K8s client.

Summary

Let’s have a quick overview of the full picture showing how chaos testing works within Neo4j Aura.

At the heart of this system is Backstage, our internal developer portal, which acts as the entry point for engineers to trigger chaos tests without needing to interact with Kubernetes directly.

An overview showing Planet Earth, a rocket representing a new feature, a satellite representing a microservice, and the Moon representing a Developer Environment where features are sent.

Step-by-step execution:

  1. An engineer initiates a chaos test in Backstage.
    Backstage, using the Scaffolder plugin, sends a POST request to our microservice. This request contains all the necessary parameters, including the database location and the name of the referenced Workflow Template.
  2. The microservice forwards the request to the correct Kubernetes cluster.
    It creates an Argo Workflow resource in the cluster where the target database resides. The Argo Workflow Controller picks up this resource and starts executing the workflow.
  3. Argo executes the chaos test.
    The workflow overloads the database with particular Cypher queries, in a harmful way of trying to push the database into an unhealthy state. After the test, it checks if the database remains healthy and records the results.
  4. Logs and results are collected.
    The logs from the test are delivered to a central location, making it easy to track execution details. The microservice then returns a link to these logs to Backstage, allowing the engineer who triggered the test to monitor its progress without needing direct Kubernetes access.

Where Next?

Image of a firefighter
Source: Pexels

Bring on more chaos!

?We’re already exploring the idea of introducing random chaos events at runtime using this same method, similar to many out-of-the-box chaos testing tools. Some of you may be familiar with tests where you don’t know in advance what will go wrong — maybe a pod gets deleted, maybe a deployment scales down unexpectedly, or maybe a cloud resource disappears. With Argo Workflows, all of these scenarios are possible if set up correctly.

? Looking ahead, we plan to enable engineers to run the same load tests and chaos tests against our production environment.

? We’ve already made it possible for engineers to write their own chaos tests, allowing them to simulate and validate failure scenarios specific to their features.

Another step we’re considering is automating our internal fire drills using the same Backstage-to-Argo Workflows setup. In the future, when a new engineer joins, they could trigger a fire drill at the end of their induction. This would run in their own development environment, simulating pre-scripted failures. Argo Workflows would guide them through the process and provide feedback on Slack as they progress.

?By continuously improving our chaos testing approach, we’re making Neo4j Aura more resilient while keeping the process simple and accessible for engineers across different teams.


From Click to Chaos: Chaos Testing Within Neo4j Aura was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.