after acceptance: reason about what your system is producing with data testing

tl;dr

Modern software development allows us to prove that new work is functionally complete. We write a set of executable specifications. We automatically execute them in the form of acceptance tests. When all the tests pass, we are done!

This approach is superior to what came before it but is not perfect. Testing frequently ends at the point of release; however it is at this point that typically code will first load actual production data. This data is usually very different to the data generated in a test environment. Acceptance tests usually represent only a simplified version of user interactions. In production, data will almost certainly be generated by code paths that have not been fully exercised in tests. And production data usually exists in far greater volumes than testing data. If the current version of the system is stateful, how do we know that future versions will consider that state valid and act upon it appropriately? This problem becomes harder in a microservices environment where there may be different versions of a service running at the same time.

In this post we describe data testing; we define it as a type of integration testing that happens after the acceptance testing phase and that integrates data (usually production data) with code prior to it being released to production. Data testing allows us to write a broad set of assertions on the interaction of our code with production data, catching a broad category of potential bugs before the code hits production.

It's 2017. You're asked to implement a feature in your team. The feature is distilled into a story with acceptance criteria. You start by writing some honest end to end acceptance tests and then test drive the development of the implementation. An iteration later, all the tests are passing and the feature goes into production. Where it proceeds to fail. Maybe exceptions are thrown left right and centre. Sounds familiar? The development process above sounds almost textbook. So what could have gone wrong?

setUp to fail

Let's assume we're developing a piece of software for retail banking. What could have gone wrong with our release that our acceptance tests in our CI environment did not catch?

making our data invalid

Let's assume that we model bank accounts as follows:

enum AccountType {
     PERSONAL, BUSINESS, PRIVATE
}

public class Account {
       private final String name;
       private final AccountType type;
}

Now, our bank decides to get out of the private banking business. We then proceed to engage ourselves in one of the most satisfying experiences a developer can have: deleting code. Our code no longer supports the creation of PRIVATE AccountTypes and we deleted the enum value and supporting code. And of course, any acceptance test that used to work with private banking accounts is gone. Hey, we don't support this type of account any more.

Only that we forgot that there are still private banking accounts in production, so as soon as we release the code, reporting starts failing with exceptions!

nobody expects that value there!

In our domain, we're assuming that every account has a name. That's the name you see when you log in to your online banking portal. Armed with this assumption, we add an equals() method to our Account. Unfortunately, one of the previous revisions had a bug that under some circumstances created accounts with null names; we have fixed the bug a long time ago now, so we can no longer create null name accounts in our CI environment. Or perhaps an account name was optional in the distant past. Either way, there's now accounts in production with null name fields which are impossible to create in our CI environment. As soon as we release our code and equals() is run on the right account, it fails with a NullPointerException.

did our migration work?

Our bank is expanding! It used to be the case that every account was implicitly a USD account - but we now are going to support accounts of different currencies. We modify our account to look like this:

public class Account {
       private final String name;
       private final AccountType type;
       private final CurrencyCode currency;
}

We migrate (or perhaps have our data team migrate) all production accounts to have a default currency code of USD, as they are all accounts that were created when the only account currency supported was USD. How do we actually know that our migration worked? How can we rely on the fact that all our accounts have a non null USD currency code? After all, we cannot create null currency code accounts in our CI environment.

does my heap looks big in this?

As with any bank, risk management is crucial. Up until now, the business is spending hours preparing a daily risk report for people with large loans and overdrafts. Technology offers to make their life simpler, by generating that report automatically and nightly. We've got a story, break it down, storm on it, acceptance test it; and it all works brilliantly.

Well, it turns out that production has orders of magnitude more accounts than our testing environment. And while the report is produced in seconds in CI, it takes over 20 hours in production (or worse, we run out of memory!) meaning that the feature just doesn't work.

This is just a small sample of problems that traditional CI may miss. So how can we avoid them?

acceptance tests are not enough

Software tests are the best means of defence we have against bugs and regressions. And in modern software development, passing tests around a new feature is how we know that feature is done and working. But standard tests may not be enough. Here's how a typical test for our banking software could look like, assuming it's an end to end test:

@Test
public void getsTheCorrectBalanceAfterADeposit() {
       //given
       Account account = new Account();

       //when
       account.deposit(200);

       //then
       assertEquals(200, account.getBalance());
}

Basically, given an account, when I deposit $200 into it, then the balance of the account is $200. This is nice and simple, right?

software is complicated

Unfortunately, real systems are rarely that simple. Take an example system that has 200 boolean variables. Even such a trivial system has 2^200 possible states it can be in. This is a huge number - larger than the number of stars in the known universe. A system also does not stay static; We all like frequent releases and for very good reasons, but frequent releases also mean frequent exposure to change and therefore new risk.

Our hypothetical system with the 200 booleans had 200 booleans that are interpreted in a certain way at a particular release. So as if 2^200 possible states wasn't hard enough on its own, we now need to add a time dimension to that complexity. This is because most systems produce durable data and that data needs to be read and dealt with sensibly in any future release. After all, you wouldn't want to lose the money in your bank account when your bank releases a new version of their software.

continuous delivery of stateful systems is hard

Our given, when, then test above is showing us that the system works when I've just created a bank account and deposited $200. In reality, the balance in your account is going to be a result of a sequence of operations you've instigated over a long period of time, actions other actors (e.g. direct debits) have performed, system releases and data migrations. What this means is that real data (your bank account) is a result of a very complex set of interactions, that testing is unlikely to include. This is because testing cannot be exhaustive given that our systems are way too complex and produce and amend data over long time periods and continuous releases.

Durable data over multiple releases means that data is being produced and added to by multiple versions of your software. A CI environment and continuous delivery pipeline by definition tests a single snapshot in time, that is the software revision / commit that we're determining whether to release to production. This means that even if somehow you do think you can do exhaustive testing, it's actually impossible to address the time dimension using standard CI acceptance tests, since if you'd wanted to address it you'd need to run your acceptance tests for your nth release on top of data generated by tests from the previous n - 1 releases. Clearly this does not scale. This means that your new release is going to touch data that have been generated by sequences of operations that are not tested in your CI environment no matter what you do!

what we'd like to test

There's four data attributes we'd like to reason about:

Data validity

Validity is an indication of the system correctness. An invalid state can lead our system to catastrophic failure (e.g. Exceptions) or undefined behaviour that's not predicted by our tests. There's two aspects to validity:

All data that could be loaded at runtime should be able to be loaded without error. The data should be broadly sane (e.g. non null). This tests for things like unexpected (even corrupt) data that the system can no longer read correctly.
Each application has different business specific criteria of what it means to be valid, which could be asserted on. In our fictional banking software, it may be that no account can have a balance that's negative and less than some overdraft limit.

These tests could be expressed rather simply:

@Test
public void allAccountsAreReadable() {
       final AccountDao accountDao = new AccountDao();
       for(User user : users) {
       		final Account account = accountDao.loadAccountFor(user);
		verify(account);
       }
}

@Test
public void allAccountsHaveABalanceThatsAboveTheOverdraftLimit() {
       final AccountDao accountDao = new AccountDao();
       for(User user : users) {
       		Account account = accountDao.loadAccountFor(user);
		assertTrue(account.getBalance() > NEGATIVE_OVERDRAFT_LIMIT);
       }
}

Data invariance

The state of the system before and after a release should be the same (there may be some exceptions of course). Given that data is a result of a complex set of operations that we have no hope of fully replicating in our CI environment, we can't really make concrete assertions on what individual data values could be. To put this another way, if you're a customer of our fictional bank, we can't write an assertion on what your balance should exactly be. But what we could do, is that if we have revision X of our banking software out in production and we're taking revision Y through our continuous delivery pipeline, then revision X should give me the same balance as revision Y in a controlled environment. In code, this could look as follows:

@Test
public void accountBalanceIsMaintained() {
       final AccountDao accountDao = new AccountDao();
       for(User user : users) {
       		Account account = accountDao.loadAccountFor(user);

		BigDecimal productionBalance = dataOfProductionRevision.getBalance(account.getId());

		assertEqual(productionBalance, account.getBalance());
       }
}

More generally, invariant testing involves the following:

capture invariants
perform action
verify invariants

In our case, the action is to deploy the new version of the system in a controlled environment.

Migration integrity

At some point in a system's life, we're going to change the way the data is stored. To paraphrase, a continuously delivered system cannot have data persistence that's set in stone. One way or another we're going to have to migrate our data from the current representation in production to a different representation. If our bank is expanding to include accounts in different currencies as explained above, how can we make sure that the migration of all existing accounts to be explicitly USD accounts happens correctly? Well, here's a test:

@Test
public void allAccountsAreInUSD() {
       final AccountDao accountDao = new AccountDao();
       for(User user : users) {
       		Account account = accountDao.loadAccountFor(user);
		assertEqual(CurrencyCode.USD, account.getCurrency())
       }
}

Data volume

There's a set of issues that only manifests itself when code interacts with production sized sets of data. This is because while acceptance tests create the state that they need to manipulate and assert on, production usually has orders of magnitude more data. Issues encountered include running out of memory, or taking too long to perform an operation. Typically, most developers are not aware of how much data exists in production, and even if they are, that knowledge may become out of date quickly in a rapidly growing business.

Let's take as an example a mission critical risk report that needs to be produced nightly. If this is important enough, we might try to want to see how long it takes before shipping it to production - this assumes we have production sized hardware in CI to test this on:

@Test
public void generatesARiskReport() {
       final AccountDao accountDao = new AccountDao();
       final RiskReportGenerator riskReportGenerator = new RiskReportGenerator(accountDao);
       time(riskReportGenerator::make, 60, TimeUnit.MINUTES)
}

So what's stopping us from writing these kind of tests? The missing link is data - real, production data. Despite agile development techniques, continuous delivery, iterating over requirements and implementation, test driven development, CI and the like, our code tends to touch production data at the time of release. This is as late as possible and smells an awful lot like waterfall development.

if it's painful, do it early and often

If we were able to get production data into a CI environment, we could import it into a database or a running system and write these kind of tests.

Unfortunately, this would probably be illegal. If you're a customer of our fictional bank, you wouldn't want your details to be available in a CI environment for all devs to see. But what if we could get something as close as possible to production data into CI, but something without personally identifying information?

data sanitisation to the rescue

The solution to this problem is to include a data sanitisation service and ship it to production along with the rest of the code. The service, a cleanser, is responsible for the following:

taking a copy of production data
running code that removes personally identifying information
archiving the cleansed result and uploading it to CI

In our banking example, assuming we're using a database, the cleanser would

take a database slave out of the pool
change all names to variations of First name, Last name. This is the sanitisation or cleansing process
archive the cleansed result and upload it to CI
undo / rollback the changes and return the database to the pool

The cleanser is an integral part of our application and should be deployed with it. This is because new versions of the application may include new types of sensitive data and the cleanser needs to be updated to sanitise it.

the migration blues

A cleanser will allow us to ship our data from production to our CI environment. But can the latest version of our software in CI read that data? It may be able to, but what if the current version serialises differently, because, for example, we're now supporting multiple account currencies in our retail banking software? If only we could migrate the data to the latest schema. The solution presents itself:

Data is owned by the application and therefore data migration is integral to it. This means that the migration process is owned by the application, and new migrations should ship with the application and be performed with the deployment of every new version. This allows for a CI process that does the following:

import the sanitised data that the cleanser produced
capture invariants (e.g. account balances)
migrate it to the latest version of the application
run the tests

volumes and frequency

This approach sounds good in principle, but how often should you do it? When should the data be refreshed? How frequently should the data tests run in CI? What do you do if you you have terabytes of data?

As is often the case with difficult questions, the answer is it depends. In principle, we want to do these things as often as possible. This includes refreshing data and running the tests in CI. After all, frequent refreshes and test runs have a chance of catching bugs earlier. But often we have to be pragmatic. Here's some techniques to consider:

Sampling: If your data volumes are just too large, consider making the cleanser sample the data that it sends over, rather than sending the whole lot.
Incremental updates: Rather than sending the complete volume of data every time, consider sending an incremental update, through a diff mechanism.
Rolling back: Consider rolling back after a test run - this should make the next run much faster.

a note on databases

We've deliberately avoided using the word database in most parts of this article. These techniques are equally applicable to databases as to other data serialisation techniques.

a note on staging

If you've implemented this approach, consider using it in your staging environment too. Once you have real (sanitised) production data, why not import them into your staging environment prior to release? That way you will end up with an environment that's as close as possible to production topology and data-wise, running the code that you wish to release to production.

conclusions

Data testing can help you catch a whole new category of bugs and errors before your users do. We hope that this article outlined some approaches for doing this. This will be talked (or has been talked about, depending on when you read this) in more detail in QCon London in March 2017 and JAX Finance in April 2017. Hopefully see you there (or maybe you were there already :-).

changelog

2017-02-18 adjusted with feedback from ever helpful Amir Langer
2017-02-03 initial revision

motocode