Automated Testing for the Data Warehouse

Case: we've integrated two sources of customers. We want to add a third source.

Q: How do we at the same time know that our current integration and solutions will continue to work while at the same time integrating the new sources?

A: Test it.

Q: How do we get faster deployments and more stability?

A: Automate the tests, so they can run continuously.

When integrating data, especially in agile environments, our already-integrated data is very likely to get some more integration. So WHY does automated testing happen so rarely within Data Warehouse projects?

Very-high-level (conceptual) overview

In essence, testing a Data Warehouse isn't that complex - there are only four steps:

  1. Set up a pre-defined environment
  2. Insert data at the one and only place data1 can enter the system
  3. Perform ETL to bring the data to the places where users read data from our BI system:
    • Data Marts
    • Cubes
    • Reports
  4. Test if the results are like the expected results
high-level-testing-process

The high-level automated testing process

Still-high-level (technical) overview

Technically, this can be achieved in four of the same steps. When using the Microsoft BI stack:

  1. Check out the latest sources
  2. Use Continuous Build and Continuous Deployment to set up the environment2
  3. The technical implementation of inserting test data is mainly about how to store the data-to-be-inserted. Some ideas:
    1. Use a insert scripts linked to test cases, for example:
      • SQL Insert scripts inside a database table
      • SQL Insert scripts inside a folder (folder name = test id)
    2. Use source CSV or Excel files inside a folder (folder name = test id)
    3. Use source database tables with sample data (equal to staging table + extra column for test id)
    4. Use some key-value store inside (for example) MDS.
  4. ETL is "just" calling on the ETL packages doing the internal data flow. If the database is empty except data needed for the test, this should be a pretty fast process. Keep an eye on your architecture: if ETL introduces new sources not flowing via staging, testing becomes way harder.
  5. For automated testing I'd recommend using NBi, which is essentially NUnit for the MS BI stack.

After each test, the environment needs to be reset. The easiest (but most time-consuming) way is to start at step 1 again. An easy shortcut is to make a backup of all databases directly after deployment. Then, after each test restore these "virgin" databases, allowing a fast insert-ETL-test cycle.

high-level-testing-process-backup-restore-cycle

High-level testing process, improved for performance by using backups

It starts with architecture

Note that all data enters the system via a few pre-defined points. This is mainly an architectural thing, but if we let this go testing will become exponentially harder. Hans Hultgren puts it like this:

One good way to visualize the EDW from a data movement perspective is that it is a One Way flow. Data cannot seep into the middle of the pipe. Likewise, no data leaks out of the middle of the pipe either.

Think about it: ultimately we're testing not for ourselves, but for our users. The inner workings of the DWH don't matter that much to our users, so let's keep them out of there - and keep the integration tests out of there as well. This gives us the freedom to change the internal implementation details as we like or need, as long as the external interfaces stay the same and the data stays right4.

one-way-flow

The One Way Flow: EDW as a pipe

This "clear cut" about where data can enter the DWH, and where our users can access delivered data means we need to think hard about how on-the-fly input can be processed: using external web services (like computing travel distance), doing supervised cleaning (like a Data Steward overseeing DQS) cannot "just" insert all data in the staging area5. Options could be a preprocessing layer (before "warehousing" the data or applying Business Rules), or well-defined interfaces that can be replaced by mock easily during the test process.

With regards to responsiblities, it might be that the BI system automatically sends an e-mail to a manager once a KPI is in deep red. This is something we definitely should test (probably we want to automate it, too), but isn't the core responsibility of the Data Warehouse. So in the architecture we'll want to have this e-mail sending system as a separate process acting on data already delivered by the Data Warehouse, thus enabling us to isolate the Data Warehouse test from the e-mail sending test.

I'm planning to do a few more posts on Data Warehousing and automated testing, but I'm curious about opinions first. Especially if you attempted or thought about automated testing for the Data Warehouse, but ran into problems with the execution. Please, speak up in the comments below - or send me an e-mail!6

Update (2016-09-27 09:05) - see for comments also these LinkedIn Groups discussions about this blogpost:

Founder of this blog. Business Intelligence consultant, developer, coach, trainer and speaker at events. Currently working at Dura Vermeer. Loves to explain things, providing insight in complex issues. Watches the ongoing development of the Microsoft Business Intelligence stack closely. Keeping an eye on Big Data, Data Science and IoT.

Leave a Reply

Next ArticleAzure ML Thursday 7: xgboost in Azure ML Studio