Implementing ‘Always Encrypted’ in an ETL process
Storing and processing personal data
In case your data warehouse processes and stores sensitive data such as social security or credit card numbers you could implement a mechanism to secure this data. In case you store the data on a MS SQL Server database, one way to do this is by using Always Encrypted. If you’re interested to read more about Always Encrypted and the other possible SQL Server security measures check out the whitepaper Microsoft has written about GDPR and SQL Server. This post will focus on Always Encrypted and provides an example of an Always Encrypted implementation in a simple data warehouse environment.
Encrypt data using Always Encrypted
Always Encrypted stores the encrypted data in the database (both in memory and on-disk). The data can be decrypted by a client application using a .NET driver and access to the right certificate. An overview of Always Encrypted and all details can be found here. SQL Server Analysis Service (SSAS) or SSIS can also act like a client application and decrypt data using the same method. For SSAS this is explained by Kasper de Jonge, and for SSIS there is a blogpost on MSDN.
Limitations of encryption
There are two ways of encryption: randomized and deterministic. Deterministic encryption always generates the same encrypted value for a given value. Using deterministic encryption, you are still allowed to use equality joins and searches (=) in the database. All other joins and searches (>, <, <> and like) won’t work. Randomized is the most secure option but prevents searching, grouping, indexing and joining on encrypted columns. For both types it is not possible to apply manipulations on column level (for example LEFT, UPPERCASE, CAST, CONCAT and aggregated calculations).
Because SSIS is able to decrypt the data we are able to apply transformations after loading the data. Data is loaded from our source table using the ADO.NET connector, decrypted, transformed, encrypted and inserted into our target table, as shown below:
Joining on encrypted columns can be done in the same way by using Lookups or – in case of deterministic encryption - by using a SQL query on database-level.
Example of an implementation
In our example implementation we have a staging area with a copy from our source data. This data is loaded into our history area. From there on the data is transformed to a dimensional model and moved to the datamart area. All ETL is handled by SSIS.
Source to staging
In this example the source system data is decrypted beforehand. This data is loaded into the staging area. In the staging area the relevant columns are encrypted and the encryption key is stored.
Staging to History
The data is copied to the history database. If joins between tables are needed, these can be handled either by SQL queries or by transformations in SSIS. Other transformations such as LEFT, UPPERCASE, CAST etc need to be executed by SSIS transformations. The relevant columns are encrypted using the same key as in the staging database. These keys can be duplicated using t-SQL or PowerShell. The keys need to be duplicated otherwise the inserts will not work.
History to Datamart
Between history and the datamart the same logic can be applied as described above. The transformations to a dimensional model can be done either by joins or by SSIS lookups.
I’m very interested in other cases of Always Encrypted implementations in ETL / Data warehouse environments. If you have any thoughts on the subject, please let me know!