This blog contains a commentary on the GDPR, as Microsoft interprets it, as of the date of publication. The tools and services referenced herein are not designed to ensure GDPR compliance but to assist you and your organization with your data classification and categorization, an important step in the journey to compliance. The application of GDPR is highly fact-specific and not all aspects and interpretations of GDPR are well-settled. As a result, this blog is provided for informational purposes only and should not be relied upon as legal advice. We encourage you to work with a qualified legal professional to discuss the meaning and applicability of GDPR and how best to ensure compliance for you and your organization.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS BLOG. This blog is provided “as-is.” Information and views expressed in this blog, including URL and other Internet website references, may change without notice. This blog does not provide you with any legal rights to any intellectual property in any Microsoft product.
About the Authors: Alice Kupcik is a Senior PM in the SQL Database Security team at Microsoft and is passionate about all things around data privacy & protection. Tony Smith is an Azure Data Solution Architect with a primary focus on analytics & data platforms
The GDPR sets a new global bar for privacy rights, security, and compliance to protect the privacy rights of individuals and will require significant operational changes by organizations. The requirements around the right of individuals to access, rectify, erase (‘be forgotten’), object to processing personal data & the need to strongly secure that personal data may lead to challenges in traditional data movement processes prevalent in analytics and data warehousing.
This post will discuss those challenges and illustrate how the GDPR may place new restraints on data staging & transformation. It will then present an alternative architecture which utilizes several capabilities within Azure Analysis Services to address those challenges in a unique way.
Fundamentally, the in memory architecture & data shaping capabilities of Azure Analysis Services allow high performance, flexible reporting whilst potentially allowing personal data to remain in place at the System of record. In turn this will hopefully help support organisations on their journey towards GDPR compliance.
GDPR – a brief summary of responsibilities
GDPR is a complex and far reaching regulation with multiple requirements.
Under the GDPR, individuals have the right to access their personal data, correct inaccuracies in that data, have their personal data erased upon request (the “right to be forgotten”), object to the processing of their personal data and generally have a finer level of control. There are also complex privacy rules around who can access personal data, the justification for that access & how long the data can be retained. Further, organizations will likely need to think about classification of that data as the GDPR has multiple aspects. This post on the GDPR & Azure Data Catalog explains that classification process in more detail and shows how ADC can be used to capture the necessary categorisations across the organisation.
Furthermore, organizations will likely have to ensure appropriate policies are applied to personal data sets. Again this post goes into more detail however applicable policies may include some of the following:
The key point however is that policies and processes may need to be in place for every data set held anywhere in the organisation if that data set includes personal data. For example, if an individual makes an erasure (“right to be forgotten”) request it may be the case that their information must be removed from the customer system of record, any downstream systems holding their customer data and any staging or intermediate area in between. Unfortunately, for traditional Analytics architectures, there could be a lot of downstream systems, and a lot of intermediate stages.
GDPR & data replication, the problem statement
Many enterprise architectures will be in the process of moving to a services or micro-services model which allows customer (and therefore personal) data to be fetched as needed via a service interface over the Enterprise system of record for that data. However, although this approach works well for OLTP scenarios the same isn’t generally true for analytics environments. In almost all cases customer data may and will take part in large scale join operations – fundamentally the 2 data sets need to be brought together in some form. Furthermore, Customer data can be reasonably complex, requiring significant flattening & denormalising before it can be used, for example, in a star schema query.
Finally, the data movement path of personal data, in fact any data sets used in analytics, can be multi stage and complicated.
Above is a simplified customer master schema, normalised over 3 tables. As customers can have multiple addresses (and similarly 2 or more customers can have the same address) there’s a many to many relationship in place. There are analytics tools which can deal with this but typically some work would be done to turn this into a reporting structure.
In a standard (pre GDPR) data warehousing/analytics environment the normalised structures might be landed in a landing area or data lake. That data may go through one or more transformations and stages before being transformed into a dimension table as part of a star schema. In turn that customer dimension table may be pushed to 1 or more data marts for focused analysis.
At the same time there may be a number of data analysts who prefer large flattened structures to work on. In this instance for example customer, product, sales & revenue data may all be combined into a single wide row of denormalised data which is then exported to be used in Excel or other desktop tool.
The GDPR Personal Data issue
In the post GDPR world organizations will likely find they need a full catalog of all personal data sets, and this will potentially include the multiple copies of personal data above. Each version may potentially need a policy to respond to data subject requests to access, rectify, erase, object to processing, and move any personal data & each must also have retention & access policies. In the case above for example a given customer could be persisted in 7 separate places.
Finally, every stage and data copy within that stage introduces an additional risk of GDPR violation. A developer may take a copy of a staging table for development purposes and a leak could occur anywhere customer data is stored. A user may take a copy of data containing interesting metrics without appreciating that the flattened table also contains customer data.
Clearly, following traditional practices of staging and duplicating personal data may in fact increase exposure to potential GDPR regulation violations and may also increase compliance overhead. However, against this there is still the very real need to shape & transform customer data, and the need to join that data with other datasets for meaningful analytics and other purposes. What are the alternatives?
Azure Analysis Services & the In Memory model
This is where Analysis Services (AAS) can play a significant part as several key features can be utilised to address some of these issues:
- Data Shaping – AAS has considerable data shaping capabilities.
- Multi source capability – AAS can reach out across multiple data sources to create a semantic layer which spans those sources in a single logical model.
- In memory model – AAS tabular mode operates using an in memory model. This has 2 distinct advantages:
- Speed & performance – AAS can join customer data from one source with fact & dimension data from others to create a highly performant in memory representation.
- No persistence in the traditional sense – AAS does offload to blob storage however this is not accessible by any Azure User. For all intents and purposes it does not persist data in any accessible way beyond the in memory model.
In practice this makes the following scenario possible.
The above solution (as the title of this blog suggests) leaves the personal data in place. The data could be tagged as such by Azure Data Catalog and this would help enforce policies around secondary copies.
2 semantic models are then created across 2 distinct data marts/data warehouses (which contain no GDPR data) and the central GDPR system of record. The models are then processed as needed, and in particular the personal data sections of the models are refreshed completely from the source each cycle.
The advantage here is that all data shaping/staging/flattening is carried out by the Analysis Services models – there are no intermediate staging tables and a copy of the original Customer schema is not moved to the Data Warehouse/Data Mart.
Furthermore, performance is taken care of by the in memory model which carries out any heavy lifting at processing time, not query time. This same in memory model also means that the major policies may not be deemed applicable to the Tabular representation but instead are owned by the Customer System of Record.
- The “right to access” still technically applies however the requirement can also be served by the underlying data source as the tabular model reflects that source (note, the exception here might be if metrics which can be tied to the customer are generated within the tabular model)
- The right to rectify & erase is already addressed by rectifying or erasing at the central source and then refreshing the personal data in the tabular model.
- Security access policies are still needed however there’s now far fewer data sinks to be covered. Furthermore, masking can be applied on a profile by profile basis to hide customer data from certain tabular model users.
There may of course be some cases where an organisation decides to apply full GDPR policies (because for example customer scoring data is generated in the tabular model in some way). However the fact remains that even in this instance there are far fewer points of failure & data sinks to address using this method.
Leaving data in place as a means of addressing GDPR compliance requirements at first seems like a radical and potentially naive approach, all normal wisdom would require 2 data sets which need to be joined using SQL be moved to the same physical location. However the data shaping capabilities and in memory model of Azure Analysis Services make it ideally suited to exactly this approach.
Potentially the idea could be extended slightly (the Customer System of Record may not hold historical customer changes and so AAS may need to connect to the SoR and an additional historical change store) however the principle would be the same. One or more in memory representations of personal data can be deployed with the knowledge that any correction of customer data happens once, in one place, with a simple 1 click refresh correcting or removing that customer from any downstream models.
It is true that AAS provides an OLAP like model rather than a strict RDBMS type interface however this can also potentially be an advantage. AAS can apply selective masking at differing levels and for different profiles allowing the model to be used by users with varying access rights in accordance with their permissions. Add this to the rich security model within Azure Analysis Services & it becomes an ideal candidate for dramatically simplifying the GDPR picture for analytics.