This blog contains a commentary on the GDPR, as Microsoft interprets it, as of the date of publication. The tools and services referenced herein are not designed to ensure GDPR compliance but to assist you and your organization with your data classification and categorization, an important step in the journey to compliance. The application of GDPR is highly fact-specific and not all aspects and interpretations of GDPR are well-settled. As a result, this blog is provided for informational purposes only and should not be relied upon as legal advice. We encourage you to work with a qualified legal professional to discuss the meaning and applicability of GDPR and how best to ensure compliance for you and your organization.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS BLOG. This blog is provided “as-is.” Information and views expressed in this blog, including URL and other Internet website references, may change without notice. This blog does not provide you with any legal rights to any intellectual property in any Microsoft product.
About the Authors: Alice Kupcik is a Senior PM in the SQL Database Security team at Microsoft and is passionate about all things around data privacy & protection. Tony Smith is an Azure Data Solution Architect with a primary focus on analytics & data platforms.
GDPR is on the minds of IT managers & compliance officers and will be an ongoing requirement going forward for many organisations. One of the initial challenges, however, is simply the identification & categorization of GDPR impacted data sets in disparate locations across the enterprise. Furthermore, it may not simply a case of flagging data as being in or out of scope for GDPR, instead, organizations may decide that they want to define multiple levels of categorization and in turn they may choose to apply these categories to data at the source, table & even attribute level. Finally, how do we scale the process of identification, logging and categorization of the potential multitude of relevant data stores?
Enter Azure Data Catalog, Microsoft’s solution to understanding & cataloging the data estate of any organization. Designed to promote self-serve data discovery it has several out of the box capabilities which intersect with & support parts of the GDPR process. Furthermore, as a SaaS solution, it has a minimal footprint, low start up cost, requires minimal training and can therefore be rapidly deployed at scale.
The previous post in this series of 3 explains in brief what is involved in GDPR classification and presents a simplified taxonomy as an example. This final post will cover implementing that taxonomy in Azure Data Catalog following a step by step process.
Please note these posts are not intended to provide advice on the legal requirements of GDPR. Organizations will need to assess for themselves how they will meet their GDPR responsibilities.
As a quick recap the 1st post in this series explained the basics behind GDPR classification and how Azure Data Catalog (ADC) can help provisioning that classification. The 2nd post then presented a simplified GDPR Taxonomy and associated concepts. This final post will provide one example of how that taxonomy can be implemented within ADC. Incidentally, although this post covers GDPR the essential concepts can be used for any classification taxonomy of data within ADC.
GDPR – key concepts
Briefly, the simplified GDPR classification presented in post 2 falls into broad categories which in turn have associated policies. Categories can also have extended attributes as well as child data attributes as shown below in the example taken from the 2nd post.
Building the Taxonomy
The major level in the taxonomy is category one which in our example holds 8 entries. Policies are attached to categories however we will connect the 2 using an alternative method (see later). Each category has additional attributes around scope, communication preferences, rights & exceptions. This however creates an issue as the glossary within ADC is a straightforward hierarchy which neither allows multiple parents for a term or attributes on that term.
We do however have several options within ADC to record these additional attributes:
- Use one or more of these additional attributes as an additional level above or below the main category level
- Create additional taxonomy hierarchies
- Add the attributes into the “Description” for the term
- Create a separate policy document to hold these attributes
In this example we’ll use the first 3 of these options – the 4th will be used later for “policies”. However, any combination could be used for your company scenario.
In scope for GDPR Attribute
This element should be reasonably stable as we should be able to define with some confidence which categories are in or out of scope for GDPR. Furthermore, having this as a term in the hierarchy may slightly simplify tagging later.
Note: One feature gap in ADC is that you can only search for a term itself, not any terms which are children of that term. Therefore to search for “GDPR” tagged data the term “GDPR” must be attached to the data item. More on this later.
Category Data Attributes
Of our 8 categories, 6 are in scope for GDPR, 2 are out. We can now add in the column level terms which will classify each attribute within a data set. Please note the terms below are an example subset, you will likely have many more attribute terms in your taxonomy.
This is our main GDPR taxonomy which we will map onto ADC shortly. However, we have 4 other attributes to work through.
Communication Preferences & Rights
These 2 classifications represent options for how individuals can interact with their held data, specifically whether or not “yes, you may communicate with me” needs to be recorded against this data set and whether or not rights to be forgotten etc can be applied. For the sake of simplicity we’ll create 2 near identical taxonomies:
Exceptions however are more free-form in nature, such as “Log Data is Exempt from Update”. These could be held separately on a policy however in this example we’ll add them directly into the term description (see below).
Policies govern cases such as minimum and maximum data retention and boundaries on geographical storage and as such go beyond simple terms or notes and into full blown documents. Furthermore there is likely to be a many to many relationship, some categories will share the same policy and each category will have multiple policies to attach. These policy documents likely exist or will exist on the corporate website and therefore the simplest method for reference would likely be a simple web page per category which includes a list of links to the relevant policies for that category. A link to this web page can be included in the description for the category within the glossary.
Personal Data in Combination
There is one final, subtle, categorisation to be applied. Some data sets do not constitute “Personal Data” in isolation but can become “Personal Data” when combined with other data sets, like 2 parts of a puzzle. There is no easy way in ADC to keep track of the matrix of dependencies which indicate which data sets should not be combined with which however we can at least record that a data set is “at risk”. A final hierarchy can be created which indicates whether this data can or can not be combined with other data sets.
Implementing the Taxonomy in Azure Data Catalog
Now we have all of the elements we can start creating them in ADC.
We’ll begin by creating the 2 simpler rights hierarchies around communication and view/update/delete. For this we’ll create a “GDPR – Customer Rights” parent term.
We open the ADC for our organisation, select “Glossary” and then “New Term +”, then enter the above.
Note: the catch all term serves an additional purpose. ADC does not allow the top term in a Glossary to be used as a tag therefore it can be useful to create a “parent” term.
We select “Create and New” and then add in the rest of the terms below this parent until we have the 2 hierarchies filled out.
Personal Data in combination
This is a simple “Yes/No” hierarchy indicating whether or not a given data set can be combined with others to create identifiable “Personal Data”.
We’ll move on now to the main hierarchy we built up in the previous section. Again, we create the parent term and then add in the In Scope/Out of Scope level, the categories and the column level attributes.
Above is a portion of the taxonomy showing the “GDPR – In Scope” branch.
Policies & Exceptions
Before we move onto adding the taxonomies onto actual data entities we will deal with the final 2 items, Policies and Exceptions. As mentioned, Policies can be held as separate documents on a share site with a web page pulling together links for a particular category. We can actually embed this website in the description for the category. Similarly, we can either hold GDPR Exceptions as a separate policy or we can hold them as formatted text within the category. This is the approach we’ll take here.
Above we’ve added a link to a web page which in turn has been set up to hold all links to all applicable policies for this GDPR category. The GDPR Exception is a free form piece of text which would be omitted if no exceptions are in place for the underlying category. We would add these 2 items in for all 8 categories and our glossary is now complete – we now have our 3 main GDPR taxonomies & our policies and exceptions handled. We can now move onto to adding categorisations onto the underlying data.
Note: Descriptions in Glossary terms are not searchable. If you need to search for say “GDPR Exceptions” this would need to be placed as a glossary term, a tag or in the description of a data item
Applying GDPR Taxonomy classifications
We won’t cover Data Source publishing to ADC here, however it is a straightforward process which connects to a data source and pulls metadata for selected items across into ADC. For this example we’ll include 2 data sets, a set of SQL Server tables from Adventureworks and a folder of data on Azure Data Lake Store (ADLS). The Adventureworks data is an OLTP database which contains customer data and the Data folder on ADLS contains IoT Data from customer cars.
Once the metadata is loaded it is then classified.
Note: ADC includes a “preview” view of certain RDBMS data sources, that being a copy of the first 10 lines of a table. You may wish to turn this off if it is not anonymised at source.
Address Table from SQL Server
Below is a sample classification for the Address table columns:
This address table holds a set of personal customer addresses and so the appropriate terms are added from the glossary (The hierarchy above the term can also be seen). However, there are additional tags to apply and applying all of them at the column level will be time consuming and furthermore, ADC only allows searching on terms themselves so if we search for “GDPR – In Scope” it will not return the above columns.
As an alternative we can also add some classifications at the Address Table Level:
Firstly we add the 2 “rights” tags – the right to store communication preferences and the right to be viewed/updated/forgotten – both of these apply to the “Customer Data (Personal)” category. Then we add in the 2 parent tags. Given that ADC does not allow parent level searches we add in terms to indicate this data set is “Customer Data (Personal)” and “GDPR – In Scope” as it is highly likely we will find ourselves searching for all objects in this category.
That completes our GDPR categorisation for this table and we would do similar classifications for the customer master table and any other tables in the database holding GDPR type data. Now we move onto the IoT Data.
IoT Data from Azure Data Lake Store
Data on ADLS is in flat file format and is often held in folders therefore ADC is not able to extract metadata for individual columns, only the overall file or parent folder.
In a similar manner to the Address table level tagging we can add in “GDPR – In Scope” and “Sensor Data” and we also add in “N/A” as communication preferences are not applicable to IoT/Sensor data. The customer does still retain rights over the data and so we have “Yes” for “Rights of View/Update/Be Forgotten”. There is one addition however, this sensor data is anonymous in isolation but not when combined with other data. We cannot record here what those other data sets might be but at least we can flag the risk.
And that’s it. Any of the tags could be clicked on to open the glossary so for example we can click “Sensor Data” above to find out if any exceptions can occur and what policies apply to that category. If we enter “Termname:GDPR – In Scope” we’ll return the 2 data sets as we’ve applied that term at the data set level.
This ends the series of 3 posts on Azure Data Catalog and GDPR. The above is just an example walk through of how you might categorise your data to apply GDPR guidelines as well as some methods for applying that categorisation within ADC. Obviously however your implementation will depend on your circumstances. It has to be said that ADC may not be as fully functional as some GDPR metadata tools on the market however all the examples above were built in a few hours on a turnkey SaaS infrastructure. Thanks for following the series to its end.