“Where’s my raw data? Is this data set complete? Am I using dev or test data from the lake? Can I trust this data set, where did it come from and who owns it?”
Data Lakes are a great design feature for a modern data warehouse however they can quickly deteriorate into a Data Swamp if there is no governance on their structure. But governance means process and process can be the enemy of agility.
This post will introduce Azure Data Catalog as a platform for providing metadata for a Data Lake. It will briefly introduce data lakes & their role in Modern Data Warehousing & Azure Data Lake Store (ADLS), Microsoft’s bespoke Data Lake storage platform. There’ll then be a short backgrounder on how data lakes are a multi-dimensional categorisation challenge and why additional metadata above that supported by ADLS is recommended. Finally the post will demonstrate how Azure Data Catalog is an ideal option for providing that additional metadata and ensuring that your data lake becomes a searchable, multi purpose asset for your organisation.
Modern Data Warehousing (including the Lambda Architecture for mixing real time & batch analytic feeds) makes extensive use of the Data Lake design pattern. This pattern uses a large, low cost data store to capture all information from a given source for initial data loads. A more detailed description is available here however in summary the data is landed in raw form, fully or partially cleansed and then processed for BI purposes. A key feature therefore of a data lake is that the same data can appear multiple times in various forms, and in particular various subsets of the data may be integrated or processed for different BI purposes.
Step forward Azure Data Lake Store
Azure Data Lake Store is an ideal platform for hosting this data lake. It has near infinite scalability, a full web HDFS interface and works in tandem with Azure Data Lake Analytics, HDInsight and other platforms to provide a custom designed “smart storage” Data Lake platform. It also has significant security benefits (data is encrypted at rest, integration with Azure Active Directory) and has other capabilities over and above say Azure Blob Storage (though Azure Blob can also play a part in a data lake strategy).
Data may be held as individual files or as collections of files within folders, perhaps with a sub folder structure based on years/months/days/time to reflect snapshots coming into the lake.
However, in terms of categorising data sets it has limited capability. Essentially one is limited to name, description, the parent folder structure & metadata such as owner (which will likely be a role rather than an individual), date created etc. Data Lakes, as we’ll see below, often need richer categorisation capabilities.
The Multi-dimensional Data Lake
Successful data lakes will hold data from multiple sources & in multiple forms, in fact a key approach for Data Lakes is to “land everything now, understand & consume later”.
This source & data agnostic paradigm can lead to its own issues. The Data will have differing levels of quality & completeness and will have multiple owners. Furthermore often Development, Test & Production data will be held in the same lake (indeed the boundary between these stages can blur).
Even after it has arrived the data can go through various stages. As well as the usual Dev > Test > Production cycle it will go through various levels of processing from (for example) raw through cleansed to eventually integrated (or “curated” to use a data lake term). This differentiation between “Raw” & “Curated” is vital as both sets may be needed depending on the use case (mainstream analytics is best served by a highly cleansed & conformed data layer whereas data science scenarios may need the data at its very lowest level). Thus “tiers” are created in the lake, logical layers which separate out these different forms of the same or similar data.
Data categorisation has now become multi-dimensional. A given folder may hold data with several distinct characteristics in one of several distinct stages. To add further complications data can be “forked” in effect as localised analytics teams take a copy of the data at any point in the process to create bespoke & selective solutions. At this point Data Lakes are in danger of becoming “Data Swamps” and this is why Azure Data Catalog can add significant value.
Azure Data Catalog – a Primer
Azure Data Catalog (ADC) is a SaaS service offered by Microsoft aimed at categorising and surfacing the entire data estate for an enterprise. Although offered on Azure it’s intended to be used across almost any data source and has connectors for most of the major DBMSs on the market (with more being added).
ADC can hold tags, glossary tags (more on this later), friendly names, descriptions, experts and other metadata for a given data set within a data source and is designed to
follow a “crowdsourcing” model by default in that many users in different roles (Business Analyst, DBA etc.) can add their own metadata for a given data source without overwriting each other – all versions are held against the data source distinctly & simultaneously. That said, it has (and is adding) additional role based capabilities to limit update capability to certain users (such as a data steward).
The primary purpose of ADC is to provide rich support for self serve data discovery as users can now search for data sources, find SMEs and identify the process to request access for Data Sets. However, its rich metadata capabilities, particularly its support for tags & glossaries, make it eminently suitable for addressing the Data Lake Categorisation problem.
Any user can add a tag to an element in ADC however like all tagging this is subject to potential duplication – 3 different users can use “Dev”, “Dev.” & “Development” as distinct tags and all mean the same thing. Glossaries can address this issue.
They provide a preformed taxonomy or hierarchy of terms which can be defined up front and independent of any data sources, perhaps by a nominated role such as a data steward.
As shown above these glossaries can be N levels deep, in this example there are 4 parent categorisations around the “provenance” of the data.
Data Cataloging in action, an example
Having introduced the categorisation issue, Azure Data Catalog and its glossary capability the actual means by which ADC can help with a Data Lake is probably best illustrated with an example. The following is a sample subset of a typical basic data lake structure:
The Lake has 3 tiers, Raw, Cleansed & Integrated. Raw Order data is landed in the lake on a country by country basis with each data set being cleansed as it arrives before being pushed into the next layer. Order Data is then integrated across markets into the “Orders Integrated” folder. An additional project (Sparta) contains a copy of the orders data with bespoke transforms, this represents a “non certified” data set, built for some local analytics requirement.
To categorise this data 6 data glossaries were created in Azure Data Catalog:
- Layer (Integrated, Cleansed, Raw)
- Owning Business Unit (Marketing, Operations, ….)
- Consuming Business Unit (Marketing, Operations, ….)
- Stage (Dev, Test, Prod)
- Granularity (Detail, Aggregated)
- Quality Level (Raw, User defined, Business Ratified, Enterprise Ratified)
The next step is to gather the initial metadata. ADC is connected to ADLS and the underlying data set folders are selected as needed (Note: Azure Data Catalog manages big data sets by either defining an individual file or it’s parent folder as a data set. This post explains the approach in more detail.) RDBMS data sources can return column level information however a data folder on ADLS basically returns its name & location.
Metadata can now be added to the data set on ADC, starting with a friendly name & description. The meta definition for “Cleansed France Orders” is shown below:
There is also a documentation pane for additional details, this can include links to additional information such as ETL processes used to transform the data. However, the major categorisation comes in the next section.
One or more experts or SMEs can be allocated and a link to the “request access” process included. In terms of the major categorisation as proposed in this post we can also add in Glossary terms and/or tags (the process for adding each is exactly the same, the only difference being that glossary terms are flagged with their hierarchy).
As can be seen above, glossary terms for the 6 taxonomies were added as appropriate to this data set (Cleansed, Detail level, Enterprise class data for France). Standard “User” tags have been added as well for “Financial Data” & “Private”.
These steps would be followed for the rest of the ratified datasets (note, there is a full REST API for all actions in ADC, anything available in the interface is available through these APIs. Therefore, bulk loaders can be written if there are a lot of data folders to categorise).
“Project Sparta” however in the example above is a one off or localised solution. It still may have value and should be categorised however that categorisation needs to reflect its “Non enterprise” characteristics.
Sparta is flagged appropriately as a data set from “Innovations” with “User Defined” data quality and the description also reflects some of the decisions made in processing the data. This differentiates it from the more mainstream data sets.
Once all entries are completed we have a fully categorised set of data and ADC becomes a fully searchable resource.
The above is an example of the Azure Data Catalog main interface, showing a search for:
All Production, Enterprise quality assets owned by Operations
Of course any aspect of the metadata can be searched including descriptions and friendly names however the main purpose of this example is to show how the data sets can be categorised.
To summarise, Data Lakes need structure or they rapidly degenerate into a swamp. However heavyweight governance processes which restrict additions to the DL can slow down adoption and limit its flexibility and power. Further, although folders can be used to impose structure they are limited & can be fragile, especially when nested (Source system>Business Unit or Business Unit>Source System? What do we do after a reorganisation or merger?). In short, we need a means for adding structure however this mechanism must be lightweight, flexible and should apply governance that’s fit for purpose (a spreadsheet of interesting marketing data should not have the same process as system of record account transactions landing from the ERP).
Azure Data Catalog provides that solution and, through its REST API, is also extensible. (I’m working with 1 customer on a workflow which will auto create an ADC entry & email a data steward when a new folder is created on ADLS). Not only does it provide a robust mechanism for formal and informal categorisation of data (and not only for Azure Data Lake, the above process can apply to any sources for which ADC has a connector) it also provides a searchable business asset that promotes self serve and data discovery in the BI & Analytics space.
Click here for a tutorial on getting started with Azure Data Catalog.
Click here for an introduction to Azure Data Lake Store.