The phrase “Modern Data Platform” is increasingly being used, but what is it? Is it just the latest name for Data Warehouse (DW) or something different? Will it support my big data/near real time streaming needs? What if I don’t know what those needs are yet? And why do I even need a “Modern Data Platform”? The 1st post in this series briefly examined the history of data warehousing and how it led to the need for a Modern Data Platform approach. This post examines one such approach and explores the capabilities it suggests. Later posts will layer on Microsoft Azure cloud technologies which meet those capabilities.
Microsoft Azure provides a rich and highly functional set of data services which together can meet virtually all of the requirements of a modern data platform. Typically they are PaaS (Platform as a Service) offerings and therefore provide cloud centric advantages such as serverless set up and pay as you go scalability, however often they are presented as individual products and therefore it’s not always obvious where they fit and what their role is as part of an overall approach. With that in mind, the primary aim of this set of posts is to present a Modern Data Platform (MDP) firstly as a technology free capability architecture. The individual capabilities will be examined in turn and then in future posts the appropriate Azure technology will be mapped to each capability as appropriate.
Why a new Modern Data Platform Architecture?
The honest answer is, it’s not really new. Various modern data warehousing & modern data platform architectures exist and most share common characteristics such as data lakes & big data analytics. The architecture below is not particularly unique and isn’t intended to be, in fact it represents a blend of the Lambda architecture, Gartner’s Logical Data Warehouse pattern and Microsoft’s own Azure Data Services architecture. Many readers will have variations of this architecture already, which is intended by design. The boxes may move around and have different names but hopefully the overall conceptual flow will resonate.
Capability First Approach
This area of cloud data services is undergoing rapid and exciting change however this can create some confusion leading to a lack of clarity. How does Azure Databricks fit into my overall data wrangling & data science methodology? How do I implement streaming/near real time processing alongside my “traditional” warehousing. How do I extend my implementation into big data & Machine Learning? What role does Azure Analysis Services fill? How do I avoid silos and/or a “data swamp”? Most importantly, what does a single Azure Data Services architecture which covers the different needs of the enterprise look like?
The aim therefore of these initial posts is to highlight key capabilities that any Modern Data Platform should support. Not all of these will need to be implemented immediately however it is useful to know for example that one can start with a Modern Data Warehouse architecture and then, when ready, move to the left into streaming, IoT etc or to the right into Big Data & Machine Learning.
The overall architecture divides into 2 major areas. The top layer is concerned with the presentation/consumption tooling layer and focuses on a typical set of user personas and their needs. Any modern data platform will need to service a wide range of users with different reporting & consumption needs, this in turn leads to the need for several distinct mechanisms for delivery. Casual users for example will require highly curated, directed interfaces with minimal customisation aimed at delivering focused insights on a repeatable basis. At the other end of the spectrum Data Scientists will need maximum flexibility in creating sandpits & experimentation areas with a wide range of tooling.
Supporting this is the main body of the MDP. From the bottom up we have data ingestion & initial storage. The foundational capability of the MDP is the Data Lake which is the initial landing area for data and first option for data manipulation & cleansing. Alongside this is the data movement capability, a catch all for both batch and real time streaming as well as data movement & manipulation within the architecture.
In the center are the 3 main pillars of Real Time Analytics (focused on events & streaming), Relational Analytics (more traditional SQL & relational workloads) and Big Data/AI/Machine Learning (the 2 are merged in this model as to a certain extent they are starting to overlap, other architectures will show them as distinct entities). Together these 3 cover the primary mechanisms for adding value to the various data sets entering the MDP (either in batch or real time).
Finally, there are 2 additional capabilities, “Semantic Layer & Data Virtualisation” and “Metadata & Governance”. The semantic layer brings meaning to the underlying data in business terms and, alongside Data Virtualisation helps abstract away from the underlying data structures and data sources.
“Metadata & governance” is the final part of the architecture, this focuses on “data about data” and provides intelligence on what data the MDP holds, it’s provenance, where it came from and who is responsible for it.
The key message here is that this architecture is intended to represent a full spectrum modern data platform which can cover the 4 areas of relational, real time, big data & advanced analytics and machine learning & AI. In the ideal world any modern data platform will be looking to deliver capabilities in all of the above areas, or at least those which tie into its main use cases. An MDP which includes these capabilities should provide a flexible, enterprise wide foundation for each distinct area of analytics. In the following posts we’ll take a closer look at each of these capabilities.