The phrase “Modern Data Platform” is increasingly being used, but what is it? Is it just the latest name for Data Warehouse (DW) or something different? Will it support my big data/near real time streaming needs? What if I don’t know what those needs are yet? And why do I even need a “Modern Data Platform”? This post, the first in a series, will provide one perspective on that question by firstly examining the history of data warehousing and how it led to the need for a Modern Data Platform approach. Later posts will examine a technology free reference architecture for the modern data platform and then will map Azure Data Services onto that architecture to provide a technology overlay.
Data Warehousing & Analytics – a brief history
Analytics (or Business Intelligence or Management Information Systems or Decision Support or any of the varied names applied over the years) has a long & varied history with several iterations and forking paths however for the purposes of this post we’ll start with the Data Warehouse. First coined in the late 1980’s the Data Warehouse (DW) became the primary repository for company reporting, the ideal being that all data from across the enterprise would be duplicated and landed in the DW so that it could be reported across in multiple ways in order to provide business insight. There were 3 primary drivers for this:
- Operational systems (and their database schemas) were not designed for and could not support the kinds of queries needed for analytics.
- If the requirement was to merge sales & customer satisfaction data together as an example then those 2 data sets really needed to be in the same place (co-located).
- Different data sources & schemas had different conventions, possibly even different identifiers for the same thing. A level of data cleansing & conforming was needed to align the data together for query performance and cross subject reporting.
2 principle techniques, Inmon & Kimball, appeared and dominated Data Warehousing (the following is a simplification, both techniques have extensive material behind them which I won’t repeat here):
- Inmon – the Inmon approach favoured a centralised, usually normalised enterprise data model as the core of the DW. ETL processes would take the data in from the various sources and transform it into a logical Enterprise Data Model, this model being the abstracted representation of the Enterprise Data Architecture. Secondary data marts would then be created on an as needed basis for each department, business unit or initiative – each would be a subset of the whole, optimised for query performance and synchronised daily/hourly/instantly with the central warehouse.
- Kimball – Ralph Kimball championed a different approach. There is no central “Enterprise Logical Data Model”, instead there are star schemas & staging tables. The staging tables (and associated ETL) perform any cleansing and conforming of the different data sets and the star schemas became the public face. Each star schema consisted of a Fact table, a denormalised heavily formatted table generally focused on 1 subject such as Sales or Shipments and related Dimensions, again heavily formatted flattened structures containing all attributes for a given entity (such as Product or Customer). Simply put, Fact Tables contained things you measure (Order Quantity, Profit) and Dimensions contain things you measure by (Sales Rep, Geography, Channel).
Each technique had its proponents (I personally favoured Kimball) and its pros & cons. However, both rely heavily on some key features:
- They relied on a large central appliance or environment to hold all of the data (in its various forms). Eventually some companies would reach the limit of SMP architecture (thing big multiprocessor but essentially individual servers) and be pushed onto MPP (Massively Parallel Processing) appliances such as Teradata or SQL Server Parallel Data Warehouse.
- They assumed all of the data ingested would be transformed into “valuable” and accessible BI Structures. Every attribute on every data set coming in would be moved through the entire process and surfaced somewhere for performance
- They were focused on structured relational data only (which was fine at the time because Analytics only focused on structured data).
- They both relied on a heavily controlled and formatted data modelling approach, which didn’t lend itself particularly well to agile development.
- They typically led to monolithic ETL stacks that created technical debt and legacy code that was hard to refactor or extend.
- They were essentially batch – the update cycle was typically daily or weekly, more regular updates were desirable but difficult as the overhead in updating the entire warehouse stack
- Finally, being largely on premise they were limited to one capacity level. Something not really obvious until cloud came along but all data warehouse environments could only operate at one level of memory, CPU & storage. Although many appliances (particularly MPP) lent themselves to expansion in reality these were costly and time consuming and upgrades would happen once a year at most. This meant that the procurement process had to estimate the peak usage need for the appliance up to 3 or 4 years in advance and furthermore in operation the appliance would operate at one speed & capacity. If more throughput was needed, something had to be removed, whether it was data, costly ETL processes running in parallel or business users.
Cracks appearing in “traditional DWs”
Although still largely robust and fit for their original purpose in recent years these features led to numerous concerns for these approaches. Firstly they are essentially “Mode 1” Waterfall style practices. They work well for predictable, robust, repeatable analytics (such as accurate quarterly reporting of performance figures to the markets) however they don’t easily lend themselves to “Mode 2” agile, iterative delivery.
A second issue is cost – all data is held on the appliance and the appliance is expensive. As data volumes grow the enterprise either has to consider archiving strategies, purchase of expensive additional capacity or, worse, reducing the amount of data ingested.
To reduce cost many organisations would limit the amount of data taken from the source systems into the DW. Either attributes or tables would be dropped or the data would be aggregated in some way at the source system before being sent to the warehouse. Along with this the DW team would often look to push data quality responsibility to the source systems, such that they would be required to supply 100% clean data to the DW. Initially this seems a sensible approach however in reality it creates a tight coupling between the source systems & the DW, leading to additional drag on development and separation of concerns issues in the architecture – any issues or gaps in the source extracts had to be pushed back onto the delivery plan of the source system, creating additional lag on delivery.
As mentioned above the assumption would generally be that all data landed would be pushed into the reporting or presentation area – storage on the appliance was simply too expensive to pull data “just in case” it was needed so everything would be transformed and conformed, not a trivial task in an Enterprise with 100s of source systems. However changing the core data models of the 2 approaches to incorporate new features was also costly and risky because of the dependencies in place on the original data structures.
Data warehouses were starting to stagnate in many cases. The rate of change was too slow for new data features & insights, DBAs spent much of their time trying to avoid costly upgrades by tuning and retuning the database, direct access for business users was limited in case they ran “killer” queries by mistake and all the time new functionality and code was being accreted by the ETL stack.
Then things got worse for Data Warehousing…
The 4 Horsemen of the on premise DW Apocalypse
Already under pressure the traditional Data Warehouse for the last few years has had 3 additional development areas to contend with.
- Data Discovery Tools – time to market for new BI reports (over new data) was already typically an issue in many businesses however then Data Discovery tools such as Tableau, Qlikbview (and eventually Power BI) started to appear. These tools came with the premise of “give me a spreadsheet and see what I can deliver in an hour” and completely disrupted the traditional enterprise reporting tool environment. Although it wasn’t a completely fair fight (the more traditional enterprise tools had full semantic layers and could provide the same report to thousands of users) the Data Discovery Tools redefined analytics delivery expectations for many business and data analysts
- IoT and the rise of real/near real time – DWs had typically been batch oriented. Near real time design implementations existed but they were seen as costly and complex. With the rise of IoT and streaming generally however this shifted expectations – businesses now wanted to see events as they happened and over time but the typical DW architecture with its rigid batch process approach simply wasn’t set up to support this.
- Note: Another potential casualty of this change is the treasured concept of “operational” vs “BI” reporting. Many arguments have been held over the years about operational reporting which managed the day to day requirements vs analytics or BI reporting which focuses on historical data. With the emergence of streaming and the Lambda design pattern this distinction has largely melted away, a given analytics report can combine real time & historical data.
- Big Data & AI – in a relatively short space of time BI data was no longer all structured, not all of it needed a formal presentation layer for reporting (in fact that was actually a hindrance for Artificial Intelligence and Big Data scenarios) however more importantly, it was no longer feasible to store all of the data needed by these use cases on expensive high end Data Warehouse appliances.
- The rise of Agile – In truth “Agile development” has gone through as many reinventions and relaunches as analytics however arguably it is now more mainstream than ever before. Business consumers are no longer content with 3 month release incremental release cycles for new features in the reporting environment nor can they afford to be with numerous disruptive coming into the market. They’re looking for weekly or even daily release cycles, a cadence that Inmon & Kimball approaches simply weren’t designed for.
However, although the above largely mode 2 type developments were in the ascendant there remained a need for bi-modal development in analytics, namely the ability to carry out waterfall style level 1 rigidly phased deliveries that supported business processes alongside the more agile mode 2 style fail fast, iterative agile approach.
Something new was needed however this new approach couldn’t completely throw out the old. It had to support the new agile approaches whilst also allowing for rigid, repeatable, dependable Analytic processes – furthermore it had to allow for mode 2 developments to mature as they became “hardened” (the rate of change slowed down as they reached optimum functionality) and be migrated potentially into an “operationalized” mode 1 delivery model.
The new approach had to also deal with the individual UX needs of different analytics personas. Again, multiple personas for analytics had always existed but many DW architectures tended to focus either on the majority (casual consumers) or minority (data/business analysts). With the emergence as well of Data Scientists came the need for the architecture to support multiple personas and to recognise the differing requirements of each.
Finally it had to be scalable to deal with big data volumes but at the same time cost effective such that it had tiered storage capability. It needed to loosely couple with source systems so that all data, in any form and at the lowest level of detail available could be landed from source, without being overly expensive and to break that dependency on custom extracts and source system processing. It would need to cope with near real time streaming and 10s of thousands of messages a second but at the same time with the big, unstructured data and complex tooling requirements of data science.
Mostly, it had to be agile. It had to provide multiple options for rapid delivery, scale up, scale out and for building up and tearing down sandpits and prototype environments quickly. Therefore it needs to not operate at “1 capacity” but instead to have the ability to scale as needed in any of its key capabilities.
Introducing … the Modern Data Platform
A modern data platform should address these issues, and above is one technology free representation of a Modern Data Platform. Drawing on features of the Gartner Logical Data Warehouse, Lambda Architecture & Microsoft’s own Azure Data Services architecture it is intended to abstract above specific technologies to provide a capability architecture. Most importantly it is not intended to be radical or ground breaking, instead it likely resonates with existing data platform/data warehousing architectures in many organisations.
This post briefly summarised the 2 main design patterns for traditional data warehousing, their differences and similarities. It then discussed the key pressures these architectures have come under and the requirements for any succeeding architecture before briefly introducing one version of a capability architecture for a modern data platform. In the next post we’ll go into more detail on this architecture and in future posts we’ll overlay Azure Data Services, to illustrate how the Azure cloud can meet the needs of a Modern Data Platform.