Power BI is Microsoft’s Data Discovery Tool however it has 3 distinct methods for data access. How do these differ and how do we choose between them?
Power BI is designed first and foremost as a self-service data discovery tool aimed at business analysts and data miners however it can also be used to build rich dashboards and BI “applications” for more casual users. What isn’t necessarily obvious however is that it has 3 distinct data access methods, “Import”, “DirectQuery” & “Explore”, each having its place & its own set of pros & cons. This post will look at these 3 options, compare them and try to advise on when to use each one.
The 3 options are shown below.
- Import – Create cached copy, periodically refresh from data source
- DirectQuery – push down queries to data source
- Live/Exploration – Explore rich semantic layer of Analysis Services
Use when: Multiple data sources are needed, potentially in a mash up and/or when data volumes are relatively low.
Import/cache mode is the typical mode for Power BI (as it is for similar tools in this space) and is the “In memory” feature which data discovery tools are renowned for. In this mode a data model is created across potentially multiple data sources. Full data shaping capabilities are available which allows the analyst to wrangle the data, either to cleanse it, prepare it for analysis or to facilitate joins to other date sets.
Note that one of Power BI’s most powerful features in this mode is its “data step” paradigm – each transformation is placed in a queue of data manipulation events and this queue can be altered or reordered.
Once the data is wrangled relationships can be created between tables (if they haven’t already been inferred) which reflect the underlying data structure – this allows filtering on one table for example to reduce the rows returned from a related table.
In operation “Import mode” works as it suggests, it preloads data from the underlying data sources into a compressed format in memory. This approach gives “Import mode” its power and its limitations.
Because it creates an internal representation of its data Import Mode is the only mode which can span multiple data sources (for example SQL Server, MySQL & an Excel spreadsheet of targets in 1 report). This compressed representation also makes Import mode very fast once the initial loads are completed from the underlying sources.
There is still, however, a start up overhead when that initial load occurs (though published Power BI reports can periodically refresh the data automatically). Perhaps more importantly the data cache is limited in volume. Although the in memory compression used is highly efficient ultimately “Import mode” cannot be used for analysing very large data sets. You may find you need to filter or aggregate the source data before import which in turn narrows your scope for analysis.
However, as a data mash up tool for iterative data analysis and ad-hoc data wrangling “Import” has the most flexibility of all the modes.
Direct Query Mode
Use when: Data volumes are large and/or the underlying data is changing in real time which needs to be reflected in reporting.
The next option is “DirectQuery” mode – in this mode there is no up front loading of the cache, instead every change of filter or click on a visualization generates 1 or more queries which are sent to the underlying database.
A minor disadvantage here is that almost every action on the dashboard has at least some lag as each action can result in the generation of a query, its transmission to the underlying database and the wait for the return of any result set. As such “DirectQuery Mode” cannot compete with “Import Mode” for query response time once the cache has been refreshed.
However, the significant benefit is that the scale and scope of the data queried depends only on the power of the underlying DBMS (which might be a very large scale SQL Data Warehouse). Depending on the power of the DBMS being queried billions of rows can be aggregated and accessed in real time, especially if it’s a columnar store database such as SQL DW. Furthermore, whereas Import Mode can only reflect the data state since the last refresh, DirectQuery returns the latest data from the DBMS each time.
The key factor here however is “query push down”. Consider the query below produced by Power BI to display a bar chart of regional sales on a [Business Type] filter:
Total Sales By Region where [Business Type] = “BU001”
[Business Type] reduces the result set to be aggregated, however the behaviour of the interaction will vary depending on how [Business Type] is defined. If it’s a calculated column in Power BI by definition it does not exist on the underlying database, this in turn means the DB cannot filter on BU001 before grouping the data on Region. The result is that the DB is forced to return the entire data set to Power BI so that filtering and then aggregation can be performed locally. This clearly becomes unfeasible if the underlying table has millions or billions of rows.
However, if [Business Type] is an attribute on the underlying table this filter can occur at the DBMS, aggregation can follow & the only data returned is the set of regions & their total sales. This is known as “push down”, the ability for a DBMS client to drive most of the work to the underlying database thus reducing network traffic, and it can drastically alter performance. 2 superficially similar visualizations can perform very differently simply because one is unable to push down the majority of the work to the DB. Pushdown is a complex area and is evolving all the time (some platforms rewrite queries on the fly to allow more work to be pushed to the Database). However a good rule of thumb is to not use calculated columns as filters or group by columns and to only use columns that exist on the underlying database tables for these purposes.
The other major limitation of DirectQuery mode is that it can only operate over one data source. You cannot for example add in a spreadsheet of targets to your dashboard if you are in DirectQuery.
Finally, not all data shaping abilities are available in DirectQuery. Many features such as column renaming are & of course calculated columns & measures are still available. Other features however will generate a warning, suggesting that you switch to Import mode instead.
With all that said, Direct Query mode is still the most scalable in terms of data volumes & as mentioned it will also reflect the latest state of the underlying data, not the state since the last cache refresh.
Use when: You have a tabular or OLAP model OR you want a rich semantic layer which is shareable across multiple clients & use cases
Although Import and DirectQuery mode differ in how they manipulate data they have one major factor in common – each expects the Power BI developer to understand data models, joins, relationships in general and the structure of the underlying data in particular. “Explore Mode” does not expose the data model but instead utilises an abstracted version.
In this mode Power BI operates over an existing OLAP or Tabular model such as that provided by Azure Analysis Services. In effect, the same process is occurring as in “Import Mode” in that multiple data sources can be added together, relationships can be created and new, complex metrics developed. However, in “Explore mode” this is happening under the covers and Power BI in effect becomes a browser of the “semantic layer”.
Although similar to those models developed in “Import mode” an Azure Analysis Services model can scale to a far greater degree and also has the advantage of being available to clients other than Power BI. This post goes into more details on the trade offs between Power BI & Analysis Services.
As well as providing this rich semantic layer “Explore mode”, in comparison to DirectQuery Mode, will typically be faster in terms of query performance as it operates in memory. The key benefit is the provision of a scalable, flexible, multipurpose (Analysis Services supports perspectives which in effect create metadata defined data marts for different user bases) and highly available consumption layer which does not require the user to understand data models or relationships or how to create complex, potentially multidimensional DAX calculations. The flipside of this is that the user is confined to whatever design decisions were made by the semantic layer developer.
Also, and similar to DirectQuery, only one semantic layer model can be queried at a time. 2 models cannot be “mashed together” nor can a model be extended with say an Excel spreadsheet.
Summary of Capabilities
Below is a more complete list of capabilities with some of the areas discussed above highlighted.
As previously outlined, each mode is very different and has strengths & weaknesses. For pure brute force scale over billions of rows DirectQuery over a powerful backend (such as Azure SQL DW) is most appropriate. When the problem space requires data discovery & mashup across multiple sources with differing data structures, the powerful data shaping capabilities of Import combined with Power BI’s visualizations is likely the way to go. Or, finally, if you are looking to deliver self serve Analytics but don’t want your users to become data model experts, the combination of Analysis Services and Power BI explore mode lets your business users slice & dice and drill up, down and across a business friendly semantic layer, whilst the underlying environment can scale up to 400 GB (compressed) data to support the workload.
3 Modes, each with their own strengths and weaknesses & each suited to a slightly different analytics use case.
Power BI is evolving rapidly – click here for the Power BI blog.
There are some important new announcements for Azure Analysis Services
A new “optimised for Compute” tier was recently announced for Azure SQL Data Warehouse.