OneLake: Microsoft Fabric’s Open Storage Architecture
OneLake: Microsoft Fabric’s Open Storage Architecture
Data practitioners encounter various challenges every day. Data is typically dispersed across several sources, in a variety of file types, and too often with questionable degrees of quality. It can take a lot of time to determine where data might be located and what the access rights are. Consequently, less time is spent on what really counts, namely using the data to make decisions.
To address the issues surrounding data fragmentation, Microsoft introduced a lake-centric and open architecture in their new Fabric offering. While each tool within Microsoft Fabric caters to specific requirements, they all share a common data foundation: OneLake, based on the increasingly popular “data lakehouse” paradigm.
The Data Lakehouse, A Pattern Aiming for the Best of Data Lakes and Warehouses
The lakehouse term and pattern were initially made popular by Databricks in the past few years as a way to bring governance and structure to the vast amounts of unstructured raw data typically loaded in data lakes. Lakehouses follow an Extract Load Transform (ELT) pattern also known as “schema on read”, as opposed to the ETL pattern associated with “schema on write” in traditional data warehouses.
This architectural shift has been driven by the explosion in the volume and variety of data and enabled by massive public cloud infrastructure investments. The lakehouse aims to combine cheap data lake storage with the sense of structure and the ability to run efficient aggregate queries associated with data warehouses.
Microsoft’s Take on the Lakehouse
“OneLake is the OneDrive for data and like OneDrive, OneLake is provisioned automatically with every Fabric tenant with no infrastructure to manage.” -Microsoft
Microsoft already started supporting lakehouses years ago with Azure Synapse Analytics. OneLake now underpins Microsoft’s expanded vision of the lakehouse architecture. This unified, logical data lake supports lakehouses, warehouses and all other workloads.
comes automatically provisioned for each tenant and is the home for all your data, moving away from siloes and the need to copy data to multiple places where you need it. Under the hood, it is still Azure Data Lake Storage (ADLS) Gen2 where Parquet files are managed with Delta metadata.
This means that Fabric does not have vendor lock-in and that the data in your OneLake may be used with a wide range of tools and technologies. Fabric’s many compute engines rely on Delta/Parquet format, providing you with a stable basis and reducing the need for format changes.
You can also easily bring data from the outside without copying it. Shortcuts are embedded references within OneLake that point to other storage locations. The embedded reference makes it appear as though the files and folders are stored locally but in reality, they exist only in their original storage. Shortcuts can be updated or removed, but these changes don’t affect the original data and its source.
OneLake supports existing ADLS Gen2 APIs and SDKs giving you the ability to hook up existing applications and tools to a OneLake endpoint.
There are several ways to get your data into the OneLake:
- Manually with the OneLake file explorer
- Data Factory pipelines
- Notebooks
- Shortcuts are a way of creating a symbolic link to external storage locations and file paths (currently supporting ADLS Gen 2 and AWS S3 buckets)
The Delta metadata format offers optimized storage for data engineering workflows with versioning, schema enforcement, and (see sidebar) to guarantee referential integrity. It also integrates well with Apache Spark, making it ideal for large-scale data processing application.
From Unmanaged to Managed Data Through the Medallion Architecture… And the Lasting Need for Data Modeling & Integration
Not all the data needs to be managed or is ready to be managed yet. In fact, there are many cases where data is ingested in raw form and should be preserved in that state, if only for auditing and traceability purposes. For these cases, OneLake provides the Files section where you can store and access any file format. Data stored in those files is called unmanaged data.
However, in the lakehouse architecture, tables play a vital role in managing and organizing data. Once you set up tables you have several new options for browsing, querying, and analyzing them. Data stored in those tables is called managed data.
Lakehouse menu within Fabric, showing the Tables and Files sections in Explorer.
Both managed and unmanaged data can be stored and handled using a medallion architecture where data is corralled through three layers of progressive refinement, with the aim of turning low-grade raw ore into ready-for-analytics gold. That gold layer at the end of the medallion architecture should be ready for consumption in a semantic model – typically a Power BI dataset – without requiring further transformations.
This brings up a debate that somehow never seems to get resolved in the data industry. Every few years, someone will loudly proclaim the end of dimensional modeling in favor of wide tables that blend granular facts with their descriptive attributes. “Hey, it worked in our startup and established data patterns are now obsolete!”. Let’s be clear, this approach will just not perform well for large data models in Power BI, so it can be ruled out purely on technical grounds if Power BI is part of your data stack.
Perhaps more importantly, trying to dodge the analytical work required to reconcile various data sources will lead to analytical silos for lack of conformed dimensions. The integration mantra championed by Bill Inmon since the 1990s is more relevant than ever. And the dimensional modeling following long-established models promoted by Ralph Kimball continue to be relevant today.
The star schema still rules for Power BI performance, and making sense of your business entities across the enterprise is still needed sooner or later. The good news is (and we’ll get back to it in future entries) Fabric has all the tools to ingest raw data and turn it into a data model that’s optimally stored physically and structured logically for Power BI consumption and other workloads, based on the OneLake foundation. The core pattern remains ELT, not just EL!
Contact us for a free strategy briefing to discuss how Microsoft Fabric can be part of your data infrastructure roadmap.
Venkataramana
Principal Architect
About Author
With over 20 years of experience, Venkat is an accomplished senior professional who excels in managing teams and driving success in Data, Analytics, Cloud Computing, and Digital Transformation. He plays a pivotal role in shaping strategy and architecture, transforming ad-hoc assessments into scalable software solutions. His expertise lies in enhancing the effectiveness and usability of analytics platforms by collaborating closely with stakeholders and strategic partners.
SHARE
Share on facebook
Share on google
Share on twitter
Share on linkedin
Related Blog Posts
iLink Makes the INC5000 List for the 7th Time
iLink Named to the INC. 5000 List for the 7th Time Highlighting iLink’s Consistent Growth…
Continue readingSHARE
OneLake: Microsoft Fabric’s Open Storage Architecture
Discover OneLake, Microsoft’s innovative solution in Fabric’s open storage architecture. Unifying da…
Continue readingSHARE
Data Cloud Migration: A Change-Management Challenge Beyond Technology
Data Cloud Migration: A Change-Management Challenge Beyond Technology Data Cloud Migration: A Change…
Continue readingSHARE