Itlize

Blog

Databricks: Delta-Lake Data Lakehouse

William Tsu
Data Analyst
Experienced data analyst working with data visualization, cloud computing and ETL solutions.
July 20, 2022

Data Lakehouse

The data Lakehouse refers to a new generation of open platforms, which unify both data warehousing and advanced-level analytics. It is a new architectural pattern that is based on and characterized by open direct-access data formats, first-class support for Machine Learning & Data-Science Workloads, and State-of-the-art Performance. The focus lies in solving several challenges showcased by various First Generation Systems. These integrated data storage and computing into on-premises appliances. As a start, the Second Generation Systems got launched that include data analytics platforms, which offload all Raw Data into Data Lakes.

The Data Lakes comprises a low-cost storage system with a file-API (Application Programming Interface), which holds Data in Generic and Open File Formats. Examples include Apache Parquet and ORC. The approach got started with the Apache Hadoop Movement, and it uses the Hadoop File System (HDFS) for cheap data storage. The Data Lake is a Schema-on-read Architecture, which enables the agility of storing any data at a low cost. In First Generation Platforms, the data got Extracted, Transformed, and Loaded from Operational Data Systems directly into a Data Warehouse. In modern data architectures, it gets Extracted, Transformed, and Loaded to Data Lakes, and then into data-warehouses. This system creates complexity and delays.

Challenges faced by modern data architecture include reliability, data staleness, limited support for advanced analytics, and high overall cost of ownership. The Data Lakehouse offers much better solutions to all challenges and addresses the following problems: reliable data management on data lakes, support for Machine Learning and Data Science, and SQL Performances. It gets noted that Data Warehouses are critical for many business processes. However, they challenge everyone by illuminating some familiar issues such as incorrect data, staleness, and high costs. For a guaranteed answer to the quest of data loss and challenges lead to certain existing steps toward Data Lakehouses. Virtually all the Data Warehouses provide an added support for external tables in Parquet and ORC Format.

The Data Warehouse users get to query the Data Lake from same SQL engine. The Data Lakes lack basic management features like ACID (Atomicity, Consistency, Isolation, and Durability) Transactions that index to match the Data Warehouse Performances.

The Data Lakehouse Architecture

The Data Management System based on low-cost and direct-accessibility storage provides the traditional DBMS (Data Base Management System) management and performance features like ACID transactions, Data Versioning, Auditing, Indexing, Caching, and Query Optimization. The Data Lakehouses combine or integrate various benefits of Data Lakes and Data Warehouses. These include Low-storage in an Open Format accessible by a variety of systems of Data Lakes and powerful management & optimization features showcased by Data Warehouses. The Data Lakehouses is a great fit for various Cloud Environments with separate Compute and Storage. One can implement Data Lakehouse over an on-premise storage system like HDFS (Hadoop File System). One of the smoothest designs used for building Data Lakehouses get provided by Databricks through the combination of Delta Lake, Delta Engine, and Databricks Machine-Language Run-time Projects.

Implementing a Data Lakehouse System

For implementing a successful Data Lakehouse, the focus should lie on implementing a system wherein the data gets stored in a Low-Cost Object using Standard File Formats like the Apache Parquet. Implementing a Transactional Metadata layer on top of the object-store defines which objects is part of a table version. This in-turn allows implementing certain Management Features like ACID Transaction or versioning within the Metadata layer. It does so while keeping the bulk of data in the low-cost object store and allows clients to directly read from this store using a Standard File Format.

Metadata Layers for Data Management

Data Lake storage systems like S3 or HDFS provide low-level object store or File-system Interfaces where even the simple operations become far from atomic versions. Organizations began designing richer data management layers starting with Apache Hive ACID.

In recent years, the new systems provide more capabilities that improve scalability. In 2016, Databricks started developing Data Lake that stores lots of information. The stored data is about a part of table in the Data Lake showcased as a transaction log in Parquet Format, which enables it to scale to billions of objects per table.

Databricks Open-Sourcing Delta Lake

The Databricks is Open-Sourcing Delta Lake for dealing with counter-criticism from rivals. It also helps Databricks to take on Apache Iceberg and Data Warehouse Products from Snowflake, Starburst, Google Cloud, AWS, and Oracle among others.

Recently Databricks have become more self-aware of its operations and to push past certain doubts cast by the Data Lake and Data Warehouse rivals, the organization announced on Tuesday that it is Open-Sourcing all Delta Lake APIs as part of the Delta Lake 2.0 Release. The Databricks organization will contribute all enhancements of Delta Lake to the Linux Foundation.

As part of hitting back at Databricks, rival competitors like Oracle, Google, Microsoft, and AWS Snowflake among others have criticized the company. It thus cast doubts whether Delta Lake was Open-Source or Proprietary. This move by the rivals took away a share of prospective customers from Databricks as per the analysts. The new announcement of Databricks Open-Sourcing Delta Lake will provide much-needed clarity and continuity, which arrests the slide of their customers to rival competitors. Thereby, the customer concerns and competitive criticism gets taken care of through this big announcement made by Databricks.

The Databricks customers will now be able to trust that their data is in an Open Platform as they are not locked into Delta Lake. Databricks refer Delta Lake as Data Lakehouse, which is a Data Architecture.

Competitions in Commercial Open Source Market

The number of commercial open source projects is on a rise in the Data Lake Market. The Delta Lake of Databricks finds itself facing new competition that includes Apache Iceberg, which offers high-performance querying for large analytic tables. Some of the Open Source Projects have started to get commercialized and these include OneHouse for Apache Hudi with both Starburst and Dremio coming out with their own Apache Iceberg Offerings. With the letting out of these offerings, Delta Lake faced pressure from other open source Data Lakehouse formats to become more functional in the robust market. Many other competitive players in this open source get invested in Apache Iceberg as an alternative to Delta Lake Tables. The Delta Tables store data in Rows and Columns to access ACID transactions that help to store Metadata for helping with faster Data Ingestion.

In April 2022, Google announced the Big Lake and Iceberg Support while earlier; Snowflake announced proper support for Apache Iceberg. These improvements served as an appeal to prospective customers who have concerns about committing to one vendor. In the face of this renowned completion, Databricks locked down its idea to Open Source its Delta Lakes, or Data Lakehouse. It is a great move that helps Databricks to compete with all the big players in the competitive Marketplace. It is an excellent step to widen their horizons and data adoption as per the former research Vice President for Big Data and Analytics at Gartner.

Delta Lake 2.0 offer Faster Query Performances

The Delta Lake 2.0 of Databricks that will be fully available later this year, offers faster Query Performances for all kinds of Data Analysis as per the spokesperson from Databricks. The organization released the second edition of MLflow, which is an Open Source platform that aids in managing the end-to-end Machine Learning Lifecycle or MLOps. The MLflow 2.0 comes with various MLflow Pipelines that offer all Data Scientists, pre-defined and production-ready templates based on certain model types. They are building to allow them to accelerate all Model Development with requiring proper interventions from production engineers as per the spokesperson from Databricks. According to all analysts, MLflow 2.0 comes with apt pipelines.

The MLflow Pipelines offer information to Data Scientists. These include predefined and production-ready templates that are based on the Model Types; the Databricks are designing and building. It allows them to accelerate Model Development without requiring much intervention from all of Production Engineers as per the spokesperson from Databricks. As per the analysts, MLflow 2.0 serves as a more mature option for all Data Scientists. Here, Machine Learning Production continues to be a very challenging process. The translation of algorithmic models into production-grade application code on securely governed resources continues to be difficult. Although there are several vendor solutions, Databricks serves as the Natural-Vendor in comparison with Hyperscalers. The unified approach of Databricks helps with Data and Model Management that serves as a differentiator to MLOps vendors, which focus on coding and production challenges of Model Operationalization. Thus, the move to release MLflow 2.0 eases the pathways to bring streaming and its analysis into production data pipelines.

Conclusion

Databricks have decided to Open-Source it's Delta Lake Data Lakehouse. It helps them with reaching out to prospective customers as a natural vendor among others like Google, AWS, and Oracle among others. Companies should look into Databricks for proper Data Management with Advanced Analytics based on Data Lakehouse and Data Warehouse techniques.