"Art of Managing & Working around Data: DataLake"

What is Data Lake?

A centralised storage system called a "Data Lake" is used to store all the unprocessed data that is ingested from various sources. It can scale up to accommodate storing all of the enterprise's data. It can keep data of different types and formats. It can store binary (audio, image, and video) as well as structured (rows and columns), semi-structured (CSV, XML, JSON), unstructured (documents, emails, pdf), and unstructured data. The insight for the business is then extracted from this data using a variety of big data processing techniques.

An infrastructure that expands to meet changing organisational needs is known as a data lake. It offers storage for all types of data, numerous modes of processing, and different kinds of analytics.

Raw data and processed data are the two forms of data that are typically stored in Data Lake. Data can be traced using raw data at every stage of its lifecycle, including ingest, storage, processing, analytics, and application. A data lake supports a variety of computational engines, including batch and stream processing, interactive analytics, and machine learning, which can be used to analyse data and derive new insights from it. Additionally, because a data lake contains data in a variety of formats and data types, it enables the multi-modal storage engine.

From Hadoop map-reduce to batch processing and stream processing, the big data processing echo system has evolved. The computation paradigm has changed at each stage to keep up with the big data's volume and pace. Map-reduce was slow since it saved intermediate steps to the disc. With in-memory data processing, Apache Spark provided a solution to that issue, although it was made for bounded datasets. Low latency processing is intended for the unbounded stream of data from Apache Flink. These processing engines are used by Data Lake to process the unprocessed data.

Why you need a Data Lake

Insights have traditionally been stored and extracted using data warehouses (DWH), but the DWH is unable to keep up with the expanding volume and velocity of data. On the AWS website, there is a great comparison between the data lake and DWH. In contrast to a data lake, which stores data in any form from any source (such as an IoT device, logs, mobile game play, and online user activity), DWH stores data from business applications in structured format.

DWH is schema-on-write, which means that before dumping data into the data warehouse, a schema must be built based on business requirements. Since it was created with the type of queries in mind and the underlying engine doesn't need to combine numerous tables to get the results, getting business intelligence is made simpler and quicker. The data lake, on the other hand, has schema-on-read and is flexible enough to adapt to unforeseen business changes and data types. Although it allows for flexible data storage, effective data consumption calls for data processing methods.

When compared to a data lake, which makes use of bulk storage, a data warehouse is more expensive since it requires premium storage to offer good performance. Additionally, DWH demands significant administrative expenses for backup and upkeep.

DWH restricts everyone's investigation by only allowing access to those with certain skills, requiring access through the compute layer, and requiring access through the DBA. Data lakes, on the other hand, give users access to both storage and computation. Additionally, it gives users access to unprocessed datasets for further exploration and research.

The aggregated or curated data is permanently stored by DWH, thus it is challenging to provide in-depth data analysis for raw data types like JSON.

How — Data Lake solutions

Different data sources, including S3, NoSQL, and AWS relational databases, are supported by data lake systems. Data is gathered from various databases and object storage and moved into the lake storage, S3, with the aid of lake construction. Additionally, you can develop a data catalogue and process your data using additional methods, such as machine learning. Later, different analytics tools like Athena and Redshift can use these datasets to provide the insights.

The Azure Data Lake solution includes a computational engine, storage, and numerous additional services. The end-to-end analytics pipeline for a Dataware house in Azure is depicted in high-level detail in the following diagram. A data lake is necessary and the only method to manage the increasing volume and velocity of data cost-effectively, albeit data is expanding and more businesses are choosing to adopt a data-oriented approach.

Azure provides ADLS (Azure Data Lake Storage) Gen2 specifically designed storage for data lakes. It provides efficient and cost-effective support for a variety of data types, hierarchical file systems, and access control. Tiered storage is also a feature of Gen 2: Hot, Cool, and Archive. The user can store data in either tier and pay according to the tier depending on their needs.

The data intake dataflows conducted as spark jobs by the data bricks platform are created using the GUI-based tool Azure Data Factory. On the other side, developers can create the transformation workflows using data bricks notebooks.