Data Lakes

There has been much written about the concept of a data lake.  Originally it was an open data environment for exploration and analytics development where a wide variety of data sets from many sources where stored and analyzed to see if they could be used to develop new business value.  The data lake was assumed to be implemented on an Apache Hadoop cluster.

When I started looking at the architecture of a data lake back in 2013,  there were a number of common challenges associated with data lakes, particularly as the amount of data stored grows, and the number of people using the data lake increases:

  • How is the right information located by the users of the data lake?
  • How is this information protected whilst still being open for sharing?
  • How is new insight derived from the data lake shared across the organization?
  • How is the data within the data lake managed so it supports the organization’s workloads?

Working with ING and other IBM colleagues we developed a robust data lake reference architecture that was marketed under the name of the “Data Reservoir”:

This reference architecture had 3 significant differences to it from other work at the time:

  • It defined a set of services around the data lake repositories that managed all access and use of the data.  Individuals did not have direct access to the data, but worked from automatically populated sandboxes.
  • Metadata about the data is used to provide a comprehensive catalog about the data and its properties.  This metadata is used by the services to enable self-service access to the data, business-driven data protection and governance of the data.
  • The data repositories that organized the data could be hosted on a variety of different data platforms, from Apache Hadoop to relational stores, graph databases and document stores.  The data is organized on these platforms in order to provide  the appropriate performance for the workloads they supported.

The result is that the data lake is prevented from becoming a data swamp through the metadata and governance capability; the services enable business friendly facades to provide easy access to data; and new data platforms can be brought into the solution as needed without impacting the business users since they still access the data through the services.

Building out this data lake reference architecture created significant challenges to the pioneering organizations that were attempting to use the data lake as a means to shift to a more data-driven organization.

  • Many data tools tended to see metadata as documentation – not as the configuration of an operational system.  This means they did not offer the APIs to access the metadata at runtime, nor were mature enough to support HA and recovery scenarios.
  • There were no data tools that covered all of the metadata and functions needed by the data lake.  In general the ETL tools had the most mature metadata capability since they were managing the integration and movement between heterogeneous systems[1].   However, even the ETL portfolios did not integrate seamlessly with information virtualization engines, business intelligence reporting tools, data security functions and information lifecycle management tools.
  • Many data experts were used to building data warehouses.  They were not comfortable with the lack of a common data model, nor were they used to building highly available real-time systems.  This lead to
  • Data security practices were built around the notion that data and people are siloed to limit the amount of data they can access.  The data lake consolidates data from many silos and as such requires a rethink of how data is secured in this environment.

Today the reference architecture has been hardened to address these challenges, and many other thought leaders have added to our knowledge of how to build successful data lakes.  In addition, the work to integrate data tools and drive the management of data through metadata has lead to a focus on the Apache Atlas project as an open metadata and governance platform for solutions such as data lakes.

Notes:

  1. We used IBM’s InfoSphere Information Governance Catalog as the core metadata store for the data lake because it had a comprehensive metadata model out-of-the box plus tools to populate and use the data lake and open APIs to extend the data model.

 

Photo: Entering the Rybinsk Reservoir, Russia