Data Lake, Governance

What do you govern?

Governance is a practice that you apply to “something”.  Just like James Watt’s fly-ball governor for the steam engine, a governance program seeks to keep a engine in balance so it works effectively.  This engine may be a process, organization, or flow of information.   The important point is that the target of what you are governing is clearly defined.

Approaches to governance, particularly around a data lake, vary widely due to the different choices that organizations make in their definition of the engine being managed.  For example, the IT department may see the data lake engine as a collection of technology working together.  The business may see the data lake as part of an innovation engine helping them to create new value from data.  So which is the right engine to govern?  It depends on the objective for data lake.

A good starting point in defining the governance program for the data lake is to consider the perspective of  each of the principle groups of users for the data lake and define the engine that each see and think what mechanisms it would take to create balance in each of these perspectives between effort and value.

So for example, the owner of a system that is supplying data to the data lake is required to maintain the catalog entry for the data coming from their system, and in return, they could get analysis on the quality or consistency of this data that helps them provide a better service to their users.

A data scientist may be restricted in how they work with sensitive data, but in return they get a rich catalog of data to choose from and easy processes to get permission to use the data sets they need.  They may also be given the ability to contribute data and content for the catalog.  The more they contribute, the easier the discovery process becomes.

By balancing the needs of the suppliers with the needs of the consumers, the balance of effort and value is achieved, creating a sustainable ecosystem.

In addition to designing the governance program to the perspective of the users, it is also necessary to decide who is in control of the data lake – is it IT or is it the business because that affect how the data lake is governed.

When IT is in control, then normal IT governance can manage many of the aspect of the data lake.  However, when the business is in control, the mechanisms that operate the data lake, and the classification that identify the different types of data, need to be abstracted through services and metadata to create a view of the data lake that makes sense  to the business and can be modified by them as needed.  This view is then mapped to the actual data and technology through the metadata in the catalog and the metadata settings are used by the data lake services to drive the behaviour of the data lake.

Once the engine have been defined, the governance program is designed in the normal way:

  • Setting standards for the metadata, formats and best practices for the data lake.
  • Measuring and monitoring the adherence to these standards and
  • Taking action as appropriate such as managing exceptions, answering compliance questions and modifying the program based on feedback.

I would like to end by emphasizing the importance of feedback in achieving balance and value.  Governance programs must be dynamic and demonstrating the value that they deliver.  The feedback mechanisms should not be forgotten as they enable the governance program to stay relevant to the changing needs to the business which in turn changes the nature of the engines we need to govern.

Photo: Ginger Lilly, Sao Jorge Island, Azores