If Data is the New Oil, who is protecting your “Data”?
Back in 2017, The Economist  published a story titled, “The world’s most valuable resource is no longer oil, but data”. Looking back, it was quite a prophetic statement. Companies have used Data for strategic purposes (customer relationship marketing, product/service strategy, etc.) while ensuring regulatory compliance (GDPR, CCPA, PCI). All this begets the question, who is securing your data. Unlike the good old days when the data was in a centralized repository, where traditional file level and database security approaches could be employed effectively, the current approach of data dispersion, fragmentation (across devices, across servers, across organizational boundaries, across private, public clouds, across Federated and Shared entities) makes Data Security, Governance a non-trivial challenge. The challenge is further amplified as the compute model graduates from relatively static virtual machines to highly ephemeral and transient Kubernetes environments. The following article discusses Data Provenance, a very powerful technique for managing Data Governance in Kubernetes environments.
Data Provenance  middleware lets individuals and applications use a common framework for reporting, storing, and querying records that characterize the history of computational processes and resulting data artifacts. A Data Provenance platform allows you to answer the following important Data Governance questions:
1. Which process [e.g., app] was used to create this data object [e.g., file]?
2. When the process ran what were the other data object it wrote?
3. What data objects did the process read?
4. Could any data have flowed from this data object to that data object?
5. What is the sensitivity of a given data flow or connection between processes?
SRI International has done seminal work in this area over the past decade. The provenance kernel is agnostic to the domain from which activity is reported. It exposes an interface that allows provenance elements to be reported, initially using the Open Provenance Model (OPM)  and more recently the W3C PROV  data model.
An Enterprise-grade Data Governance platform needs to provide the following:
1. Map container processing pipelines to a sensitive data configuration, which then drives the runtime analysis to track where this sensitive information flows across (container local data files, cloud data stores, DB tables, and network data flows)
2. A live Provenance Engine that supports Cloud Data Stores, Big Data Stores, SQL Databases, etc.
3. A container-aware dataflow auditing and data-flow alert production system
As an example, here is an asset map of Processes (Blue Boxes) and Data Elements (Yellow Oval); with the associated sensitivity labels (L1, L2).
In this scenario a Data Provenance Engine should be able to provide a temporal graph of data elements accessed by pertinent processes along with the associated sensitivity.
With standards such as PCI, GDPR, CCPA, it is becoming increasingly clear that consumers expect organizations to be responsible guardians of private and privileged data. Data Provenance engines provide a very powerful framework for tracking data compliance, monitor policy violations, and ensure Data Governance. Data can be a strategic asset if managed meticulously with precision; conversely, it can be a liability if pertinent Governance measures that are in keeping with modern compute platforms (Kubernetes, Big Data, Cloud Complex Data Pipelines).
 Economist, May 7, 2017
 Ashish Gehani, et al. Scaling SPADE to “Big Provenance”, SRI International.
 Luc Moreau, et al, The Open Provenance Model core specification (v1.1), Future Generation Computer Systems, 2010.
 W3C PROV, http://www.w3.org/TR/prov-overview/