- Start Date: 2020-08-28
- RFC PR: #1841
- Discussion Issue: #1731
- Implementation PR(s):
DataHub supports dataset level lineage. UpStreamLineage is an aspect of dataset that powers the dataset level lineage (a.k.a., coarse-grained lineage). However, there is a need to understand the lineage at the field level (a.k.a., fine-grained lineage)
In this RFC, we will discuss below and get consensus on the modelling involved.
- Representation of a field in a dataset
- Representation of the field level lineage
- Process of creating dataset fields and its relations to other entities.
- Transformation function involved in the field level lineage is out of scope of the current RFC.
A unique identifier for a field in a dataset will be introduced in the form of DatasetFieldUrn. And this urn will be the key for DatasetField entity. A sample is as below.
There is a lot of interest in the field level lineage for datasets. Related issues/rfcs are
- when does Fine grain lineage feature come out? #1649
- dataset field level Lineage support? #1519
- add lineage workflow schedule support #1615
- Design Review: column level lineage feature #1731
- Alternate proposal for field level lineage #1784
There are alternate proposals for field level lineage (refer #1731 and #1784). However, I believe a there is a need to uniquely idenity a dataset field and represent as URN for the following reasons.
- It provides a natural path forward to make dataset field a first class entity. Producers and Consumers of this dataset field can naturally provide more metadata for a field which doesn't come/can't be expressed as part of the schema definition.
- Search and discovery of datasets based on the field and its metadata will be natural extension of this.
We propose a standard identifier for the dataset field in the below format.
It contains two parts
- Dataset Urn -> Standard Identifier of the dataset. This URN is already part of DataHub models
- Field Path -> Represents the field of a dataset
FieldPath in most typical cases is the fieldName or column name of the dataset. Where the fields are nested in nature this will be a path to reach the leaf node.
To standardize the field paths for different formats, there is a need to build standardized
If this is the schema of the dataset, then the dataset fields that emanate from this schema are
- An aspect with name
DatasetUpstreamLineagewill be introduced to capture fine grained lineage. Technically coarse grained is already captured with fine-grained lineage.
- One can also provide a transformation function on how the data got transformed from source fields to destination field. The exact syntax of such function is out of scope of this document.
- BlackBox UDF means destination field is derived from source fields, but the transformation function is not knwon.
- Identity UDF means destination field is a pure copy from source field and the transformation is Identity.
- Upstream sources in the field level relations are dataset field urns and is extensible to support other types in future. Think of rest api as a possible upstream in producing a field in dataset.
As part of the POC we did, we used the below workflow. Essentially, DatasetFieldUrn is introduced paving the path for that being the first class entity.
- GraphBuilder on receiving MAE for
SchemaMetadataaspect, will do below
- Create Dataset Entity in graph db.
- Use schema normalizers and extract field paths. This schema and hence forth the fields are the source of truth for dataset fields.
- Creates dataset field entities in graph db.
- A new relationship builder
Data Derived From Relation Builderwill create
- GraphBuilder on receiving MAE for
DatasetUpstreamLienageaspect will create the field level lineages (relationship
Two new relationships can be introduced to represent the relationships in graph.
HasFieldrelationship represents the relation from dataset to dataset field.
DataDerivedFromrelationship model represents the data in destination dataset field derived from source dataset fields.
Once we decide to make dataset field as a first class entity, producers can start emitting MCEs for dataset fields. Below represents the end to end flow of dataset field entity will look like in the larger picture.
- Schema Normalizers as a utility will be developed.
DatasetFieldentity will be introduced with aspect
- Producers can use Schema Normalizers and send emit
DatasetFieldMCEs for every field in the schema.
- Producers will still emit the
SchemaMetadataas an aspect of
Datasetentity. This aspect serves as the metadata for the relationship
- An aspect with name
DatasetUpstreamLineagewill be introduced to capture field level lineage. Technically coarse grained is already captured with fine-grained lineage.
We are introducing the capability of field level lineage in DataHub. As part of this, below are the salient features one should know
Schema Normalizerswill be defined to standardize the field paths in a schema. Once this is done, field level lineage will be relation between two standardized field paths of source and destination paths.
Dataset Field URNwill be introduced and
DatasetFieldwill be a first class entity in DataHub.
HasFieldrelations will be populated in graph db between
DataDerivedrelations will be populated at the field level.
SchemaMetadatawill still serve the schema information of a dataset. But, it is uses as a SOT for presence of dataset field entities.
This is an extension to the current support of coarse grained lineage by DataHub.
Relationships tab in DataHub UI can also be enhanced to show field level lineage.
Haven't thought about any potential drawbacks.
In the alternate design, we wouldn't need to consider defining a dataset field urn. There is an extensive RFC and discussion on this at ( #1784 )
This introduces a new aspect
DatasetUpstreamLineage which is capable of defining lineage at field level. Hence, the existing customers shouldn't be impacted with this change.
- The syntax of transformation function representing how the source fields got transformed to destination fields is not thought through.
- How to automatically get the field level lineage by parsing the higher level languages or query plans of different execution environments.
For the above two, we need to have more detailed RFCs.