Skip to main content
Version: Next

Dataset

The dataset entity is one the most important entities in the metadata model. They represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).

Identity

Datasets are identified by three pieces of information:

  • The platform that they belong to: this is the specific data technology that hosts this dataset. Examples are hive, bigquery, redshift etc. See dataplatform for more details.
  • The name of the dataset in the specific platform. Each platform will have a unique way of naming assets within its system. Usually, names are composed by combining the structural elements of the name and separating them by .. e.g. relational datasets are usually named as <db>.<schema>.<table>, except for platforms like MySQL which do not have the concept of a schema; as a result MySQL datasets are named <db>.<table>. In cases where the specific platform can have multiple instances (e.g. there are multiple different instances of MySQL databases that have different data assets in them), names can also include instance ids, making the general pattern for a name <platform_instance>.<db>.<schema>.<table>.
  • The environment or fabric in which the dataset belongs: this is an additional qualifier available on the identifier, to allow disambiguating datasets that live in Production environments from datasets that live in Non-production environments, such as Staging, QA, etc. The full list of supported environments / fabrics is available in FabricType.pdl.

An example of a dataset identifier is urn:li:dataset:(urn:li:dataPlatform:redshift,userdb.public.customer_table,PROD).

Important Capabilities

Schemas

Datasets support flat and nested schemas. Metadata about schemas are contained in the schemaMetadata aspect. Schemas are represented as an array of fields, each identified by a specific field path.

Field Paths explained

Fields that are either top-level or expressible unambiguously using a . based notation can be identified via a v1 path name, whereas fields that are part of a union need further disambiguation using [type=X] markers. Taking a simple nested schema as described below:

{
"type": "record",
"name": "Customer",
"fields":[
{
"type": "record",
"name": "address",
"fields": [
{ "name": "zipcode", "type": string},
{"name": "street", "type": string}]
}],
}
  • v1 field path: address.zipcode
  • v2 field path: [version=2.0].[type=struct].address.[type=string].zipcode". More examples and a formal specification of a v2 fieldPath can be found here.

Understanding field paths is important, because they are the identifiers through which tags, terms, documentation on fields are expressed. Besides the type and name of the field, schemas also contain descriptions attached to the individual fields, as well as information about primary and foreign keys.

The following code snippet shows you how to add a Schema containing 3 fields to a dataset.

Python SDK: Add a schema to a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_schema.py
# Imports for urn construction utility methods
from datahub.emitter.mce_builder import make_data_platform_urn, make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
AuditStampClass,
DateTypeClass,
OtherSchemaClass,
SchemaFieldClass,
SchemaFieldDataTypeClass,
SchemaMetadataClass,
StringTypeClass,
)

event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD"),
aspect=SchemaMetadataClass(
schemaName="customer", # not used
platform=make_data_platform_urn("hive"), # important <- platform must be an urn
version=0, # when the source system has a notion of versioning of schemas, insert this in, otherwise leave as 0
hash="", # when the source system has a notion of unique schemas identified via hash, include a hash, else leave it as empty string
platformSchema=OtherSchemaClass(rawSchema="__insert raw schema here__"),
lastModified=AuditStampClass(
time=1640692800000, actor="urn:li:corpuser:ingestion"
),
fields=[
SchemaFieldClass(
fieldPath="address.zipcode",
type=SchemaFieldDataTypeClass(type=StringTypeClass()),
nativeDataType="VARCHAR(50)", # use this to provide the type of the field in the source system's vernacular
description="This is the zipcode of the address. Specified using extended form and limited to addresses in the United States",
lastModified=AuditStampClass(
time=1640692800000, actor="urn:li:corpuser:ingestion"
),
),
SchemaFieldClass(
fieldPath="address.street",
type=SchemaFieldDataTypeClass(type=StringTypeClass()),
nativeDataType="VARCHAR(100)",
description="Street corresponding to the address",
lastModified=AuditStampClass(
time=1640692800000, actor="urn:li:corpuser:ingestion"
),
),
SchemaFieldClass(
fieldPath="last_sold_date",
type=SchemaFieldDataTypeClass(type=DateTypeClass()),
nativeDataType="Date",
description="Date of the last sale date for this property",
created=AuditStampClass(
time=1640692800000, actor="urn:li:corpuser:ingestion"
),
lastModified=AuditStampClass(
time=1640692800000, actor="urn:li:corpuser:ingestion"
),
),
],
),
)

# Create rest emitter
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
rest_emitter.emit(event)

Tags and Glossary Terms

Datasets can have Tags or Terms attached to them. Read this blog to understand the difference between tags and terms so you understand when you should use which.

Adding Tags or Glossary Terms at the top-level to a dataset

At the top-level, tags are added to datasets using the globalTags aspect, while terms are added using the glossaryTerms aspect.

Here is an example for how to add a tag to a dataset. Note that this involves reading the currently set tags on the dataset and then adding a new one if needed.

Python SDK: Add a tag to a dataset at the top-level
# Inlined from /metadata-ingestion/examples/library/dataset_add_tag.py
import logging
from typing import Optional

from datahub.emitter.mce_builder import make_dataset_urn, make_tag_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import GlobalTagsClass, TagAssociationClass

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


# First we get the current tags
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD")

current_tags: Optional[GlobalTagsClass] = graph.get_aspect(
entity_urn=dataset_urn,
aspect_type=GlobalTagsClass,
)

tag_to_add = make_tag_urn("purchase")
tag_association_to_add = TagAssociationClass(tag=tag_to_add)

need_write = False
if current_tags:
if tag_to_add not in [x.tag for x in current_tags.tags]:
# tags exist, but this tag is not present in the current tags
current_tags.tags.append(TagAssociationClass(tag_to_add))
need_write = True
else:
# create a brand new tags aspect
current_tags = GlobalTagsClass(tags=[tag_association_to_add])
need_write = True

if need_write:
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=current_tags,
)
graph.emit(event)
log.info(f"Tag {tag_to_add} added to dataset {dataset_urn}")

else:
log.info(f"Tag {tag_to_add} already exists, omitting write")

Here is an example of adding a term to a dataset. Note that this involves reading the currently set terms on the dataset and then adding a new one if needed.

Python SDK: Add a term to a dataset at the top-level
# Inlined from /metadata-ingestion/examples/library/dataset_add_term.py
import logging
from typing import Optional

from datahub.emitter.mce_builder import make_dataset_urn, make_term_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
AuditStampClass,
GlossaryTermAssociationClass,
GlossaryTermsClass,
)

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


# First we get the current terms
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD")

current_terms: Optional[GlossaryTermsClass] = graph.get_aspect(
entity_urn=dataset_urn, aspect_type=GlossaryTermsClass
)

term_to_add = make_term_urn("Classification.HighlyConfidential")
term_association_to_add = GlossaryTermAssociationClass(urn=term_to_add)
# an audit stamp that basically says we have no idea when these terms were added to this dataset
# change the time value to (time.time() * 1000) if you want to specify the current time of running this code as the time
unknown_audit_stamp = AuditStampClass(time=0, actor="urn:li:corpuser:ingestion")
need_write = False
if current_terms:
if term_to_add not in [x.urn for x in current_terms.terms]:
# terms exist, but this term is not present in the current terms
current_terms.terms.append(term_association_to_add)
need_write = True
else:
# create a brand new terms aspect
current_terms = GlossaryTermsClass(
terms=[term_association_to_add],
auditStamp=unknown_audit_stamp,
)
need_write = True

if need_write:
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=current_terms,
)
graph.emit(event)
else:
log.info(f"Term {term_to_add} already exists, omitting write")

Adding Tags or Glossary Terms to columns / fields of a dataset

Tags and Terms can also be attached to an individual column (field) of a dataset. These attachments are done via the schemaMetadata aspect by ingestion connectors / transformers and via the editableSchemaMetadata aspect by the UI. This separation allows the writes from the replication of metadata from the source system to be isolated from the edits made in the UI.

Here is an example of how you can add a tag to a field in a dataset using the low-level Python SDK.

Python SDK: Add a tag to a column (field) of a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_column_tag.py
import logging
import time

from datahub.emitter.mce_builder import make_dataset_urn, make_tag_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
AuditStampClass,
EditableSchemaFieldInfoClass,
EditableSchemaMetadataClass,
GlobalTagsClass,
TagAssociationClass,
)
from datahub.utilities.urns.field_paths import get_simple_field_path_from_v2_field_path

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


# Inputs -> the column, dataset and the tag to set
column = "user_name"
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")
tag_to_add = make_tag_urn("deprecated")


# First we get the current editable schema metadata
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))


current_editable_schema_metadata = graph.get_aspect(
entity_urn=dataset_urn,
aspect_type=EditableSchemaMetadataClass,
)


# Some pre-built objects to help all the conditional pathways
tag_association_to_add = TagAssociationClass(tag=tag_to_add)
tags_aspect_to_set = GlobalTagsClass(tags=[tag_association_to_add])
field_info_to_set = EditableSchemaFieldInfoClass(
fieldPath=column, globalTags=tags_aspect_to_set
)


need_write = False
field_match = False
if current_editable_schema_metadata:
for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo:
if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column:
# we have some editable schema metadata for this field
field_match = True
if fieldInfo.globalTags:
if tag_to_add not in [x.tag for x in fieldInfo.globalTags.tags]:
# this tag is not present
fieldInfo.globalTags.tags.append(tag_association_to_add)
need_write = True
else:
fieldInfo.globalTags = tags_aspect_to_set
need_write = True

if not field_match:
# this field isn't present in the editable schema metadata aspect, add it
field_info = field_info_to_set
current_editable_schema_metadata.editableSchemaFieldInfo.append(field_info)
need_write = True

else:
# create a brand new editable schema metadata aspect
now = int(time.time() * 1000) # milliseconds since epoch
current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion")
current_editable_schema_metadata = EditableSchemaMetadataClass(
editableSchemaFieldInfo=[field_info_to_set],
created=current_timestamp,
)
need_write = True

if need_write:
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=current_editable_schema_metadata,
)
graph.emit(event)
log.info(f"Tag {tag_to_add} added to column {column} of dataset {dataset_urn}")

else:
log.info(f"Tag {tag_to_add} already attached to column {column}, omitting write")

Similarly, here is an example of how you would add a term to a field in a dataset using the low-level Python SDK.

Python SDK: Add a term to a column (field) of a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_column_term.py
import logging
import time

from datahub.emitter.mce_builder import make_dataset_urn, make_term_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
AuditStampClass,
EditableSchemaFieldInfoClass,
EditableSchemaMetadataClass,
GlossaryTermAssociationClass,
GlossaryTermsClass,
)
from datahub.utilities.urns.field_paths import get_simple_field_path_from_v2_field_path

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


# Inputs -> the column, dataset and the term to set
column = "address.zipcode"
dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD")
term_to_add = make_term_urn("Classification.Location")


# First we get the current editable schema metadata
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))


current_editable_schema_metadata = graph.get_aspect(
entity_urn=dataset_urn, aspect_type=EditableSchemaMetadataClass
)


# Some pre-built objects to help all the conditional pathways
now = int(time.time() * 1000) # milliseconds since epoch
current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion")

term_association_to_add = GlossaryTermAssociationClass(urn=term_to_add)
term_aspect_to_set = GlossaryTermsClass(
terms=[term_association_to_add], auditStamp=current_timestamp
)
field_info_to_set = EditableSchemaFieldInfoClass(
fieldPath=column, glossaryTerms=term_aspect_to_set
)

need_write = False
field_match = False
if current_editable_schema_metadata:
for fieldInfo in current_editable_schema_metadata.editableSchemaFieldInfo:
if get_simple_field_path_from_v2_field_path(fieldInfo.fieldPath) == column:
# we have some editable schema metadata for this field
field_match = True
if fieldInfo.glossaryTerms:
if term_to_add not in [x.urn for x in fieldInfo.glossaryTerms.terms]:
# this tag is not present
fieldInfo.glossaryTerms.terms.append(term_association_to_add)
need_write = True
else:
fieldInfo.glossaryTerms = term_aspect_to_set
need_write = True

if not field_match:
# this field isn't present in the editable schema metadata aspect, add it
field_info = field_info_to_set
current_editable_schema_metadata.editableSchemaFieldInfo.append(field_info)
need_write = True

else:
# create a brand new editable schema metadata aspect
current_editable_schema_metadata = EditableSchemaMetadataClass(
editableSchemaFieldInfo=[field_info_to_set],
created=current_timestamp,
)
need_write = True

if need_write:
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=current_editable_schema_metadata,
)
graph.emit(event)
log.info(f"Term {term_to_add} added to column {column} of dataset {dataset_urn}")

else:
log.info(f"Term {term_to_add} already attached to column {column}, omitting write")

Ownership

Ownership is associated to a dataset using the ownership aspect. Owners can be of a few different types, DATAOWNER, PRODUCER, DEVELOPER, CONSUMER, etc. See OwnershipType.pdl for the full list of ownership types and their meanings. Ownership can be inherited from source systems, or additionally added in DataHub using the UI. Ingestion connectors for sources will automatically set owners when the source system supports it.

Adding Owners

The following script shows you how to add an owner to a dataset using the low-level Python SDK.

Python SDK: Add an owner to a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_owner.py
import logging
from typing import Optional

from datahub.emitter.mce_builder import make_dataset_urn, make_user_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
OwnerClass,
OwnershipClass,
OwnershipTypeClass,
)

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


# Inputs -> owner, ownership_type, dataset
owner_to_add = make_user_urn("jdoe")
ownership_type = OwnershipTypeClass.TECHNICAL_OWNER
dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD")

# Some objects to help with conditional pathways later
owner_class_to_add = OwnerClass(owner=owner_to_add, type=ownership_type)
ownership_to_add = OwnershipClass(owners=[owner_class_to_add])


# First we get the current owners
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))


current_owners: Optional[OwnershipClass] = graph.get_aspect(
entity_urn=dataset_urn, aspect_type=OwnershipClass
)


need_write = False
if current_owners:
if (owner_to_add, ownership_type) not in [
(x.owner, x.type) for x in current_owners.owners
]:
# owners exist, but this owner is not present in the current owners
current_owners.owners.append(owner_class_to_add)
need_write = True
else:
# create a brand new ownership aspect
current_owners = ownership_to_add
need_write = True

if need_write:
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=current_owners,
)
graph.emit(event)
log.info(
f"Owner {owner_to_add}, type {ownership_type} added to dataset {dataset_urn}"
)

else:
log.info(f"Owner {owner_to_add} already exists, omitting write")

Fine-grained lineage

Fine-grained lineage at field level can be associated to a dataset in two ways - either directly attached to the upstreamLineage aspect of a dataset, or captured as part of the dataJobInputOutput aspect of a dataJob.

Python SDK: Add fine-grained lineage to a dataset
# Inlined from /metadata-ingestion/examples/library/lineage_emitter_dataset_finegrained.py
import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
DatasetLineageType,
FineGrainedLineage,
FineGrainedLineageDownstreamType,
FineGrainedLineageUpstreamType,
Upstream,
UpstreamLineage,
)


def datasetUrn(tbl):
return builder.make_dataset_urn("postgres", tbl)


def fldUrn(tbl, fld):
return builder.make_schema_field_urn(datasetUrn(tbl), fld)


# Lineage of fields in a dataset
# c1 <-- unknownFunc(bar2.c1, bar4.c1)
# c2 <-- myfunc(bar3.c2)
# {c3,c4} <-- unknownFunc(bar2.c2, bar2.c3, bar3.c1)
# c5 <-- unknownFunc(bar3)
# {c6,c7} <-- unknownFunc(bar4)

# note that the semantic of the "transformOperation" value is contextual.
# In above example, it is regarded as some kind of UDF; but it could also be an expression etc.

fineGrainedLineages = [
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c1"), fldUrn("bar4", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c1")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar3", "c2")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c2")],
confidenceScore=0.8,
transformOperation="myfunc",
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c2"), fldUrn("bar2", "c3"), fldUrn("bar3", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c3"), fldUrn("bar", "c4")],
confidenceScore=0.7,
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar3")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c5")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar4")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c6"), fldUrn("bar", "c7")],
),
]


# this is just to check if any conflicts with existing Upstream, particularly the DownstreamOf relationship
upstream = Upstream(dataset=datasetUrn("bar2"), type=DatasetLineageType.TRANSFORMED)

fieldLineages = UpstreamLineage(
upstreams=[upstream], fineGrainedLineages=fineGrainedLineages
)

lineageMcp = MetadataChangeProposalWrapper(
entityUrn=datasetUrn("bar"),
aspect=fieldLineages,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata!
emitter.emit_mcp(lineageMcp)

Python SDK: Add fine-grained lineage to a datajob
# Inlined from /metadata-ingestion/examples/library/lineage_emitter_datajob_finegrained.py
import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.com.linkedin.pegasus2avro.dataset import (
FineGrainedLineage,
FineGrainedLineageDownstreamType,
FineGrainedLineageUpstreamType,
)
from datahub.metadata.schema_classes import DataJobInputOutputClass


def datasetUrn(tbl):
return builder.make_dataset_urn("postgres", tbl)


def fldUrn(tbl, fld):
return builder.make_schema_field_urn(datasetUrn(tbl), fld)


# Lineage of fields output by a job
# bar.c1 <-- unknownFunc(bar2.c1, bar4.c1)
# bar.c2 <-- myfunc(bar3.c2)
# {bar.c3,bar.c4} <-- unknownFunc(bar2.c2, bar2.c3, bar3.c1)
# bar.c5 <-- unknownFunc(bar3)
# {bar.c6,bar.c7} <-- unknownFunc(bar4)
# bar2.c9 has no upstream i.e. its values are somehow created independently within this job.

# Note that the semantic of the "transformOperation" value is contextual.
# In above example, it is regarded as some kind of UDF; but it could also be an expression etc.

fineGrainedLineages = [
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c1"), fldUrn("bar4", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c1")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar3", "c2")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c2")],
confidenceScore=0.8,
transformOperation="myfunc",
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
upstreams=[fldUrn("bar2", "c2"), fldUrn("bar2", "c3"), fldUrn("bar3", "c1")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c3"), fldUrn("bar", "c4")],
confidenceScore=0.7,
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar3")],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar", "c5")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.DATASET,
upstreams=[datasetUrn("bar4")],
downstreamType=FineGrainedLineageDownstreamType.FIELD_SET,
downstreams=[fldUrn("bar", "c6"), fldUrn("bar", "c7")],
),
FineGrainedLineage(
upstreamType=FineGrainedLineageUpstreamType.NONE,
upstreams=[],
downstreamType=FineGrainedLineageDownstreamType.FIELD,
downstreams=[fldUrn("bar2", "c9")],
),
]

# The lineage of output col bar.c9 is unknown. So there is no lineage for it above.
# Note that bar2 is an input as well as an output dataset, but some fields are inputs while other fields are outputs.

dataJobInputOutput = DataJobInputOutputClass(
inputDatasets=[datasetUrn("bar2"), datasetUrn("bar3"), datasetUrn("bar4")],
outputDatasets=[datasetUrn("bar"), datasetUrn("bar2")],
inputDatajobs=None,
inputDatasetFields=[
fldUrn("bar2", "c1"),
fldUrn("bar2", "c2"),
fldUrn("bar2", "c3"),
fldUrn("bar3", "c1"),
fldUrn("bar3", "c2"),
fldUrn("bar4", "c1"),
],
outputDatasetFields=[
fldUrn("bar", "c1"),
fldUrn("bar", "c2"),
fldUrn("bar", "c3"),
fldUrn("bar", "c4"),
fldUrn("bar", "c5"),
fldUrn("bar", "c6"),
fldUrn("bar", "c7"),
fldUrn("bar", "c9"),
fldUrn("bar2", "c9"),
],
fineGrainedLineages=fineGrainedLineages,
)

dataJobLineageMcp = MetadataChangeProposalWrapper(
entityUrn=builder.make_data_job_urn("spark", "Flow1", "Task1"),
aspect=dataJobInputOutput,
)

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("http://localhost:8080")

# Emit metadata!
emitter.emit_mcp(dataJobLineageMcp)

Querying lineage information

The standard GET APIs to retrieve entities can be used to fetch the dataset/datajob created by the above example. The response will include the fine-grained lineage information as well.

Fetch entity snapshot, including fine-grained lineages
curl 'http://localhost:8080/entities/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Apostgres,bar,PROD)'
curl 'http://localhost:8080/entities/urn%3Ali%3AdataJob%3A(urn%3Ali%3AdataFlow%3A(spark,Flow1,prod),Task1)'

The below queries can be used to find the upstream/downstream datasets/fields of a dataset/datajob.

Find upstream datasets and fields of a dataset
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Apostgres,bar,PROD)&types=DownstreamOf'

{
"start": 0,
"count": 9,
"relationships": [
{
"type": "DownstreamOf",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c1)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c3)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c2)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c2)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD),c1)"
},
{
"type": "DownstreamOf",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c1)"
}
],
"total": 9
}
Find the datasets and fields consumed by a datajob i.e. inputs to a datajob
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AdataJob%3A(urn%3Ali%3AdataFlow%3A(spark,Flow1,prod),Task1)&types=Consumes'

{
"start": 0,
"count": 9,
"relationships": [
{
"type": "Consumes",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD)"
},
{
"type": "Consumes",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD)"
},
{
"type": "Consumes",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar4,PROD),c1)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c2)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar3,PROD),c1)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c3)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c2)"
},
{
"type": "Consumes",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c1)"
}
],
"total": 9
}
Find the datasets and fields produced by a datajob i.e. outputs of a datajob
curl 'http://localhost:8080/relationships?direction=OUTGOING&urn=urn%3Ali%3AdataJob%3A(urn%3Ali%3AdataFlow%3A(spark,Flow1,prod),Task1)&types=Produces'

{
"start": 0,
"count": 11,
"relationships": [
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD),c9)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c9)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c7)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c6)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c5)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c4)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c3)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c2)"
},
{
"type": "Produces",
"entity": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD),c1)"
},
{
"type": "Produces",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar2,PROD)"
},
{
"type": "Produces",
"entity": "urn:li:dataset:(urn:li:dataPlatform:postgres,bar,PROD)"
}
],
"total": 11
}

Documentation for Datasets is available via the datasetProperties aspect (typically filled out via ingestion connectors when information is already present in the source system) and via the editableDatasetProperties aspect (filled out via the UI typically)

Links that contain more knowledge about the dataset (e.g. links to Confluence pages) can be added via the institutionalMemory aspect.

Here is a simple script that shows you how to add documentation for a dataset including some links to pages using the low-level Python SDK.

Python SDK: Add documentation, links to a dataset
# Inlined from /metadata-ingestion/examples/library/dataset_add_documentation.py
import logging
import time

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper

# read-modify-write requires access to the DataHubGraph (RestEmitter is not enough)
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Imports for metadata model classes
from datahub.metadata.schema_classes import (
AuditStampClass,
EditableDatasetPropertiesClass,
InstitutionalMemoryClass,
InstitutionalMemoryMetadataClass,
)

log = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


# Inputs -> owner, ownership_type, dataset
documentation_to_add = "## The Real Estate Sales Dataset\nThis is a really important Dataset that contains all the relevant information about sales that have happened organized by address.\n"
link_to_add = "https://wikipedia.com/real_estate"
link_description = "This is the definition of what real estate means"
dataset_urn = make_dataset_urn(platform="hive", name="realestate_db.sales", env="PROD")

# Some helpful variables to fill out objects later
now = int(time.time() * 1000) # milliseconds since epoch
current_timestamp = AuditStampClass(time=now, actor="urn:li:corpuser:ingestion")
institutional_memory_element = InstitutionalMemoryMetadataClass(
url=link_to_add,
description=link_description,
createStamp=current_timestamp,
)


# First we get the current owners
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))

current_editable_properties = graph.get_aspect(
entity_urn=dataset_urn, aspect_type=EditableDatasetPropertiesClass
)

need_write = False
if current_editable_properties:
if documentation_to_add != current_editable_properties.description:
current_editable_properties.description = documentation_to_add
need_write = True
else:
# create a brand new editable dataset properties aspect
current_editable_properties = EditableDatasetPropertiesClass(
created=current_timestamp, description=documentation_to_add
)
need_write = True

if need_write:
event: MetadataChangeProposalWrapper = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=current_editable_properties,
)
graph.emit(event)
log.info(f"Documentation added to dataset {dataset_urn}")

else:
log.info("Documentation already exists and is identical, omitting write")


current_institutional_memory = graph.get_aspect(
entity_urn=dataset_urn, aspect_type=InstitutionalMemoryClass
)

need_write = False

if current_institutional_memory:
if link_to_add not in [x.url for x in current_institutional_memory.elements]:
current_institutional_memory.elements.append(institutional_memory_element)
need_write = True
else:
# create a brand new institutional memory aspect
current_institutional_memory = InstitutionalMemoryClass(
elements=[institutional_memory_element]
)
need_write = True

if need_write:
event = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=current_institutional_memory,
)
graph.emit(event)
log.info(f"Link {link_to_add} added to dataset {dataset_urn}")

else:
log.info(f"Link {link_to_add} already exists and is identical, omitting write")

Notable Exceptions

The following overloaded uses of the Dataset entity exist for convenience, but will likely move to fully modeled entity types in the future.

  • OpenAPI endpoints: the GET API of OpenAPI endpoints are currently modeled as Datasets, but should really be modeled as a Service/API entity once this is created in the metadata model.
  • DataHub's Logical Entities (e.g.. Dataset, Chart, Dashboard) are represented as Datasets, with sub-type Entity. These should really be modeled as Entities in a logical ER model once this is created in the metadata model.

Aspects

datasetKey

Key for a Dataset

Schema
{
"type": "record",
"Aspect": {
"name": "datasetKey"
},
"name": "DatasetKey",
"namespace": "com.linkedin.metadata.key",
"fields": [
{
"Searchable": {
"enableAutocomplete": true,
"fieldType": "URN"
},
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "platform",
"doc": "Data platform urn associated with the dataset"
},
{
"Searchable": {
"boostScore": 10.0,
"enableAutocomplete": true,
"fieldName": "id",
"fieldType": "WORD_GRAM"
},
"type": "string",
"name": "name",
"doc": "Unique guid for dataset"
},
{
"Searchable": {
"addToFilters": true,
"fieldType": "TEXT_PARTIAL",
"filterNameOverride": "Environment",
"queryByDefault": false
},
"type": {
"type": "enum",
"symbolDocs": {
"CORP": "Designates corporation fabrics",
"DEV": "Designates development fabrics",
"EI": "Designates early-integration fabrics",
"NON_PROD": "Designates non-production fabrics",
"PRE": "Designates pre-production fabrics",
"PROD": "Designates production fabrics",
"QA": "Designates quality assurance fabrics",
"RVW": "Designates review fabrics",
"STG": "Designates staging fabrics",
"TEST": "Designates testing fabrics",
"UAT": "Designates user acceptance testing fabrics"
},
"name": "FabricType",
"namespace": "com.linkedin.common",
"symbols": [
"DEV",
"TEST",
"QA",
"UAT",
"EI",
"PRE",
"STG",
"NON_PROD",
"PROD",
"CORP",
"RVW"
],
"doc": "Fabric group type"
},
"name": "origin",
"doc": "Fabric type where dataset belongs to or where it was generated."
}
],
"doc": "Key for a Dataset"
}

datasetProperties

Properties associated with a Dataset

Schema
{
"type": "record",
"Aspect": {
"name": "datasetProperties"
},
"name": "DatasetProperties",
"namespace": "com.linkedin.dataset",
"fields": [
{
"Searchable": {
"/*": {
"fieldType": "TEXT",
"queryByDefault": true
}
},
"type": {
"type": "map",
"values": "string"
},
"name": "customProperties",
"default": {},
"doc": "Custom property bag."
},
{
"Searchable": {
"fieldType": "KEYWORD"
},
"java": {
"class": "com.linkedin.common.url.Url",
"coercerClass": "com.linkedin.common.url.UrlCoercer"
},
"type": [
"null",
"string"
],
"name": "externalUrl",
"default": null,
"doc": "URL where the reference exist"
},
{
"Searchable": {
"boostScore": 10.0,
"enableAutocomplete": true,
"fieldNameAliases": [
"_entityName"
],
"fieldType": "WORD_GRAM"
},
"type": [
"null",
"string"
],
"name": "name",
"default": null,
"doc": "Display name of the Dataset"
},
{
"Searchable": {
"addToFilters": false,
"boostScore": 10.0,
"enableAutocomplete": true,
"fieldType": "WORD_GRAM"
},
"type": [
"null",
"string"
],
"name": "qualifiedName",
"default": null,
"doc": "Fully-qualified name of the Dataset"
},
{
"Searchable": {
"fieldType": "TEXT",
"hasValuesFieldName": "hasDescription"
},
"type": [
"null",
"string"
],
"name": "description",
"default": null,
"doc": "Documentation of the dataset"
},
{
"deprecated": "Use ExternalReference.externalUrl field instead.",
"java": {
"class": "java.net.URI"
},
"type": [
"null",
"string"
],
"name": "uri",
"default": null,
"doc": "The abstracted URI such as hdfs:///data/tracking/PageViewEvent, file:///dir/file_name. Uri should not include any environment specific properties. Some datasets might not have a standardized uri, which makes this field optional (i.e. kafka topic)."
},
{
"Searchable": {
"/time": {
"fieldName": "createdAt",
"fieldType": "DATETIME"
}
},
"type": [
"null",
{
"type": "record",
"name": "TimeStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the event occur"
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "actor",
"default": null,
"doc": "Optional: The actor urn involved in the event."
}
],
"doc": "A standard event timestamp"
}
],
"name": "created",
"default": null,
"doc": "A timestamp documenting when the asset was created in the source Data Platform (not on DataHub)"
},
{
"Searchable": {
"/time": {
"fieldName": "lastModifiedAt",
"fieldType": "DATETIME"
}
},
"type": [
"null",
"com.linkedin.common.TimeStamp"
],
"name": "lastModified",
"default": null,
"doc": "A timestamp documenting when the asset was last modified in the source Data Platform (not on DataHub)"
},
{
"deprecated": "Use GlobalTags aspect instead.",
"type": {
"type": "array",
"items": "string"
},
"name": "tags",
"default": [],
"doc": "[Legacy] Unstructured tags for the dataset. Structured tags can be applied via the `GlobalTags` aspect.\nThis is now deprecated."
}
],
"doc": "Properties associated with a Dataset"
}

editableDatasetProperties

EditableDatasetProperties stores editable changes made to dataset properties. This separates changes made from ingestion pipelines and edits in the UI to avoid accidental overwrites of user-provided data by ingestion pipelines

Schema
{
"type": "record",
"Aspect": {
"name": "editableDatasetProperties"
},
"name": "EditableDatasetProperties",
"namespace": "com.linkedin.dataset",
"fields": [
{
"type": {
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
},
"name": "created",
"default": {
"actor": "urn:li:corpuser:unknown",
"impersonator": null,
"time": 0,
"message": null
},
"doc": "An AuditStamp corresponding to the creation of this resource/association/sub-resource. A value of 0 for time indicates missing data."
},
{
"type": "com.linkedin.common.AuditStamp",
"name": "lastModified",
"default": {
"actor": "urn:li:corpuser:unknown",
"impersonator": null,
"time": 0,
"message": null
},
"doc": "An AuditStamp corresponding to the last modification of this resource/association/sub-resource. If no modification has happened since creation, lastModified should be the same as created. A value of 0 for time indicates missing data."
},
{
"type": [
"null",
"com.linkedin.common.AuditStamp"
],
"name": "deleted",
"default": null,
"doc": "An AuditStamp corresponding to the deletion of this resource/association/sub-resource. Logically, deleted MUST have a later timestamp than creation. It may or may not have the same time as lastModified depending upon the resource/association/sub-resource semantics."
},
{
"Searchable": {
"fieldName": "editedDescription",
"fieldType": "TEXT"
},
"type": [
"null",
"string"
],
"name": "description",
"default": null,
"doc": "Documentation of the dataset"
},
{
"Searchable": {
"fieldName": "editedName",
"fieldType": "TEXT_PARTIAL"
},
"type": [
"null",
"string"
],
"name": "name",
"default": null,
"doc": "Editable display name of the Dataset"
}
],
"doc": "EditableDatasetProperties stores editable changes made to dataset properties. This separates changes made from\ningestion pipelines and edits in the UI to avoid accidental overwrites of user-provided data by ingestion pipelines"
}

datasetUpstreamLineage

Fine Grained upstream lineage for fields in a dataset

Schema
{
"type": "record",
"Aspect": {
"name": "datasetUpstreamLineage"
},
"deprecated": "use UpstreamLineage.fineGrainedLineages instead",
"name": "DatasetUpstreamLineage",
"namespace": "com.linkedin.dataset",
"fields": [
{
"type": {
"type": "array",
"items": {
"type": "record",
"deprecated": "use FineGrainedLineage instead",
"name": "DatasetFieldMapping",
"namespace": "com.linkedin.dataset",
"fields": [
{
"type": {
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
},
"name": "created",
"doc": "Audit stamp containing who reported the field mapping and when"
},
{
"type": [
{
"type": "enum",
"symbolDocs": {
"BLACKBOX": "Field transformation expressed as unknown black box function.",
"IDENTITY": "Field transformation expressed as Identity function."
},
"name": "TransformationType",
"namespace": "com.linkedin.common.fieldtransformer",
"symbols": [
"BLACKBOX",
"IDENTITY"
],
"doc": "Type of the transformation involved in generating destination fields from source fields."
},
{
"type": "record",
"name": "UDFTransformer",
"namespace": "com.linkedin.common.fieldtransformer",
"fields": [
{
"type": "string",
"name": "udf",
"doc": "A UDF mentioning how the source fields got transformed to destination field. This is the FQCN(Fully Qualified Class Name) of the udf."
}
],
"doc": "Field transformation expressed in UDF"
}
],
"name": "transformation",
"doc": "Transfomration function between the fields involved"
},
{
"type": {
"type": "array",
"items": [
"string"
]
},
"name": "sourceFields",
"doc": "Source fields from which the fine grained lineage is derived"
},
{
"deprecated": "use SchemaFieldPath and represent as generic Urn instead",
"java": {
"class": "com.linkedin.common.urn.DatasetFieldUrn"
},
"type": "string",
"name": "destinationField",
"doc": "Destination field which is derived from source fields"
}
],
"doc": "Representation of mapping between fields in source dataset to the field in destination dataset"
}
},
"name": "fieldMappings",
"doc": "Upstream to downstream field level lineage mappings"
}
],
"doc": "Fine Grained upstream lineage for fields in a dataset"
}

upstreamLineage

Upstream lineage of a dataset

Schema
{
"type": "record",
"Aspect": {
"name": "upstreamLineage"
},
"name": "UpstreamLineage",
"namespace": "com.linkedin.dataset",
"fields": [
{
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Upstream",
"namespace": "com.linkedin.dataset",
"fields": [
{
"type": {
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
},
"name": "auditStamp",
"default": {
"actor": "urn:li:corpuser:unknown",
"impersonator": null,
"time": 0,
"message": null
},
"doc": "Audit stamp containing who reported the lineage and when."
},
{
"type": [
"null",
"com.linkedin.common.AuditStamp"
],
"name": "created",
"default": null,
"doc": "Audit stamp containing who created the lineage and when."
},
{
"Relationship": {
"createdActor": "upstreams/*/created/actor",
"createdOn": "upstreams/*/created/time",
"entityTypes": [
"dataset"
],
"isLineage": true,
"name": "DownstreamOf",
"properties": "upstreams/*/properties",
"updatedActor": "upstreams/*/auditStamp/actor",
"updatedOn": "upstreams/*/auditStamp/time",
"via": "upstreams/*/query"
},
"Searchable": {
"fieldName": "upstreams",
"fieldType": "URN",
"queryByDefault": false
},
"java": {
"class": "com.linkedin.common.urn.DatasetUrn"
},
"type": "string",
"name": "dataset",
"doc": "The upstream dataset the lineage points to"
},
{
"type": {
"type": "enum",
"symbolDocs": {
"COPY": "Direct copy without modification",
"TRANSFORMED": "Transformed data with modification (format or content change)",
"VIEW": "Represents a view defined on the sources e.g. Hive view defined on underlying hive tables or a Hive table pointing to a HDFS dataset or DALI view defined on multiple sources"
},
"name": "DatasetLineageType",
"namespace": "com.linkedin.dataset",
"symbols": [
"COPY",
"TRANSFORMED",
"VIEW"
],
"doc": "The various types of supported dataset lineage"
},
"name": "type",
"doc": "The type of the lineage"
},
{
"type": [
"null",
{
"type": "map",
"values": "string"
}
],
"name": "properties",
"default": null,
"doc": "A generic properties bag that allows us to store specific information on this graph edge."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "query",
"default": null,
"doc": "If the lineage is generated by a query, a reference to the query"
}
],
"doc": "Upstream lineage information about a dataset including the source reporting the lineage"
}
},
"name": "upstreams",
"doc": "List of upstream dataset lineage information"
},
{
"Relationship": {
"/*/upstreams/*": {
"entityTypes": [
"dataset",
"schemaField"
],
"name": "DownstreamOf"
}
},
"type": [
"null",
{
"type": "array",
"items": {
"type": "record",
"name": "FineGrainedLineage",
"namespace": "com.linkedin.dataset",
"fields": [
{
"type": {
"type": "enum",
"symbolDocs": {
"DATASET": " Indicates that this lineage is originating from upstream dataset(s)",
"FIELD_SET": " Indicates that this lineage is originating from upstream field(s)",
"NONE": " Indicates that there is no upstream lineage i.e. the downstream field is not a derived field"
},
"name": "FineGrainedLineageUpstreamType",
"namespace": "com.linkedin.dataset",
"symbols": [
"FIELD_SET",
"DATASET",
"NONE"
],
"doc": "The type of upstream entity in a fine-grained lineage"
},
"name": "upstreamType",
"doc": "The type of upstream entity"
},
{
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"name": "upstreams",
"default": null,
"doc": "Upstream entities in the lineage"
},
{
"type": {
"type": "enum",
"symbolDocs": {
"FIELD": " Indicates that the lineage is for a single, specific, downstream field",
"FIELD_SET": " Indicates that the lineage is for a set of downstream fields"
},
"name": "FineGrainedLineageDownstreamType",
"namespace": "com.linkedin.dataset",
"symbols": [
"FIELD",
"FIELD_SET"
],
"doc": "The type of downstream field(s) in a fine-grained lineage"
},
"name": "downstreamType",
"doc": "The type of downstream field(s)"
},
{
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"name": "downstreams",
"default": null,
"doc": "Downstream fields in the lineage"
},
{
"type": [
"null",
"string"
],
"name": "transformOperation",
"default": null,
"doc": "The transform operation applied to the upstream entities to produce the downstream field(s)"
},
{
"type": "float",
"name": "confidenceScore",
"default": 1.0,
"doc": "The confidence in this lineage between 0 (low confidence) and 1 (high confidence)"
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "query",
"default": null,
"doc": "The query that was used to generate this lineage. \nPresent only if the lineage was generated from a detected query."
}
],
"doc": "A fine-grained lineage from upstream fields/datasets to downstream field(s)"
}
}
],
"name": "fineGrainedLineages",
"default": null,
"doc": " List of fine-grained lineage information, including field-level lineage"
}
],
"doc": "Upstream lineage of a dataset"
}

institutionalMemory

Institutional memory of an entity. This is a way to link to relevant documentation and provide description of the documentation. Institutional or tribal knowledge is very important for users to leverage the entity.

Schema
{
"type": "record",
"Aspect": {
"name": "institutionalMemory"
},
"name": "InstitutionalMemory",
"namespace": "com.linkedin.common",
"fields": [
{
"type": {
"type": "array",
"items": {
"type": "record",
"name": "InstitutionalMemoryMetadata",
"namespace": "com.linkedin.common",
"fields": [
{
"java": {
"class": "com.linkedin.common.url.Url",
"coercerClass": "com.linkedin.common.url.UrlCoercer"
},
"type": "string",
"name": "url",
"doc": "Link to an engineering design document or a wiki page."
},
{
"type": "string",
"name": "description",
"doc": "Description of the link."
},
{
"type": {
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
},
"name": "createStamp",
"doc": "Audit stamp associated with creation of this record"
}
],
"doc": "Metadata corresponding to a record of institutional memory."
}
},
"name": "elements",
"doc": "List of records that represent institutional memory of an entity. Each record consists of a link, description, creator and timestamps associated with that record."
}
],
"doc": "Institutional memory of an entity. This is a way to link to relevant documentation and provide description of the documentation. Institutional or tribal knowledge is very important for users to leverage the entity."
}

ownership

Ownership information of an entity.

Schema
{
"type": "record",
"Aspect": {
"name": "ownership"
},
"name": "Ownership",
"namespace": "com.linkedin.common",
"fields": [
{
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Owner",
"namespace": "com.linkedin.common",
"fields": [
{
"Relationship": {
"entityTypes": [
"corpuser",
"corpGroup"
],
"name": "OwnedBy"
},
"Searchable": {
"addToFilters": true,
"fieldName": "owners",
"fieldType": "URN",
"filterNameOverride": "Owned By",
"hasValuesFieldName": "hasOwners",
"queryByDefault": false
},
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "owner",
"doc": "Owner URN, e.g. urn:li:corpuser:ldap, urn:li:corpGroup:group_name, and urn:li:multiProduct:mp_name\n(Caveat: only corpuser is currently supported in the frontend.)"
},
{
"deprecated": true,
"type": {
"type": "enum",
"symbolDocs": {
"BUSINESS_OWNER": "A person or group who is responsible for logical, or business related, aspects of the asset.",
"CONSUMER": "A person, group, or service that consumes the data\nDeprecated! Use TECHNICAL_OWNER or BUSINESS_OWNER instead.",
"CUSTOM": "Set when ownership type is unknown or a when new one is specified as an ownership type entity for which we have no\nenum value for. This is used for backwards compatibility",
"DATAOWNER": "A person or group that is owning the data\nDeprecated! Use TECHNICAL_OWNER instead.",
"DATA_STEWARD": "A steward, expert, or delegate responsible for the asset.",
"DELEGATE": "A person or a group that overseas the operation, e.g. a DBA or SRE.\nDeprecated! Use TECHNICAL_OWNER instead.",
"DEVELOPER": "A person or group that is in charge of developing the code\nDeprecated! Use TECHNICAL_OWNER instead.",
"NONE": "No specific type associated to the owner.",
"PRODUCER": "A person, group, or service that produces/generates the data\nDeprecated! Use TECHNICAL_OWNER instead.",
"STAKEHOLDER": "A person or a group that has direct business interest\nDeprecated! Use TECHNICAL_OWNER, BUSINESS_OWNER, or STEWARD instead.",
"TECHNICAL_OWNER": "person or group who is responsible for technical aspects of the asset."
},
"deprecatedSymbols": {
"CONSUMER": true,
"DATAOWNER": true,
"DELEGATE": true,
"DEVELOPER": true,
"PRODUCER": true,
"STAKEHOLDER": true
},
"name": "OwnershipType",
"namespace": "com.linkedin.common",
"symbols": [
"CUSTOM",
"TECHNICAL_OWNER",
"BUSINESS_OWNER",
"DATA_STEWARD",
"NONE",
"DEVELOPER",
"DATAOWNER",
"DELEGATE",
"PRODUCER",
"CONSUMER",
"STAKEHOLDER"
],
"doc": "Asset owner types"
},
"name": "type",
"doc": "The type of the ownership"
},
{
"Relationship": {
"entityTypes": [
"ownershipType"
],
"name": "ownershipType"
},
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "typeUrn",
"default": null,
"doc": "The type of the ownership\nUrn of type O"
},
{
"type": [
"null",
{
"type": "record",
"name": "OwnershipSource",
"namespace": "com.linkedin.common",
"fields": [
{
"type": {
"type": "enum",
"symbolDocs": {
"AUDIT": "Auditing system or audit logs",
"DATABASE": "Database, e.g. GRANTS table",
"FILE_SYSTEM": "File system, e.g. file/directory owner",
"ISSUE_TRACKING_SYSTEM": "Issue tracking system, e.g. Jira",
"MANUAL": "Manually provided by a user",
"OTHER": "Other sources",
"SERVICE": "Other ownership-like service, e.g. Nuage, ACL service etc",
"SOURCE_CONTROL": "SCM system, e.g. GIT, SVN"
},
"name": "OwnershipSourceType",
"namespace": "com.linkedin.common",
"symbols": [
"AUDIT",
"DATABASE",
"FILE_SYSTEM",
"ISSUE_TRACKING_SYSTEM",
"MANUAL",
"SERVICE",
"SOURCE_CONTROL",
"OTHER"
]
},
"name": "type",
"doc": "The type of the source"
},
{
"type": [
"null",
"string"
],
"name": "url",
"default": null,
"doc": "A reference URL for the source"
}
],
"doc": "Source/provider of the ownership information"
}
],
"name": "source",
"default": null,
"doc": "Source information for the ownership"
}
],
"doc": "Ownership information"
}
},
"name": "owners",
"doc": "List of owners of the entity."
},
{
"Searchable": {
"/*": {
"fieldType": "MAP_ARRAY",
"queryByDefault": false
}
},
"type": [
{
"type": "map",
"values": {
"type": "array",
"items": "string"
}
},
"null"
],
"name": "ownerTypes",
"default": {},
"doc": "Ownership type to Owners map, populated via mutation hook."
},
{
"type": {
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
},
"name": "lastModified",
"default": {
"actor": "urn:li:corpuser:unknown",
"impersonator": null,
"time": 0,
"message": null
},
"doc": "Audit stamp containing who last modified the record and when. A value of 0 in the time field indicates missing data."
}
],
"doc": "Ownership information of an entity."
}

status

The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc. This aspect is used to represent soft deletes conventionally.

Schema
{
"type": "record",
"Aspect": {
"name": "status"
},
"name": "Status",
"namespace": "com.linkedin.common",
"fields": [
{
"Searchable": {
"fieldType": "BOOLEAN"
},
"type": "boolean",
"name": "removed",
"default": false,
"doc": "Whether the entity has been removed (soft-deleted)."
}
],
"doc": "The lifecycle status metadata of an entity, e.g. dataset, metric, feature, etc.\nThis aspect is used to represent soft deletes conventionally."
}

schemaMetadata

SchemaMetadata to describe metadata related to store schema

Schema
{
"type": "record",
"Aspect": {
"name": "schemaMetadata"
},
"name": "SchemaMetadata",
"namespace": "com.linkedin.schema",
"fields": [
{
"validate": {
"strlen": {
"max": 500,
"min": 1
}
},
"type": "string",
"name": "schemaName",
"doc": "Schema name e.g. PageViewEvent, identity.Profile, ams.account_management_tracking"
},
{
"java": {
"class": "com.linkedin.common.urn.DataPlatformUrn"
},
"type": "string",
"name": "platform",
"doc": "Standardized platform urn where schema is defined. The data platform Urn (urn:li:platform:{platform_name})"
},
{
"type": "long",
"name": "version",
"doc": "Every change to SchemaMetadata in the resource results in a new version. Version is server assigned. This version is differ from platform native schema version."
},
{
"type": {
"type": "record",
"name": "AuditStamp",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When did the resource/association/sub-resource move into the specific lifecycle stage represented by this AuditEvent."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) which will be credited for moving the resource/association/sub-resource into the specific lifecycle stage. It is also the one used to authorize the change."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "impersonator",
"default": null,
"doc": "The entity (e.g. a service URN) which performs the change on behalf of the Actor and must be authorized to act as the Actor."
},
{
"type": [
"null",
"string"
],
"name": "message",
"default": null,
"doc": "Additional context around how DataHub was informed of the particular change. For example: was the change created by an automated process, or manually."
}
],
"doc": "Data captured on a resource/association/sub-resource level giving insight into when that resource/association/sub-resource moved into a particular lifecycle stage, and who acted to move it into that specific lifecycle stage."
},
"name": "created",
"default": {
"actor": "urn:li:corpuser:unknown",
"impersonator": null,
"time": 0,
"message": null
},
"doc": "An AuditStamp corresponding to the creation of this resource/association/sub-resource. A value of 0 for time indicates missing data."
},
{
"type": "com.linkedin.common.AuditStamp",
"name": "lastModified",
"default": {
"actor": "urn:li:corpuser:unknown",
"impersonator": null,
"time": 0,
"message": null
},
"doc": "An AuditStamp corresponding to the last modification of this resource/association/sub-resource. If no modification has happened since creation, lastModified should be the same as created. A value of 0 for time indicates missing data."
},
{
"type": [
"null",
"com.linkedin.common.AuditStamp"
],
"name": "deleted",
"default": null,
"doc": "An AuditStamp corresponding to the deletion of this resource/association/sub-resource. Logically, deleted MUST have a later timestamp than creation. It may or may not have the same time as lastModified depending upon the resource/association/sub-resource semantics."
},
{
"java": {
"class": "com.linkedin.common.urn.DatasetUrn"
},
"type": [
"null",
"string"
],
"name": "dataset",
"default": null,
"doc": "Dataset this schema metadata is associated with."
},
{
"type": [
"null",
"string"
],
"name": "cluster",
"default": null,
"doc": "The cluster this schema metadata resides from"
},
{
"type": "string",
"name": "hash",
"doc": "the SHA1 hash of the schema content"
},
{
"type": [
{
"type": "record",
"name": "EspressoSchema",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "documentSchema",
"doc": "The native espresso document schema."
},
{
"type": "string",
"name": "tableSchema",
"doc": "The espresso table schema definition."
}
],
"doc": "Schema text of an espresso table schema."
},
{
"type": "record",
"name": "OracleDDL",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "tableSchema",
"doc": "The native schema in the dataset's platform. This is a human readable (json blob) table schema."
}
],
"doc": "Schema holder for oracle data definition language that describes an oracle table."
},
{
"type": "record",
"name": "MySqlDDL",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "tableSchema",
"doc": "The native schema in the dataset's platform. This is a human readable (json blob) table schema."
}
],
"doc": "Schema holder for MySql data definition language that describes an MySql table."
},
{
"type": "record",
"name": "PrestoDDL",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "rawSchema",
"doc": "The raw schema in the dataset's platform. This includes the DDL and the columns extracted from DDL."
}
],
"doc": "Schema holder for presto data definition language that describes a presto view."
},
{
"type": "record",
"name": "KafkaSchema",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "documentSchema",
"doc": "The native kafka document schema. This is a human readable avro document schema."
},
{
"type": [
"null",
"string"
],
"name": "documentSchemaType",
"default": null,
"doc": "The native kafka document schema type. This can be AVRO/PROTOBUF/JSON."
},
{
"type": [
"null",
"string"
],
"name": "keySchema",
"default": null,
"doc": "The native kafka key schema as retrieved from Schema Registry"
},
{
"type": [
"null",
"string"
],
"name": "keySchemaType",
"default": null,
"doc": "The native kafka key schema type. This can be AVRO/PROTOBUF/JSON."
}
],
"doc": "Schema holder for kafka schema."
},
{
"type": "record",
"name": "BinaryJsonSchema",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "schema",
"doc": "The native schema text for binary JSON file format."
}
],
"doc": "Schema text of binary JSON schema."
},
{
"type": "record",
"name": "OrcSchema",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "schema",
"doc": "The native schema for ORC file format."
}
],
"doc": "Schema text of an ORC schema."
},
{
"type": "record",
"name": "Schemaless",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "The dataset has no specific schema associated with it"
},
{
"type": "record",
"name": "KeyValueSchema",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "keySchema",
"doc": "The raw schema for the key in the key-value store."
},
{
"type": "string",
"name": "valueSchema",
"doc": "The raw schema for the value in the key-value store."
}
],
"doc": "Schema text of a key-value store schema."
},
{
"type": "record",
"name": "OtherSchema",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": "string",
"name": "rawSchema",
"doc": "The native schema in the dataset's platform."
}
],
"doc": "Schema holder for undefined schema types."
}
],
"name": "platformSchema",
"doc": "The native schema in the dataset's platform."
},
{
"type": {
"type": "array",
"items": {
"type": "record",
"name": "SchemaField",
"namespace": "com.linkedin.schema",
"fields": [
{
"Searchable": {
"boostScore": 5.0,
"fieldName": "fieldPaths",
"fieldType": "TEXT",
"queryByDefault": "true"
},
"type": "string",
"name": "fieldPath",
"doc": "Flattened name of the field. Field is computed from jsonPath field."
},
{
"Deprecated": true,
"type": [
"null",
"string"
],
"name": "jsonPath",
"default": null,
"doc": "Flattened name of a field in JSON Path notation."
},
{
"type": "boolean",
"name": "nullable",
"default": false,
"doc": "Indicates if this field is optional or nullable"
},
{
"Searchable": {
"boostScore": 0.1,
"fieldName": "fieldDescriptions",
"fieldType": "TEXT"
},
"type": [
"null",
"string"
],
"name": "description",
"default": null,
"doc": "Description"
},
{
"Deprecated": true,
"Searchable": {
"boostScore": 0.2,
"fieldName": "fieldLabels",
"fieldType": "TEXT"
},
"type": [
"null",
"string"
],
"name": "label",
"default": null,
"doc": "Label of the field. Provides a more human-readable name for the field than field path. Some sources will\nprovide this metadata but not all sources have the concept of a label. If just one string is associated with\na field in a source, that is most likely a description.\n\nNote that this field is deprecated and is not surfaced in the UI."
},
{
"type": [
"null",
"com.linkedin.common.AuditStamp"
],
"name": "created",
"default": null,
"doc": "An AuditStamp corresponding to the creation of this schema field."
},
{
"type": [
"null",
"com.linkedin.common.AuditStamp"
],
"name": "lastModified",
"default": null,
"doc": "An AuditStamp corresponding to the last modification of this schema field."
},
{
"type": {
"type": "record",
"name": "SchemaFieldDataType",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": [
{
"type": "record",
"name": "BooleanType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Boolean field type."
},
{
"type": "record",
"name": "FixedType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Fixed field type."
},
{
"type": "record",
"name": "StringType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "String field type."
},
{
"type": "record",
"name": "BytesType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Bytes field type."
},
{
"type": "record",
"name": "NumberType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Number data type: long, integer, short, etc.."
},
{
"type": "record",
"name": "DateType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Date field type."
},
{
"type": "record",
"name": "TimeType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Time field type. This should also be used for datetimes."
},
{
"type": "record",
"name": "EnumType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Enum field type."
},
{
"type": "record",
"name": "NullType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Null field type."
},
{
"type": "record",
"name": "MapType",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": [
"null",
"string"
],
"name": "keyType",
"default": null,
"doc": "Key type in a map"
},
{
"type": [
"null",
"string"
],
"name": "valueType",
"default": null,
"doc": "Type of the value in a map"
}
],
"doc": "Map field type."
},
{
"type": "record",
"name": "ArrayType",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"name": "nestedType",
"default": null,
"doc": "List of types this array holds."
}
],
"doc": "Array field type."
},
{
"type": "record",
"name": "UnionType",
"namespace": "com.linkedin.schema",
"fields": [
{
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"name": "nestedTypes",
"default": null,
"doc": "List of types in union type."
}
],
"doc": "Union field type."
},
{
"type": "record",
"name": "RecordType",
"namespace": "com.linkedin.schema",
"fields": [],
"doc": "Record field type."
}
],
"name": "type",
"doc": "Data platform specific types"
}
],
"doc": "Schema field data types"
},
"name": "type",
"doc": "Platform independent field type of the field."
},
{
"type": "string",
"name": "nativeDataType",
"doc": "The native type of the field in the dataset's platform as declared by platform schema."
},
{
"type": "boolean",
"name": "recursive",
"default": false,
"doc": "There are use cases when a field in type B references type A. A field in A references field of type B. In such cases, we will mark the first field as recursive."
},
{
"Relationship": {
"/tags/*/tag": {
"entityTypes": [
"tag"
],
"name": "SchemaFieldTaggedWith"
}
},
"Searchable": {
"/tags/*/attribution/actor": {
"fieldName": "fieldTagAttributionActors",
"fieldType": "URN"
},
"/tags/*/attribution/source": {
"fieldName": "fieldTagAttributionSources",
"fieldType": "URN"
},
"/tags/*/attribution/time": {
"fieldName": "fieldTagAttributionDates",
"fieldType": "DATETIME"
},
"/tags/*/tag": {
"boostScore": 0.5,
"fieldName": "fieldTags",
"fieldType": "URN"
}
},
"type": [
"null",
{
"type": "record",
"Aspect": {
"name": "globalTags"
},
"name": "GlobalTags",
"namespace": "com.linkedin.common",
"fields": [
{
"Relationship": {
"/*/tag": {
"entityTypes": [
"tag"
],
"name": "TaggedWith"
}
},
"Searchable": {
"/*/tag": {
"addToFilters": true,
"boostScore": 0.5,
"fieldName": "tags",
"fieldType": "URN",
"filterNameOverride": "Tag",
"hasValuesFieldName": "hasTags",
"queryByDefault": true
}
},
"type": {
"type": "array",
"items": {
"type": "record",
"name": "TagAssociation",
"namespace": "com.linkedin.common",
"fields": [
{
"java": {
"class": "com.linkedin.common.urn.TagUrn"
},
"type": "string",
"name": "tag",
"doc": "Urn of the applied tag"
},
{
"type": [
"null",
"string"
],
"name": "context",
"default": null,
"doc": "Additional context about the association"
},
{
"Searchable": {
"/actor": {
"fieldName": "tagAttributionActors",
"fieldType": "URN"
},
"/source": {
"fieldName": "tagAttributionSources",
"fieldType": "URN"
},
"/time": {
"fieldName": "tagAttributionDates",
"fieldType": "DATETIME"
}
},
"type": [
"null",
{
"type": "record",
"name": "MetadataAttribution",
"namespace": "com.linkedin.common",
"fields": [
{
"type": "long",
"name": "time",
"doc": "When this metadata was updated."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": "string",
"name": "actor",
"doc": "The entity (e.g. a member URN) responsible for applying the assocated metadata. This can\neither be a user (in case of UI edits) or the datahub system for automation."
},
{
"java": {
"class": "com.linkedin.common.urn.Urn"
},
"type": [
"null",
"string"
],
"name": "source",
"default": null,
"doc": "The DataHub source responsible for applying the associated metadata. This will only be filled out\nwhen a DataHub source is responsible. This includes the specific metadata test urn, the automation urn."
},
{
"type": {
"type": "map",
"values": "string"
},
"name": "sourceDetail",
"default": {},
"doc": "The details associated with why this metadata was applied. For example, this could include\nthe actual regex rule, sql statement, ingestion pipeline ID, etc."
}
]