The Kafka emitter or Rest emitter can be used to push metadata to DataHub. The DataHub graph client extends the Rest emitter with additional functionality.

class datahub.emitter.rest_emitter.DataHubRestEmitter(gms_server, token=None, timeout_sec=None, connect_timeout_sec=None, read_timeout_sec=None, retry_status_codes=None, retry_methods=None, retry_max_times=None, extra_headers=None, ca_certificate_path=None, client_certificate_path=None, disable_ssl_verification=False)

Bases: Closeable, Emitter

Parameters:
  • gms_server (str)

  • token (Optional[str])

  • timeout_sec (Optional[float])

  • connect_timeout_sec (Optional[float])

  • read_timeout_sec (Optional[float])

  • retry_status_codes (Optional[List[int]])

  • retry_methods (Optional[List[str]])

  • retry_max_times (Optional[int])

  • extra_headers (Optional[Dict[str, str]])

  • ca_certificate_path (Optional[str])

  • client_certificate_path (Optional[str])

  • disable_ssl_verification (bool)

test_connection()
Return type:

None

get_server_config()
Return type:

dict

to_graph()
Return type:

DataHubGraph

emit(item, callback=None)
Parameters:
Return type:

None

emit_mce(mce)
Parameters:

mce (MetadataChangeEventClass)

Return type:

None

emit_mcp(mcp)
Parameters:

mcp (Union[MetadataChangeProposalClass, MetadataChangeProposalWrapper])

Return type:

None

emit_usage(usageStats)
Parameters:

usageStats (UsageAggregationClass)

Return type:

None

flush()
Return type:

None

close()
Return type:

None

datahub.emitter.rest_emitter.DatahubRestEmitter

alias of DataHubRestEmitter

class datahub.emitter.kafka_emitter.KafkaEmitterConfig(**data)

Bases: ConfigModel

Parameters:
  • data (Any)

  • connection (KafkaProducerConnectionConfig)

  • topic_routes (Dict[str, str])

connection: KafkaProducerConnectionConfig
topic_routes: Dict[str, str]
classmethod validate_topic_routes(v)
Parameters:

v (Dict[str, str])

Return type:

Dict[str, str]

model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'_schema_extra': <function ConfigModel.Config._schema_extra>, 'extra': 'forbid', 'ignored_types': (<class 'cached_property.cached_property'>,), 'json_schema_extra': <function ConfigModel.Config._schema_extra>}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'connection': FieldInfo(annotation=KafkaProducerConnectionConfig, required=False, default_factory=KafkaProducerConnectionConfig), 'topic_routes': FieldInfo(annotation=Dict[str, str], required=False, default={'MetadataChangeEvent': 'MetadataChangeEvent_v4', 'MetadataChangeProposal': 'MetadataChangeProposal_v1'})}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

class datahub.emitter.kafka_emitter.DatahubKafkaEmitter(config)

Bases: Closeable, Emitter

Parameters:

config (KafkaEmitterConfig)

emit(item, callback=None)
Parameters:
Return type:

None

emit_mce_async(mce, callback)
Parameters:
Return type:

None

emit_mcp_async(mcp, callback)
Parameters:
Return type:

None

flush()
Return type:

None

close()
Return type:

None

class datahub.ingestion.graph.client.DatahubClientConfig(**data)

Bases: ConfigModel

Configuration class for holding connectivity to datahub gms

Parameters:
  • data (Any)

  • server (str)

  • token (str | None)

  • timeout_sec (int | None)

  • retry_status_codes (List[int] | None)

  • retry_max_times (int | None)

  • extra_headers (Dict[str, str] | None)

  • ca_certificate_path (str | None)

  • client_certificate_path (str | None)

  • disable_ssl_verification (bool)

server: str
token: Optional[str]
timeout_sec: Optional[int]
retry_status_codes: Optional[List[int]]
retry_max_times: Optional[int]
extra_headers: Optional[Dict[str, str]]
ca_certificate_path: Optional[str]
client_certificate_path: Optional[str]
disable_ssl_verification: bool
model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'_schema_extra': <function ConfigModel.Config._schema_extra>, 'extra': 'forbid', 'ignored_types': (<class 'cached_property.cached_property'>,), 'json_schema_extra': <function ConfigModel.Config._schema_extra>}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[dict[str, FieldInfo]] = {'ca_certificate_path': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'client_certificate_path': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'disable_ssl_verification': FieldInfo(annotation=bool, required=False, default=False), 'extra_headers': FieldInfo(annotation=Union[Dict[str, str], NoneType], required=False, default=None), 'retry_max_times': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'retry_status_codes': FieldInfo(annotation=Union[List[int], NoneType], required=False, default=None), 'server': FieldInfo(annotation=str, required=False, default='http://localhost:8080'), 'timeout_sec': FieldInfo(annotation=Union[int, NoneType], required=False, default=None), 'token': FieldInfo(annotation=Union[str, NoneType], required=False, default=None)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].

This replaces Model.__fields__ from Pydantic V1.

datahub.ingestion.graph.client.DataHubGraphConfig

alias of DatahubClientConfig

class datahub.ingestion.graph.client.RelatedEntity(urn, relationship_type, via=None)

Bases: object

Parameters:
  • urn (str)

  • relationship_type (str)

  • via (Optional[str])

urn: str
relationship_type: str
via: Optional[str] = None
class datahub.ingestion.graph.client.DataHubGraph(config)

Bases: DataHubRestEmitter

Parameters:

config (DatahubClientConfig)

test_connection()
Return type:

None

classmethod from_emitter(emitter)
Parameters:

emitter (DataHubRestEmitter)

Return type:

DataHubGraph

make_rest_sink(run_id='__datahub-graph-client')
Parameters:

run_id (str)

Return type:

Iterator[DatahubRestSink]

emit_all(items, run_id='__datahub-graph-client')

Emit all items in the iterable using multiple threads.

Parameters:
Return type:

None

get_aspect(entity_urn, aspect_type, version=0)

Get an aspect for an entity.

Parameters:
  • entity_urn (str) – The urn of the entity

  • aspect_type (Type[TypeVar(Aspect, bound= _Aspect)]) – The type class of the aspect being requested (e.g. datahub.metadata.schema_classes.DatasetProperties)

  • version (int) – The version of the aspect to retrieve. The default of 0 means latest. Versions > 0 go from oldest to newest, so 1 is the oldest.

Return type:

Optional[TypeVar(Aspect, bound= _Aspect)]

Returns:

the Aspect as a dictionary if present, None if no aspect was found (HTTP status 404)

Raises:
  • TypeError – if the aspect type is a timeseries aspect

  • HttpError – if the HTTP response is not a 200 or a 404

get_aspect_v2(entity_urn, aspect_type, aspect, aspect_type_name=None, version=0)
Parameters:
  • entity_urn (str)

  • aspect_type (Type[TypeVar(Aspect, bound= _Aspect)])

  • aspect (str)

  • aspect_type_name (Optional[str])

  • version (int)

Return type:

Optional[TypeVar(Aspect, bound= _Aspect)]

get_config()
Return type:

Dict[str, Any]

get_ownership(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[OwnershipClass]

get_schema_metadata(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[SchemaMetadataClass]

get_domain_properties(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[DomainPropertiesClass]

get_dataset_properties(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[DatasetPropertiesClass]

get_tags(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[GlobalTagsClass]

get_glossary_terms(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[GlossaryTermsClass]

get_domain(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[DomainsClass]

get_browse_path(entity_urn)
Parameters:

entity_urn (str)

Return type:

Optional[BrowsePathsClass]

get_usage_aspects_from_urn(entity_urn, start_timestamp, end_timestamp)
Parameters:
  • entity_urn (str)

  • start_timestamp (int)

  • end_timestamp (int)

Return type:

Optional[List[DatasetUsageStatisticsClass]]

list_all_entity_urns(entity_type, start, count)
Parameters:
  • entity_type (str)

  • start (int)

  • count (int)

Return type:

Optional[List[str]]

get_latest_timeseries_value(entity_urn, aspect_type, filter_criteria_map)
Parameters:
  • entity_urn (str)

  • aspect_type (Type[TypeVar(Aspect, bound= _Aspect)])

  • filter_criteria_map (Dict[str, str])

Return type:

Optional[TypeVar(Aspect, bound= _Aspect)]

get_entity_raw(entity_urn, aspects=None)
Parameters:
  • entity_urn (str)

  • aspects (Optional[List[str]])

Return type:

Dict

get_aspects_for_entity(entity_urn, aspects, aspect_types)

Get multiple aspects for an entity.

Deprecated in favor of get_aspect (single aspect) or get_entity_semityped (full entity without manually specifying a list of aspects).

Warning: Do not use this method to determine if an entity exists! This method will always return an entity, even if it doesn’t exist. This is an issue with how DataHub server responds to these calls, and will be fixed automatically when the server-side issue is fixed.

Parameters:
  • entity_urn (str) – The urn of the entity

  • aspect_type_list (List[Type[Aspect]]) – List of aspect type classes being requested (e.g. [datahub.metadata.schema_classes.DatasetProperties])

  • aspects_list (List[str]) – List of aspect names being requested (e.g. [schemaMetadata, datasetProperties])

  • entity_urn

  • aspects (List[str])

  • aspect_types (List[Type[TypeVar(Aspect, bound= _Aspect)]])

Return type:

Dict[str, Optional[TypeVar(Aspect, bound= _Aspect)]]

Returns:

Optionally, a map of aspect_name to aspect_value as a dictionary if present, aspect_value will be set to None if that aspect was not found. Returns None on HTTP status 404.

Raises:

HttpError – if the HTTP response is not a 200

get_entity_semityped(entity_urn)

Get all non-timeseries aspects for an entity (experimental).

This method is called “semityped” because it returns aspects as a dictionary of properly typed objects. While the returned dictionary is constrained using a TypedDict, the return type is still fairly loose.

Warning: Do not use this method to determine if an entity exists! This method will always return something, even if the entity doesn’t actually exist in DataHub.

Parameters:

entity_urn (str) – The urn of the entity

Return type:

AspectBag

Returns:

A dictionary of aspect name to aspect value. If an aspect is not found, it will not be present in the dictionary. The entity’s key aspect will always be present.

get_domain_urn_by_name(domain_name)

Retrieve a domain urn based on its name. Returns None if there is no match found

Parameters:

domain_name (str)

Return type:

Optional[str]

get_container_urns_by_filter(env=None, search_query='*')

Return container urns that match based on query

Parameters:
  • env (Optional[str])

  • search_query (str)

Return type:

Iterable[str]

get_urns_by_filter(*, entity_types=None, platform=None, platform_instance=None, env=None, query=None, container=None, status=RemovedStatusFilter.NOT_SOFT_DELETED, batch_size=10000, extraFilters=None)

Fetch all urns that match all of the given filters.

Filters are combined conjunctively. If multiple filters are specified, the results will match all of them. Note that specifying a platform filter will automatically exclude all entity types that do not have a platform. The same goes for the env filter.

Parameters:
  • entity_types (Optional[List[str]]) – List of entity types to include. If None, all entity types will be returned.

  • platform (Optional[str]) – Platform to filter on. If None, all platforms will be returned.

  • platform_instance (Optional[str]) – Platform instance to filter on. If None, all platform instances will be returned.

  • env (Optional[str]) – Environment (e.g. PROD, DEV) to filter on. If None, all environments will be returned.

  • query (Optional[str]) – Query string to filter on. If None, all entities will be returned.

  • container (Optional[str]) – A container urn that entities must be within. This works recursively, so it will include entities within sub-containers as well. If None, all entities will be returned. Note that this requires browsePathV2 aspects (added in 0.10.4+).

  • status (RemovedStatusFilter) – Filter on the deletion status of the entity. The default is only return non-soft-deleted entities.

  • extraFilters (Optional[List[Dict[str, Any]]]) – Additional filters to apply. If specified, the results will match all of the filters.

  • batch_size (int)

Return type:

Iterable[str]

Returns:

An iterable of urns that match the filters.

get_latest_pipeline_checkpoint(pipeline_name, platform)
Parameters:
  • pipeline_name (str)

  • platform (str)

Return type:

Optional[Checkpoint[GenericCheckpointState]]

get_search_results(start=0, count=1, entity='dataset')
Parameters:
  • start (int)

  • count (int)

  • entity (str)

Return type:

Dict

get_aspect_counts(aspect, urn_like=None)
Parameters:
  • aspect (str)

  • urn_like (Optional[str])

Return type:

int

execute_graphql(query, variables=None, operation_name=None)
Parameters:
  • query (str)

  • variables (Optional[Dict])

  • operation_name (Optional[str])

Return type:

Dict

class RelationshipDirection(value)

Bases: str, Enum

An enumeration.

INCOMING = 'INCOMING'
OUTGOING = 'OUTGOING'
Parameters:
Return type:

Iterable[RelatedEntity]

exists(entity_urn)
Parameters:

entity_urn (str)

Return type:

bool

soft_delete_entity(urn, run_id='__datahub-graph-client', deletion_timestamp=None)

Soft-delete an entity by urn.

Parameters:
  • urn (str) – The urn of the entity to soft-delete.

  • run_id (str)

  • deletion_timestamp (Optional[int])

Return type:

None

hard_delete_entity(urn)

Hard delete an entity by urn.

Parameters:

urn (str) – The urn of the entity to hard delete.

Return type:

Tuple[int, int]

Returns:

A tuple of (rows_affected, timeseries_rows_affected).

delete_entity(urn, hard=False)

Delete an entity by urn.

Parameters:
  • urn (str) – The urn of the entity to delete.

  • hard (bool) – Whether to hard delete the entity. If False (default), the entity will be soft deleted.

Return type:

None

hard_delete_timeseries_aspect(urn, aspect_name, start_time, end_time)

Hard delete timeseries aspects of an entity.

Parameters:
  • urn (str) – The urn of the entity.

  • aspect_name (str) – The name of the timeseries aspect to delete.

  • start_time (Optional[datetime]) – The start time of the timeseries data to delete. If not specified, defaults to the beginning of time.

  • end_time (Optional[datetime]) – The end time of the timeseries data to delete. If not specified, defaults to the end of time.

Return type:

int

Returns:

The number of timeseries rows affected.

delete_references_to_urn(urn, dry_run=False)

Delete references to a given entity.

This is useful for cleaning up references to an entity that is about to be deleted. For example, when deleting a tag, you might use this to remove that tag from all other entities that reference it.

This does not delete the entity itself.

Parameters:
  • urn (str) – The urn of the entity to delete references to.

  • dry_run (bool) – If True, do not actually delete the references, just return the count of references and the list of related aspects.

Return type:

Tuple[int, List[Dict]]

Returns:

A tuple of (reference_count, sample of related_aspects).

initialize_schema_resolver_from_datahub(platform, platform_instance, env, batch_size=100)
Parameters:
  • platform (str)

  • platform_instance (Optional[str])

  • env (str)

  • batch_size (int)

Return type:

SchemaResolver

parse_sql_lineage(sql, *, platform, platform_instance=None, env='PROD', default_db=None, default_schema=None)
Parameters:
  • sql (str)

  • platform (str)

  • platform_instance (Optional[str])

  • env (str)

  • default_db (Optional[str])

  • default_schema (Optional[str])

Return type:

SqlParsingResult

create_tag(tag_name)
Parameters:

tag_name (str)

Return type:

str

close()
Return type:

None

datahub.ingestion.graph.client.get_default_graph()
Return type:

DataHubGraph