Explore key concepts of DataHub to take full advantage of its capabilities in managing your data.
URN (Uniform Resource Name)
URN (Uniform Resource Name) is the chosen scheme of URI to uniquely define any resource in DataHub. It has the following form.
Access policies in DataHub define who can do what to which resources.
DataHub provides the ability to use Roles to manage permissions.
Access Token (Personal Access Token)
Personal Access Tokens, or PATs for short, allow users to represent themselves in code and programmatically use DataHub's APIs in deployments where security is a concern. Used along-side with authentication-enabled metadata service, PATs add a layer of protection to DataHub where only authorized users are able to perform actions in an automated way.
Views allow you to save and share sets of filters for reuse when browsing DataHub. A view can either be public or personal.
Deprecation is an aspect that indicates the deprecation status of an entity. Typically it is expressed as a Boolean value.
Ingestion sources refer to the data systems that we are extracting metadata from. For example, we have sources for BigQuery, Looker, Tableau and many others.
A container of related data assets.
Data Platforms are systems or tools that contain Datasets, Dashboards, Charts, and all other kinds of data assets modeled in the metadata graph.
List of Data Platforms
- Azure Data Lake (Gen 1)
- Azure Data Lake (Gen 2)
- External Source
- SAP HANA
- AWS S3
- Kafka Connect
Reference : data_platforms.json
Datasets represent collections of data that are typically represented as Tables or Views in a database (e.g. BigQuery, Snowflake, Redshift etc.), Streams in a stream-processing environment (Kafka, Pulsar etc.), bundles of data found as Files or Folders in data lake systems (S3, ADLS, etc.).
A single data vizualization derived from a Dataset. A single Chart can be a part of multiple Dashboards. Charts can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Looker Chart.
A collection of Charts for visualization. Dashboards can have tags, owners, links, glossary terms, and descriptions attached to them. Examples include a Superset or Mode Dashboard.
An executable job that processes data assets, where "processing" implies consuming data, producing data, or both. In orchestration systems, this is sometimes referred to as an individual "Task" within a "DAG". Examples include an Airflow Task.
An executable collection of Data Jobs with dependencies among them, or a DAG. Sometimes referred to as a "Pipeline". Examples include an Airflow DAG.
Shared vocabulary within the data ecosystem.
Glossary Term Group
Glossary Term Group is similar to a folder, containing Terms and even other Term Groups to allow for a nested structure.
Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary.
Domains are curated, top-level folders or categories where related assets can be explicitly grouped.
Owner refers to the users or groups that has ownership rights over entities. For example, owner can be acceessed to dataset or a column or a dataset.
CorpUser represents an identity of a person (or an account) in the enterprise.
CorpGroup represents an identity of a group of users in the enterprise.
An entity is the primary node in the metadata graph. For example, an instance of a Dataset or a CorpUser is an Entity.
An aspect is a collection of attributes that describes a particular facet of an entity. Aspects can be shared across entities, for example "Ownership" is an aspect that is re-used across all the Entities that have owners.
A relationship represents a named edge between 2 entities. They are declared via foreign key attributes within Aspects along with a custom annotation (@Relationship).