DataHub Components Overview
The DataHub platform consists of the components shown in the following diagram.
The Metadata Store is responsible for storing the Entities & Aspects comprising the Metadata Graph. This includes exposing an API for ingesting metadata, fetching Metadata by primary key, searching entities, and fetching Relationships between entities. It consists of a Spring Java Service hosting a set of Rest.li API endpoints, along with MySQL, Elasticsearch, & Kafka for primary storage & indexing.
Get started with the Metadata Store by following the Quickstart Guide.
Metadata Models are schemas defining the shape of the Entities & Aspects comprising the Metadata Graph, along with the relationships between them. They are defined
using PDL, a modeling language quite similar in form to Protobuf while serializes to JSON. Entities represent a specific class of Metadata
Asset such as a Dataset, a Dashboard, a Data Pipeline, and beyond. Each instance of an Entity is identified by a unique identifier called an
urn. Aspects represent related bundles of data attached
to an instance of an Entity such as its descriptions, tags, and more. View the current set of Entities supported here.
Learn more about DataHub models Metadata here.
The Ingestion Framework is a modular, extensible Python library for extracting Metadata from external source systems (e.g. Snowflake, Looker, MySQL, Kafka), transforming it into DataHub's Metadata Model, and writing it into DataHub via either Kafka or using the Metadata Store Rest APIs directly. DataHub supports an extensive list of Source connectors to choose from, along with a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more.
Getting started with the Ingestion Framework is as simple: just define a YAML file and execute the
datahub ingest command.
Learn more by heading over the the Metadata Ingestion guide.
The GraphQL API provides a strongly-typed, entity-oriented API that makes interacting with the Entities comprising the Metadata Graph simple, including APIs for adding and removing tags, owners, links & more to Metadata Entities! Most notably, this API is consumed by the User Interface (discussed below) for enabling Search & Discovery, Governance, Observability and more.
To get started using the GraphQL API, check out the Getting Started with GraphQL guide.
DataHub comes with a React UI including an ever-evolving set of features to make Discovering, Governing, & Debugging your Data Assets easy & delightful. For a full overview of the capabilities currently supported, take a look at the Features overview. For a look at what's coming next, head over to the Roadmap.
Learn more about the specifics of the DataHub Architecture in the Architecture Overview. Learn about using & developing the components of the Platform by visiting the Module READMEs.
Feedback / Questions / Concerns
We want to hear from you! For any inquiries, including Feedback, Questions, or Concerns, reach out on Slack!