Skip to main content

Metadata Ingestion

Python version 3.6+

This module hosts an extensible Python-based metadata ingestion system for DataHub. This supports sending data to DataHub using Kafka or through the REST API. It can be used through our CLI tool, with an orchestrator like Airflow, or as a library.

Getting Started


Before running any metadata ingestion job, you should make sure that DataHub backend services are all running. If you are trying this out locally, the easiest way to do that is through quickstart Docker images.

Install from PyPI

The folks over at Acryl Data maintain a PyPI package for DataHub metadata ingestion.

# Requires Python 3.6+
python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
datahub version
# If you see "command not found", try running this instead: python3 -m datahub version

If you run into an error, try checking the common setup issues.

Installing Plugins

We use a plugin architecture so that you can install only the dependencies you actually need. Click the plugin name to learn more about the specific source recipe and any FAQs!


Plugin NameInstall CommandProvides
fileincluded by defaultFile source and sink
athenapip install 'acryl-datahub[athena]'AWS Athena source
bigquerypip install 'acryl-datahub[bigquery]'BigQuery source
bigquery-usagepip install 'acryl-datahub[bigquery-usage]'BigQuery usage statistics source
datahub-business-glossaryno additional dependenciesBusiness Glossary File source
dbtno additional dependenciesdbt source
druidpip install 'acryl-datahub[druid]'Druid Source
feastpip install 'acryl-datahub[feast]'Feast source
gluepip install 'acryl-datahub[glue]'AWS Glue source
hivepip install 'acryl-datahub[hive]'Hive source
kafkapip install 'acryl-datahub[kafka]'Kafka source
kafka-connectpip install 'acryl-datahub[kafka-connect]'Kafka connect source
ldappip install 'acryl-datahub[ldap]' (extra requirements)LDAP source
lookerpip install 'acryl-datahub[looker]'Looker source
lookmlpip install 'acryl-datahub[lookml]'LookML source, requires Python 3.7+
mongodbpip install 'acryl-datahub[mongodb]'MongoDB source
mssqlpip install 'acryl-datahub[mssql]'SQL Server source
mysqlpip install 'acryl-datahub[mysql]'MySQL source
mariadbpip install 'acryl-datahub[mariadb]'MariaDB source
openapipip install 'acryl-datahub[openapi]'OpenApi Source
oraclepip install 'acryl-datahub[oracle]'Oracle source
postgrespip install 'acryl-datahub[postgres]'Postgres source
redashpip install 'acryl-datahub[redash]'Redash source
redshiftpip install 'acryl-datahub[redshift]'Redshift source
sagemakerpip install 'acryl-datahub[sagemaker]'AWS SageMaker source
snowflakepip install 'acryl-datahub[snowflake]'Snowflake source
snowflake-usagepip install 'acryl-datahub[snowflake-usage]'Snowflake usage statistics source
sql-profilespip install 'acryl-datahub[sql-profiles]'Data profiles for SQL-based systems
sqlalchemypip install 'acryl-datahub[sqlalchemy]'Generic SQLAlchemy source
supersetpip install 'acryl-datahub[superset]'Superset source
trinopip install 'acryl-datahub[trino]Trino source
starburst-trino-usagepip install 'acryl-datahub[starburst-trino-usage]'Starburst Trino usage statistics source


Plugin NameInstall CommandProvides
fileincluded by defaultFile source and sink
consoleincluded by defaultConsole sink
datahub-restpip install 'acryl-datahub[datahub-rest]'DataHub sink over REST API
datahub-kafkapip install 'acryl-datahub[datahub-kafka]'DataHub sink over Kafka

These plugins can be mixed and matched as desired. For example:

pip install 'acryl-datahub[bigquery,datahub-rest]'

You can check the active plugins:

datahub check plugins

Basic Usage

pip install 'acryl-datahub[datahub-rest]'  # install the required plugin
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml

The --dry-run option of the ingest command performs all of the ingestion steps, except writing to the sink. This is useful to ensure that the ingestion recipe is producing the desired workunits before ingesting them into datahub.

# Dry run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml --dry-run
# Short-form
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml -n

The --preview option of the ingest command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source. This option helps with quick end-to-end smoke testing of the ingestion recipe.

# Preview
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml --preview
# Preview with dry-run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.yml -n --preview

Install using Docker

Docker Hub datahub-ingestion docker

If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container. We have prebuilt images available on Docker hub. All plugins will be installed and enabled automatically.

Limitation: the convenience script assumes that the recipe and any input/output files are accessible in the current working directory or its subdirectories. Files outside the current working directory will not be found, and you'll need to invoke the Docker image directly.

# Assumes the DataHub repo is cloned locally.
./metadata-ingestion/scripts/ ingest -c ./examples/recipes/example_to_datahub_rest.yml

Install from source

If you'd like to install from source, see the developer guide.


A recipe is a configuration file that tells our ingestion scripts where to pull data from (source) and where to put it (sink). Here's a simple example that pulls metadata from MSSQL and puts it into datahub.

# A sample recipe that pulls metadata from MSSQL and puts it into DataHub
# using the Rest API.
type: mssql
username: sa
password: ${MSSQL_PASSWORD}
database: DemoData

- type: "fully-qualified-class-name-of-transformer"
some_property: "some.value"

type: "datahub-rest"
server: "http://localhost:8080"

We automatically expand environment variables in the config, similar to variable substitution in GNU bash or in docker-compose files. For details, see

Running a recipe is quite easy.

datahub ingest -c ./examples/recipes/mssql_to_datahub.yml

A number of recipes are included in the examples/recipes directory. For full info and context on each source and sink, see the pages described in the table of plugins.


If you'd like to modify data before it reaches the ingestion sinks – for instance, adding additional owners or tags – you can use a transformer to write your own module and integrate it with DataHub.

Check out the transformers guide for more info!

Using as a library

In some cases, you might want to construct the MetadataChangeEvents yourself but still use this framework to emit that metadata to DataHub. In this case, take a look at the emitter interfaces, which can easily be imported and called from your own code.

Sample code


The following samples will cover emitting dataset-to-dataset, dataset-to-job-to-dataset, chart-to-dataset, dashboard-to-chart and job-to-dataflow lineages.


  • Emitting aspects as MetadataChangeProposalWrapper is recommended over emitting aspects via the MetadataChangeEvent.
  • Emitting any aspect associated with an entity completely overwrites the previous value of the aspect associated with the entity. This means that emitting a lineage aspect associated with a dataset will overwrite lineage edges that already exist.

Programmatic Pipeline

In some cases, you might want to configure and run a pipeline entirely from within your custom python script. Here is an example of how to do it.

Lineage with Airflow

There's a couple ways to get lineage information from Airflow into DataHub.


If you're simply looking to run ingestion on a schedule, take a look at these sample DAGs:


The Airflow lineage backend is only supported in Airflow 1.10.15+ and 2.0.2+.

Running on Docker locally

If you are looking to run Airflow and DataHub using docker locally, follow the guide here. Otherwise proceed to follow the instructions below.

Setting up Airflow to use DataHub as Lineage Backend

  1. You need to install the required dependency in your airflow. See
  pip install acryl-datahub[airflow]
  1. You must configure an Airflow hook for Datahub. We support both a Datahub REST hook and a Kafka-based hook, but you only need one.

    # For REST-based:
    airflow connections add --conn-type 'datahub_rest' 'datahub_rest_default' --conn-host 'http://localhost:8080'
    # For Kafka-based (standard Kafka sink config can be passed via extras):
    airflow connections add --conn-type 'datahub_kafka' 'datahub_kafka_default' --conn-host 'broker:9092' --conn-extra '{}'
  2. Add the following lines to your airflow.cfg file.

    backend = datahub_provider.lineage.datahub.DatahubLineageBackend
    datahub_kwargs = {
    "datahub_conn_id": "datahub_rest_default",
    "cluster": "prod",
    "capture_ownership_info": true,
    "capture_tags_info": true,
    "graceful_exceptions": true }
    # The above indentation is important!

    Configuration options:

    • datahub_conn_id (required): Usually datahub_rest_default or datahub_kafka_default, depending on what you named the connection in step 1.
    • cluster (defaults to "prod"): The "cluster" to associate Airflow DAGs and tasks with.
    • capture_ownership_info (defaults to true): If true, the owners field of the DAG will be capture as a DataHub corpuser.
    • capture_tags_info (defaults to true): If true, the tags field of the DAG will be captured as DataHub tags.
    • graceful_exceptions (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall task to fail. Note that configuration issues will still throw exceptions.
  3. Configure inlets and outlets for your Airflow operators. For reference, look at the sample DAG in, or reference if you're using the TaskFlow API.

  4. [optional] Learn more about Airflow lineage, including shorthand notation and some automation.

Emitting lineage via a separate operator

Take a look at this sample DAG:

In order to use this example, you must first configure the Datahub hook. Like in ingestion, we support a Datahub REST hook and a Kafka-based hook. See step 1 above for details.


See the guides on developing, adding a source and using transformers.