Skip to main content

dbt

Module dbt

Certified

Important Capabilities

CapabilityStatusNotes
Dataset Usage
Detect Deleted EntitiesEnabled via stateful ingestion
Table-Level LineageEnabled by default

This plugin pulls metadata from dbt's artifact files and generates:

  • dbt Tables: for nodes in the dbt manifest file that are models materialized as tables
  • dbt Views: for nodes in the dbt manifest file that are models materialized as views
  • dbt Ephemeral: for nodes in the dbt manifest file that are ephemeral models
  • dbt Sources: for nodes that are sources on top of the underlying platform tables
  • dbt Seed: for seed entities
  • dbt Tests as Assertions: for dbt test entities (starting with version 0.8.38.1)

Note:

  1. It also generates lineage between the dbt nodes (e.g. ephemeral nodes that depend on other dbt sources) as well as lineage between the dbt nodes and the underlying (target) platform nodes (e.g. BigQuery Table -> dbt Source, dbt View -> BigQuery View).
  2. We also support automated actions (like add a tag, term or owner) based on properties defined in dbt meta.

The artifacts used by this source are:

  • dbt manifest file
    • This file contains model, source, tests and lineage data.
  • dbt catalog file
    • This file contains schema data.
    • dbt does not record schema data for Ephemeral models, as such datahub will show Ephemeral models in the lineage, however there will be no associated schema for Ephemeral models
  • dbt sources file
    • This file contains metadata for sources with freshness checks.
    • We transfer dbt's freshness checks to DataHub's last-modified fields.
    • Note that this file is optional – if not specified, we'll use time of ingestion instead as a proxy for time last-modified.
  • dbt run_results file
    • This file contains metadata from the result of a dbt run, e.g. dbt test
    • When provided, we transfer dbt test run results into assertion run events to see a timeline of test runs on the dataset

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[dbt]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "dbt"
config:
# Coordinates
# To use this as-is, set the environment variable DBT_PROJECT_ROOT to the root folder of your dbt project
manifest_path: "${DBT_PROJECT_ROOT}/target/manifest_file.json"
catalog_path: "${DBT_PROJECT_ROOT}/target/catalog_file.json"
sources_path: "${DBT_PROJECT_ROOT}/target/sources_file.json" # optional for freshness
test_results_path: "${DBT_PROJECT_ROOT}/target/run_results.json" # optional for recording dbt test results after running dbt test

# Options
target_platform: "my_target_platform_id" # e.g. bigquery/postgres/etc.

# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

View All Configuration Options
FieldRequiredTypeDescriptionDefault
envstringEnvironment to use in namespace when constructing URNs.PROD
platformstringThe platform that this source connects toNone
platform_instancestringThe instance of the platform that all assets produced by this recipe belong toNone
manifest_pathstringPath to dbt manifest JSON. See https://docs.getdbt.com/reference/artifacts/manifest-json Note this can be a local file or a URI.None
catalog_pathstringPath to dbt catalog JSON. See https://docs.getdbt.com/reference/artifacts/catalog-json Note this can be a local file or a URI.None
sources_pathstringPath to dbt sources JSON. See https://docs.getdbt.com/reference/artifacts/sources-json. If not specified, last-modified fields will not be populated. Note this can be a local file or a URI.None
test_results_pathstringPath to output of dbt test run as run_results file in JSON format. See https://docs.getdbt.com/reference/artifacts/run-results-json. If not specified, test execution results will not be populated in DataHub.None
target_platformstringThe platform that dbt is loading onto. (e.g. bigquery / redshift / postgres etc.)None
target_platform_instancestringThe platform instance for the platform that dbt is operating on. Use this if you have multiple instances of the same platform (e.g. redshift) and need to distinguish between them.None
use_identifiersbooleanUse model identifier instead of model name if defined (if not, default to model name).False
tag_prefixstringPrefix added to tags during ingestion.dbt:
meta_mappingDictmapping rules that will be executed against dbt meta properties. Refer to the section below on dbt meta automated mappings.{}
column_meta_mappingDictmapping rules that will be executed against dbt column meta properties. Refer to the section below on dbt meta automated mappings.{}
query_tag_mappingDictmapping rules that will be executed against dbt query_tag meta properties. Refer to the section below on dbt meta automated mappings.{}
write_semanticsstringWhether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be "PATCH" or "OVERRIDE"PATCH
strip_user_ids_from_emailbooleanWhether or not to strip email id while adding owners using dbt meta actions.False
enable_owner_extractionbooleanWhen enabled, ownership info will be extracted from the dbt metaTrue
owner_extraction_patternstringRegex string to extract owner from the dbt node using the (?P<name>...) syntax of the match object, where the group name must be owner. Examples: (1)r"(?P<owner>(.*)): (\w+) (\w+)" will extract jdoe as the owner from "jdoe: John Doe" (2) r"@(?P<owner>(.*))" will extract alice as the owner from "@alice".None
backcompat_skip_source_on_lineage_edgebooleanPrior to version 0.8.41, lineage edges to sources were directed to the target platform node rather than the dbt source node. This contradicted the established pattern for other lineage edges to point to upstream dbt nodes. To revert lineage logic to this legacy approach, set this flag to true.False
enable_meta_mappingbooleanWhen enabled, applies the mappings that are defined through the meta_mapping directives.True
enable_query_tag_mappingbooleanWhen enabled, applies the mappings that are defined through the query_tag_mapping directives.True
stateful_ingestionDBTStatefulIngestionConfig (see below for fields)DBT Stateful Ingestion Config.
stateful_ingestion.enabledbooleanThe type of the ingestion state provider registered with datahub.False
stateful_ingestion.max_checkpoint_state_sizeintegerThe maximum size of the checkpoint state in bytes. Default is 16MB16777216
stateful_ingestion.state_providerDynamicTypedStateProviderConfig (see below for fields)The ingestion state provider configuration.
stateful_ingestion.state_provider.type❓ (required if stateful_ingestion.state_provider is set)stringThe type of the state provider to use. For DataHub use datahubNone
stateful_ingestion.state_provider.configGeneric dictThe configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19).None
stateful_ingestion.ignore_old_statebooleanIf set to True, ignores the previous checkpoint state.False
stateful_ingestion.ignore_new_statebooleanIf set to True, ignores the current checkpoint state.False
stateful_ingestion.remove_stale_metadatabooleanSoft-deletes the entities of type in the last successful run but missing in the current run with stateful_ingestion enabled.True
stateful_ingestion.fail_safe_thresholdnumberPrevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.95.0
node_type_patternAllowDenyPattern (see below for fields)Deprecated: use entities_enabled instead. Regex patterns for dbt nodes to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
node_type_pattern.allowArray of stringList of regex patterns to include in ingestion['.*']
node_type_pattern.denyArray of stringList of regex patterns to exclude from ingestion.[]
node_type_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
entities_enabledDBTEntitiesEnabled (see below for fields)Controls for enabling / disabling metadata emission for different dbt entities (models, test definitions, test results, etc.){'models': <EmitDirective.YES: 'YES'>, 'sources': <EmitDirective.YES: 'YES'>, 'seeds': <EmitDirective.YES: 'YES'>, 'test_definitions': <EmitDirective.YES: 'YES'>, 'test_results': <EmitDirective.YES: 'YES'>}
entities_enabled.modelsenum(EmitDirective)Emit metadata for dbt models when set to Yes or Only. Allowed symbols are YES, NO, ONLYYES
entities_enabled.sourcesenum(EmitDirective)Emit metadata for dbt sources when set to Yes or Only. Allowed symbols are YES, NO, ONLYYES
entities_enabled.seedsenum(EmitDirective)Emit metadata for dbt seeds when set to Yes or Only. Allowed symbols are YES, NO, ONLYYES
entities_enabled.test_definitionsenum(EmitDirective)Emit metadata for test definitions when enabled when set to Yes or Only. Allowed symbols are YES, NO, ONLYYES
entities_enabled.test_resultsenum(EmitDirective)Emit metadata for test results when set to Yes or Only. Allowed symbols are YES, NO, ONLYYES
node_name_patternAllowDenyPattern (see below for fields)regex patterns for dbt model names to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
node_name_pattern.allowArray of stringList of regex patterns to include in ingestion['.*']
node_name_pattern.denyArray of stringList of regex patterns to exclude from ingestion.[]
node_name_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
aws_connectionAwsConnectionConfig (see below for fields)When fetching manifest files from s3, configuration for aws connection details
aws_connection.aws_access_key_idstringAWS access key ID. Can be auto-detected, see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html for details.None
aws_connection.aws_secret_access_keystringAWS secret access key. Can be auto-detected, see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html for details.None
aws_connection.aws_session_tokenstringAWS session token. Can be auto-detected, see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html for details.None
aws_connection.aws_roleGeneric dictAWS roles to assume. If using the string format, the role ARN can be specified directly. If using the object format, the role can be specified in the RoleArn field and additional available arguments are documented at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts.html?highlight=assume_role#STS.Client.assume_roleNone
aws_connection.aws_profilestringNamed AWS profile to use. Only used if access key / secret are unset. If not set the default will be usedNone
aws_connection.aws_region❓ (required if aws_connection is set)stringAWS region code.None
aws_connection.aws_endpoint_urlstringAutodetected. See https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.htmlNone
aws_connection.aws_proxyDict[str,string]Autodetected. See https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
github_infoGitHubInfo (see below for fields)Reference to your github location to enable easy navigation from DataHub to your dbt files.
github_info.repo❓ (required if github_info is set)stringName of your github repository in org/repo format. e.g. repo for https://github.com/datahub-project/datahub is datahub-project/datahub.None
github_info.branchstringBranch on which your files live by default. Typically main or master.main
github_info.base_urlstringBase url for Github. Used to construct clickable links on the UI.https://github.com
github_info.deploy_key_filestringA private key file that contains an ssh key that has been configured as a deploy key for this repository. Use a file where possible, else see deploy_key for a config field that accepts a raw string.None
github_info.deploy_keystringA private key that contains an ssh key that has been configured as a deploy key for this repository. See deploy_key_file if you want to use a file that contains this key.None
github_info.repo_ssh_locatorstringAuto-inferred from repo as git@github.com:{repo}, but you can override this if needed.None

dbt meta automated mappings

dbt allows authors to define meta properties for datasets. Checkout this link to know more - dbt meta. Our dbt source allows users to define actions such as add a tag, term or owner. For example if a dbt model has a meta config "has_pii": True, we can define an action that evaluates if the property is set to true and add, lets say, a pii tag. To leverage this feature we require users to define mappings as part of the recipe. The following section describes how you can build these mappings. Listed below is a meta_mapping and column_meta_mapping section that among other things, looks for keys like business_owner and adds owners that are listed there.

meta_mapping:
business_owner:
match: ".*"
operation: "add_owner"
config:
owner_type: user
owner_category: BUSINESS_OWNER
has_pii:
match: True
operation: "add_tag"
config:
tag: "has_pii_test"
int_property:
match: 1
operation: "add_tag"
config:
tag: "int_meta_property"
double_property:
match: 2.5
operation: "add_term"
config:
term: "double_meta_property"
data_governance.team_owner:
match: "Finance"
operation: "add_term"
config:
term: "Finance_test"
terms_list:
match: ".*"
operation: "add_terms"
config:
separator: ","
column_meta_mapping:
terms_list:
match: ".*"
operation: "add_terms"
config:
separator: ","
is_sensitive:
match: True
operation: "add_tag"
config:
tag: "sensitive"

We support the following operations:

  1. add_tag - Requires tag property in config.
  2. add_term - Requires term property in config.
  3. add_terms - Accepts an optional separator property in config.
  4. add_owner - Requires owner_type property in config which can be either user or group. Optionally accepts the owner_category config property which you can set to one of ['TECHNICAL_OWNER', 'BUSINESS_OWNER', 'DATA_STEWARD', 'DATAOWNER' (defaults to DATAOWNER).

Note:

  1. The dbt meta_mapping config works at the model level, while the column_meta_mapping config works at the column level. The add_owner operation is not supported at the column level.
  2. For string meta properties we support regex matching.

With regex matching, you can also use the matched value to customize how you populate the tag, term or owner fields. Here are a few advanced examples:

Data Tier - Bronze, Silver, Gold

If your meta section looks like this:

meta:
data_tier: Bronze # chosen from [Bronze,Gold,Silver]

and you wanted to attach a glossary term like urn:li:glossaryTerm:Bronze for all the models that have this value in the meta section attached to them, the following meta_mapping section would achieve that outcome:

meta_mapping:
data_tier:
match: "Bronze|Silver|Gold"
operation: "add_term"
config:
term: "{{ $match }}"

to match any data_tier of Bronze, Silver or Gold and maps it to a glossary term with the same name.

Case Numbers - create tags

If your meta section looks like this:

meta:
case: PLT-4678 # internal Case Number

and you want to generate tags that look like case_4678 from this, you can use the following meta_mapping section:

meta_mapping:
case:
match: "PLT-(.*)"
operation: "add_tag"
config:
tag: "case_{{ $match }}"

Stripping out leading @ sign

You can also match specific groups within the value to extract subsets of the matched value. e.g. if you have a meta section that looks like this:

meta:
owner: "@finance-team"
business_owner: "@janet"

and you want to mark the finance-team as a group that owns the dataset (skipping the leading @ sign), while marking janet as an individual user (again, skipping the leading @ sign) that owns the dataset, you can use the following meta-mapping section.

meta_mapping:
owner:
match: "^@(.*)"
operation: "add_owner"
config:
owner_type: group
business_owner:
match: "^@(?P<owner>(.*))"
operation: "add_owner"
config:
owner_type: user
owner_category: BUSINESS_OWNER

In the examples above, we show two ways of writing the matching regexes. In the first one, ^@(.*) the first matching group (a.k.a. match.group(1)) is automatically inferred. In the second example, ^@(?P<owner>(.*)), we use a named matching group (called owner, since we are matching an owner) to capture the string we want to provide to the ownership urn.

dbt query_tag automated mappings

This works similarly as the dbt meta mapping but for the query tags

We support the below actions -

  1. add_tag - Requires tag property in config.

The below example set as global tag the query tag tag key's value.

"query_tag_mapping":
{
"tag":
"match": ".*"
"operation": "add_tag"
"config":
"tag": "{{ $match }}"
}

Integrating with dbt test

To integrate with dbt tests, the dbt source needs access to the run_results.json file generated after a dbt test execution. Typically, this is written to the target directory. A common pattern you can follow is:

  1. Run dbt docs generate and upload manifest.json and catalog.json to a location accessible to the dbt source (e.g. s3 or local file system)
  2. Run dbt test and upload run_results.json to a location accessible to the dbt source (e.g. s3 or local file system)
  3. Run datahub ingest -c dbt_recipe.dhub.yaml with the following config parameters specified
    • test_results_path: pointing to the run_results.json file that you just created

The connector will produce the following things:

  • Assertion definitions that are attached to the dataset (or datasets)
  • Results from running the tests attached to the timeline of the dataset

View of dbt tests for a dataset

test view

Viewing the SQL for a dbt test

test logic view

Viewing timeline for a failed dbt test

test view

Separating test result emission from other metadata emission

You can segregate emission of test results from the emission of other dbt metadata using the entities_enabled config flag. The following recipe shows you how to emit only test results.

source:
type: dbt
config:
manifest_path: _path_to_manifest_json
catalog_path: _path_to_catalog_json
test_results_path: _path_to_run_results_json
target_platform: postgres
entities_enabled:
test_results: Only

Similarly, the following recipe shows you how to emit everything (i.e. models, sources, seeds, test definitions) but not test results:

source:
type: dbt
config:
manifest_path: _path_to_manifest_json
catalog_path: _path_to_catalog_json
run_results_path: _path_to_run_results_json
target_platform: postgres
entities_enabled:
test_results: No

Code Coordinates

  • Class Name: datahub.ingestion.source.dbt.DBTSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for dbt, feel free to ping us on our Slack