Skip to main content

Great Expectations

This guide helps to setup and configure DataHubValidationAction in Great Expectations to send assertions(expectations) and their results to DataHub using DataHub's Python Rest emitter.

Capabilities

DataHubValidationAction pushes assertions metadata to DataHub. This includes

  • Assertion Details: Details of assertions (i.e. expectation) set on a Dataset (Table).
  • Assertion Results: Evaluation results for an assertion tracked over time.

This integration supports v3 api datasources using SqlAlchemyExecutionEngine.

Limitations

This integration does not support

  • v2 Datasources such as SqlAlchemyDataset
  • v3 Datasources using execution engine other than SqlAlchemyExecutionEngine (Spark, Pandas)
  • Cross-dataset expectations (those involving > 1 table)

Setting up

  1. Install the required dependency in your Great Expectations environment.
    pip install 'acryl-datahub[great-expectations]'
  1. To add DataHubValidationAction in Great Expectations Checkpoint, add following configuration in action_list for your Great Expectations Checkpoint. For more details on setting action_list, see Checkpoints and Actions
    action_list:
    - name: datahub_action
    action:
    module_name: datahub.integrations.great_expectations.action
    class_name: DataHubValidationAction
    server_url: http://localhost:8080 #datahub server url
    Configuration options:
    • server_url (required): URL of DataHub GMS endpoint
    • env (optional, defaults to "PROD"): Environment to use in namespace when constructing dataset URNs.
    • exclude_dbname (optional): Exclude dbname / catalog when constructing dataset URNs. (Highly applicable to Trino / Presto where we want to omit catalog e.g. hive)
    • platform_alias (optional): Platform alias when constructing dataset URNs. e.g. main data platform is presto-on-hive but using trino to run the test
    • platform_instance_map (optional): Platform instance mapping to use when constructing dataset URNs. Maps the GE 'data source' name to a platform instance on DataHub. e.g. platform_instance_map: { "datasource_name": "warehouse" }
    • graceful_exceptions (defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall checkpoint to fail. Note that configuration issues will still throw exceptions.
    • token (optional): Bearer token used for authentication.
    • timeout_sec (optional): Per-HTTP request timeout.
    • retry_status_codes (optional): Retry HTTP request also on these status codes.
    • retry_max_times (optional): Maximum times to retry if HTTP request fails. The delay between retries is increased exponentially.
    • extra_headers (optional): Extra headers which will be added to the datahub request.
    • parse_table_names_from_sql (defaults to false): The integration can use an SQL parser to try to parse the datasets being asserted. This parsing is disabled by default, but can be enabled by setting parse_table_names_from_sql: True. The parser is based on the sqllineage package.

Debugging

Set environment variable DATAHUB_DEBUG (default false) to true to enable debug logging for DataHubValidationAction.

Learn more

To see the Great Expectations in action, check out this demo from the Feb 2022 townhall.