Skip to main content
Version: Next

DataHub APIs

DataHub has several APIs to manipulate metadata on the platform. Here's the list of APIs and their pros and cons to help you choose the right one for your use case.

APIDefinitionProsCons
Python SDKSDKHighly flexible, Good for bulk executionRequires an understanding of the metadata change event
Java SDKSDKHighly flexible, Good for bulk executionRequires an understanding of the metadata change event
GraphQL APIGraphQL interfaceIntuitive; mirrors UI capabilitiesLess flexible than SDKs; requires knowledge of GraphQL syntax
OpenAPI
(Not Recommended)
Lower-level API for advanced usersGenerally not recommended for typical use cases

In general, Python and Java SDKs are our most recommended tools for extending and customizing the behavior of your DataHub instance. We don't recommend using the OpenAPI directly, as it's more complex and less user-friendly than the other APIs.

Python and Java SDK

We offer an SDK for both Python and Java that provide full functionality when it comes to CRUD operations and any complex functionality you may want to build into DataHub. We recommend using the SDKs for most use cases. Here are the examples of how to use the SDKs:

  • Define a lineage between data entities
  • Executing bulk operations - e.g. adding tags to multiple datasets
  • Creating custom metadata entities

Learn more about the SDKs:

GraphQL API

The graphql API serves as the primary public API for the platform. It can be used to fetch and update metadata programatically in the language of your choice. Intended as a higher-level API that simplifies the most common operations.

We recommend using the GraphQL API if you're getting started with DataHub since it's more user-friendly and straighfowrad. Here are some examples of how to use the GraphQL API:

  • Search for datasets with conditions
  • Update a certain field of a dataset

Learn more about the GraphQL API:

DataHub API Comparison

DataHub supports several APIs, each with its own unique usage and format. Here's an overview of what each API can do.

Last Updated : Feb 16 2024

FeatureGraphQLPython SDKOpenAPI
Create a Dataset🚫[Guide]
Delete a Dataset (Soft Delete)[Guide][Guide]
Delete a Dataset (Hard Delete)🚫[Guide]
Search a Dataset[Guide]
Read a Dataset Deprecation
Read Dataset Entities (V2)
Create a Tag[Guide][Guide]
Read a Tag[Guide][Guide]
Add Tags to a Dataset[Guide][Guide]
Add Tags to a Column of a Dataset[Guide][Guide]
Remove Tags from a Dataset[Guide][Guide]
Create Glossary Terms[Guide][Guide]
Read Terms from a Dataset[Guide][Guide]
Add Terms to a Column of a Dataset[Guide][Guide]
Add Terms to a Dataset[Guide][Guide]
Create Domains[Guide][Guide]
Read Domains[Guide][Guide]
Add Domains to a Dataset[Guide][Guide]
Remove Domains from a Dataset[Guide][Guide]
Create / Upsert Users[Guide][Guide]
Create / Upsert Group[Guide][Guide]
Read Owners of a Dataset[Guide][Guide]
Add Owner to a Dataset[Guide][Guide]
Remove Owner from a Dataset[Guide][Guide]
Add Lineage[Guide][Guide]
Add Column Level (Fine Grained) Lineage🚫[Guide]
Add Documentation (Description) to a Column of a Dataset[Guide][Guide]
Add Documentation (Description) to a Dataset[Guide][Guide]
Add / Remove / Replace Custom Properties on a Dataset🚫[Guide]
Add ML Feature to ML Feature Table🚫[Guide]
Add ML Feature to MLModel🚫[Guide]
Add ML Group to MLFeatureTable🚫[Guide]
Create MLFeature🚫[Guide]
Create MLFeatureTable🚫[Guide]
Create MLModel🚫[Guide]
Create MLModelGroup🚫[Guide]
Create MLPrimaryKey🚫[Guide]
Create MLFeatureTable🚫[Guide]
Read MLFeature[Guide][Guide]
Read MLFeatureTable[Guide][Guide]
Read MLModel[Guide][Guide]
Read MLModelGroup[Guide][Guide]
Read MLPrimaryKey[Guide][Guide]
Create Data Product🚫[Code]
Create Lineage Between Chart and Dashboard🚫[Code]
Create Lineage Between Dataset and Chart🚫[Code]
Create Lineage Between Dataset and DataJob🚫[Code]
Create Finegrained Lineage as DataJob for Dataset🚫[Code]
Create Finegrained Lineage for Dataset🚫[Code]
Create Dataset Lineage with Kafka🚫[Code]
Create Dataset Lineage with MCPW & Rest Emitter🚫[Code]
Create Dataset Lineage with Rest Emitter🚫[Code]
Create DataJob with Dataflow🚫[Code] [Simple] [Verbose]
Create Programmatic Pipeline🚫[Code]