DataHub is a self-service data portal which provides search and discovery capabilities (and more!) over the data assets of an organization. This tool can help improve productivity of data scientists, data analysts, engineers of organizations dealing with massive amounts of data. Also, the regulatory environment (GDPR, CCPA etc) requires a company to know what data it has, who is using it and how long it will be retained - DataHub provides a solution to these challenges by gathering metadata across a distributed data ecosystem and surfacing it as a data catalog thereby easing the burden of data privacy/compliance.
Common problems with commercial solutions can be summarized as:
- Lacks direct access to source code: Any feature gaps can only be closed by external parties, which can be both time consuming and expensive.
- Dependency on larger proprietary systems or environments, e.g. AWS, Azure, Cloudera etc., making it infeasible to adopt if it doesn’t fit your environment.
- Expensive to acquire, integrate and operate
- Vendor Lock-in
DataHub can be right for you if you want an open source unbundled solution (front-end application completely decoupled from a “battle-tested” metadata store), that you are free to modify, extend and integrate with your data ecosystem. In our experience at LinkedIn and talking to other companies in a similar situation, metadata always has a very company specific implementation and meaning. Commercial tools will typically drop-in and solve a few use-cases well out of the gate, but will need much more investment or will be impossible to extend for some specific kinds of metadata.
Currently LinkedIn engineers. However, we’re receiving more and more PRs from individuals working at various companies.
Check out our adoption here
We welcome contributions from everyone in the community. Please read our contributing guidelines. In general, we will review PRs with the same rigor as our internal code review process to maintain overall quality.
We organize public town hall meetings at a monthly cadence. We use Slack as one of the main ways to support the community.
The best way to engage is through the Slack channel. You’ll get to interact with the developers and the community. It is a vibrant community and most questions are answered within a few hours by the community.
The docs are the best resource. We have documented the steps to install and test DataHub thoroughly there.
You can learn more about DataHub's product roadmap, which gets updated regularly.
You can learn more about the current list of features.
Mixed of both LinkedIn DataHub team and the community. The roadmap will be a joint effort of both LinkedIn and the community. However, we’ll most likely prioritize tasks that align with the community's asks.
LinkedIn is not using GCP so we cannot commit to building and testing that connectivity. However, we’ll be happy to accept community contributions for GCP integration. Also, our Slack channel and regularly scheduled town hall meetings are a good opportunity to meet with people from different companies who have similar requirements and might be interested in collaborating on these features.
Please take a look at our roadmap & features to get a sense of what’s being open sourced in the near future. If there’s something missing from the list, we’re open to discussion. In fact, the town hall would be the perfect venue for such discussions.
All PRs are reviewed by the LinkedIn team. Any extension/contribution coming from the community which LinkedIn team doesn’t have any expertise on will be placed into a incuation directory first (
/contrib). Once it’s blessed and adopted by the community, we’ll graduate it from incubation and move it into the main code base.
In the beginning, LinkedIn will play a big role in terms of stewardship of the code base. We’ll re-evaluate this strategy based on the amount of engagement from the community. There is a large backlog of features that we have only for the internal DataHub. We are going through the effort of generalizing and open sourcing these features. This will lead to batches of large commits from LinkedIn in the near future until the two code bases get closely aligned. See our blog post for more details.
It varies depending on the data platform. HDFS, MySQL, Oracle, Teradata, and LDAP are scheduled on a daily basis. We rely on real-time pushs to ingest from sveral data platforms such as Hive, Presto, Kafka, Pinot, Espresso, Ambry, Galene, Venice, and more.
URN is the only sensible option to ensure events for the same entity land in the same parition and get processed in the chronological order.
In addition to leverage the Kafka schema validation to ensure the MXEs output from metadata producer, we also actively monitor the ingestion streaming pipeline on the snapshot level with status.
This talk (slides, video) describes the role of metadata (DataHub) in the data governance/privacy space at LinkedIn. Field-level, dataset-level classification, governed data movement, automated data deletion, data export etc. are the supported use cases. We have plans to open source some of the compliance capabilities, listed as part of our roadmap.
You can configure compatibility level per topic at Confluent Schema Registry. The default being used is “Backward”. So, you’re only allowed to make backward compatible changes on the topic schema. You can also change this configuration and flex compatibility check. However, as a best practice, we would suggest not doing backward incompatible changes on the topic schema because this will fail your old metadata producers’ flows. Instead, you might consider creating a new Kafka topic (new version).
We plan to add “fine-grain lineage” in the near future, which should cover the transformation documentation. DataHub currently has a simple “Docs” feature that allows capturing of tribal knowledge. We also plan to expand it significantly going forward.
We are adding some “social features” and documentation captures to DataHub. However, we do welcome the community to contribute in this area.
It’s very similar to what you see on the community version. We have added screenshots of the internal version of the catalog in our blog post.
We’re working on a similar feature internally. Will evaluate and update the roadmap once we have a better idea of the timeline.
constraints set at table definition?#Is DataHub capturing/showing column level
The SchemaField model currently does not capture any property/field corresponding to constraints defined in the table definition. However, it should be fairly easy to extend the model to support that if needed.
MCE is the ideal way to push metadata from different security zones, assuming there is a common Kafka infrastructure that aggregates the events from various security zones.
Currently, DataHub supports all major database providers that are supported by Ebean as the document store i.e. Oracle, Postgres, MySQL, H2. We also support Espresso, which is LinkedIn's proprietary document store. Other than that, we support Elasticsearch and Neo4j for search and graph use cases, respectively. However, as data stores in the backend are all abstracted and accessed through DAOs, you should be able to easily support other data stores by plugging in your own DAO implementations. Please refer to Metadata Serving for more details.
Supported data sources are listed here. It's also fairly easy to add your own sources.
You can call the rest.li API to ingest metadata in DataHub directly instead of using Kafka event. Metadata ingestion is real-time if you're updating via rest.li API. It's near real-time in the case of Kafka events due to the asynchronous nature of Kafka processing.