Skip to main content
Version: 0.14.1

Monitoring DataHub

Monitoring DataHub's system components is critical for operating and improving DataHub. This doc explains how to add tracing and metrics measurements in the DataHub containers.

Tracing

Traces let us track the life of a request across multiple components. Each trace is consisted of multiple spans, which are units of work, containing various context about the work being done as well as time taken to finish the work. By looking at the trace, we can more easily identify performance bottlenecks.

We enable tracing by using the OpenTelemetry java instrumentation library. This project provides a Java agent JAR that is attached to java applications. The agent injects bytecode to capture telemetry from popular libraries.

Using the agent we are able to

1) Plug and play different tracing tools based on the user's setup: Jaeger, Zipkin, or other tools 2) Get traces for Kafka, JDBC, and Elasticsearch without any additional code 3) Track traces of any function with a simple @WithSpan annotation

You can enable the agent by setting env variable ENABLE_OTEL to true for GMS and MAE/MCE consumers. In our example docker-compose, we export metrics to a local Jaeger instance by setting env variable OTEL_TRACES_EXPORTER to jaeger and OTEL_EXPORTER_JAEGER_ENDPOINT to http://jaeger-all-in-one:14250, but you can easily change this behavior by setting the correct env variables. Refer to this doc for all configs.

Once the above is set up, you should be able to see a detailed trace as a request is sent to GMS. We added the @WithSpan annotation in various places to make the trace more readable. You should start to see traces in the tracing collector of choice. Our example docker-compose deploys an instance of Jaeger with port 16686. The traces should be available at http://localhost:16686.

Metrics

With tracing, we can observe how a request flows through our system into the persistence layer. However, for a more holistic picture, we need to be able to export metrics and measure them across time. Unfortunately, OpenTelemetry's java metrics library is still in active development.

As such, we decided to use Dropwizard Metrics to export custom metrics to JMX, and then use Prometheus-JMX exporter to export all JMX metrics to Prometheus. This allows our code base to be independent of the metrics collection tool, making it easy for people to use their tool of choice. You can enable the agent by setting env variable ENABLE_PROMETHEUS to true for GMS and MAE/MCE consumers. Refer to this example docker-compose for setting the variables.

In our example docker-compose, we have configured prometheus to scrape from 4318 ports of each container used by the JMX exporter to export metrics. We also configured grafana to listen to prometheus and create useful dashboards. By default, we provide two dashboards: JVM dashboard and DataHub dashboard.

In the JVM dashboard, you can find detailed charts based on JVM metrics like CPU/memory/disk usage. In the DataHub dashboard, you can find charts to monitor each endpoint and the kafka topics. Using the example implementation, go to http://localhost:3001 to find the grafana dashboards! (Username: admin, PW: admin)

To make it easy to track various metrics within the code base, we created MetricUtils class. This util class creates a central metric registry, sets up the JMX reporter, and provides convenient functions for setting up counters and timers. You can run the following to create a counter and increment.

MetricUtils.counter(this.getClass(),"metricName").increment();

You can run the following to time a block of code.

try(Timer.Context ignored=MetricUtils.timer(this.getClass(),"timerName").timer()){
...block of code
}

Enable monitoring through docker-compose

We provide some example configuration for enabling monitoring in this directory. Take a look at the docker-compose files, which adds necessary env variables to existing containers, and spawns new containers (Jaeger, Prometheus, Grafana).

You can add in the above docker-compose using the -f <<path-to-compose-file>> when running docker-compose commands. For instance,

docker-compose \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
pull && \
docker-compose -p datahub \
-f quickstart/docker-compose.quickstart.yml \
-f monitoring/docker-compose.monitoring.yml \
up

We set up quickstart.sh, dev.sh, and dev-without-neo4j.sh to add the above docker-compose when MONITORING=true. For instance MONITORING=true ./docker/quickstart.sh will add the correct env variables to start collecting traces and metrics, and also deploy Jaeger, Prometheus, and Grafana. We will soon support this as a flag during quickstart.

Health check endpoint

For monitoring healthiness of your DataHub service, /admin endpoint can be used.