This is to provide the parts of urn that need to be indexed as well as the logic to obtain the same from the urn. Refer to DatasetUrnPathExtractor as an example.
Enable SCSI by adding your variable in docker environment file of datahub-gms. Each entity has it's own environment variable. If corresponding variable of your entity is already defined in the docker environment file, then make sure it is set (in order to enable SCSI).
Import the docker environment variable in your local DAO factory to enable SCSI. Refer to DatasetDaoFactory as an example.
Other than the urn parts, you may want to index certain fields of an aspect. The indexable fields of aspects of a given entity are configured in a file in JSON format which must be provided during your local DAO instantiation. Refer to the storage config for dataset.
If you have already enabled SCSI then the write path will ensure that every new urn inserted into the primary document store (i.e.
metadata_aspect table), also gets inserted into the index table. However for urns that already exist in the
metadata_aspect table, you will need to bootstrap the index table. Refer to the bootstrap script for datasets as an example.
BaseEntityResource currently exposes Finder resource method called filter that returns a list of entities that satisfy the filter conditions specified in query parameters. Please refer to Datasets resource to understand how to override the filter method. Once you have the resource method defined, you could as well expose client methods that take different input arguments. Please refer to listUrnsFromIndex and filter methods in Datasets client for reference.
Once you have onboarded to SCSI for your entity, you can test the changes as described below
For the steps below, we assume you have already enabled SCSI by following the steps mentioned above.
Run the ingestion script if you haven't already using
Connect to the MySQL server and you should be able to see the records.
In the following section we will try some API calls, now that the urn parts are ingested
Note that the results are paginated
Get all datasets along with aspects
Ownership (if they exist)#
The storage config for datasets looks like the following:
which means that the
removed field of
Status aspect should be indexed in SCSI.
None of the dataset urns ingested so far, has a
Status aspect. Let us try to ingest a new dataset, with several metadata aspects including the
You should be able to see the urn parts of the newly ingested urn in the index table, along with the
removed field of
Next, let's try some API calls to test the filter conditions.
Get all dataset urns that are non-removed i.e.
You can try similar API calls to return metadata aspects of urns that meet the filter criteria.