Version: Next

MongoDB

Important Capabilities

Capability	Status	Notes
Detect Deleted Entities	✅	Optionally enabled via `stateful_ingestion.remove_stale_metadata`
Platform Instance	✅	Enabled by default
Schema Metadata	✅	Enabled by default

This plugin extracts the following:

Databases and associated metadata
Collections in each database and schemas for each collection (via schema inference)

By default, schema inference samples 1,000 documents from each collection. Setting schemaSamplingSize: null will scan the entire collection. Moreover, setting useRandomSampling: False will sample the first documents found without random selection, which may be faster for large collections.

Note that schemaSamplingSize has no effect if enableSchemaInference: False is set.

Really large schemas will be further truncated to a maximum of 300 schema fields. This is configurable using the maxSchemaSize parameter.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[mongodb]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: "mongodb"
  config:
    # Coordinates
    connect_uri: "mongodb://localhost"

    # Credentials
    username: admin
    password: password
    authMechanism: "DEFAULT"

    # Options
    enableSchemaInference: True
    useRandomSampling: True
    maxSchemaSize: 300

sink:
  # sink configs

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
authMechanism string	MongoDB authentication mechanism.
connect_uri string	MongoDB connection URI. Default: mongodb://localhost
enableSchemaInference boolean	Whether to infer schemas. Default: True
hostingEnvironment Enum	Hosting environment of MongoDB, default is SELF_HOSTED, currently support `SELF_HOSTED`, `ATLAS`, `AWS_DOCUMENTDB` Default: SELF_HOSTED
maxDocumentSize integer	Default: 16793600
maxSchemaSize integer	Maximum number of fields to include in the schema. Default: 300
options object	Additional options to pass to `pymongo.MongoClient()`. Default: {}
password string	MongoDB password.
platform_instance string	The instance of the platform that all assets produced by this recipe belong to
schemaSamplingSize integer	Number of documents to use when inferring schema size. If set to `null`, all documents will be scanned. Default: 1000
useRandomSampling boolean	If documents for schema inference should be randomly selected. If `False`, documents will be selected from start. Default: True
username string	MongoDB username.
env string	The environment that all assets produced by this connector belong to Default: PROD
collection_pattern AllowDenyPattern	regex patterns for collections to filter in ingestion. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
collection_pattern.ignoreCase boolean	Whether to ignore case sensitivity during pattern matching. Default: True
collection_pattern.allow array	List of regex patterns to include in ingestion Default: ['.*']
collection_pattern.allow.string string
collection_pattern.deny array	List of regex patterns to exclude from ingestion. Default: []
collection_pattern.deny.string string
database_pattern AllowDenyPattern	regex patterns for databases to filter in ingestion. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
database_pattern.ignoreCase boolean	Whether to ignore case sensitivity during pattern matching. Default: True
database_pattern.allow array	List of regex patterns to include in ingestion Default: ['.*']
database_pattern.allow.string string
database_pattern.deny array	List of regex patterns to exclude from ingestion. Default: []
database_pattern.deny.string string
stateful_ingestion StatefulStaleMetadataRemovalConfig	Base specialized config for Stateful Ingestion with stale metadata removal capability.
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "title": "MongoDBConfig",
  "description": "Any source that connects to a platform should inherit this class",
  "type": "object",
  "properties": {
    "stateful_ingestion": {
      "$ref": "#/definitions/StatefulStaleMetadataRemovalConfig"
    },
    "env": {
      "title": "Env",
      "description": "The environment that all assets produced by this connector belong to",
      "default": "PROD",
      "type": "string"
    },
    "platform_instance": {
      "title": "Platform Instance",
      "description": "The instance of the platform that all assets produced by this recipe belong to",
      "type": "string"
    },
    "connect_uri": {
      "title": "Connect Uri",
      "description": "MongoDB connection URI.",
      "default": "mongodb://localhost",
      "type": "string"
    },
    "username": {
      "title": "Username",
      "description": "MongoDB username.",
      "type": "string"
    },
    "password": {
      "title": "Password",
      "description": "MongoDB password.",
      "type": "string"
    },
    "authMechanism": {
      "title": "Authmechanism",
      "description": "MongoDB authentication mechanism.",
      "type": "string"
    },
    "options": {
      "title": "Options",
      "description": "Additional options to pass to `pymongo.MongoClient()`.",
      "default": {},
      "type": "object"
    },
    "enableSchemaInference": {
      "title": "Enableschemainference",
      "description": "Whether to infer schemas. ",
      "default": true,
      "type": "boolean"
    },
    "schemaSamplingSize": {
      "title": "Schemasamplingsize",
      "description": "Number of documents to use when inferring schema size. If set to `null`, all documents will be scanned.",
      "default": 1000,
      "exclusiveMinimum": 0,
      "type": "integer"
    },
    "useRandomSampling": {
      "title": "Userandomsampling",
      "description": "If documents for schema inference should be randomly selected. If `False`, documents will be selected from start.",
      "default": true,
      "type": "boolean"
    },
    "maxSchemaSize": {
      "title": "Maxschemasize",
      "description": "Maximum number of fields to include in the schema.",
      "default": 300,
      "exclusiveMinimum": 0,
      "type": "integer"
    },
    "maxDocumentSize": {
      "title": "Maxdocumentsize",
      "default": 16793600,
      "exclusiveMinimum": 0,
      "type": "integer"
    },
    "hostingEnvironment": {
      "description": "Hosting environment of MongoDB, default is SELF_HOSTED, currently support `SELF_HOSTED`, `ATLAS`, `AWS_DOCUMENTDB`",
      "default": "SELF_HOSTED",
      "allOf": [
        {
          "$ref": "#/definitions/HostingEnvironment"
        }
      ]
    },
    "database_pattern": {
      "title": "Database Pattern",
      "description": "regex patterns for databases to filter in ingestion.",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "allOf": [
        {
          "$ref": "#/definitions/AllowDenyPattern"
        }
      ]
    },
    "collection_pattern": {
      "title": "Collection Pattern",
      "description": "regex patterns for collections to filter in ingestion.",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "allOf": [
        {
          "$ref": "#/definitions/AllowDenyPattern"
        }
      ]
    }
  },
  "additionalProperties": false,
  "definitions": {
    "DynamicTypedStateProviderConfig": {
      "title": "DynamicTypedStateProviderConfig",
      "type": "object",
      "properties": {
        "type": {
          "title": "Type",
          "description": "The type of the state provider to use. For DataHub use `datahub`",
          "type": "string"
        },
        "config": {
          "title": "Config",
          "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19).",
          "default": {},
          "type": "object"
        }
      },
      "required": [
        "type"
      ],
      "additionalProperties": false
    },
    "StatefulStaleMetadataRemovalConfig": {
      "title": "StatefulStaleMetadataRemovalConfig",
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "type": "object",
      "properties": {
        "enabled": {
          "title": "Enabled",
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "default": false,
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "title": "Remove Stale Metadata",
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "default": true,
          "type": "boolean"
        }
      },
      "additionalProperties": false
    },
    "HostingEnvironment": {
      "title": "HostingEnvironment",
      "description": "An enumeration.",
      "enum": [
        "SELF_HOSTED",
        "ATLAS",
        "AWS_DOCUMENTDB"
      ]
    },
    "AllowDenyPattern": {
      "title": "AllowDenyPattern",
      "description": "A class to store allow deny regexes",
      "type": "object",
      "properties": {
        "allow": {
          "title": "Allow",
          "description": "List of regex patterns to include in ingestion",
          "default": [
            ".*"
          ],
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "deny": {
          "title": "Deny",
          "description": "List of regex patterns to exclude from ingestion.",
          "default": [],
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "ignoreCase": {
          "title": "Ignorecase",
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "default": true,
          "type": "boolean"
        }
      },
      "additionalProperties": false
    }
  }
}

Code Coordinates

Class Name: datahub.ingestion.source.mongodb.MongoDBSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for MongoDB, feel free to ping us on our Slack.

Is this page helpful?

MongoDB

Important Capabilities​

CLI based Ingestion​

Install the Plugin​

Starter Recipe​

Config Details​

Code Coordinates​