Skip to main content

MongoDB

Module mongodb

Certified

Important Capabilities

CapabilityStatusNotes
Table-Level LineageEnabled by default

This plugin extracts the following:

  • Databases and associated metadata
  • Collections in each database and schemas for each collection (via schema inference)

By default, schema inference samples 1,000 documents from each collection. Setting schemaSamplingSize: null will scan the entire collection. Moreover, setting useRandomSampling: False will sample the first documents found without random selection, which may be faster for large collections.

Note that schemaSamplingSize has no effect if enableSchemaInference: False is set.

Really large schemas will be further truncated to a maximum of 300 schema fields. This is configurable using the maxSchemaSize parameter.

Install the Plugin

pip install 'acryl-datahub[mongodb]'

Quickstart Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide

source:
type: "mongodb"
config:
# Coordinates
connect_uri: "mongodb://localhost"

# Credentials
username: admin
password: password
authMechanism: "DEFAULT"

# Options
enableSchemaInference: True
useRandomSampling: True
maxSchemaSize: 300

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

View All Configuration Options
FieldRequiredTypeDescriptionDefault
envstringThe environment that all assets produced by this connector belong toPROD
connect_uristringMongoDB connection URI.mongodb://localhost
usernamestringMongoDB username.None
passwordstringMongoDB password.None
authMechanismstringMongoDB authentication mechanism.None
optionsDictAdditional options to pass to pymongo.MongoClient().{}
enableSchemaInferencebooleanWhether to infer schemas.True
schemaSamplingSizeintegerNumber of documents to use when inferring schema size. If set to 0, all documents will be scanned.1000
useRandomSamplingbooleanIf documents for schema inference should be randomly selected. If False, documents will be selected from start.True
maxSchemaSizeintegerMaximum number of fields to include in the schema.300
maxDocumentSizeinteger16793600
database_patternAllowDenyPattern (see below for fields)regex patterns for databases to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True, 'alphabet': '[A-Za-z0-9 _.-]'}
database_pattern.allowArray of stringList of regex patterns for process groups to include in ingestion['.*']
database_pattern.denyArray of stringList of regex patterns for process groups to exclude from ingestion.[]
database_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
database_pattern.alphabetstringAllowed alphabets pattern[A-Za-z0-9 _.-]
collection_patternAllowDenyPattern (see below for fields)regex patterns for collections to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True, 'alphabet': '[A-Za-z0-9 _.-]'}
collection_pattern.allowArray of stringList of regex patterns for process groups to include in ingestion['.*']
collection_pattern.denyArray of stringList of regex patterns for process groups to exclude from ingestion.[]
collection_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
collection_pattern.alphabetstringAllowed alphabets pattern[A-Za-z0-9 _.-]

Code Coordinates

  • Class Name: datahub.ingestion.source.mongodb.MongoDBSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for MongoDB, feel free to ping us on our Slack