Skip to main content

Classification

The classification feature enables sources to be configured to automatically predict info types for columns and use them as glossary terms. This is an explicit opt-in feature and is not enabled by default.

Config details

Note that a . is used to denote nested fields in the YAML recipe.

FieldRequiredTypeDescriptionDefault
enabledbooleanWhether classification should be used to auto-detect glossary termsFalse
info_type_to_termDict[str,string]Optional mapping to provide glossary term identifier for info type.By default, info type is used as glossary term identifier.
classifiersArray of objectClassifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.[{'type': 'datahub', 'config': None}]
table_patternAllowDenyPattern (see below for fields)Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
table_pattern.allowArray of stringList of regex patterns to include in ingestion['.*']
table_pattern.denyArray of stringList of regex patterns to exclude from ingestion.[]
table_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
column_patternAllowDenyPattern (see below for fields)Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in database.schema.table.column format.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
column_pattern.allowArray of stringList of regex patterns to include in ingestion['.*']
column_pattern.denyArray of stringList of regex patterns to exclude from ingestion.[]
column_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True

DataHub Classifier

DataHub Classifier is the default classifier implementation, which uses acryl-datahub-classify library to predict info types.

Config Details

FieldRequiredTypeDescriptionDefault
confidence_level_thresholdnumber0.6
info_typeslist[string]List of infotypes to be predicted. By default, all supported infotypes are considered. If specified. this should be subset of ['Email_Address', 'Gender', 'Credit_Debit_Card_Number', 'Phone_Number', 'Street_Address', 'Full_Name', 'Age', 'IBAN', 'US_Social_Security_Number', 'Vehicle_Identification_Number', 'IP_Address_v4', 'IP_Address_v6', 'US_Driving_License_Number', 'Swift_Code']None
info_types_configConfiguration details for infotypesDict[str, InfoTypeConfig]See reference_input.py for default configuration.
info_types_config.key.prediction_factors_and_weights❓ (required if info_types_config.key is set)Dict[str,number]Factors and their weights to consider when predicting info types
info_types_config.key.nameNameFactorConfig (see below for fields)
info_types_config.key.name.regexArray of stringList of regex patterns the column name follows for the info type['.*']
info_types_config.key.descriptionDescriptionFactorConfig (see below for fields)
info_types_config.key.description.regexArray of stringList of regex patterns the column description follows for the info type['.*']
info_types_config.key.datatypeDataTypeFactorConfig (see below for fields)
info_types_config.key.datatype.typeArray of stringList of data types for the info type['.*']
info_types_config.key.valuesValuesFactorConfig (see below for fields)
info_types_config.key.values.prediction_type❓ (required if info_types_config.key.values is set)stringNone
info_types_config.key.values.regexArray of stringList of regex patterns the column value follows for the info typeNone
info_types_config.key.values.libraryArray of stringLibrary used for predictionNone

Supported sources

  • snowflake

Example

source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"

# Credentials
username: user
password: pass
role: "sysadmin"

# Options
top_n_queries: 10
email_domain: mycompany.com

classification:
enabled: True
classifiers:
- type: datahub

Example with Advanced Configuration: Specifying custom info_types_config

source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"

# Credentials
username: user
password: pass
role: "sysadmin"

# Options
top_n_queries: 10
email_domain: mycompany.com

classification:
enabled: True
info_type_to_term:
Email_Address: "Email"
classifiers:
- type: datahub
config:
confidence_level_threshold: 0.7
info_types_config:
Email_Address:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*mail.*id.*$"
- "^.*id.*mail.*$"
- "^.*mail.*add.*$"
- "^.*add.*mail.*$"
- email
- mail
description:
regex:
- "^.*mail.*id.*$"
- "^.*mail.*add.*$"
- email
- mail
datatype:
type:
- str
values:
prediction_type: regex
regex:
- "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}"
library: []
Gender:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*gender.*$"
- "^.*sex.*$"
- gender
- sex
description:
regex:
- "^.*gender.*$"
- "^.*sex.*$"
- gender
- sex
datatype:
type:
- int
- str
values:
prediction_type: regex
regex:
- male
- female
- man
- woman
- m
- f
- w
- men
- women
library: []
Credit_Debit_Card_Number:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*card.*number.*$"
- "^.*number.*card.*$"
- "^.*credit.*card.*$"
- "^.*debit.*card.*$"
description:
regex:
- "^.*card.*number.*$"
- "^.*number.*card.*$"
- "^.*credit.*card.*$"
- "^.*debit.*card.*$"
datatype:
type:
- str
- int
values:
prediction_type: regex
regex:
- "^4[0-9]{12}(?:[0-9]{3})?$"
- "^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$"
- "^3[47][0-9]{13}$"
- "^3(?:0[0-5]|[68][0-9])[0-9]{11}$"
- "^6(?:011|5[0-9]{2})[0-9]{12}$"
- "^(?:2131|1800|35\\d{3})\\d{11}$"
- "^(6541|6556)[0-9]{12}$"
- "^389[0-9]{11}$"
- "^63[7-9][0-9]{13}$"
- "^9[0-9]{15}$"
- "^(6304|6706|6709|6771)[0-9]{12,15}$"
- "^(5018|5020|5038|6304|6759|6761|6763)[0-9]{8,15}$"
- "^(62[0-9]{14,17})$"
- "^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})$"
- "^(4903|4905|4911|4936|6333|6759)[0-9]{12}|(4903|4905|4911|4936|6333|6759)[0-9]{14}|(4903|4905|4911|4936|6333|6759)[0-9]{15}|564182[0-9]{10}|564182[0-9]{12}|564182[0-9]{13}|633110[0-9]{10}|633110[0-9]{12}|633110[0-9]{13}$"
- "^(6334|6767)[0-9]{12}|(6334|6767)[0-9]{14}|(6334|6767)[0-9]{15}$"
library: []
Phone_Number:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- ".*phone.*(num|no).*"
- ".*(num|no).*phone.*"
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
- ".*mobile.*(num|no).*"
- ".*(num|no).*mobile.*"
- ".*telephone.*(num|no).*"
- ".*(num|no).*telephone.*"
- ".*cell.*(num|no).*"
- ".*(num|no).*cell.*"
- ".*contact.*(num|no).*"
- ".*(num|no).*contact.*"
- ".*landline.*(num|no).*"
- ".*(num|no).*landline.*"
- ".*fax.*(num|no).*"
- ".*(num|no).*fax.*"
- phone
- telephone
- landline
- mobile
- tel
- fax
- cell
- contact
description:
regex:
- ".*phone.*(num|no).*"
- ".*(num|no).*phone.*"
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
- ".*mobile.*(num|no).*"
- ".*(num|no).*mobile.*"
- ".*telephone.*(num|no).*"
- ".*(num|no).*telephone.*"
- ".*cell.*(num|no).*"
- ".*(num|no).*cell.*"
- ".*contact.*(num|no).*"
- ".*(num|no).*contact.*"
- ".*landline.*(num|no).*"
- ".*(num|no).*landline.*"
- ".*fax.*(num|no).*"
- ".*(num|no).*fax.*"
- phone
- telephone
- landline
- mobile
- tel
- fax
- cell
- contact
datatype:
type:
- int
- str
values:
prediction_type: library
regex: []
library:
- phonenumbers
Street_Address:
prediction_factors_and_weights:
name: 0.5
description: 0
datatype: 0
values: 0.5
name:
regex:
- ".*street.*add.*"
- ".*add.*street.*"
- ".*full.*add.*"
- ".*add.*full.*"
- ".*mail.*add.*"
- ".*add.*mail.*"
- add[^a-z]+
- address
- street
description:
regex:
- ".*street.*add.*"
- ".*add.*street.*"
- ".*full.*add.*"
- ".*add.*full.*"
- ".*mail.*add.*"
- ".*add.*mail.*"
- add[^a-z]+
- address
- street
datatype:
type:
- str
values:
prediction_type: library
regex: []
library:
- spacy
Full_name:
prediction_factors_and_weights:
name: 0.3
description: 0
datatype: 0
values: 0.7
name:
regex:
- ".*person.*name.*"
- ".*name.*person.*"
- ".*user.*name.*"
- ".*name.*user.*"
- ".*full.*name.*"
- ".*name.*full.*"
- fullname
- name
- person
- user
description:
regex:
- ".*person.*name.*"
- ".*name.*person.*"
- ".*user.*name.*"
- ".*name.*user.*"
- ".*full.*name.*"
- ".*name.*full.*"
- fullname
- name
- person
- user
datatype:
type:
- str
values:
prediction_type: library
regex: []
library:
- spacy
Age:
prediction_factors_and_weights:
name: 0.65
description: 0
datatype: 0
values: 0.35
name:
regex:
- age[^a-z]+.*
- ".*[^a-z]+age"
- ".*[^a-z]+age[^a-z]+.*"
- age
description:
regex:
- age[^a-z]+.*
- ".*[^a-z]+age"
- ".*[^a-z]+age[^a-z]+.*"
- age
datatype:
type:
- int
values:
prediction_type: library
regex: []
library:
- rule_based_logic