Skip to main content
Version: 0.14.0

Classification

The classification feature enables sources to be configured to automatically predict info types for columns and use them as glossary terms. This is an explicit opt-in feature and is not enabled by default.

Config details

Note that a . is used to denote nested fields in the YAML recipe.

FieldRequiredTypeDescriptionDefault
enabledbooleanWhether classification should be used to auto-detect glossary termsFalse
sample_sizeintNumber of sample values used for classification.100
max_workersintNumber of worker processes to use for classification. Set to 1 to disable.Number of cpu cores or 4
info_type_to_termDict[str,string]Optional mapping to provide glossary term identifier for info type.By default, info type is used as glossary term identifier.
classifiersArray of objectClassifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.[{'type': 'datahub', 'config': None}]
table_patternAllowDenyPattern (see below for fields)Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
table_pattern.allowArray of stringList of regex patterns to include in ingestion['.*']
table_pattern.denyArray of stringList of regex patterns to exclude from ingestion.[]
table_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
column_patternAllowDenyPattern (see below for fields)Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in database.schema.table.column format.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
column_pattern.allowArray of stringList of regex patterns to include in ingestion['.*']
column_pattern.denyArray of stringList of regex patterns to exclude from ingestion.[]
column_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True

DataHub Classifier

DataHub Classifier is the default classifier implementation, which uses acryl-datahub-classify library to predict info types.

Config Details

FieldRequiredTypeDescriptionDefault
confidence_level_thresholdnumber0.68
strip_exclusion_formattingboolA flag that determines whether the exclusion list uses exact matching or format stripping (case-insensitivity, punctuation removal, and special character removal).True
info_typeslist[string]List of infotypes to be predicted. By default, all supported infotypes are considered, along with any custom infotypes configured in info_types_config.None
info_types_configConfiguration details for infotypesDict[str, InfoTypeConfig]See reference_input.py for default configuration.
info_types_config.key.prediction_factors_and_weights❓ (required if info_types_config.key is set)Dict[str,number]Factors and their weights to consider when predicting info types
info_types_config.key.exclude_namelist[string]Optional list of names to exclude from classification.None
info_types_config.key.nameNameFactorConfig (see below for fields)
info_types_config.key.name.regexArray of stringList of regex patterns the column name follows for the info type['.*']
info_types_config.key.descriptionDescriptionFactorConfig (see below for fields)
info_types_config.key.description.regexArray of stringList of regex patterns the column description follows for the info type['.*']
info_types_config.key.datatypeDataTypeFactorConfig (see below for fields)
info_types_config.key.datatype.typeArray of stringList of data types for the info type['.*']
info_types_config.key.valuesValuesFactorConfig (see below for fields)
info_types_config.key.values.prediction_type❓ (required if info_types_config.key.values is set)stringNone
info_types_config.key.values.regexArray of stringList of regex patterns the column value follows for the info typeNone
info_types_config.key.values.libraryArray of stringLibrary used for predictionNone
minimum_values_thresholdnumberMinimum number of non-null column values required to process values prediction factor.50

Supported infotypes

  • Email_Address
  • Gender
  • Credit_Debit_Card_Number
  • Phone_Number
  • Street_Address
  • Full_Name
  • Age
  • IBAN
  • US_Social_Security_Number
  • Vehicle_Identification_Number
  • IP_Address_v4
  • IP_Address_v6
  • US_Driving_License_Number
  • Swift_Code
  • Regex based Custom InfoTypes

Supported sources

  • All SQL sources

Future Work

  • Classification for nested columns (struct, array type)

Examples

Basic

source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"

# Credentials
username: user
password: pass
role: "sysadmin"

# Options
top_n_queries: 10
email_domain: mycompany.com

classification:
enabled: True
classifiers:
- type: datahub

Advanced Configuration: Customizing configuration for supported info types

source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"

# Credentials
username: user
password: pass
role: "sysadmin"

# Options
top_n_queries: 10
email_domain: mycompany.com

classification:
enabled: True
info_type_to_term:
Email_Address: "Email"
classifiers:
- type: datahub
config:
confidence_level_threshold: 0.7
info_types_config:
Email_Address:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*mail.*id.*$"
- "^.*id.*mail.*$"
- "^.*mail.*add.*$"
- "^.*add.*mail.*$"
- email
- mail
description:
regex:
- "^.*mail.*id.*$"
- "^.*mail.*add.*$"
- email
- mail
datatype:
type:
- str
values:
prediction_type: regex
regex:
- "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}"
library: []
Gender:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*gender.*$"
- "^.*sex.*$"
- gender
- sex
description:
regex:
- "^.*gender.*$"
- "^.*sex.*$"
- gender
- sex
datatype:
type:
- int
- str
values:
prediction_type: regex
regex:
- male
- female
- man
- woman
- m
- f
- w
- men
- women
library: []
Credit_Debit_Card_Number:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*card.*number.*$"
- "^.*number.*card.*$"
- "^.*credit.*card.*$"
- "^.*debit.*card.*$"
description:
regex:
- "^.*card.*number.*$"
- "^.*number.*card.*$"
- "^.*credit.*card.*$"
- "^.*debit.*card.*$"
datatype:
type:
- str
- int
values:
prediction_type: regex
regex:
- "^4[0-9]{12}(?:[0-9]{3})?$"
- "^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$"
- "^3[47][0-9]{13}$"
- "^3(?:0[0-5]|[68][0-9])[0-9]{11}$"
- "^6(?:011|5[0-9]{2})[0-9]{12}$"
- "^(?:2131|1800|35\\d{3})\\d{11}$"
- "^(6541|6556)[0-9]{12}$"
- "^389[0-9]{11}$"
- "^63[7-9][0-9]{13}$"
- "^9[0-9]{15}$"
- "^(6304|6706|6709|6771)[0-9]{12,15}$"
- "^(5018|5020|5038|6304|6759|6761|6763)[0-9]{8,15}$"
- "^(62[0-9]{14,17})$"
- "^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})$"
- "^(4903|4905|4911|4936|6333|6759)[0-9]{12}|(4903|4905|4911|4936|6333|6759)[0-9]{14}|(4903|4905|4911|4936|6333|6759)[0-9]{15}|564182[0-9]{10}|564182[0-9]{12}|564182[0-9]{13}|633110[0-9]{10}|633110[0-9]{12}|633110[0-9]{13}$"
- "^(6334|6767)[0-9]{12}|(6334|6767)[0-9]{14}|(6334|6767)[0-9]{15}$"
library: []
Phone_Number:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- ".*phone.*(num|no).*"
- ".*(num|no).*phone.*"
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
- ".*mobile.*(num|no).*"
- ".*(num|no).*mobile.*"
- ".*telephone.*(num|no).*"
- ".*(num|no).*telephone.*"
- ".*cell.*(num|no).*"
- ".*(num|no).*cell.*"
- ".*contact.*(num|no).*"
- ".*(num|no).*contact.*"
- ".*landline.*(num|no).*"
- ".*(num|no).*landline.*"
- ".*fax.*(num|no).*"
- ".*(num|no).*fax.*"
- phone
- telephone
- landline
- mobile
- tel
- fax
- cell
- contact
description:
regex:
- ".*phone.*(num|no).*"
- ".*(num|no).*phone.*"
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
- ".*mobile.*(num|no).*"
- ".*(num|no).*mobile.*"
- ".*telephone.*(num|no).*"
- ".*(num|no).*telephone.*"
- ".*cell.*(num|no).*"
- ".*(num|no).*cell.*"
- ".*contact.*(num|no).*"
- ".*(num|no).*contact.*"
- ".*landline.*(num|no).*"
- ".*(num|no).*landline.*"
- ".*fax.*(num|no).*"
- ".*(num|no).*fax.*"
- phone
- telephone
- landline
- mobile
- tel
- fax
- cell
- contact
datatype:
type:
- int
- str
values:
prediction_type: library
regex: []
library:
- phonenumbers
Street_Address:
prediction_factors_and_weights:
name: 0.5
description: 0
datatype: 0
values: 0.5
name:
regex:
- ".*street.*add.*"
- ".*add.*street.*"
- ".*full.*add.*"
- ".*add.*full.*"
- ".*mail.*add.*"
- ".*add.*mail.*"
- add[^a-z]+
- address
- street
description:
regex:
- ".*street.*add.*"
- ".*add.*street.*"
- ".*full.*add.*"
- ".*add.*full.*"
- ".*mail.*add.*"
- ".*add.*mail.*"
- add[^a-z]+
- address
- street
datatype:
type:
- str
values:
prediction_type: library
regex: []
library:
- spacy
Full_Name:
prediction_factors_and_weights:
name: 0.3
description: 0
datatype: 0
values: 0.7
name:
regex:
- ".*person.*name.*"
- ".*name.*person.*"
- ".*user.*name.*"
- ".*name.*user.*"
- ".*full.*name.*"
- ".*name.*full.*"
- fullname
- name
- person
- user
description:
regex:
- ".*person.*name.*"
- ".*name.*person.*"
- ".*user.*name.*"
- ".*name.*user.*"
- ".*full.*name.*"
- ".*name.*full.*"
- fullname
- name
- person
- user
datatype:
type:
- str
values:
prediction_type: library
regex: []
library:
- spacy
Age:
prediction_factors_and_weights:
name: 0.65
description: 0
datatype: 0
values: 0.35
name:
regex:
- age[^a-z]+.*
- ".*[^a-z]+age"
- ".*[^a-z]+age[^a-z]+.*"
- age
description:
regex:
- age[^a-z]+.*
- ".*[^a-z]+age"
- ".*[^a-z]+age[^a-z]+.*"
- age
datatype:
type:
- int
values:
prediction_type: library
regex: []
library:
- rule_based_logic

Advanced Configuration: Specifying Custom InfoType

source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"

# Credentials
username: user
password: pass
role: "sysadmin"

# Options
top_n_queries: 10
email_domain: mycompany.com

classification:
enabled: True
classifiers:
- type: datahub
config:
confidence_level_threshold: 0.7
minimum_values_threshold: 10
info_types_config:
CloudRegion:
prediction_factors_and_weights:
name: 0
description: 0
datatype: 0
values: 1
values:
prediction_type: regex
regex:
- "(af|ap|ca|eu|me|sa|us)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)-\\d+"
library: []

Additional Resources

DataHub Blog