Classification
The classification feature enables sources to be configured to automatically predict info types for columns and use them as glossary terms. This is an explicit opt-in feature and is not enabled by default.
Config details
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Required | Type | Description | Default |
---|---|---|---|---|
enabled | boolean | Whether classification should be used to auto-detect glossary terms | False | |
sample_size | int | Number of sample values used for classification. | 100 | |
max_workers | int | Number of worker processes to use for classification. Set to 1 to disable. | Number of cpu cores or 4 | |
info_type_to_term | Dict[str,string] | Optional mapping to provide glossary term identifier for info type. | By default, info type is used as glossary term identifier. | |
classifiers | Array of object | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. | [{'type': 'datahub', 'config': None}] | |
table_pattern | AllowDenyPattern (see below for fields) | Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*' | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | |
table_pattern.allow | Array of string | List of regex patterns to include in ingestion | ['.*'] | |
table_pattern.deny | Array of string | List of regex patterns to exclude from ingestion. | [] | |
table_pattern.ignoreCase | boolean | Whether to ignore case sensitivity during pattern matching. | True | |
column_pattern | AllowDenyPattern (see below for fields) | Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in database.schema.table.column format. | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | |
column_pattern.allow | Array of string | List of regex patterns to include in ingestion | ['.*'] | |
column_pattern.deny | Array of string | List of regex patterns to exclude from ingestion. | [] | |
column_pattern.ignoreCase | boolean | Whether to ignore case sensitivity during pattern matching. | True |
DataHub Classifier
DataHub Classifier is the default classifier implementation, which uses acryl-datahub-classify library to predict info types.
Config Details
Field | Required | Type | Description | Default |
---|---|---|---|---|
confidence_level_threshold | number | 0.68 | ||
strip_exclusion_formatting | bool | A flag that determines whether the exclusion list uses exact matching or format stripping (case-insensitivity, punctuation removal, and special character removal). | True | |
info_types | list[string] | List of infotypes to be predicted. By default, all supported infotypes are considered, along with any custom infotypes configured in info_types_config . | None | |
info_types_config | Configuration details for infotypes | Dict[str, InfoTypeConfig] | See reference_input.py for default configuration. | |
info_types_config.key .prediction_factors_and_weights | ❓ (required if info_types_config.key is set) | Dict[str,number] | Factors and their weights to consider when predicting info types | |
info_types_config.key .exclude_name | list[string] | Optional list of names to exclude from classification. | None | |
info_types_config.key .name | NameFactorConfig (see below for fields) | |||
info_types_config.key .name.regex | Array of string | List of regex patterns the column name follows for the info type | ['.*'] | |
info_types_config.key .description | DescriptionFactorConfig (see below for fields) | |||
info_types_config.key .description.regex | Array of string | List of regex patterns the column description follows for the info type | ['.*'] | |
info_types_config.key .datatype | DataTypeFactorConfig (see below for fields) | |||
info_types_config.key .datatype.type | Array of string | List of data types for the info type | ['.*'] | |
info_types_config.key .values | ValuesFactorConfig (see below for fields) | |||
info_types_config.key .values.prediction_type | ❓ (required if info_types_config.key .values is set) | string | None | |
info_types_config.key .values.regex | Array of string | List of regex patterns the column value follows for the info type | None | |
info_types_config.key .values.library | Array of string | Library used for prediction | None | |
minimum_values_threshold | number | Minimum number of non-null column values required to process values prediction factor. | 50 | |
Supported infotypes
Email_Address
Gender
Credit_Debit_Card_Number
Phone_Number
Street_Address
Full_Name
Age
IBAN
US_Social_Security_Number
Vehicle_Identification_Number
IP_Address_v4
IP_Address_v6
US_Driving_License_Number
Swift_Code
- Regex based Custom InfoTypes
Supported sources
- All SQL sources
Future Work
- Classification for nested columns (struct, array type)
Examples
Basic
source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"
# Credentials
username: user
password: pass
role: "sysadmin"
# Options
top_n_queries: 10
email_domain: mycompany.com
classification:
enabled: True
classifiers:
- type: datahub
Advanced Configuration: Customizing configuration for supported info types
source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"
# Credentials
username: user
password: pass
role: "sysadmin"
# Options
top_n_queries: 10
email_domain: mycompany.com
classification:
enabled: True
info_type_to_term:
Email_Address: "Email"
classifiers:
- type: datahub
config:
confidence_level_threshold: 0.7
info_types_config:
Email_Address:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*mail.*id.*$"
- "^.*id.*mail.*$"
- "^.*mail.*add.*$"
- "^.*add.*mail.*$"
- email
- mail
description:
regex:
- "^.*mail.*id.*$"
- "^.*mail.*add.*$"
- email
- mail
datatype:
type:
- str
values:
prediction_type: regex
regex:
- "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}"
library: []
Gender:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*gender.*$"
- "^.*sex.*$"
- gender
- sex
description:
regex:
- "^.*gender.*$"
- "^.*sex.*$"
- gender
- sex
datatype:
type:
- int
- str
values:
prediction_type: regex
regex:
- male
- female
- man
- woman
- m
- f
- w
- men
- women
library: []
Credit_Debit_Card_Number:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- "^.*card.*number.*$"
- "^.*number.*card.*$"
- "^.*credit.*card.*$"
- "^.*debit.*card.*$"
description:
regex:
- "^.*card.*number.*$"
- "^.*number.*card.*$"
- "^.*credit.*card.*$"
- "^.*debit.*card.*$"
datatype:
type:
- str
- int
values:
prediction_type: regex
regex:
- "^4[0-9]{12}(?:[0-9]{3})?$"
- "^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$"
- "^3[47][0-9]{13}$"
- "^3(?:0[0-5]|[68][0-9])[0-9]{11}$"
- "^6(?:011|5[0-9]{2})[0-9]{12}$"
- "^(?:2131|1800|35\\d{3})\\d{11}$"
- "^(6541|6556)[0-9]{12}$"
- "^389[0-9]{11}$"
- "^63[7-9][0-9]{13}$"
- "^9[0-9]{15}$"
- "^(6304|6706|6709|6771)[0-9]{12,15}$"
- "^(5018|5020|5038|6304|6759|6761|6763)[0-9]{8,15}$"
- "^(62[0-9]{14,17})$"
- "^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})$"
- "^(4903|4905|4911|4936|6333|6759)[0-9]{12}|(4903|4905|4911|4936|6333|6759)[0-9]{14}|(4903|4905|4911|4936|6333|6759)[0-9]{15}|564182[0-9]{10}|564182[0-9]{12}|564182[0-9]{13}|633110[0-9]{10}|633110[0-9]{12}|633110[0-9]{13}$"
- "^(6334|6767)[0-9]{12}|(6334|6767)[0-9]{14}|(6334|6767)[0-9]{15}$"
library: []
Phone_Number:
prediction_factors_and_weights:
name: 0.4
description: 0
datatype: 0
values: 0.6
name:
regex:
- ".*phone.*(num|no).*"
- ".*(num|no).*phone.*"
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
- ".*mobile.*(num|no).*"
- ".*(num|no).*mobile.*"
- ".*telephone.*(num|no).*"
- ".*(num|no).*telephone.*"
- ".*cell.*(num|no).*"
- ".*(num|no).*cell.*"
- ".*contact.*(num|no).*"
- ".*(num|no).*contact.*"
- ".*landline.*(num|no).*"
- ".*(num|no).*landline.*"
- ".*fax.*(num|no).*"
- ".*(num|no).*fax.*"
- phone
- telephone
- landline
- mobile
- tel
- fax
- cell
- contact
description:
regex:
- ".*phone.*(num|no).*"
- ".*(num|no).*phone.*"
- ".*[^a-z]+ph[^a-z]+.*(num|no).*"
- ".*(num|no).*[^a-z]+ph[^a-z]+.*"
- ".*mobile.*(num|no).*"
- ".*(num|no).*mobile.*"
- ".*telephone.*(num|no).*"
- ".*(num|no).*telephone.*"
- ".*cell.*(num|no).*"
- ".*(num|no).*cell.*"
- ".*contact.*(num|no).*"
- ".*(num|no).*contact.*"
- ".*landline.*(num|no).*"
- ".*(num|no).*landline.*"
- ".*fax.*(num|no).*"
- ".*(num|no).*fax.*"
- phone
- telephone
- landline
- mobile
- tel
- fax
- cell
- contact
datatype:
type:
- int
- str
values:
prediction_type: library
regex: []
library:
- phonenumbers
Street_Address:
prediction_factors_and_weights:
name: 0.5
description: 0
datatype: 0
values: 0.5
name:
regex:
- ".*street.*add.*"
- ".*add.*street.*"
- ".*full.*add.*"
- ".*add.*full.*"
- ".*mail.*add.*"
- ".*add.*mail.*"
- add[^a-z]+
- address
- street
description:
regex:
- ".*street.*add.*"
- ".*add.*street.*"
- ".*full.*add.*"
- ".*add.*full.*"
- ".*mail.*add.*"
- ".*add.*mail.*"
- add[^a-z]+
- address
- street
datatype:
type:
- str
values:
prediction_type: library
regex: []
library:
- spacy
Full_Name:
prediction_factors_and_weights:
name: 0.3
description: 0
datatype: 0
values: 0.7
name:
regex:
- ".*person.*name.*"
- ".*name.*person.*"
- ".*user.*name.*"
- ".*name.*user.*"
- ".*full.*name.*"
- ".*name.*full.*"
- fullname
- name
- person
- user
description:
regex:
- ".*person.*name.*"
- ".*name.*person.*"
- ".*user.*name.*"
- ".*name.*user.*"
- ".*full.*name.*"
- ".*name.*full.*"
- fullname
- name
- person
- user
datatype:
type:
- str
values:
prediction_type: library
regex: []
library:
- spacy
Age:
prediction_factors_and_weights:
name: 0.65
description: 0
datatype: 0
values: 0.35
name:
regex:
- age[^a-z]+.*
- ".*[^a-z]+age"
- ".*[^a-z]+age[^a-z]+.*"
- age
description:
regex:
- age[^a-z]+.*
- ".*[^a-z]+age"
- ".*[^a-z]+age[^a-z]+.*"
- age
datatype:
type:
- int
values:
prediction_type: library
regex: []
library:
- rule_based_logic
Advanced Configuration: Specifying Custom InfoType
source:
type: snowflake
config:
env: PROD
# Coordinates
account_id: account_name
warehouse: "COMPUTE_WH"
# Credentials
username: user
password: pass
role: "sysadmin"
# Options
top_n_queries: 10
email_domain: mycompany.com
classification:
enabled: True
classifiers:
- type: datahub
config:
confidence_level_threshold: 0.7
minimum_values_threshold: 10
info_types_config:
CloudRegion:
prediction_factors_and_weights:
name: 0
description: 0
datatype: 0
values: 1
values:
prediction_type: regex
regex:
- "(af|ap|ca|eu|me|sa|us)-(central|north|(north(?:east|west))|south|south(?:east|west)|east|west)-\\d+"
library: []
Additional Resources
DataHub Blog
Is this page helpful?