Skip to main content

SchemaFieldPath Specification (Version 2)

This document outlines the formal specification for the fieldPath member of the SchemaField model. This specification (version 2) takes into account the unique requirements of supporting a wide variety of nested types, unions and optional fields and is a substantial improvement over the current implementation (version 1).

Requirements#

The fieldPath field is currently used by datahub for not just rendering the schema fields in the UI, but also as a primary identifier of a field in other places such as EditableSchemaFieldInfo, usage stats and data profiles. Therefore, it must satisfy the following requirements.

  • must be unique across all fields within a schema.
  • make schema navigation in the UI more intuitive.
  • allow for identifying the type of schema the field is part of, such as a key-schema or a value-schema.
  • allow for future-evolution

Existing Convention(v1)#

The existing convention is to simply use the field's name as the fieldPath for simple fields, and use the dot delimited names for nested fields. This scheme does not satisfy the requirements stated above. The following example illustrates where the uniqueness requirement is not satisfied.

Example: Ambiguous field path#

Consider the following Avro schema which is a union of two record types A and B, each having a simple field with the same name f that is of type string. The v1 naming scheme cannot differentiate if a fieldPath=f is referring to the record type A or B.

[    {        "type": "record",        "name": "A",        "fields": [{ "name": "f", "type": "string" } ]    }, {        "type": "record",        "name": "B",        "fields": [{ "name": "f", "type": "string" } ]    }]

The FieldPath encoding scheme(v2)#

The syntax for V2 encoding of the fieldPath is captured in the following grammar. The FieldPathSpec is essentially the type annotated path of the member, with each token along the path representing one level of nested member, starting from the most-enclosing type, leading up to the member. In the case of unions that have one-of semantics, the corresponding field will be emitted once for each member of the union as its type, along with one path corresponding to the union itself.

Formal Spec:#

<SchemaFieldPath> := <VersionToken>.<PartOfKeySchemaToken>.<FieldPathSpec>  // when part of a key-schema                   | <VersionToken>.<FieldPathSpec> // when part of a value schema<VersionToken> := [version=<VersionId>] // [version=2.0] for v2<PartOfKeySchemaToken> := [key=True]  // when part of a key schema<FieldPathSpec> := <FieldToken>+  // this is the type prefixed path field (nested if repeats).<FieldToken> := <TypePrefixToken>.<name_of_the_field> // type prefixed path of a field.<TypePrefixToken> := <NestedTypePrefixToken>.<SimpleTypeToken> | <SimpleTypeToken><NestedTypePrefixToken> := [type=<NestedType>]<SimpleTypeToken> := [type=<SimpleType>]<NestedType> := <name of a struct/record> | union | array | map<SimpleType> := int | float | double | string | fixed | enum

For the example above, this encoding would produce the following 2 unique paths corresponding to the A.f and B.f fields.

unique_v2_field_paths = [ "[version=2.0].[type=union].[type=A].[type=string].f", "[version=2.0].[type=union].[type=B].[type=string].f"]

NOTE:

  • this encoding always ensures uniqueness within a schema since the full type annotation leading to a field is encoded in the fieldPath itself.
  • processing a fieldPath, such as from UI, gets simplified simply by walking each token along the path from left-to-right.
  • adding PartOfKeySchemaToken allows for identifying if the field is part of key-schema.
  • adding VersionToken allows for future evolvability.
  • to represent optional fields, which sometimes are modeled as unions in formats like Avro, instead of treating it as a union member, set the nullable member of SchemaField to True.

Examples#

Primitive types#

avro_schema = """{  "type": "string"}"""unique_v2_field_paths = [  "[version=2.0].[type=string]"]

Records#

Simple Record

avro_schema = """{  "type": "record",  "name": "some.event.E",  "namespace": "some.event.N",  "doc": "this is the event record E"  "fields": [    {      "name": "a",      "type": "string",      "doc": "this is string field a of E"    },    {      "name": "b",      "type": "string",      "doc": "this is string field b of E"    }  ]}"""
unique_v2_field_paths = [    "[version=2.0].[type=E].[type=string].a",    "[version=2.0].[type=E].[type=string].b",]

Nested Record

avro_schema = """{    "type": "record",    "name": "SimpleNested",    "namespace": "com.linkedin",    "fields": [{        "name": "nestedRcd",        "type": {            "type": "record",            "name": "InnerRcd",            "fields": [{                "name": "aStringField",                 "type": "string"            } ]        }    }]}"""
unique_v2_field_paths = [  "[version=2.0].[key=True].[type=SimpleNested].[type=InnerRcd].nestedRcd",  "[version=2.0].[key=True].[type=SimpleNested].[type=InnerRcd].nestedRcd.[type=string].aStringField",]

Recursive Record

avro_schema = """{    "type": "record",    "name": "Recursive",    "namespace": "com.linkedin",    "fields": [{        "name": "r",        "type": {            "type": "record",            "name": "R",            "fields": [                { "name" : "anIntegerField", "type" : "int" },                { "name": "aRecursiveField", "type": "com.linkedin.R"}            ]        }    }]}"""
unique_v2_field_paths = [  "[version=2.0].[type=Recursive].[type=R].r",  "[version=2.0].[type=Recursive].[type=R].r.[type=int].anIntegerField",  "[version=2.0].[type=Recursive].[type=R].r.[type=R].aRecursiveField"]
avro_schema ="""{    "type": "record",    "name": "TreeNode",    "fields": [        {            "name": "value",            "type": "long"        },        {            "name": "children",            "type": { "type": "array", "items": "TreeNode" }        }    ]}"""unique_v2_field_paths = [ "[version=2.0].[type=TreeNode].[type=long].value", "[version=2.0].[type=TreeNode].[type=array].[type=TreeNode].children",]

Unions#

avro_schema = """{    "type": "record",    "name": "ABUnion",    "namespace": "com.linkedin",    "fields": [{        "name": "a",        "type": [{            "type": "record",            "name": "A",            "fields": [{ "name": "f", "type": "string" } ]            }, {            "type": "record",            "name": "B",            "fields": [{ "name": "f", "type": "string" } ]            }        ]    }]}"""unique_v2_field_paths: List[str] = [    "[version=2.0].[key=True].[type=ABUnion].[type=union].a",    "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=A].a",    "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=A].a.[type=string].f",    "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=B].a",    "[version=2.0].[key=True].[type=ABUnion].[type=union].[type=B].a.[type=string].f",]

Arrays#

avro_schema = """{    "type": "record",    "name": "NestedArray",    "namespace": "com.linkedin",    "fields": [{        "name": "ar",        "type": {            "type": "array",            "items": {                "type": "array",                "items": [                    "null",                    {                        "type": "record",                        "name": "Foo",                        "fields": [ {                            "name": "a",                            "type": "long"                        } ]                    }                ]            }        }    }]}"""unique_v2_field_paths: List[str] = [  "[version=2.0].[type=NestedArray].[type=array].[type=array].[type=Foo].ar",  "[version=2.0].[type=NestedArray].[type=array].[type=array].[type=Foo].ar.[type=long].a",]

Maps#

avro_schema = """{  "type": "record",  "name": "R",  "namespace": "some.namespace",  "fields": [    {      "name": "a_map_of_longs_field",      "type": {        "type": "map",        "values": "long"      }    }  ]}"""unique_v2_field_paths = [  "[version=2.0].[type=R].[type=map].[type=long].a_map_of_longs_field",]

Mixed Complex Type Examples#

# Combines arrays, unions and records.avro_schema = """{    "type": "record",    "name": "ABFooUnion",    "namespace": "com.linkedin",    "fields": [{        "name": "a",        "type": [ {            "type": "record",            "name": "A",            "fields": [{ "name": "f", "type": "string" } ]            }, {            "type": "record",            "name": "B",            "fields": [{ "name": "f", "type": "string" } ]            }, {            "type": "array",            "items": {                "type": "array",                "items": [                    "null",                    {                        "type": "record",                        "name": "Foo",                        "fields": [{ "name": "f", "type": "long" }]                    }                ]            }    }]    }]}"""
unique_v2_field_paths: List[str] = [  "[version=2.0].[type=ABFooUnion].[type=union].a",  "[version=2.0].[type=ABFooUnion].[type=union].[type=A].a",  "[version=2.0].[type=ABFooUnion].[type=union].[type=A].a.[type=string].f",  "[version=2.0].[type=ABFooUnion].[type=union].[type=B].a",  "[version=2.0].[type=ABFooUnion].[type=union].[type=B].a.[type=string].f",  "[version=2.0].[type=ABFooUnion].[type=union].[type=array].[type=array].[type=Foo].a",  "[version=2.0].[type=ABFooUnion].[type=union].[type=array].[type=array].[type=Foo].a.[type=long].f",]

For more examples, see the unit-tests for AvroToMceSchemaConverter.

Backward-compatibility#

While this format is not directly compatible with the v1 format, the v1 equivalent can easily be constructed from the v2 encoding by stripping away all the v2 tokens enclosed in the square-brackets [<new_in_v2>].