Skip to main content

Extend data model to model Notebook entity

Extend data model to model Notebook entity

Background

Querybook is Pinterest’s open-source big data IDE via a notebook interface. We(Included Health) leverage it as our main querying tool. It has a feature, DataDoc, which organizes rich text, queries, and charts into a notebook to easily document analyses. People could work collaboratively with others in a DataDoc and get real-time updates. We believe it would be valuable to ingest the DataDoc metadata to Datahub and make it easily searchable and discoverable by others.

Summary

This RFC proposes the data model used to model DataDoc entity. It does not talk about any architecture, API or other implementation details. This RFC only includes minimum data model which could meet our initial goal. If the community decides to adopt this new entity, further effort is needed.

Detailed design

DataDoc Model

DataDoc High Level Model

As shown in the above diagram, DataDoc is a document which contains a list of DataDoc cells. It organizes rich text, queries, and charts into a notebook to easily document analyses. We could see that the DataDoc model is very similar as Notebook. DataDoc would be viewed as a subset of Notebook. Therefore we are going to model Notebook rather than DataDoc. We will include "subTypes" aspect to differentiate Notebook and DataDoc

Notebook Data Model

This section talks about the mininum data model of Notebook which could meet our needs.

  • notebookKey (keyAspect)
    • notebookTool: The name of the DataDoc tool such as QueryBook, Notebook, and etc
    • notebookId: Unique id for the DataDoc
  • notebookInfo
    • title(Searchable): The title of this DataDoc
    • description(Searchable): Detailed description about the DataDoc
    • lastModified: Captures information about who created/last modified/deleted this DataDoc and when
  • notebookContent
    • content: The content of a DataDoc which is composed by a list of DataDocCell
  • editableDataDocProperties
  • ownership
  • status
  • globalTags
  • institutionalMemory
  • browsePaths
  • domains
  • subTypes
  • dataPlatformInstance
  • glossaryTerms

Notebook Cells

Notebook cell is the unit that compose a Notebook. There are three types of cells: Text Cell, Query Cell, Chart Cell. Each type of cell has its own metadata. Since the cell only lives within a Notebook, we model cells as one aspect of Notebook rather than another entity. Here are the metadata of each type of cell:

  • TextCell
    • cellTitle: Title of the cell
    • cellId: Unique id for the cell.
    • lastModified: Captures information about who created/last modified/deleted this Notebook cell and when
    • text: The actual text in a TextCell in a Notebook
  • QueryCell
    • cellTitle: Title of the cell
    • cellId: Unique id for the cell.
    • lastModified: Captures information about who created/last modified/deleted this Notebook cell and when
    • rawQuery: Raw query to explain some specific logic in a Notebook
    • lastExecuted: Captures information about who last executed this query cell and when
  • ChartCell
    • cellTitle: Title of the cell
    • cellId: Unique id for the cell.
    • lastModified: Captures information about who created/last modified/deleted this Notebook cell and when

Future Work

Querybook provides an embeddable feature. We could embed a query tab which utilize the embedded feature in Datahub which provide a search-and-explore experience to user.