Reviewing the health monitoring report and creating data quality alerts.
Once enabled, the health monitoring tool can be accessed in the catalog by selecting a dataset, and navigating to the Health page.
The Health monitoring page is divided into 3 sections; Alerts, Row and Column changes, and Column Metrics.
Alerts
Dataset alerts can be created by users with Admin or Manage permissions on a dataset. Alerts are applied to the dataset and notify all users with access to the dataset when an alert constraint is infringed.
To create a new alert, select Create alert rule under the Alerts section of the Health UI. You will be prompted to make several selections via drop down menus. The first selection is to identify what type of constraint you would like to create a rule for, the options are as follows;
- Absolute: an absolute constraints compares a calculated metric to a constraint value on the refreshed table. For example, if a user was looking at the total row count metric, and set a constraint of greater than or equal to 1000, that means that when the health monitoring statistics are refreshed at the set schedule, and the dataset has a row count less than 1000, then an alert will be triggered.
- Relative: a relative constrain compares a calculated metric as a percentage change from the last refresh of the health metrics to the current refresh. Using the global row count metric example again, if a user set an expected relative change percentage of greater than or equal to 5%, that means that they expect the number of rows to increase by 5% or more every refresh. If the last refresh had 1000 rows, then the current refresh would need to have 1050 rows or more to satisfy the constraint, otherwise an alert will be issued.
- Query: selecting Query as the constraint type creates a text box where you can use SQL based language to select specific columns, apply aggregate calculations available in SQL, as well as conditions using the WHERE clause to create a broad range of calculations/metrics with a high degree of flexibility.
Once the constraint type, property (AKA column), metric, and alert constraint value are selected, the user can submit the alert. Once submitted there are toggle options to set the alert to active or inactive, and to toggle email alerts to not only provide notifications in the Health monitoring UI, but also via email to all users with access to the dataset.
Row and Column Changes
As you scroll past the alerts section of the Health monitoring UI you will land on the Row and Column changes section. This section of the Health monitoring UI allows you to track how row count and the table schema are changing over time. You can select a date range to view using the "Filter by date" calendar drop downs for start date and end date. You can also limit the number of revisions, also referred to as refreshes, to review using the "Revision limit" up-down incremental field. Below these options there are some global statistics based on the latest refresh of the dataset, including;
- Version: this is the current version of the dataset. A new version is created every time there is a schema change in the dataset. The first version is Version 1.
- Columns: the current column count of the table.
- Rows: the total number of rows in the current table.
- Imported on: the last time the health monitoring scan was run on the dataset
The UI also has a graphical representation of the row count over time, and a list showing the column changes that have occurred in the current version compared to the previous version.
Column Metrics
Column level out of the box statistics include:
- Min and max values along the entire column for integers, dates, and float data types
- Average value along the entire column for integers and float data types
- Min, max, and average string length along the entire column for string data types
- Null count and null percentage for all data types
A detailed view can be accessed by selecting "Details". The following information is available in the Details view:
- For columns with string data type:
-
- A composition graph with distinct count, average string length, minimum string length, and maximum string length, as well as the standard deviation of string length
- Change: how the distinct count and null count are changing over time (with each scheduled update of the health monitoring)
- Top values: a histogram showing the frequency of top values accompanied by a list showing the top values and their frequency
-
- For columns with integer or float data type:
- A composition graph showing distinct count, average value, min value, max value, and standard deviation.
- A table that provides the average, distinct count, max, min, null percentage, standard deviation, and sum
- A table with the top 5 values by frequency (if no top value found, that is because a large number of values have the same maximum frequency, e.g. that frequency would be 1 if all values are unique)
- Change: distinct and null count changes over time.
- For columns with date data type:
- A table with minimum and maximum dates
- A table with the top 5 values by frequency.
- Change: distinct and null count change over time.