Skip to main content

๐Ÿ“œ Data Contracts

What is a Data Contract (DC)?โ€‹

A data contract defines the agreement between a data producer and consumers. More details about how to write a Data Contract (DC) you can find here open-data-contract-standard.

In X-Automate from the Data Contract specification we can extract and execute DQ/DB tests. Each Data Contract will be imported in the app as a Test Suite with test type DC. Each quality specification will be a test case of that test suite.

How to import and sync DCs?โ€‹

To add Data Contracts in the application you will need to connect a git repository and click Import Data Contracts from repository all files with extension *contract.yaml from repository which respects the OCDS schema will be imported.

DC Import - Git import

After import, you will see the imported files in Data Quality > Data Contracts page. On DC expand there will be available 3 tabs: Test Cases (the Quality Checks extracted), Data Contract (the yaml file view), Execution History. You can execute the DC and get a robot report. Data Contracts can be updated only from git repository, in the app they can be only executed. There is a configurable cron job which syncs DC from git repo or you can go to Git Connection page and run Import Data Contracts from repository to sync updates.

DC page

Here is a trimmed down example of a Data Contract (DC). Checkout the comments on the fields. Field quality is specific to X-Automate the rest of the fields can be taken from open-data-contract-standard.

# yaml-language-server: $schema=https://xampublicdocs.blob.core.windows.net/data-contract-schema/data_contract_schema.json

version: 1.0.0
kind: DataContract
apiVersion: v3.1.0
id: dc-suite-with-dq-checks
name: dc-suite-with-dq-checks # This needs to be unique is equivalent to Test Suite name
status: active # DC is active or inactive
servers:
- id: DB-1
host: xam-mysql.mysql.database.azure.com # This must correspond to the host name specified on Database Connection page
schema: CORE # This schema will be used for Quality Tests
type: MYSQL
description: Internal DB For XAM
database: core
port: 3306
server: xam-mysql.mysql.database.azure.com

schema:
- quality: # This is a quality check specification, you can add as many as needed
- name: random-custom-checks-22 # This will be the Test Case Name which needs to be unique
metric: xam_custom # This is a xam_custom Quality Check equivalent to DQ Custom
implementation:
threshold: percentage # Threshold can be percentage or amount
threshold_critical: 98.5 # Value critical for threshold
threshold_warning: 99.7 # Value warning for threshold
count_query: select count(*) from core.employee; # 1. All Records Query
bad_count_query: select count(*) from core.employee; # 2. Bad Records Query
verification_query: select count(*) from core.employee; # 3. Validate Query
- physicalType: table
physicalName: employee # This is the Table Name mandatory for quality of type xam_simple
properties:
- physicalType: int # This is the Column Data Type mandatory for quality of type xam_simple
physicalName: Employee_ID # This is the Column Name mandatory for quality of type xam_simple
required: true # This will create a dq test named: "dq-required-{schema}-{table}-{column}" with MetricName.TEXT_MISSING_VALUES or MetricName.NUMBER_NULL_COUNT based on column data type and both thresholds at 100%
unique: true # This will create a dq test named: "dq-unique-{schema}-{table}-{column}" with MetricName.UNIQUENESS and both thresholds at 100%
quality:
- name: employee-id-is-unique-22 # Qulity name which will be Test Case name (must be unique)
description: Has some random filters # Test Case Description
metric: xam_simple # For metric xam_simple we have a different implementation schema
implementation:
threshold: percentage # Can be percentage or amount same as with DQ checks
threshold_critical: 98.5
threshold_warning: 99.7
metric: uniqueness # Must be choosen carefully based on Column Data Type (some metrics can be applyied only for numeric column data types for example and some can be applied to all column data types)
filters: # Here we can specify filters that will be applied before the metric is executed
- column_name: Country # Column Name
column_type: varchar # Column Data Type
filter_name: not_equals # Filter name - similar to metric must be choosed carefully based on column_type
values:
- romania
- column_name: Employee_ID
column_type: int
filter_name: greater_than
values:
- 0

Metrics for xam_simpleโ€‹

Here are the metric names available for xam_simple. They are equivalent to Quality Simple Checks in the UI, only prefixed to make it easier to identify proper metric based on Column Data Type.

Metrics available for all Column Data Types:

  • contains (Allowed Values (in))
  • not_contains (Allowed Values (not in))
  • uniqueness (Uniqueness)

Metrics available for datetime Column Data Types:

  • datetime_value_range (Value Range (between))

Metrics available for numeric Column Data Types:

  • number_value_range (Value Range (between))
  • number_null_count (Null Count)
  • number_zeroes_count (Zeroes Count)
  • number_equal (Math all below)
  • number_not_equal
  • number_greater_than
  • number_greater_than_or_equal_to
  • number_less_than_or_equal_to
  • number_less_than

Metrics available for text Column Data Types:

  • text_missing_values (Missing Values (= ''))
  • text_length_range (Length Range (between_str))
  • text_like (LIKE)
  • text_not_like (NOT LIKE)
  • text_match_regex (Regular Expression)
  • text_not_match_regex (Regular Expression)

Here are the available filters you can use:

Filters for xam_simpleโ€‹

Same as with metrics, filters must be choosed based on column data type.

Tier 1: Core Logic & Comparison (Daily Use)

  • equals
  • not_equals
  • and
  • or
  • not
  • is_null
  • is_not_null
  • in
  • not_in
  • greater_than
  • less_than
  • greater_than_or_equal_to
  • less_than_or_equal_to
  • is_true
  • is_false

Tier 2: Text Search & Common Ranges

  • ilike
  • like
  • between
  • not_between
  • contains
  • not_ilike
  • not_like
  • age
  • length

Tier 3: Math & JSON/Key Operations

  • addition
  • subtraction
  • multiplication
  • division
  • key_exists
  • any_key_exists
  • all_keys_exist
  • contained_by
  • is_distinct_from

Tier 4: Full-Text Search & Logic Extensions

  • to_tsvector
  • to_tsquery
  • path_query
  • similar_to
  • logical_and
  • logical_or
  • logical_not
  • xor

Tier 5: Specialized / Geospatial / Bitwise

  • distance
  • intersects
  • absolute_value
  • modulus
  • square_root
  • bitwise_and
  • bitwise_or
  • bitwise_not
  • bitwise_xor
  • shift_left
  • shift_right
  • above
  • below
  • is_above_or_same
  • is_below_or_same
  • is_left_or_same
  • is_right_or_same
  • is_same_horizontal
  • is_same_vertical
  • is_parallel
  • is_perpendicular
  • distance_between_centers
  • contained_by_or_equal
  • contains_or_equal
  • same_as
  • is_distinct
  • is_not_distinct_from
  • is_not_true
  • is_not_false
  • is_unknown
  • is_not_unknown
  • path_querytext
  • delete_path
  • setweight
  • strip
  • not_greater_than
  • not_less_than
  • not_similar_to