2.0.156
Features
- API allows now to create Data Quality Tests
Fixes
- fix for downloading test cases
- inactive tests should not be generated
- fix for profiling SQL Server DBs
Improvements
- Better logging in the backend
- internal reorganization
- better filters for database profiling
Known issues
- Reconciliation tests may not be rerunnable
- Data Quality tests may not be rerunnable
Data Quality Test Fluff
Creating a data quality test in XAM is very easy. There are a few steps that the user has to follow in order to maximize their effectiveness.
Only API now: First step is to calculate the current global aggregates. You can do it by selecting your database connection and pressing the button "Gather Global Aggregates". This will gather the database's metadata, profile it and gather the statistical aggregates and other useful information that you might need in the data quality tests.
As an example, it will return that the minimum age of a User defined in the table Users is 23 and maximum is 65, while the unique values in the User.Role column are "tester" and "admin". You will see similar data for each column, with the metadata varying depending on the type of the column.
Your next step is to define a data quality rule that you expect your data to follow. A Data Quality rule describes some excerpt of reality that we need to control and consists of a condition and the thresholds for failure.
Let's use an example of an online liqour store. You will have a Buyer, Order and Products. In the background, there's some majestic backend-software-solution controlling and mangling the data back and forth.
By gathering the global aggregates, you will be able to see facets of your data that you may have never considered. It may turn out that the oldest user you have is 40 years old, and the youngest is 17 years old.
Wait, 17? But the legal drinking age is 18! That points us to some potential problems. Is there something wrong with the data? Or is the data correct, but there's some failure in the process?
Let's create a Data Quality rule to inform us that such a problem in the data might exist.
Using the auto-completion feature(only api now) you can now select the database connection,
then a schema from its schemas,
table Buyer,
column Age.
Set the operator to >= (greater equal than)
and the value to 18.
Since 18 is the smallest allowed value, it's a "minimum" aggregation-type.
Thus our condition is: Buyer.Age >= 18
Then we set 2 thresholds for the data.
Warning Threshold is the percentage of data that is allowed to skew from the condition we've given. We can set that, as an example, to 99% => "I expect that at least 99% of recorded buyers will be no younger than 18".
Critical threshold is the percentage of the data that we cannot allow to skew beyond. Let's set that to 95%. What will happen now when we run the test:
If the % of the records fitting the condition is more than 99%, the check will pass. The information about this check will be written to dq*stats table in target database.
If the % of the records fitting the condition is more than 95%, the check will pass. The information about this check will be written to dq_stats table in target database, but it will contain the status equal to "WARNING".
If the % of the records fitting the condition is below 95%, the check will fail. The information about this check will be written to dq_stats table in target database, but it will contain the status equal to "FAIL". The "offending" records will be written to {schema}*{table_name}\_errors table in the target database, allowing to review the failures even after fixing the records in the main table.
Disclaimers:
- gathering global aggregates utilizes database mechanisms (ie. no piece of production data ever enters XAM)
- copying the records to 'error' table uses database mechanisms (ie. no piece of production data ever enters XAM)
- checking the thresholds is achieved by counting the records in the database (ie. no piece of production data ever enters XAM)
- doing anything regarding DQ requires a user that has both READ access to production data and WRITE access to a dq schema for statistics
XAM Data Quality checks/rules created in this manner are handled in the same way as other test cases, which allows the checks to be run, scheduled, reused, added to test suites and removed from existence.