Issues in source data
- out of acceptable range
- non-standard data/ varying formats
- outliers, outliers is above/below 3* standard deviation
- invalid values
- null values
- differing culture rules (like, nz/us time date format)
- cosmetic
- verification (wrong post code)
strategy
- filter/clean/correct errors in source date before transform
- quality assured by verifying again
- correct and complete
- pre-processing
Data Quality(DQ) Assurance process
- checking
- checking if there is an error or dirty data
- based on pre-defined data quality rules
- Logging
- record the data with errors
- notify DB users/admin to correct the data
- Correcting
- clean or ignore the erros in the data – data cleaning
Logging errors
DQ rule is violated, log the event into a table. (the below is the fields of the table)
- RowID of the record that vialoates the rule a physical address the row stored on disk
- rule no that is violated
- table name of that row id
- Database name of the table
- Action -> Reject/Allow/Fix
DQ Actions
- Reject
- Allow (low risk error, the error is in within tolerance limits)
- Fix (Correct the error)
Reading
- sql server trigger