In my several years of DW/BI experience, one of the toughest aspects over the entire lifecycle has been testing quality of data in the DWH. What are the different types of testing we can do on the data in the warehouse and what are the strategies we need to consider for the same? How we can do this testing (including some proprietary frameworks we have built at Marlabs to perform line-by-line testing for every row) is a subject for another day.
Data Completeness (can be tested with both designer data and full load)
One of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. This includes validating that all records in all fields and the full contents of each field are loaded. Strategies to consider include:
- Comparing record counts between source data, data loaded into the ODS and data loaded into the warehouse. You might also want to consider the rejected/suspense records.
- Comparing the count of distinct values of key fields between source data and data loaded to the warehouse. You could also do a comparison of the distinct values themselves.
- Utilizing a data profiling tool (Talend, http://www.talend.com, provides a free tool) that shows the type, range and distributions of data in a data set. This can be used during testing and in production to compare source and target data sets. The profilers can also identify defects from source systems that may be missed even when the data movement is correct.
- Ensure that each field is tested to its maximum length – for example if you have a varchar(10) field, make sure you test it with 10 characters, 11 characters, blank and null.
- Testing the boundaries of each field to find any database limitations. For example, for a decimal field with a precision of 3 include values of -99 and 999, and for date fields include the entire range of dates expected (including dates in multiple formats).
Data Transformation (typically tested with designer data)
Often the real value of data in a warehouse lies in the transformation rules that are applied. Validating that data is transformed correctly based on business rules/logic can often be the most complex part of testing an ETL application. One typical method is to pick some sample records and eyeball to validate data transformations manually. This can be useful but requires manual testing steps and testers who understand the ETL logic. A combination of automated data profiling and automated data movement validations is a better long-term strategy. Here are some simple automated data movement validation techniques:
- Create a set of scenarios of input data and expected outcomes and validate these with the customer. This is a good exercise to be performed during requirements and can also be used during testing.
- Create a test bed with test data that includes all scenarios. Automate the process of populating data sets with from the scenarios sheet prepared in the previous step. More scenarios can be added and the testing automation will remain valid.
- Use the results of the data profiling tool to compare range and distribution of values in each field between source and target data.
- Validate correct processing of technical fields generated during ETL design such as flags and surrogate keys.
- Validate that data types in the warehouse are as specified in the design and/or the data model.
- Set up data scenarios that test referential integrity between tables. This is especially true when you are testing behavior of child records that do not have a parent and when checking many to many relationships.
Data Quality (typically tested with designer data)
For comprehensive data quality testing, make sure to include as many scenarios as possible. Typically, data quality rules are defined during design, for example:
- If a state is spelt wrongly, look up from the master list and correct it.
- Substitute null if a certain decimal field has nonnumeric data.
- Validate the city and state based on ZIP code.
Depending on the data quality rules of the application being tested, scenarios to test might include null key values, duplicate records in source data and invalid data types in fields (e.g. special characters in date field). Data quality rules applied to the data will usually be invisible to the users once the application is in production; users will only see whats loaded to the database. For this reason, it is important to ensure users are aware of how bad data is being handled. The data quality reports often present patterns and trends that hint at underlying structural issues. Quite often, the DQ reports are published before the reporting tables are fully processed. It might make sense to stop processing and publishing of reports in the number of data quality issues passes a certain threshold.
Performance and Scalability (typically tested with mock-up data)
As the volume of data in a data warehouse grows, ETL load times can be expected to increase and performance of queries can be expected to degrade. The aim of the performance testing is to point out any potential weaknesses in the ETL architecture and design, such as reading a file multiple times or creating unnecessary intermediate files. The following strategies will help discover performance issues:
- Load the database with peak expected production volumes (this is often much higher than current production data and some method of statistical extrapolation might need to be considered) to ensure that this volume of data can be loaded by the ETL process within the agreed-upon window.
- Compare these ETL loading times to loads performed with a smaller amount of data to anticipate scalability issues.
- The ETL processing time should also be tested module by module to understand potential bottlenecks.
- If parallelism is present in the ETL design, test it fully to ensure that there are no memory bottlenecks.
- Monitor the timing of the reject/suspended process and consider how large volumes of rejected data will be handled – often this leads to source tables being scanned fully twice and that deteriorates performance.
- Perform simple and complex outer join queries to validate query performance on large database volumes. Identify acceptable performance criteria for each query and benchmark against this. This is especially true for ad-hoc reporting scenarios.
In part 2 of this article we will look at Integration Testing, Audit, Logging and Error Handling Testing, Deployment Testing, User-Acceptance Testing, and Regression Testing
This article by Rajesh Ramaswamy, Head of the BI/DW/Analytics practice at Marlabs, was first published at siliconindia.com.