In part 1 of our Introduction to Data Warehouse Testing, we introduced the different types of data warehouse testing while taking a closer look at testing for data completeness, data transformation, data quality, performance and scalability. Now let us examine integration testing, audit, logging and error handling testing, deployment testing, user-acceptance testing, and regression testing.
Integration Testing (typically tested with full load testing)
Integration testing tests the interfaces of the DWH process – the source system, the scheduling tool, the alerting mechanism and the publishing agent.
Most issues found during integration testing are either related to the data that is ingested into the DWH or related to a wrong assumption on the design of another application. As most applications today expose their interfaces as services, it is fairly easy to get comprehensive information on the design of these services. It is critical to test integration with production-like data. Real production data is ideal, but depending on the contents of the data, there could be privacy or security concerns that require certain fields to be masked before using it in a test environment. Good communication between the testing and design teams of all systems involved is of paramount importance here. To help bridge this communication gap, framing test scenarios should be a cross-team activity. Integration testing should always be run end-to-end and with extreme inputs from the interfaces mimicking production systems as closely as possible.
Audit, Logging and Error Handling Testing
This testing is more for validation of the technical processes of a data warehouse. Every ETL system should log what processes have run, when and for how long. Consider the following strategies:
- Ensure that you consciously abort processes and validate that they re-start from the point of failure.
- Sometime re-starting from the point of failure may not be possible; in that case we need to validate that the system can roll back to the original state and the ETL can be reprocessed.
- Check the depth of error logging – does it log the error alone or does it also capture additional details including when and in which data row does it occur.
- The scheduling mechanisms in the DWH may need to have manual over-rides for special cases (Feb 29th) or “back-data” loading – identifying the various automated and manual scheduling scenarios and testing for it may help identify production issues.
Eventually ETL processes are run out of data centers by production support operators. The deployment process and documentation for the same should be complete and comprehensive enough for them to run the process end to end. A couple of areas that need to be kept in mind here:
- Ensure that deployment testing is performed by a person who is new to the project and does not understand the nuances of the system you have designed.
- When things go fine, the data center personnel do not have much to do except monitoring; it is when things go wrong that they swing into action. Testing should cover instances where things go wrong (an ETL process hangs, a data file does not get loaded) and how the system effective capture and presents these negative scenarios.
The eventual goal of a data warehouse application is to make data available to end business users. Users know the data best, and their participation in the testing effort is a key component to the success of a data warehouse implementation. User-acceptance testing (UAT) typically focuses on data loaded to the data warehouse and any views that have been created on top of the tables. Consider the following strategies:
- Use data that is either from production or as near to production data as possible. Users typically find issues once they see the “real” data, sometimes leading to design changes.
- Plan for the system test team to support users during UAT. The users will likely have questions about how the data is populated and need to understand details of how the ETL works.
- Consider how the users would require the data loaded during UAT and negotiate how often the data will be refreshed.
Regression testing is revalidation of existing functionality with each new release of code. When building test cases, remember that they will likely be executed multiple times as new releases are created due to defect fixes, enhancements or upstream systems changes. Building automation during system testing will make the process of regression testing much smoother. Test cases should be prioritized by risk in order to help determine which need to be rerun for each new release. A simple but effective and efficient strategy to retest basic functionality is to store source data sets and results from successful runs of the code and compare new test results with previous runs. When doing a regression test, it is much quicker to compare results to a previous execution than to do an entire data validation again.
There are no industry standard automation tools to perform data transformation and data quality checks, but try using a tool like Concordion (http://www.concordion.org/), especially for unit testing.
Taking these considerations into account during the design and testing portions of building a data warehouse will ensure that a quality product is delivered and prevent costly mistakes from being discovered in production.
This article by Rajesh Ramaswamy, Head of the BI/DW/Analytics practice at Marlabs, was first published at siliconindia.com.