This chapter focuses on two important database design issues related to the integrity of databases: semantic and physical data integrity. First and foremost, enforcing database integrity means getting the correct results. This is fundamental.
The Data Warehousing Institute reports that businesses in the United States alone lose $600 billion dollars each year due to bad data (Eckerson, 2002), spending as much as 10% to 20% of their operating revenue just to scrap and rework poor quality data (Stahl, 2004). The same TDWI white paper notes that 78% of survey respondents believe their organization needs more education about the importance of data quality and how to maintain and improve it.
In a 2005 follow-up study (Russom, 2005; Whiting, 2006), TDWI reports that 82.5% of the organizations surveyed “…continue to perceive their data as good or okay. However, half of the practitioners surveyed warn that data quality is worse than their organization realizes… Two-thirds of respondents have studied the problems of data quality, while less than half have studied its benefits.”
Furthermore, when asked if their company has suffered losses, problems, or costs due to poor quality data, 53% say yes, only 11% say no, and an alarming 36% have not even studied the issue (Russom, 2005; Whiting, 2006, where the relevant pie chart on page 42 is labelled “Database Debacles”).
Shilakes and Tylman (1998) estimate that data cleansing accounts for anywhere between 30 and 80% of the development time for a typical data warehouse. Most organizations do their best to fix bad data at the source, long before it arrives in the data warehouse (see English, 1997; Lee, et al., 2006; Olson, 2002; Piattini et al., 2006; Redman, 1997, 2001). But even if only 1% of the errors remain by the time data reaches the data warehouse, that still accounts for roughly $6 billion dollars of that $600 billion dollar total lost each year because of bad data. Russom (2008) also calls attention to the fact that the majority of enterprises focus their data quality efforts on customer data to the exclusion of other important data domains such as product, financial, and asset data. He writes, “Customer‑oriented data quality techniques and tools can be retrofitted to operate on other data domains, but with limited success.”
Of course, the quality of data that is received from different sources can vary immensely. Efforts are currently underway to develop tools for data cleaning that allow users to estimate the reliability of uncertain data using Bayesian statistical methods, which can be used both to develop probability models for uncertain data and to improve the quality of uncertain data.