Data Warehousing Institute 2002 Survey
The following table reports the results of an industry survey of sources of data quality problems:
Source | Percentage |
---|---|
Data entry by employees | 76 |
Changes to source systems | 53 |
Data migration or conversion projects | 48 |
Mixed expectations by users | 46 |
External data | 34 |
System errors | 26 |
Data entry by customers This includes typographical errors and information typed into the wrong fields made when entering data into forms on the World Wide Web. |
25 |
Other | 12 |
Source: Wayne W. Eckerson, Data Quality and the Bottom Line: Achieving Business Success Through a Commitment to High Quality Data, The Data Warehousing Institute, 2002. |
Data Warehousing Institute 2005 Follow-Up Survey
Three years later, a TDWI survey of many of the same industry sources reported the following list of most frequent contributors to data quality problems in their organizations:
Source | Percentage |
---|---|
Data entry by employees | 75 |
Inconsistent definitions for common terms | 75 |
Data migration or conversion projects | 46 |
Mixed expectations by users | 40 |
External data | 38 |
Data entry by customers This includes typographical errors and information typed into the wrong fields made when entering data into forms on the World Wide Web. Note that human error can also have a significant effect on the security of your site (Liginlal, Sim, and Khansa, 2009). |
26 |
System errors | 25 |
Changes to source systems | 20 |
Other | 7 |
Source: Philip Russom, Taking Data Quality to the Enterprise Through Data Governance, The Data Warehousing Institute, 2005. |
The percentages do not add to 100 because they represent the relative number of survey respondents who reported the associated data quality problem as a significant source of data quality problems in their organization, not the percentage of all errors contributed by each source.
The surveys cover the same error sources except that the 2005 survey adds the category “Inconsistent definitions for common terms,” which tied with “Employee data entry” as the number one source of data quality problems in the organization.
Data Quality Problem Source | 2002 Survey Score | 2005 Survey Score | Percent Change |
---|---|---|---|
Data entry by employees | 76 | 75 | -1 |
Inconsistent definitions for common terms | ø | 75 | 0 |
Data migration or conversion problems | 48 | 46 | -2 |
Mixed expectations by users | 46 | 40 | -6 |
External data | 34 | 38 | 4 |
Data entry by customers | 25 | 26 | 1 |
System errors | 26 | 25 | -1 |
Changes to source systems | 53 | 20 | -33 |
Other | 12 | 7 | -5 |
Sources:
|
The category “inconsistent definitions for common terms” was not measured in the 2002 survey.
Note that over three years time, the percent change, whether positive or negative, is, with two possible exceptions, not significant. The exceptions are the relatively small 6% drop in “Mixed expectations by users” and the significantly large 33% drop in “Changes to source systems.”
It is interesting to note, however, that of the 79% of the organizations surveyed that have a data quality initiative in place, the team leading that initiative in the 2005 survey is most likely to be the data warehousing group, whereas in the 2002 survey, the data warehousing team was second to the IT department in terms of which was the more likely leader of the initiative. Unfortunately, a whopping 42% of those surveyed had no plans to institute a data governance initiative, while a mere 8% had such an initiative already in place. 33% had an initiative under consideration, while another 17% had such a plan in either its design or implementation phase.
Database Constraints and Enterprise Business Rules
Database constraints can easily validate data entries and enforce referential integrity, both of which are commonly reported sources of data errors. CHECK constraints can prevent a keypunch operator from successfully typing values into a table column that are outside the range of values permitted for that column by enterprise business rules, and referential integrity constraints can prevent child table rows from becoming orphaned as the result of a mistaken deletion of a parent table row or update to a parent table primary, or alternate, key.
Constraints are a physical implementation of the business rules under which an enterprise operates. Integrity constraints are a method of restricting database updates to a set of specified values or ranges of values. They ensure not only that bad data does not get into the database, but also that intertable relationships do not become corrupted by the improper deletion or updating of data from the existing database.
- Semantic
- Physical
Teradata Database also provides a referential constraints, which provides the Optimizer with a means for devising better query plans, but which does not enforce the referential integrity of the database. See “CREATE TABLE” in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144 for more information.
Definitions
Term | Definition |
---|---|
Business rule | A component of the business model
that defines specific conditional modes of activity. Business rules are expressed in natural language and are represented in a relational database by integrity constraints. They are the ultimate determining factor for proper database design. |
Constraint | A predicate that must evaluate to TRUE if a database DELETE, INSERT, or UPDATE operation is to be permitted. |
Integrity constraint | A component of the logical database
model that formalizes business rules by specifying the boundary
conditions and ranges permitted for various database parameters. Integrity constraints are one of the four components of an integrity rule and are expressed in the language of databases: tables, columns, rows, and so on. |
Integrity rule | A set of rules for ensuring the
integrity of a database. Each integrity rule is composed of a:
The checking time specifies the processing
point at which the constraint is checked. In the ANSI/ISO SQL
standard, a checking time has either of the following possible
values:
All Teradata Database constraints have an immediate checking time. Deferred constraint checking, which is never a good way to ensure integrity, is not permitted. The violation response
specifies the action to be taken when an integrity constraint is
violated. In the ANSI/ISO SQL standard, a violation response has
either of the following possible values:
All Teradata Database constraints have a reject violation response. Compensate violation responses are not permitted. Integrity rules are specified by the SQL CREATE TABLE and ALTER TABLE statements. |
Semantic Data Integrity Constraints
Semantic integrity constraints enforce the logical meaning of the data and its relationships. Physical integrity constraints enforce the physical integrity of the data, for example ensuring that the bit sequence 00000001 is stored as 00000001 and not 00000010.
The emphasis of this section is on declarative semantic constraints: constraints that are part of the definition of the database itself. It is also possible to implement procedural support for integrity using database features such as triggers and stored procedures. The principal drawbacks to procedural enforcement of database integrity are performance, the added programming requirements and, critically, the increased opportunity for errors in enforcement introduced by these methods. If a declarative constraint alternative is available, it is almost always a better choice than a trigger (which can introduce serious sequential timing issues) or stored procedure.
The least favorable method for enforcing database integrity is through application programming. Besides the increased programming burden and its likely introduction of errors into integrity enforcement, application programming introduces the additional fault of application specificity. If one application enforces database integrity and another does not, or if two programs enforce integrity in different, perhaps contradictory ways, then a still greater chance of corrupting the database results. Worse still, application-based database integrity cannot affect ad hoc inserts, deletes, and updates to the database, and as a result place still further burdens on the DBA to find other mechanisms of preventing database corruption.
The point of relational databases is to be application-neutral, thus serving all applications equally well.
These constraints, in addition to the other constraints defined for the system during the process of normalization, such as functional and multivalued dependencies, have the added advantage of facilitating query and join optimization. The richer the constraint set specified for a database, the more opportunities there are to enhance query optimization.
- Column-level constraints
- Table-level constraints
- Database constraints
The recommendation for constraints is to specify them anywhere you can, being mindful of the performance debit their enforcement accrues. If you cannot trust your data, there is little point in maintaining the databases that contain it.
You can never declare semantic database constraints on columns defined with the XML, JSON, BLOB, or CLOB data types.
If performance issues make declarative constraints an unattractive option, then you should always implement integrity constraints by some other mechanism. The performance savings attained by implementing integrity constraints outside the database are often just transferred from the database to the application, negating any actual performance gains realized by not implementing the constraints declaratively.
The most important consideration must be that database integrity is consistently enforced.