Data Warehousing Institute 2002 Survey
The following table reports the results of an industry survey of sources of data quality problems:
|Data entry by employees||76|
|Changes to source systems||53|
|Data migration or conversion projects||48|
|Mixed expectations by users||46|
|Data entry by customers
This includes typographical errors and information typed into the wrong fields made when entering data into forms on the World Wide Web.
|Source: Wayne W. Eckerson, Data Quality and the Bottom Line: Achieving Business Success Through a Commitment to High Quality Data, The Data Warehousing Institute, 2002.|
Data Warehousing Institute 2005 Follow-Up Survey
Three years later, a TDWI survey of many of the same industry sources reported the following list of most frequent contributors to data quality problems in their organizations:
|Data entry by employees||75|
|Inconsistent definitions for common terms||75|
|Data migration or conversion projects||46|
|Mixed expectations by users||40|
|Data entry by customers
This includes typographical errors and information typed into the wrong fields made when entering data into forms on the World Wide Web. Note that human error can also have a significant effect on the security of your site (Liginlal, Sim, and Khansa, 2009).
|Changes to source systems||20|
|Source: Philip Russom, Taking Data Quality to the Enterprise Through Data Governance, The Data Warehousing Institute, 2005.|
The percentages do not add to 100 because they represent the relative number of survey respondents who reported the associated data quality problem as a significant source of data quality problems in their organization, not the percentage of all errors contributed by each source.
The surveys cover the same error sources except that the 2005 survey adds the category “Inconsistent definitions for common terms,” which tied with “Employee data entry” as the number one source of data quality problems in the organization.
|Data Quality Problem Source||2002 Survey Score||2005 Survey Score||Percent Change|
|Data entry by employees||76||75||-1|
|Inconsistent definitions for common terms||ø||75||0|
|Data migration or conversion problems||48||46||-2|
|Mixed expectations by users||46||40||-6|
|Data entry by customers||25||26||1|
|Changes to source systems||53||20||-33|
The category “inconsistent definitions for common terms” was not measured in the 2002 survey.
Note that over three years time, the percent change, whether positive or negative, is, with two possible exceptions, not significant. The exceptions are the relatively small 6% drop in “Mixed expectations by users” and the significantly large 33% drop in “Changes to source systems.”
It is interesting to note, however, that of the 79% of the organizations surveyed that have a data quality initiative in place, the team leading that initiative in the 2005 survey is most likely to be the data warehousing group, whereas in the 2002 survey, the data warehousing team was second to the IT department in terms of which was the more likely leader of the initiative. Unfortunately, a whopping 42% of those surveyed had no plans to institute a data governance initiative, while a mere 8% had such an initiative already in place. 33% had an initiative under consideration, while another 17% had such a plan in either its design or implementation phase.
The Problem of Capturing the External World as Data
Fundamental to the problem of data quality is the fact that it is virtually impossible to capture the reality of the external world as data. At the same time, the Closed World Hypothesis (see The Closed World Assumption and The Closed World Assumption Revisited) tells us that if an otherwise valid tuple does not appear in the body of a relation, then the proposition representing that tuple must be false. In other words, the assumption is made that facts not known to be true in a relational database are false.
The following diagram is from an early draft of a book by Kowalski (2011) on computational logic for people who are not computing professionals.
Think of the Agent as a business analyst or user, the Logical Representation of the World as the database, and The World as reality external to the database.
The relationship between logic and the world can be represented from two points of view:
- The world perspective.
- The logical representation of the world.
From the world perspective, logical sentences represent selected world features.
From the logical perspective, the world provides semantics for logical sentences. A world structure is a collection of individuals and the relationships among them, and only true sentences are useful to a business analyst or user, which is why only true sentences are stored in a database.
A world structure corresponds to a single, static state of the world. In the relational model, this corresponds to a relation value. See Relations, Relation Values, and Relation Variables.
An atomic sentence is true in a world structure if, and only if, the relationship it expresses holds in the world structure, and otherwise it is false.
The difference between such atomic sentences and the world structure they represent is that in a world structure the individuals and the relationships between them have a kind of external existence that is independent of language. Atomic sentences, on the other hand, are just symbolic expressions that stand for such external relationships.
When studied closely, it becomes apparent that the philosophical nature of data itself is important enough to warrant serious investigation, particularly as applied to issues of database management.
This is not a problem of any particular vendor implementation, nor even of the relational model: it is inherent in data as data.
Database Constraints and Enterprise Business Rules
Database constraints can easily validate data entries and enforce referential integrity, both of which are commonly reported sources of data errors. CHECK constraints can prevent a keypunch operator from successfully typing values into a table column that are outside the range of values permitted for that column by enterprise business rules, and referential integrity constraints can prevent child table rows from becoming orphaned as the result of a mistaken deletion of a parent table row or update to a parent table primary, or alternate, key.
Constraints are a physical implementation of the business rules under which an enterprise operates. Integrity constraints are a method of restricting database updates to a set of specified values or ranges of values. They ensure not only that bad data does not get into the database, but also that intertable relationships do not become corrupted by the improper deletion or updating of data from the existing database.
The basic types of integrity constraints are:
Teradata Database also provides a referential constraints, which provides the Optimizer with a means for devising better query plans, but which does not enforce the referential integrity of the database. See “CREATE TABLE” in SQL Data Definition Language for more information.
|Business rule||A component of the business model that defines specific conditional modes of activity.
Business rules are expressed in natural language and are represented in a relational database by integrity constraints. They are the ultimate determining factor for proper database design.
|Constraint||A predicate that must evaluate to TRUE if a database DELETE, INSERT, or UPDATE operation is to be permitted.|
|Integrity constraint||A component of the logical database model that formalizes business rules by specifying the boundary conditions and ranges permitted for various database parameters.
Integrity constraints are one of the four components of an integrity rule and are expressed in the language of databases: tables, columns, rows, and so on.
|Integrity rule||A set of rules for ensuring the integrity of a database. Each integrity rule is composed of a:
The checking time specifies the processing point at which the constraint is checked. In the ANSI/ISO SQL standard, a checking time has either of the following possible values:
All Teradata Database constraints have an immediate checking time. Deferred constraint checking, which is never a good way to ensure integrity, is not permitted.
The violation response specifies the action to be taken when an integrity constraint is violated. In the ANSI/ISO SQL standard, a violation response has either of the following possible values:
All Teradata Database constraints have a reject violation response. Compensate violation responses are not permitted.
Integrity rules are specified by the SQL CREATE TABLE and ALTER TABLE statements.
Semantic Data Integrity Constraints
Semantic integrity constraints enforce the logical meaning of the data and its relationships. Physical integrity constraints enforce the physical integrity of the data, for example ensuring that the bit sequence 00000001 is stored as 00000001 and not 00000010.
The emphasis of this chapter is on declarative semantic constraints: constraints that are part of the definition of the database itself. It is also possible to implement procedural support for integrity using database features such as triggers and stored procedures. The principal drawbacks to procedural enforcement of database integrity are performance, the added programming requirements and, critically, the increased opportunity for errors in enforcement introduced by these methods. If a declarative constraint alternative is available, it is almost always a better choice than a trigger (which can introduce serious sequential timing issues) or stored procedure.
The least favorable method for enforcing database integrity is through application programming. Besides the increased programming burden and its likely introduction of errors into integrity enforcement, application programming introduces the additional fault of application specificity. If one application enforces database integrity and another does not, or if two programs enforce integrity in different, perhaps contradictory ways, then a still greater chance of corrupting the database results. Worse still, application-based database integrity cannot affect ad hoc inserts, deletes, and updates to the database, and as a result place still further burdens on the DBA to find other mechanisms of preventing database corruption.
The point of relational databases is to be application-neutral, thus serving all applications equally well.
These constraints, in addition to the other constraints defined for the system during the process of normalization, such as functional and multivalued dependencies, have the added advantage of facilitating query and join optimization. The richer the constraint set specified for a database, the more opportunities there are to enhance query optimization.
The principal types of declarative constraints described are as follows:
- Column-level constraints
- Table-level constraints
- Database constraints
The recommendation for constraints is to specify them anywhere you can, being mindful of the performance debit their enforcement accrues. If you cannot trust your data, there is little point in maintaining the databases that contain it.
You can never declare semantic database constraints on columns defined with the XML, JSON, BLOB, or CLOB data types.
If performance issues make declarative constraints an unattractive option, then you should always implement integrity constraints by some other mechanism. The performance savings attained by implementing integrity constraints outside the database are often just transferred from the database to the application, negating any actual performance gains realized by not implementing the constraints declaratively.
The most important consideration must be that database integrity is consistently enforced.