17.10 - Sources of Data Quality Problems - Advanced SQL Engine - Teradata Database

Teradata Vantage™ - Database Design

Product
Advanced SQL Engine
Teradata Database
Release Number
17.10
Release Date
July 2021
Content Type
User Guide
Publication ID
B035-1094-171K
Language
English (United States)

Data Warehousing Institute 2002 Survey

The following table reports the results of an industry survey of sources of data quality problems.

Source Percentage
Data entry by employees 76
Changes to source systems 53
Data migration or conversion projects 48
Mixed expectations by users 46
External data 34
System errors 26
Data entry by customers

This includes typographical errors and information typed into the wrong fields made when entering data into forms on the World Wide Web.

25
Other 12
Source: Wayne W. Eckerson, Data Quality and the Bottom Line: Achieving Business Success Through a Commitment to High Quality Data, The Data Warehousing Institute, 2002.

Data Warehousing Institute 2005 Follow-Up Survey

Three years later, a TDWI survey of many of the same industry sources reported the following list of most frequent contributors to data quality problems in their organizations:

Source Percentage
Data entry by employees 75
Inconsistent definitions for common terms 75
Data migration or conversion projects 46
Mixed expectations by users 40
External data 38
Data entry by customers

This includes typographical errors and information typed into the wrong fields made when entering data into forms on the World Wide Web. Note that human error can also have a significant effect on the security of your site (Liginlal, Sim, and Khansa, 2009).

26
System errors 25
Changes to source systems 20
Other   7
Source: Philip Russom, Taking Data Quality to the Enterprise Through Data Governance, The Data Warehousing Institute, 2005.

The percentages do not add to 100 because they represent the relative number of survey respondents who reported the associated data quality problem as a significant source of data quality problems in their organization, not the percentage of all errors contributed by each source.

The surveys cover the same error sources except that the 2005 survey adds the category “Inconsistent definitions for common terms,” which tied with “Employee data entry” as the number one source of data quality problems in the organization.

Data Quality Problem Source 2002 Survey Score 2005 Survey Score Percent Change
Data entry by employees 76 75           -1
Inconsistent definitions for common terms ø 75            0
Data migration or conversion problems 48 46           -2
Mixed expectations by users 46 40           -6
External data 34 38            4
Data entry by customers 25 26            1
System errors 26 25           -1
Changes to source systems 53 20         -33
Other 12   7           -5
Sources:
  • Wayne W. Eckerson, Data Quality and the Bottom Line: Achieving Business Success Through a Commitment to High Quality Data, Seattle, WA: The Data Warehousing Institute, 2002.
  • Philip Russom, Taking Data Quality to the Enterprise Through Data Governance, Chatsworth, CA: The Data Warehousing Institute, 2005.

The category “inconsistent definitions for common terms” was not measured in the 2002 survey.

Note that over three years time, the percent change, whether positive or negative, is, with two possible exceptions, not significant. The exceptions are the relatively small 6% drop in “Mixed expectations by users” and the significantly large 33% drop in “Changes to source systems.”

It is interesting to note, however, that of the 79% of the organizations surveyed that have a data quality initiative in place, the team leading that initiative in the 2005 survey is most likely to be the data warehousing group, whereas in the 2002 survey, the data warehousing team was second to the IT department in terms of which was the more likely leader of the initiative. Unfortunately, a whopping 42% of those surveyed had no plans to institute a data governance initiative, while a mere 8% had such an initiative already in place. 33% had an initiative under consideration, while another 17% had such a plan in either its design or implementation phase.

Database Constraints and Enterprise Business Rules

Database constraints can easily validate data entries and enforce referential integrity, both of which are commonly reported sources of data errors. CHECK constraints can prevent a keypunch operator from successfully typing values into a table column that are outside the range of values permitted for that column by enterprise business rules, and referential integrity constraints can prevent child table rows from becoming orphaned as the result of a mistaken deletion of a parent table row or update to a parent table primary, or alternate, key.

Constraints are a physical implementation of the business rules under which an enterprise operates. Integrity constraints are a method of restricting database updates to a set of specified values or ranges of values. They ensure not only that bad data does not get into the database, but also that intertable relationships do not become corrupted by the improper deletion or updating of data from the existing database.

The basic types of integrity constraints are:
  • Semantic
  • Physical

Vantage also provides a referential constraint, which provides the Optimizer with a means for devising better query plans, but which does not enforce the referential integrity of the database (see the information about CREATE TABLE in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144).

Definitions

Term Definition
Business rule A component of the business model that defines specific conditional modes of activity.

Business rules are expressed in natural language and are represented in a relational database by integrity constraints. They are the ultimate determining factor for proper database design.

Constraint A predicate that must evaluate to TRUE if a database DELETE, INSERT, or UPDATE operation is to be permitted.
Integrity constraint A component of the logical database model that formalizes business rules by specifying the boundary conditions and ranges permitted for various database parameters.

Integrity constraints are one of the four components of an integrity rule and are expressed in the language of databases: tables, columns, rows, and so on.

Integrity rule A set of rules for ensuring the integrity of a database. Each integrity rule is composed of a:
  • Name
  • Integrity constraint set
  • Checking time
  • Violation response
The checking time specifies the processing point at which the constraint is checked. In the ANSI/ISO SQL standard, a checking time has either of the following possible values:
  • Immediate
  • Deferred

All Teradata constraints have an immediate checking time. Deferred constraint checking, which is never a good way to ensure integrity, is not permitted.

The violation response specifies the action to be taken when an integrity constraint is violated. In the ANSI/ISO SQL standard, a violation response has either of the following possible values:
  • Reject
  • Compensate

All Teradata constraints have a reject violation response. Compensate violation responses are not permitted.

Integrity rules are specified by the SQL CREATE TABLE and ALTER TABLE statements.

Semantic Data Integrity Constraints

Semantic integrity constraints enforce the logical meaning of the data and its relationships. Physical integrity constraints enforce the physical integrity of the data, for example ensuring that the bit sequence 00000001 is stored as 00000001 and not 00000010.

The emphasis of this section is on declarative semantic constraints: constraints that are part of the definition of the database itself. It is also possible to implement procedural support for integrity using database features such as triggers and stored procedures. The principal drawbacks to procedural enforcement of database integrity are performance, the added programming requirements and, critically, the increased opportunity for errors in enforcement introduced by these methods. If a declarative constraint alternative is available, it is almost always a better choice than a trigger (which can introduce serious sequential timing issues) or stored procedure.

The least favorable method for enforcing database integrity is through application programming. Besides the increased programming burden and its likely introduction of errors into integrity enforcement, application programming introduces the additional fault of application specificity. If one application enforces database integrity and another does not, or if two programs enforce integrity in different, perhaps contradictory ways, then a still greater chance of corrupting the database results. Worse still, application-based database integrity cannot affect ad hoc inserts, deletes, and updates to the database, and as a result place still further burdens on the DBA to find other mechanisms of preventing database corruption.

The point of relational databases is to be application-neutral, thus serving all applications equally well.

These constraints, in addition to the other constraints defined for the system during the process of normalization, such as functional and multivalued dependencies, have the added advantage of facilitating query and join optimization. The richer the constraint set specified for a database, the more opportunities there are to enhance query optimization.

The principal types of declarative constraints described are as follows:
  • Column-level constraints
  • Table-level constraints
  • Database constraints

The recommendation for constraints is to specify them anywhere you can, being mindful of the performance debit their enforcement accrues. If you cannot trust your data, there is little point in maintaining the databases that contain it.

You can never declare semantic database constraints on columns defined with the XML, JSON, BLOB, or CLOB data types.

If performance issues make declarative constraints an unattractive option, then you should always implement integrity constraints by some other mechanism. The performance savings attained by implementing integrity constraints outside the database are often just transferred from the database to the application, negating any actual performance gains realized by not implementing the constraints declaratively.

The most important consideration must be that database integrity is consistently enforced.