The pursuit of theory is frequently perceived as being unnecessary in the day-to-day practice of administering relational database management systems, but a small dose of theory is essential to understanding the issues you encounter when designing and maintaining databases.
The extent to which practical relational database management can be formalized using the principles of set theory and formal logic is seldom appreciated. The correspondence between relational set theory and relational database theory is not always direct. While it is true that relational database theory is based on the relations of set theory, it necessarily introduces the additional constructs listed below.
- Database relations are typed, while set theory relations are not. Note that data types, or domains, are also a form of constraint.
- Database relations are not ordered left-to-right, while set theory relations are.
- Dependency theory is critical to database relations, while no corresponding theory exists for set theory relations.
- Relation variables are critical to database relations, while no corresponding theory exists for set theory relations. As a result, the concept of integrity as it is understood for relational database management is foreign to set theory relations.
Definitions of Terms
Before proceeding further, some definitions for terms used in predicate logic and the predicate calculus are in order.
|Existential quantifier||The symbolic quantifier ∃ of predicate logic, signifying the logically equivalent English language phrases “for some,” “for any,” and “there exists.”|
|Identity predicate||The symbolic operator = of predicate logic, signifying the logically equivalent English language phrases “is identical to” or “is equal to.”|
|Inference rules||The rule set of a formal system that determines the steps of reasoning that are valid for proving logical propositions.|
|Predicate||A truth-valued function.
The attributes of a relation (columns), as well as the relation heading (relation variable) itself, can be represented formally as logical predicates.
This is true whether an explicit constraint is defined over the column or not, because when no explicit constraint is defined, the implicit constraint specified by the data type for the column specifies its minimum, essential, constraint. You cannot, for example, insert the character string ‘character string’ into a column typed as INTEGER without first converting the string into an integer value.
This type of constraint is known as a domain constraint (see Domain Constraints).
|Predicate calculus||The set of inference rules by which propositions in predicate logic are proven.|
|Predicate logic||The study of statement validity using the truth-functional operators of the propositional calculus, the universal and existential quantifiers, and the identity predicate.|
|Proposition||An assertion that can be proven unequivocally to be either true or false.
In a relational table, or relation variable, all rows are assumed to be true propositions by default, because if they were false, they would have been prevented from entry in the database by the various integrity constraints, both implicit and explicit, defined on that database. Each proposition (tuple) in a relation is an instantiation of its relation variable predicate that evaluates to TRUE.
This important property is sometimes called the Closed World Assumption (see The Closed World Assumption).
|Truth-valued function||A function that evaluates unequivocally to either TRUE or FALSE.|
|Universal quantifier||The symbolic quantifier ∀ of predicate logic, signifying the English language phrase “for all.”|
Relations, Relation Values, and Relation Variables
As the relational model has matured, careful analysis has continued to reveal facets of its definition that are obvious in hindsight, but had remained undiscovered and unstated until recently. An important example of such a revelation of the obvious is the relation variable, or relvar.
Note that relations, as defined in set theory, do not vary over time. A relation is a value, and values do not vary as a function of time. The number 3, for example, is always the number 3, and never 2 or 13 or 724.
Variables, on the other hand, take on different values as a function of time, so it is formally correct to state that the value of a relation variable changes as a function of time.
The distinction between a relation variable and its associated relation value is an important one because without it, the application of the axioms of traditional set theory to database relations is suspect at best. That set theory deals entirely with extensional sets: a set is determined entirely by its population. There is simply no notion of a set with changing population; each different population constitutes a different set.”
In the context of business rules and their relationship with integrity constraints, the concept of the relvar is useful because it provides a formal, yet simple, framework in which to situate the idea of integrity constraints. Note that the relvar concept applies equally well to views (virtual relvars) and base tables (base relvars).
The first principle captured by the concept of the relvar is that the value of a relation is unrelated to its defining variable. The easiest way to explain this is by analogy. A variable in any programming language represents the possible values that can be taken for that variable, but in itself it is not those values: it is, instead, a place holder for them. The values represented by the program variable can take on any number of values, but the variable itself is always an abstraction.
Think of a relvar as the heading definition for the attributes of a relation: it is the ANDing of all the column headings, including their constraints, defined for the relation. In terms of logic, a relvar is the predicate for a relation. More specifically, a relvar is the internal predicate for a relation. The real world that relvar represents is its external predicate. Any lack of correspondence between the external and internal predicates for a relvar is the defining characteristic of the data quality issue (see the introduction to this chapter for more information) as well as the inspiration for the maxim that a database can only enforce consistency, not truth. Unless otherwise stated, the phrase relvar predicate refers to the internal relvar predicate in this book, while each individual instantiation of that relvar, a tuple (or row), is a proposition for that predicate that evaluates to TRUE.
The relation value, then, is the net value of all the data contained by the relation. Each time the relation is updated, its value changes, but its relation variable does not.In a very real sense, a relation becomes an entirely new relation when an update occurs, but its relation variable does not change.
For example, consider the following equation:
x + y = 7
If you set the variable x=4, then the value for variable y must be 3. If you then change the value of the variable x to 3, you have updated x, but you have not changed the value of 4 to 3, you have just changed the value of the variable x from 4 to 3. This holds for any other valid number over the domain of x. Another way to think of this is as follows: values are not time-varying. A 3 is always a 3 and never a 4. On the other hand, a variable can represent any number of different values over the course of time. An old joke also illustrates this point by confusing variables with values: “Do you realize that 2 + 2 = 5 for large values of 2”?
The integrity of this update relationship is maintained by the relvar predicate, which is defined by the logical ANDing of all the integrity constraints defined on the relation in question. The result of enforcing the relvar predicate is what Date and Darwen (1998, 2000) call The Golden Rule: no statement can leave any relvar with a value that violates its relvar predicate. This means that no update operation can ever cause any database constraint to evaluate to FALSE. In other words, the integrity of the database cannot be violated. The Golden Rule is one of the fundamental principles of relational database theory.
It follows that tuples (“rows” in everyday language: see Definitions) are instantiations of the relvar predicate that represent various propositions about the relation they constitute. When you think of a row as a proposition, it immediately follows that it must also be a true proposition because, by definition, false propositions (wrong, or corrupt data in everyday language) are not permitted in a database that enforces integrity constraints (see The Closed World Assumption). If they were, the database would not represent facts and the Golden Rule would be violated. In other words, if a given tuple is an element of a particular relvar, then that tuple, by default, satisfies its predicate by evaluating to TRUE.
Unfortunately, the representation of the real world by a database is only as good as its input, and any database will permit the insertion of factually wrong data as long as that data does not violate the constraints specified for the base table or view through which the table is updated (see Sources of Data Quality Problems. In other words, a database can only enforce consistency, not truth. All truths are consistent, but not all consistent things are true.
This is an important consideration: the database itself cannot know whether its information corresponds to real world facts (see Sources of Data Quality Problems). It can, however, constrain the data with which it is updated by ensuring that no business rules are violated, and it is these constraints that provide the mechanism for ensuring the success of the data integrity rules enforced by relational database management systems.
The Logic of Database Integrity
To ensure that no row in the database represents a logically false proposition, each table in the database must be capable of evaluating its own predicate, or relvar, for its truth value, and the only way it can enforce this integrity is to enforce the set of business rules defined by the enterprise it supports. In practice, this means that the predicate for any table is its sole criterion for update acceptability. From this, it follows that the formal meaning of any table is defined by the logical ANDing of its component column and table constraints. This logically ANDed constraint set is also the best approximation the database management system can make to any table predicate.
From this, it follows that the meaning of an entire database, its predicate, can be represented formally as the logical ORing of its table constraints logically ANDed with the set of all database constraints.
The Closed World Assumption
The Closed World Assumption, or CWA, asserts that if an otherwise valid tuple does not appear in the body of a relation, then the proposition representing that tuple must be false. In other words, the assumption is made that facts not known to be true in a relational database are false. Two other assumptions are: the Unique Name Assumption, which states that any item in a database has a unique name and, further, that objects with different names are not the same; and the Domain Closure Assumption, which states that there are no objects other than those within the database. Two closely related rules are the Completed Database Rule and the Negation As Finite Failure Rule.
The CWA has become one of the fundamental principles of relational database theory.
In symbolic logic, such an assertion is called a completion axiom. It follows that a relation contains all of, and only those, tuples whose corresponding propositions are true. In other words, a relation body contains only those tuples that satisfy the completion axiom for its relvar. By generalization, the complete set of completion axioms for a given database defines its CWA, and the set of all tuples in a given database must also satisfy its CWA.
The following Venn diagram illustrates this point:
Note that, by the definition of the CWA, the sets of false and true propositions for a database do not intersect. See The Closed World Assumption Revisited for an application of the CWA to missing information in relational databases and the paradox that nulls present to the definition of a well-formed completion axiom.