First Order Predicate Logic and Bivalent Logic
In his original statement of the database relational model, Codd (1970) explicitly stated that it was based on the foundation of first order predicate logic. The classic first order predicate logic is based on bivalent logic (a bivalent, or Boolean, logic has exactly two truth values: TRUE and FALSE, as noted in the previous topic. Higher-valued logics have three or more truth values. For example, SQL uses a 3VL having the following three truth values: TRUE, FALSE, and UNKNOWN), and though the theory of higher‑valued logics has been studied by mathematicians and philosophers, no true multivalued logics are used for real world applications.
Upon close examination, it is apparent that the logic supporting SQL is actually a bivalent logic (2VL) with additional ad hoc support for nulls. Note that the inconsistent treatment of missing values in SQL is common to all vendors: it is not an implementation problem per se, but rather a problem with the way SQL itself is defined.
Indeed, Codd did not add nulls to the relational model until 1979 (see section 2.3 of Codd, 1979). Codd did not originate the use of nulls to represent missing information in relational systems. The Project MAC Advanced Interactive Management System at the Massachusetts Institute of Technology pioneered their use as early as 1970 (see Goldstein and Strnad, 1970, 1971; Strnad, 1970, 1971 for details).
Perhaps significantly, the pre‑System R relational system pioneered in the early 1970s at the IBM Peterlee Scientific Centre in the UK, which was based on solid principles drawn from the relational algebra, did not (see Notley, 1972; Hall, Hitchcock, and Todd, 1975; Todd, 1976 for details).
Pearson (2006), in reviewing the problems that missing data present to data mining, notes that the use of nulls in SQL databases “introduces significant practical complications” because of the faults of applying 3VL to a language that is based on 2VL. For example, the truth value NOT TRUE is not equivalent to the truth value FALSE in 3VLs, but the SQL language can deal only with truth values of TRUE and FALSE. SQL dialects typically evaluate a truth value of unknown as if it were FALSE.
Schafer and Graham (2002) review the statistical literature on dealing with missing data and evaluate the various methodologies that have been developed to handle the issue. Unfortunately, their recommendation of the methods of maximum likelihood estimation and Bayesian multiple imputation for estimating missing data values are of no use to data warehousing because they do not deal with the specifics of detail data. For example, how might one do a maximum likelihood estimate of a missing employee home address, or of an undisclosed employee age (see “The Closed World Assumption Revisited” on page 677)?