15.00 - The Problem of Capturing the External World as Data - Teradata Database

Teradata Database Design

Teradata Database
User Guide

The Problem of Capturing the External World as Data

Fundamental to the problem of data quality is the fact that it is virtually impossible to capture the reality of the external world as data. At the same time, the Closed World Hypothesis (see “The Closed World Assumption” on page 630 and “The Closed World Assumption Revisited” on page 677) tells us that if an otherwise valid tuple does not appear in the body of a relation, then the proposition representing that tuple must be false. In other words, the assumption is made that facts not known to be true in a relational database are false.

The following diagram is from an early draft of a book by Kowalski (2011) on computational logic for people who are not computing professionals.

Think of the Agent as a business analyst or user, the Logical Representation of the World as the database, and The World as reality external to the database.

Still following Kowalski (2011), though with numerous modifications, the relationship between logic and the world can be represented from two points of view:

  • The world perspective.
  • The logical representation of the world.
  • From the world perspective, logical sentences represent selected world features.

    From the logical perspective, the world provides semantics for logical sentences. A world structure is a collection of individuals and the relationships among them, and only true sentences are useful to a business analyst or user, which is why only true sentences are stored in a database. Note how closely this corresponds to the Entity-Relationship model of Chen (1976).

    A world structure corresponds to a single, static state of the world. In the relational model, this corresponds to a relation value. See “Relations, Relation Values, and Relation Variables” on page 627.

    An atomic sentence is true in a world structure if, and only if, the relationship it expresses holds in the world structure, and otherwise it is false.

    The difference between such atomic sentences and the world structure they represent is that in a world structure the individuals and the relationships between them have a kind of external existence that is independent of language. Atomic sentences, on the other hand, are just symbolic expressions that stand for such external relationships.

    When studied closely, it becomes apparent that the philosophical nature of data itself is important enough to warrant serious investigation, particularly as applied to issues of database management. See Kent (2000) for a highly recommended book‑length study of this problem. Much of what Kent describes is the disconnect between the fuzzy logic and sets that underlie the real world and the classical set theory and logic that underlie the data in relational and other databases, although he never frames his arguments in those terms. While some research has been made into the concept of fuzzy databases (see Raju and Majumdar, 1988 and Buckles and Petry, 1995, for example), the majority of research in this area has involved the use of fuzzy queries of relational databases and the use of fuzzy logic in data mining. The current state of research makes the problem of a clean, easy‑to‑use application of fuzzy sets and fuzzy logic to database systems seem intractable, in no small part because fuzzy logic suffers from the same difficulties as other multivalued logics. See Klir, St. Clair, and Yuan (1997) for an introduction to fuzzy set theory and fuzzy logic that builds on classical set theory and logic.

    Also see Date and Darwen (1998, 2000) for a briefer and somewhat more formal study of the semantic limitations of data, which they couch in terms of the inability of data represented by an internal predicate to capture the full semantics of the external world, or external (to the database) predicate.

    This is not a problem of any particular vendor implementation, nor even of the relational model: it is inherent in data as data.