Normalization and Database Design Problems | Teradata Vantage - Normalization and Database Design Problems - Advanced SQL Engine - Teradata Database

Database Design

Product
Advanced SQL Engine
Teradata Database
Release Number
17.05
17.00
Published
June 2020
Language
English (United States)
Last Update
2021-01-22
dita:mapPath
qby1588121512748.ditamap
dita:ditavalPath
lze1555437562152.ditaval
dita:id
B035-1094
lifecycle
previous
Product Category
Teradata Vantage™

Although normalization is the only component of database design based in provably correct mathematics, it also retains some of the more subjective elements common to other aspects of database design. For example, it is usually possible to normalize a database schema in multiple ways. Normalization theory does not guarantee, or even suggest, that it produces a canonical set of relations for any given database. Instead, it guarantees that a normalized database schema is free of a number of problems that otherwise cannot be excluded, the most widely known of these problems being update anomalies.

This topic describes some of the problems commonly encountered in database design that cannot be solved merely by fully normalizing the database schema.

Preserving Functional Dependencies

It is often true that a relation can be nonloss-decomposed in more than one way; however, some of those ways are better than others. For example, consider the following relvar:

employee
emp_num dept_num budget
PK

This relvar has the following FDs:

  • emp_no  dept_no
  • dept_no  budget
  • emp_no » budget

    where » indicates the dependency is transitive.

A relvar like employee suffers from certain well known update anomalies that can be avoided by nonloss-decomposition into various projections. Relvar employee can be nonloss-decomposed into two valid 5NF projections.

For example, consider the following set of 5NF projections:

emp_dept
emp_num dept_num
PK FK
dept_budget
dept_num budget
PK

Call this decomposition A. The second valid 5NF decomposition of employee is as follows:

emp_dept
emp_num dept_num
PK
emp_budget
emp_num budget
PK

Call this decomposition B.

Note that the projection emp_dept is the same for both decompositions A and B.

Even though both decompositions are valid, decomposition A is better than decomposition B. For example, it is not possible to represent the fact that a given department has a budget unless that department also has at least one employee in decomposition B.

You can make the determination that decomposition A is better than decomposition B solely on an analysis of the FDs in the original employee relvar. Note that the projections in decomposition A correspond to the nontransitive FDs in employee (indicated by → ). As a result of this, you can update either projection without regard for the other as long as the referential constraint from emp_dept to dept_budget is satisfied. Provided only that the update in question is legal within the context of the given projection, which means only that it must not violate the primary key constraint for that projection, the join of the two projections after the update is still be a valid value for the employee relvar. In other words, the join cannot possibly violate the FD constraints on employee.

In decomposition B, by contrast, one of the two projections corresponds to the transitive dependency in employee (indicated by »). As a result, updates to either projection must at least be monitored to ensure that the FD dept_no  budget is not violated (in other words, if two employees have the same department, they must also have the same budget - consider, for example, the machinations required in decomposition B to move an employee from department D1 to department D2.)

The problem with decomposition B is that the FD dept_no  budget spans two relvars. On the other hand, while it is true that the transitive FD dept_no » budget in Decomposition A spans relvars, it is also true that the FD is enforced by default as long as the FDs emp_no  dept_no and dept_no  budget, which do not span relvars, are enforced, which only requires enforcement of the corresponding primary key uniqueness constraints.

The point being made is that you should decompose relvars in such a way that functional dependencies are preserved.

Full Normalization and Dependency Preservation

Sometimes the goals of decomposing to full normalization and decomposing them in a way that preserves functional dependencies can conflict. Consider the following example relvar:

student_subject_teacher
student subject teacher
CK  
  CK
Smith Logic Alonzo Church
Smith Physics Robert Millikan
Jones Logic Alfred Tarski
Jones Physics Max Planck

The predicate for the student_subject_teacher relvar is as follows: the student named student is taught the subject named subject by the teacher named teacher.

Assume the following restrictions for purposes of illustration:

  • For each subject, each student of that subject is taught by only one teacher.
  • Each teacher teaches only one subject, but each subject is taught by several teachers.

What are the FDs for the student_subject_teacher relvar?

  • The first bullet implies the following FD:

    student, subjectteacher

  • The second bullet implies the following FD:

    teacher — subject

  • The second bullet also implies that the following FD is not true:

    subject — teacher

For the sake of illustration, this set of dependencies is portrayed symbolically as follows: student — teacher — subject.

As the representation of relvar student_subject_teacher indicates, it has two candidate keys, the composite column sets student-subject and student-teacher.

Both CKs have the property that all columns of student_subject_teacher are functionally dependent on them, and that property no longer holds if any column is discarded from either column set.

Note, too, that student_subject_teacher also satisfies the FD teachersubject, which is not implied by either CK. This absence of implication is indicative that the student_subject_teacher relvar is not in BCNF. As a result, student_subject_teacher suffers from certain update anomalies. Suppose, for example, that you want to delete the information that Jones is studying physics. You cannot do this without also losing the information that Max Planck teaches physics.

As usual, you can avoid the update anomalies by performing a nonloss decomposition into projections. In this particular case, the projections, both of which are in 5NF, are as follows:

student_teacher
student teacher
PK
Smith Alonzo Church
Smith Robert Millikan
Jones Alfred Tarski
Jones Max Planck
teacher_subject
teacher subject
PK  
Alonzo Church Logic
Alfred Tarski Logic
Robert Millikan Physics
Max Planck Physics

It is now possible to delete the information that Jones is studying physics by deleting the row for Jones and Max Planck from projection student_teacher without losing the row for Max Planck and Physics from projection teacher_subject. So the problem is solved … or is it?

Yes, that particular problem is solved. A remaining problem, however, is that this decomposition violates the FD preservation objective.

Specifically, the FD student, subjectteacher now spans two relvars, student_teacher and teacher_subject, so it cannot be deduced from the FD teachersubject, which is the only nontrivial FD represented in student_teacher and teacher_subject. As a result, student_teacher and teacher_subject cannot be updated independently of one another.

For example, an attempt to insert a row for Smith and Max Planck into student_teacher must be rejected, because Max Planck teaches physics and Smith is already being taught physics by Robert Millikan, yet this fact cannot be detected without examining teacher_subject.

The point of this analysis is to highlight the fact that the following equally desirable objectives of database design can sometimes clash:

  • Decomposing relvars to the their ultimate normal form (or even just to BCNF)
  • Decomposing relvars in such a way that preserve FDs

In other words, it is not always possible to achieve both objectives simultaneously.

Other points arise from this example:

  • Neither design is superior to the other on its face. The principles of normalization suggest that one design is better, while the principle of FD preservation suggest that the other is.

    As a result, the ultimate design choice must be based on other criteria.

  • Assume that you settle on the two-relvar design.

    In this case, the relvar-spanning FD must be declared, at least for documentation purposes, even if the design cannot enforce it.

    A formal version of that declaration might look something like this:

        FORALL s IN S, t1,t2 IN T, j1,j2 IN J
        IF     EXISTS { S:s, T:t1 } IN ST
           AND EXISTS { S:s, T:t2 } IN ST
           AND EXISTS { T:t1, J:j1 } IN TJ
           AND EXISTS { T:t2, J:j2 } IN TJ
        THEN j1 , j2

where the student, teacher, and subject domains are represented by S, T, and J, respectively.

If you think this example is overly contrived, and that involuted dependency structures such as studentteachersubject never occur in practice, consider the following, more familiar, example:

address
street city state zip_code

This relvar satisfies the following set of FDs:

  • zip_codecity, state
  • street, city, state  zip_code

In other words, it is identical to the student_subject_teacher example, where student maps to the composite street-city, state maps to subject, and zip_code maps to teacher.