A denormalized physical schema creates ambiguity because within a denormalized table, it is not possible to determine which columns are related to the key, to parent tables, or to one another (see Functional, Transitive, and Multivalued Dependencies, The Referential Integrity Rule, and Domains and Referential Integrity for explanations of how fully normalized tables avoid this ambiguity). Normalized models maintain all relationships through the association of primary keys with their functionally dependent attributes and with foreign keys.
Consider the following non-normalized table, where all but one of the column names have been abbreviated to fit on the page.
The expanded column heading names are given in the following table.
|Abbreviated Column Name||Actual Column Name|
This table meets the criteria for 2NF, but not for 3NF, because not only does every non-key attribute not depend on the primary key, but several non-key attributes are functionally dependent on other non-key attributes (see Definition of Third Normal Form).
You cannot perceive the relationships among the commodity_id, group_id, and family_id attributes by looking at this table. While it is fairly obvious that commodity_name is a functional dependency of the primary key, it could also be true that the commodity attribute has separate relationships with groups and families, or it could be true that commodities are related to groups, which are, in turn, related to families. Moreover, the relationship of the Sequence attribute to the primary key, or to anything else, cannot be resolved. Is it used to order commodities within groups, commodities within families, or groups within families? It is surprising how often such dangling relationships are found in denormalized logical models.
In contrast, normalized models maintain all relationships through the association of primary keys with their attributes and with foreign keys. As a result, you can clearly see the relationships among the various tables in the database schema simply by looking at the data model.
Referential Integrity and a Denormalized Schema
Referential integrity cannot be effectively maintained within a denormalized database schema.
By definition, denormalized structures compromise data quality. Relational database management systems enforce referential integrity to ensure that whenever foreign keys exist, the instances of the objects to which they refer also exist. Denormalized tables cannot support referential integrity through declarative constraints. For example, users have no guarantee that the children in a denormalized table have parents in that table unless they implement some other, more costly method of programmatically ensuring semantic data integrity. The problems with maintaining referential integrity by means of application code rather than through declarative constraints are myriad (see Designing for Database Integrity).
Also, as new incremental data loads are inserted into the database, changed fields are not reflected in the old data. When an attribute is changed in the source system, only the newly loaded rows reflect those changes. For example, suppose the marital status, last name, or another field in the customer table changes. Any new loads for that customer reflect the different data in the changed fields, causing inconsistencies and incorrect results in certain types of analyses because both the changed and unchanged attributes are stored redundantly, violating the concept of a single version of truth that is so easily maintained with a fully normalized database schema.
Denormalized Views Versus Physical Denormalization of the Database Schema
A fully-normalized database schema accessed through denormalized views is far more flexible in creating and managing dimensional consolidation hierarchies for OLAP models than a star schema.
The dimensional view approach, by using one physical database, eliminates both the additional costs and the data quality problems of maintaining redundant data.
DBAs who deal with denormalized methods typically define a single set of mutually exclusive dimensions for a given OLAP model. However, it is possible to create many nonexclusive dimensional paths for an individual OLAP model that also are reusable across many different OLAP models.
The following figure shows two denormalized virtual schemas that share a number of common dimensional views:
The number of virtual schemas that can be created and aligned with particular users or user communities is unlimited. As DBAs become more sophisticated in their application of this concept, they can manage an extensive number of database views tailored toward the specific needs of individual users as well as the specific needs of various user groups. It is not possible to maintain such an extensive schema management program when the data must be physically propagated for performance reasons.
Consider another example: suppose there are eight potential dimensional consolidation paths that could arise from the following three entities:
In OLAP terminology, a consolidation path is a set of hierarchically related data. Consolidation itself is the process of aggregating the hierarchical data to form subtotals. The highest level in the consolidation path is the dimension for the data.
The eight possible consolidation paths for this data are the following, where → indicates a downward path through a hierarchy and in E-R terminology is equivalent to the word have:
- Households → Customers → Accounts
- Households → Accounts → Customers
- Households → Customers
- Households → Accounts
- Customers → Accounts
When either these consolidation paths or the previously mentioned schemas are created virtually, any problems with managing referential integrity are rendered moot because the only data that must be maintained is in the base tables. However, when these tables are instead instantiated as separate physical tables, managing referential integrity across the multiple alternate dimensional paths becomes very difficult to maintain. As more dimensional entities and potentially hundreds more dimensional consolidation paths are added to the schema, it becomes impossible to continue to propagate data physically. As a result, if a physical star schema is implemented, compromises to the maintenance of the semantic integrity of the data must be made to accommodate the resulting complexity.
A star schema is designed specifically to support dimensional analysis, but not all analysis is dimensional. Many types of quantitative analysis, such as data mining, statistical analysis, and case-based reasoning, are actually inhibited by a physical star schema design.
Given the definition of 3NF as the key, the whole key, and nothing but the key, any attributes that do not modify the key, the whole key, and nothing but the key are “noise” in the context of normalization theory.
If you consider the household/customer/account example of the previous section (see Denormalized Views Versus Physical Denormalization of the Database Schema) in a different light, you encounter an interesting problem with denormalization. An OLAP modeler, focused on creating dimensional views and drill-downs, is likely to assume that a household can have many customers but that a customer can belong to only one household. However, the underlying normalized data model might reveal a many-to-many relationship between households and customers. For example, a customer can belong to many different households over time.
A modeler focused on OLAP applications typically is not concerned with this relationship because it is unlikely that OLAP users would want to drill down into previous households. However, a data mining user might find tremendous value in exploring this relationship for patterns of behavior that revolve around the changing composition of households.
For example, consider the following possible household change behaviors:
- Children leave their parents to start their own households.
- Couples marry and form one household from two.
- Married couples divorce and form two households from one.
And so on.
The enterprise needs a logical model to address the precise needs of any analytical methodology, whether it be a complicated ad hoc SQL query, an OLAP analysis, exploratory work using data mining, or something entirely different.
Unbalanced Hierarchies and Unnormalized Physical Schemas
You cannot make the universal assumption that all levels in a dimensional model are balanced. A star schema dimension table, which requires fixed columns for each level, has considerable difficulty in handling unbalanced hierarchies.
An unbalanced hierarchy, as shown in the following figure, is a hierarchy in which a particular level can have its parent at different levels above it. An account executive, for example, might report to a region or a subregion.
Because a star schema creates specific columns to handle each level in a hierarchy, it requires that all hierarchies be balanced so that, using the current example, an account executive would always need to be at level three.
Analyses Driven By Normalized Relationships
Many types of analysis are driven by the normalized relationships in the database.
There is, for example, a considerable amount of analysis that is not quantitative in any way. For example, market basket studies determine clusters of products that are most frequently purchased together. A market basket study does no counting, aggregating, or any other quantitative measure or analysis.
Tracking the cycle times for various entities requires a sophisticated method of following the relationships between statuses and events. Modeling strategic frameworks such as value chains, supply chain management, competency models, industry structure analysis, and total quality management requires sophisticated relational models for which denormalized approaches are inadequate.
Costs of Denormalization
In the long term, the actual cost of maintaining a denormalized environment exceeds the costs of a normalized environment, and the cost of decreased cross-functional information opportunities is much greater.
There are economic costs associated with denormalization that often are not considered. In a star schema, each row in a dimension table contains all of the attributes of every entity mapped within it. In the accounts, customers, and households example described in the previous section, you would carry all the redundant household data and all the redundant customer data for each account. Not only does the redundancy from the expansion of columns exist, but in the case of the many-to-many relationship between accounts and customers, the number of rows also increases because a separate row is required for each legitimate account-customer combination. When there are millions of accounts, thousands of which are jointly held, this horizontal and vertical redundancy can add significant storage overhead.
These costs generally are not significant in light of performance gains, but they are significant in terms of the additional DBA and application coding costs incurred to programmatically maintain referential integrity.
More important is the cost of business opportunities lost because of compromises that render entire categories of data analysis difficult, if not impossible, to perform. Total benefits are much greater for the normalized approach because of its adaptability and generalizability. As an enterprise begins to define its business analysis requirements, it initially identifies only a fraction of what it ultimately needs. Possibly as much as 95% of the real analytic needs of the enterprise go undefined in the infancy of the data warehouse.
The key to being able to capitalize on unsuspected opportunity is flexibility. By building an adaptable database schema from the beginning, an enterprise enables itself to address new business needs as they are identified without having to compromise or restructure the database. The enterprise is also able to address any new challenges in the shortest time frame because it does not need to involve its IT staff in designing, building, and propagating data for the new queries. The faster a company can respond to unexpected challenges and opportunities, the higher the business value that can be realized from its data.
Cross-Functional Analysis and Denormalization
Cross-functional analysis becomes increasingly difficult as a database becomes more and more denormalized.
Data warehousing provides a means to build complex interrelated models for cross-subject area analyses in ways no other system can. These models can move beyond the traditional financial measures to begin interrelating internal process measures and customer-oriented measures as well. More importantly, these more sophisticated analytical models can begin to push from results-oriented or outcome-oriented measures toward measures directly linked to organizational activities. The following figure shows an example of interrelated measures from a credit-card model.
The problem with denormalized data for this example is that measures cross many different denormalized schemas. Even though Teradata Database is optimized to handle them, star joins are cumbersome to process, and making joins across multiple models also makes them extremely complex. In this particular example, there could easily be four or more distinct star schema structures.
No Crystal Ball
There is no crystal ball for predicting the future data analysis needs of the enterprise. The tendency to compromise and build the enterprise data model based on current needs is much greater for those building denormalized enterprise database schemas than for those building normalized schemas.
Even when aware of the compromises being made, individuals undertaking a denormalized implementation of the logical model tend to discount the possibilities of those compromises working against future growth of the system because they believe they have already anticipated all future analytical opportunities.
However, as business discovery evolves, it inevitably becomes clear that those initial projections do not fully support the requirements and vision of the business.
Design For Decision Support and Tactical Analysis
Not only is basic schema normalization an important design consideration, but normalization tailored specifically for the analytical needs of business intelligence is also critical. Even normalized data warehouses are often built as if they were designed to support OLTP applications.
Designers often fail to examine the following list of considerations:
- Integration of subject areas
- Versioning across time
- Generalization of key entities in anticipation of change
- Interrelation of life cycles across many cross-subject area entities and events
- Integration of measures, calculations, and dimensional context paths across the enterprise
These factors all support the immediate and long-term benefits provided by a fully normalized enterprise data model. The fully normalized schema provides a framework both for continued growth and for increasing user insight.
The moral of the story is this: it is often a mistake to compromise your logical data model when you select tools to extend the value of your data and to optimize the delivery of information to your users. When necessary to facilitate ease of use or to improve tool performance, implement views to help ensure the continued integrity of your logical data model. When you are evaluating the data model for your data warehouse, ensure that model not only supports your business today, but that it is capable of supporting the enterprise well into the future.
Design for Flexible Access Using Views
You should design your applications to access the database through views rather than directly accessing base tables. Administrators sometimes avoid application designs that access the database through views because of a perceived performance penalty. In practice, this is rarely true for access operations, and then only in exceptionally rare cases. In fact, view projections of their underlying base tables can minimize the use of spool space by joins by reducing the number of columns that must be redistributed to make the join. Furthermore, you can more readily control locking for access through careful view design, and so minimize the complexity of applications that access the database through those views.
As a general rule, the best practice is to design applications to access the database only through views rather than accessing base tables directly.
Besides providing you with the ability to provide pseudo-denormalized access for enhanced usability (see Denormalized Views Versus Physical Denormalization of the Database Schema), views also provide a degree of data independence that permits you to make changes either to your applications or to the physical database those applications access without having to worry about rewriting application software. Views can also provide a coarse level of schema versioning that is otherwise not available.