- Greater number of relations
- More primary index choices
- Optimal distribution of data
- Fewer full-table scans
For example, consider the multicolumn primary index of a Fact table, which has the primary key of each of its dimension tables as a component. The Optimizer cannot retrieve a row with a partial primary index, so many, if not all, accesses to the Fact table must use a full-table scan.
This assumes that you implement the natural primary key of the fact table as the primary index. If you instead define a surrogate key column to be the primary index, the full-table scan issue is moot (see the definition for Surrogate key in Definitions). This practice is not generally advised.
- More joins possible
- Enhanced likelihood the Optimizer will use the high-performing merge or nested join methods
- Optimal data separation to eliminate redundancy from the database
- Optimal control of data by eliminating update anomalies
- Fewer columns per row
- Optimal application separation
- Optimal control of data
- Smaller rows
- Optimal data blocking by the file system
- Reduced transient and permanent journal space
- Reduced physical I/O
How Normalization Is Beneficial for Physical Databases
- Different applications require different views of the same data.
- Data must be extendable, and a framework must exist to support the introduction of new applications and analyses without having to modify existing applications.
This topic develops some of themes described elsewhere in this document by explaining their relevance to everyday exploratory data analysis problems in data warehousing.
Data Dependence and Data Independence
The main objective of relational DBMSs is data independence. For years, the relational database management systems used to run businesses, often referred to as OLTP systems, made data independence obligatory. In an OLTP database, data is stored in nonredundant tables that demand that every column of the table be rigorously related to its primary key alone and to no other tables. This ensures that information is available to all applications and analyses that use it, and it provides a mechanism for maintaining consistency and reliability across applications: a single source of each particular data element, a single version of the truth.
Data independence works well for OLTP systems because the applications accessing the data generally access single tables or join only a few, small tables in relatively simple queries. With the introduction of the data warehouse, previously unheard of demands were placed on the relational database management systems underlying them. In the data warehouse environment, large tables must be scanned and large result sets are frequently returned. Many tables are joined together, complicated calculations are made, and detailed data is aggregated directly in the queries. In addition, large data volumes are extracted, transformed and loaded into the tables concurrently with users running queries against the data. It quickly became apparent that databases created and tuned for OLTP could not sustain the performance levels required to support the demands of business intelligence processing. The OLTP databases could not perform the queries within their allotted time window or, in some cases, at all.
This situation highlights the potential for contradiction between designing databases for optimum integrity and designing databases for optimum performance. The key to data independence is data normalization, and normalized data schemas are the most demanding of system performance.
To address the issue of poor performance, data independence has often been abandoned in many environments and denormalized schemas have been used to address a few particular, rather than all general, analytical needs of the enterprise.
Although this arrangement addresses short-term decision support needs, it compromises the enterprise view of the data and its adaptability. Data independence, adaptability, and cross-enterprise functionality go hand in hand, and a normalized data schema is critical to reaching these objectives.
The following topics provide some detail about why this is true.
The star schema (see Dimensional Modeling, Star, and Snowflake Schemas), which is the most common form of denormalization used in contemporary data warehousing, cannot handle every kind of relationship that can exist comfortably in a fully-normalized environment. Recursive relationships are one such example. Recursion, as the term is generally used in computer science, is only a small subset of the recursive function theory of formal logic.
A recursive relationship exists when the parent of a member in a hierarchical relationship is also a member of the same entity. As demonstrated by the following figure, there are two ways that this can manifest itself: with only a single parent or with multiple parents:
The most commonly used example of a single-parent recursive relationship is an employee table, where both an employee and its manager have rows. From an E-R perspective, you would say a manager has employees. But managers, too, are employees. This also means that managers can have managers who are employees, and so on.
In the diagram, the single-parent recursive relationship is a customer table in which a customer can be a customer of yet another customer in the table. The classic multiple-parent recursive relationship is the bill of material. The diagram shows an example in which multiple organizations can have multiple organizational affiliations. Project work breakdown hierarchies are another common example of a multiple parent recursive structure.
In a recursive structure, there can be an unlimited number of levels without knowing how many levels each member hierarchy currently has or potentially can have. One hierarchy have only two levels, while another might be 15 levels deep. Herein lies the limitation of the star schema for handling recursive relationships: it requires a fixed number of levels, because each level is set up by a series of fixed columns in a dimension table. Because you do not know the number of levels in a recursive structure, you cannot predefine the columns.
The most critical entities in an enterprise data model frequently have recursive structures. Organizational hierarchies such as internal, customer, supplier, and competitor entities are usually recursive relationships.
Arguments Against Denormalizing the Physical Database Schema to Increase Its Usability
Many data warehouse designers argue that denormalized physical database schemas are easier for end users to navigate than fully normalized schemas.
Denormalized physical schemas certainly seem more user-friendly than the complexity of a highly generalized, normalized data model. However, denormalized physical schemas are driven by known queries, so their ease of use is somewhat illusory. Formulating queries to address novel requirements, a task that is nearly definitive of the data warehouse process model, is made more difficult, if not impossible, in a denormalized environment. A fully normalized enterprise data model is flexible enough to support the undertaking of any new analyses of the data.
That said, the reality is that end users typically do not write queries anyway, and when they do, they are likely to use a third party natural language query generator, so the usability argument is often moot. Coding novel queries is often the responsibility of an application developer or a natural language query writing tool.
More importantly, you can create “denormalized” views to implement a semantic layer that makes the normalized data model easier to navigate (see Denormalizing Through Views). Few sites permit users to query base tables directly anyway, so creating views on base tables that look exactly like their denormalized table counterparts should not be an issue.
If there were no issues of performance for those database management systems that lack the parallel processing power of Vantage, then denormalization could be handled universally by implementing views. Star schemas, snowflakes, summary tables, derived data, and the like could be built as virtual clusters of tables that look exactly like their physical counterparts. By handling denormalization virtually, the relationships within, between, and among the underlying base tables of the schema remain intact, and referential integrity can be maintained by the system regardless of how many virtual denormalized relationships are created. This flexibility frees DBAs to create any number of denormalized views for users while simultaneously maintaining semantic data integrity and eliminating the data redundancies required by denormalized physical schemas.
DBAs can create virtual, subject-oriented schemas for specific applications as well as creating views for more general database access without affecting the underlying base table data. These same views can also be used to enforce security constraints for the different communities of business users who must access the database.
Consider another argument that favors the ease of use of a normalized database schema over a denormalized schema. A physical star schema has physical dimensions that support a physical fact table. However, for some dimensions there can be mutually exclusive substitutes for the same data. For example, suppose an airline is interested in both the point-to-point travel of customers between segments in addition to their travel between their true origins and destinations. This discussion abbreviates this dimensional family as O&D.
The true O&D dimension is different from the segment O&D, although superficially, it looks the same. Moreover, their respective consolidation of facts is different as well, although the detailed base table data is the same. If the star schemas are physicalized, two very large schemas must be created, maintained, and coordinated to represent them, whereas with virtual star schemas, the data is maintained only in the base tables, producing a single, consistent version of the truth.
Arguments Against Denormalizing for Performance
Data warehousing authorities often argue that physical denormalization of the logical data model offers better performance than a physical instantiation of the normalized logical data model. This question is mitigated when the SQL Engine can handle a normalized physical design and scale linearly. Vantage does just that.
If further need to improve performance remains aside from scaling the database, then implement a well-planned data propagation strategy that maintains and complements the underlying normalized base table substructure.
To begin, consider propagating denormalized data within the same database instance. Weigh propagating to another environment only if there are other considerations for the target data source, such as geographic needs or the need to support proprietary data structures. In any case, the propagation is from the data warehouse and not directly from source systems.
- The fully-parallel capabilities of the data warehouse can be used to optimize the propagation of data.
- By keeping the data in the same instance of the database, you can perform hybrid queries that take advantage of both the denormalized and normalized data. For example, large volumes of nonvolatile data from transaction detail rows can be propagated into new physical fact tables, and smaller volume, highly volatile dimensional data can be built into virtual dimension tables.
- The resulting administration of the complete data warehousing environment is not only easier, but also less expensive.