Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
This definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
In contrast to data warehouses are operational databases that support day-to-day transaction processing.
History of data warehousing
The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow - mainly, the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same data. The process of gathering, cleaning and integrating data from various sources, usually long existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from the operational systems that were logically related to prior gathered data.
Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. Data could be retrieved from one central point or data could be distributed to "retail stores" or "data marts" that were tailored for ready access by users.
Key developments in early years of data warehousing were:
- 1960s - General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.
- 1970s - ACNielsen and IRI provide dimensional data marts for retail sales.
- 1983 - Teradata introduces a database management system specifically designed for decision support.
- 1988 - Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "business data warehouse".
- 1990 - Red Brick Systems introduces Red Brick Warehouse, a database management system specifically for data warehousing.
- 1991 - Prism Solutions introduces Prism Warehouse Manager, software for developing a data warehouse.
- 1991 - Bill Inmon publishes the book Building the Data Warehouse.
- 1995 - The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
- 1996 - Ralph Kimball publishes the book The Data Warehouse Toolkit.
- 1997 - Oracle 8, with support for star queries, is released.
Normalized versus dimensional approach for storage of data
There are two leading approaches to storing data in a data warehouse - the dimensional approach and the normalized approach.
In the dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, or "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to 1) join data from different sources into meaningful information and then 2) access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.
These approaches are not mutually exclusive, and of course there are other approaches. Dimensional approaches can involve normalizing data to a degree
The future of data warehousing
Data warehousing, like any technology niche, has a history of innovations that did not receive market acceptance.
A 2009 Gartner Group paper predicted these developments in business intelligence/data warehousing market .
Because of lack of information, processes, and tools, through 2012, more than 35 percent of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets.
By 2012, business units will control at least 40 percent of the total budget for business intelligence.
By 2010, 20 percent of organizations will have an industry-specific analytic application delivered via software as a service as a standard component of their business intelligence portfolio.
In 2009, collaborative decision making will emerge as a new product category that combines social software with business intelligence platform capabilities.
By 2012, one-third of analytic applications applied to business processes will be delivered through coarse-grained application mashups.