Data Warehouse

Data warehouse is a repository of an organization’s electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
Data mining potential can be enhanced if the appropriate data has been collected and stored in a data warehouse. A data warehouse is a relational database management system (RDBMS) designed specifically to meet the needs of transaction processing systems. It can be loosely defined as any centralized data repository which can be queried for business benefit. Data warehousing is a new powerful technique making it possible to extract archived operational data and overcome inconsistencies between different legacy data formats.
A data warehouse houses a standardized, consistent, clean and integrated form of data sourced from various operational systems in use in the organization, structured in a way to specifically address the reporting and analytic requirements.
This definition of data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into  the repository, and tools to manage and retrieve metadata.
Characteristics of a data warehouse

  • subject-oriented: data are organized according to subject instead of application e.g. an insurance company using a data warehouse would organize their data by customer, premium, and claim, instead of by different products (auto, life etc). The data organized by subject contain only the information necessary for decision support processing.
  • integrated: when data resides in many separate applications in the operational environment, encoding of data is often inconsistent. For instance, in one application gender might be coded as “m” and “f” in another by 0 and 1. When data are moved form the operational environment into the data warehouse, they assume a consistent coding convention eg. gender data is transformed to “m” and “f”.
  • time-variant: The data warehouse contains a place for storing data that are five to 10 years old, or older, to be used for comparisons, trends, and forecasting. These data are not updated.
  • non-volatile: Data are not updated or changed in any way once they enter the data warehouse, but are only loaded and accessed.

Criteria for a data warehouse

  • Load Performance: performance of the load process should be measured in hundreds of millions of rows and gigabytes per hour and must not artificially constrain the volume of data required by the business.
  • Load Processing: Many steps must be taken to load new or updated data into the data warehouse including data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and ,metadata update. There steps must be executed as a single seamless unit of work.
  • Data Quality Management: The shift to fact-based management demands  the highest data quality. The warehouse must ensure local consistency, global consistency, and referential integrity despite “dirty” sources and massive database size.
  • Query Performance: Fact-based management and ad-hoc analysis must not be slower or inhibited by the performance of the data warehouse RDBMS: large, complex queries for key business operations must complete in seconds not days.
  • Terabyte Scalability – Data warehouse sizes are growing at astonishing rates. The RDBMS must not have any architectural limitations. It must support modular and parallel management. It must support continued availability in the event of a point failure and must provide a fundamentally different mechanism for recovery.
  • Mass User Scalability: the RDBMS server must support hundreds, even thousands of concurrent users while maintaining acceptable query performance.
  • Networked Data Warehouse: The server must include tools that coordinate the movement of subsets of data between warehouses.Users must be able to look at and work with multiple warehouses from a single client workstation. Warehouse managers have to manage and administer a network of warehouses from a single physical location.
  • Warehouse Administration: The RDBMS must provide controls for implementing resource limits, chargeback accounting to allocate costs back to users and query prioritization to address the needs of different user classes and activities.
  • Integrated Dimensional Analysis: The power of multidimensional views is widely accepted, and dimensional support must be inherent in the warehouse RDBMS to provide the highest performance for relational OLAP tools. The RDBMS must support fast,easy creation of precomputed summaries common in large data warehouses.
  • Advanced Query functionality: The RDBMS must provide a complete set of analytical operations including core sequential and statistical operations.

Conceptualization of a data warehouse architecture consists of the following interconnected layers:
Operational database layer:http://pratikshya.com.np/wp-admin/post-new.php
the source data layer for the data warehouse- an organization’s enterprise resource planning systems fall into this layer
Data Access layer:
The interface between the operational and informational access layer- Tools to extract, transform, load data into the warehouse fall into this layer
Metadata layer:
THe data dictionary- This is usually more detailed than an operational sysem data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.
Informational access layer:
The data accessed for reporting and analysing and the tools for reporting and analyzing data- business intelligence tools fall into this layer.
Some of the Benefits that a data warehouse provides are as follows:
A data warehouse provides a common data model for all data of interest regardless of the data’s source. This makes it easier to report and analyse information than it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc.

  • Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis
  • information in the data warehouse is under the control of data warehouse users so that, even if the source system data is purged over time, the information in the warehouse can be stored safely for extended periods of time
  • Because they are separate form operational systems, data warehouses provides retrieval of data without slowing down operational system
  • Data warehouses can work in conjunction  with and hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems
  • Data warehouses facilitates decision support system applications such as trend reports (eg. the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.

Disadvantages

  • Data warehouses are not the optimal environment for unstructured data
  • Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data
  • Over their life, data warehouses can have high cost
  • Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to organization
  • There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems.

Sample Applications

  • Decision Support
  • Trend analysis
  • Financial forecasting
  • Churn prediction for telecom subscribers, credit card users etc
  • insurance fraud analysis
  • call record analysis
  • Logistics and inventory management
  • Agriculture

Leave a Reply

Your email address will not be published. Required fields are marked *