EDW (Enterprise datawarehousing)
1- It
provides a central database for decision support throughout the enterprise
2- It
is a collection of DATAMARTS
DATAMART
1- It
is a subset of Datawarehousing
2- It
is a subject oriented database which supports the needs of
individuals depts. in an organizations
3- It is
called high performance query structure
4- It supports particular line of business like sales, marketing etc..
ODS (Operational data store)
5- It is
defined as an integrated view of operational database designed to support
operational monitoring
6- It
is a collection of operational data sources designed to support Transaction
processing
7- Data is
refreshed near real-time and used for business activity
8- It
is an intermediate between the OLTP and OLAP which helps to create an
instance reports
What are the types of
Approach in DWH?
Bottom up approach: first we need to develop data mart then we integrate these data
mart into EDW
Top down approach: first we need to develop EDW then form that EDW we develop
data mart
Bottom up
OLTP
ETL
Data mart
DWH OLAP
Top down
OLTP
ETL
DWH
Data mart
OLAP
Top down
1- Cost of
initial planning & design is high
2- Takes
longer duration of more than an year
Bottom up
1- Planning
& Designing the Data Marts without waiting for the Global warehouse design
2- Immediate
results from the data marts
3- Tends to
take less time to implement
4- Errors
in critical modules are detected earlier.
5- Benefits
are realized in the early phases.
6- It is a
Best Approach
Data Modeling Types:
1 Conceptual
Data Modeling
2 Logical
Data Modeling
3 Physical
Data Modeling
4 Dimensional
Data Modeling
1.
Conceptual Data
Modeling
1- Conceptual
data model includes all major entities and relationships and does not contain
much detailed level of information about attributes and is often used in the
INITIAL PLANNING PHASE
2- Conceptual
data model is created by gathering business requirements from various sources
like business documents, discussion with functional teams, business analysts,
smart management experts and end users who do the reporting on the database.
Data modelers create conceptual data model and forward that model to functional
team for their review.
3- Conceptual
data modeling gives an idea to the functional and technical team about how
business requirements would be projected in the logical data model.
2.
Logical Data Modeling
1- This is the
actual implementation and extension of a conceptual data model. Logical
data model includes all required entities, attributes, key groups, and
relationships that represent business information and define business
rules.
3.
Physical Data Modeling
1-
Physical data model
includes all required tables,
columns, relationships, database properties for
the physical implementation of databases. Database performance, indexing
strategy, physical storage and demoralization are important parameters of a
physical model.
Logical vs. Physical Data Modeling
Logical Data Model
|
Physical Data Model
|
Represents business information and defines business rules
|
Represents the physical implementation of the model in a
database.
|
Entity
|
Table
|
Attribute
|
Column
|
Primary Key
|
Primary Key Constraint
|
Alternate Key
|
Unique Constraint or Unique Index
|
Inversion Key Entry
|
Non Unique Index
|
Rule
|
Check Constraint, Default Value
|
Relationship
|
Foreign Key
|
Definition
|
Comment
|
Dimensional Data Modeling
1- Dimension
model consists of fact and dimension tables
2-
It is an approach to
develop the schema DB designs
Types of Dimensional modeling
1- Star
schema
ü Snow
flake schema
ü Star
flake schema (or) Hybrid schema
ü Multi
star schema
ü The
Star Schema Logical database design which contains a centrally located fact
table surrounded by at least one or more dimension tables
ü Since
the database design looks like a star, hence it is called star schema db
ü The
Dimension table contains Primary keys and the textual descriptions
ü It
contain de-normalized business information
ü A
Fact table contains a composite key and measures
ü The
measure are of types of key performance indicators which are
used to evaluate the enterprise performance in the form of success and failure
ü Eg:
Total revenue , Product sale , Discount given, no of customers
ü To
generate meaningful report the report should contain at least one dimension and
one fact table
The advantage of star
schema
ü Less
number of joins
ü Improve
query performance
ü Slicing
down
ü Easy
understanding of data.
Disadvantage:
ü Require
more storage space
Example of Star Schema:
Snowflake Schema
ü In
star schema, If the dimension tables are spitted into one or more dimension
tables
ü The
de-normalized dimension tables are spitted into a normalized dimension table
Example of Snowflake Schema:
ü In
Snowflake schema, the example diagram shown below has 4 dimension tables, 4
lookup tables and 1 fact table. The reason is that hierarchies (category,
branch, state, and month) are being broken out of the dimension tables
(PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and separately.
ü It increases
the number of joins and poor performance in retrieval of data.
ü In
few organizations, they try to normalize the dimension tables to save space.
ü Since
dimension tables hold less space snow flake schema approach may be avoided.
ü Bit
map indexes cannot be effectively utilized
Important aspects of Star Schema & Snow
Flake Schema
ü In a
star schema every dimension will have a primary key.
ü In a
star schema, a dimension table will not have any parent table.
ü Whereas
in a snow flake schema, a dimension table will have one or more parent tables.
ü Hierarchies
for the dimensions are stored in the dimensional table itself in star schema.
ü Whereas
hierarchies are broken into separate tables in snow flake schema. These
hierarchies help to drill down the data from topmost hierarchies to the
lowermost hierarchies.
Star flake schema (or) Hybrid Schema
ü Hybrid
schema is a combination of Star and Snowflake schema
Multi Star schema
ü Multiple
fact tables sharing a set of dimension tables
ü Confirmed
Dimensions are nothing but Reusable Dimensions.
ü The
dimensions which u r using multiple times or in multiple data marts.
ü Those
are common in different data marts
Measure Types (or) Types of Facts
- Additive -
Measures that can be summed up across all dimensions.
- Ex:
Sales Revenue
- Semi
Additive - Measures that can be
summed up across few dimensions and not with others
- Ex:
Current Balance
- Non
Additive - Measures that cannot be
summed up across any of the dimensions.
- Ex:
Student attendance
Surrogate Key
ü Joins
between fact and dimension tables should be based on surrogate keys
ü Users
should not obtain any information by looking at these keys
ü These
keys should be simple integers
A sample data warehouse schema
Why need staging area for DWH?
ü Staging
area needs to clean operational data before loading into data warehouse.
ü Cleaning
in the sense your merging data which comes from different source.
ü It’s
the area where most of the ETL is done
Data Cleansing
ü It is
used to remove duplications
ü It is
used to correct wrong email addresses
ü It is
used to identify missing data
ü It
used to convert the data types
ü It is
used to capitalize name & addresses.
Types of Dimensions:
There are three types of Dimensions
ü Confirmed
Dimensions
ü Junk
Dimensions Garbage Dimension
ü Degenerative
Dimensions
Slowly
changing Dimensions
Garbage Dimension or Junk Dimension
ü Confirmed
is something which can be shared by multiple Fact Tables or multiple Data
Marts.
ü Junk
Dimensions is grouping flagged values
ü Degenerative
Dimension is something dimensional in nature but exist fact table.(Invoice No)
Which is neither fact nor
strictly dimension attributes.
These are useful for some kind of analysis.
These are kept as attributes in fact table called degenerated dimension
Degenerate dimension: A column of the key section of the fact table that does
not have the associated dimension table but used for reporting and
analysis, such column is called degenerate dimension or line item dimension.
For ex,
we have a
fact table with customer_id, product_id, branch_id, employee_id, bill_no, and
date in key section and price, quantity, amount in measure section. In this
fact table, bill_no from key section is a single value; it has no associated
dimension table. Instead of creating a
Separate dimension table for that single value,
we can Include it in fact table to improve performance. SO here the column,
bill_no is a degenerate dimension or line item dimension.
Interesting blog, here a lot of valuable information is available, it is very useful information Keep do posting i like to follow this informatica online training
ReplyDeleteinformatica online course
informatica bdm training
informatica developer training
informatica training
informatica course
informatica axon training
Glad to chat your blog, I seem to be forward to more reliable articles and I think we all wish to thank so many good articles, blog to share with us. Flat for sale Slough
ReplyDeleteAfter reading your article I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Flat for sale Slough
ReplyDeleteThanks for sharing valuable information and very well explained. Keep posting.
ReplyDeleteonline etl testing training
etl testing online training