A mapplet is a
reusable object that we create in the Mapplet Designer.
It contains a set of
transformations and lets us reuse that transformation logic in multiple
mappings.
Created in Mapplet
Designer in Designer Tool.
We need to use same
set of 5 transformations in say 10 mappings. So instead of making 5
transformations in every 10 mapping, we create a mapplet of these 5
transformations. Now we use this mapplet in all 10 mappings. Example: To create
a surrogate key in target. We create a mapplet using a stored procedure to
create Primary key for target table. We give target table name and key column
name as input to mapplet and get the Surrogate key as output.
Mapplets help
simplify mappings in the following ways:
Include source
definitions:
Use multiple source definitions and source qualifiers to provide source data
for a mapping.
Accept data from
sources in a mapping
Include multiple
transformations:
As many transformations as we need.
Pass data to multiple
transformations:
We can create a mapplet to feed data to multiple transformations. Each Output
transformation in a mapplet represents one output group in a mapplet.
Contain unused ports: We do not have to
connect all mapplet input and output ports in a mapping.
Mapplet Input:
Mapplet input can
originate from a source definition and/or from an Input transformation in the
mapplet. We can create multiple pipelines in a mapplet.
We use Mapplet Input
transformation to give input to mapplet.
Use of Mapplet Input
transformation is optional.
Mapplet Output:
The output of a
mapplet is not connected to any target table.
We must use Mapplet
Output transformation to store mapplet output.
A mapplet must
contain at least one Output transformation with at least one connected port in
the mapplet.
Example1: We will join
EMP and DEPT table. Then calculate total salary. Give the output to mapplet out
transformation.
· EMP and DEPT will
be source tables.
· Output will be
given to transformation Mapplet_Out.
Steps:
·
Open
folder where we want to create the mapping.
·
Click
Tools -> Mapplet Designer.
·
Click
Mapplets-> Create-> Give name. Ex: mplt_example1
·
Drag
EMP and DEPT table.
·
Use
Joiner transformation as described earlier to join them.
·
Transformation
-> Create -> Select Expression for list -> Create -> Done
·
Pass
all ports from joiner to expression and then calculate total salary as
described in expression transformation.
·
Now
Transformation -> Create -> Select Mapplet Out from list –> Create
-> Give name and then done.
·
Pass
all ports from expression to Mapplet output.
·
Mapplet
-> Validate
·
Repository
-> Save
Use of mapplet in
mapping:
We can mapplet in
mapping by just dragging the mapplet from mapplet folder on left pane as we
drag source and target tables.
When we use the
mapplet in a mapping, the mapplet object displays only the ports from the Input
and Output transformations. These are referred to as the mapplet input and
mapplet output ports.
Make sure to give
correct connection information in session.
Making a mapping:
We will use
mplt_example1, and then create a filter
transformation to
filter records whose Total Salary is >= 1500.
·
mplt_example1
will be source.
·
Create
target table same as Mapplet_out transformation as in picture above. Creating
Mapping
Open folder where we
want to create the mapping.
·
Click
Tools -> Mapping Designer.
·
Click
Mapping-> Create-> Give name. Ex: m_mplt_example1
·
Drag
mplt_Example1 and target table.
·
Transformation
-> Create -> Select Filter for list -> Create -> Done.
·
Drag
all ports from mplt_example1 to filter and give filter condition.
·
Connect
all ports from filter to target. We can add more transformations after filter
if needed.
·
Validate
mapping and Save it.
Make session and
workflow.
Give connection
information for mapplet source tables.
Give connection
information for target table.
Run workflow and see
result.
Indirect
Loading For Flat Files
Suppose, you have 10
flat files of same structure. All the flat files have same number of columns
and data type. Now we need to transfer all the 10 files to same target.
Names of files are
say EMP1, EMP2 and so on.
Solution1:
1. Import one flat
file definition and make the mapping as per need.
2. Now in session
give the Source File name and Source File Directory location of one file.
3. Make workflow and
run.
4. Now open session after
workflow completes. Change the Filename and Directory to give information of
second file. Run workflow again.
5. Do the above for
all 10 files.
Solution2:
1. Import one flat
file definition and make the mapping as per need.
2. Now in session
give the Source Directory location of the files.
3. Now in Fieldname
use $InputFileName. This is a session parameter.
4. Now make a
parameter file and give the value of $InputFileName.
$InputFileName=EMP1.txt
5. Run the workflow
6. Now edit parameter
file and give value of second file. Run workflow again.
7. Do same for
remaining files.
Solution3:
1. Import one flat
file definition and make the mapping as per need.
2. Now make a notepad
file that contains the location and name of each 10 flat files.
Sample:
D:\EMP1.txt
E:\EMP2.txt
E:\FILES\DWH\EMP3.txt
and so on
3. Now make a session
and in Source file name and Source File Directory location fields, give the
name and location of above created file.
4. In Source file
type field, select Indirect.
5. Click Apply.
6. Validate Session
7. Make Workflow.
Save it to repository and run.
Incremental
Aggregation
When we enable
the session option->
Incremental Aggregation the Integration Service performs
incremental aggregation, it passes source data through the mapping and uses
historical cache data to perform aggregation calculations incrementally.
When using
incremental aggregation, you apply captured changes in the source to aggregate
calculations in a session. If the source changes incrementally and you can
capture changes, you can configure the session to process those changes. This
allows the Integration Service to update the target incrementally, rather than
forcing it to process the entire source and recalculate the same data each time
you run the session.
For example, you
might have a session using a source that receives new data every day. You can
capture those incremental changes because you have added a filter condition to
the mapping that removes pre-existing data from the flow of data. You then
enable incremental aggregation.
When the session runs
with incremental aggregation enabled for the first time on March 1, you use the
entire source. This allows the Integration Service to read and store the
necessary aggregate data. On March 2, when you run the session again, you
filter out all the records except those time-stamped March 2. The Integration
Service then processes the new data and updates the target accordingly.
Consider using incremental aggregation in the following circumstances:
You can capture new
source data. Use incremental aggregation when you can capture new source
data each time you run the session. Use a Stored Procedure or Filter
transformation to process new data.
Incremental changes
do not significantly change the target. Use incremental aggregation when
the changes do not significantly change the target. If processing the
incrementally changed source alters more than half the existing target, the
session may not benefit from using incremental aggregation. In this case, drop
the table and recreate the target with complete source data.
Note: Do not use
incremental aggregation if the mapping contains percentile or median functions.
The Integration Service uses system memory to process these functions in
addition to the cache memory you configure in the session properties. As a
result, the Integration Service does not store incremental aggregation values
for percentile and median functions in disk caches.
Integration Service
Processing for Incremental Aggregation
(i)The first time you
run an incremental aggregation session, the Integration Service processes the
entire source. At the end of the session, the Integration Service stores
aggregate data from that session run in two files, the index file and the data
file. The Integration Service creates the files in the cache directory
specified in the Aggregator transformation properties.
(ii)Each subsequent
time you run the session with incremental aggregation, you use the incremental
source changes in the session. For each input record, the Integration Service checks
historical information in the index file for a corresponding group. If it finds
a corresponding group, the Integration Service performs the aggregate operation
incrementally, using the aggregate data for that group, and saves the
incremental change. If it does not find a corresponding group, the Integration
Service creates a new group and saves the record data.
(iii)When writing to
the target, the Integration Service applies the changes to the existing target.
It saves modified aggregate data in the index and data files to be used as
historical data the next time you run the session.
(iv) If the source
changes significantly and you want the Integration Service to continue saving
aggregate data for future incremental changes, configure the Integration Service
to overwrite existing aggregate data with new aggregate data.
Each subsequent time
you run a session with incremental aggregation, the Integration Service creates
a backup of the incremental aggregation files. The cache directory for the
Aggregator transformation must contain enough disk space for two sets of the
files.
(v)When you partition
a session that uses incremental aggregation, the Integration Service creates
one set of cache files for each partition.
The Integration
Service creates new aggregate data, instead of using historical data, when you
perform one of the following tasks:
·
Save
a new version of the mapping.
·
Configure
the session to reinitialize the aggregate cache.
·
Move
the aggregate files without correcting the configured path or directory for the
files in the session properties.
·
Change
the configured path or directory for the aggregate files without moving the
files to the new location.
·
Delete
cache files.
·
Decrease
the number of partitions.
·
When
the Integration Service rebuilds incremental aggregation files, the data in the
previous files is lost.
Note: To protect
the incremental aggregation files from file corruption or disk failure,
periodically back up the files.
Preparing for
Incremental Aggregation:
When you use
incremental aggregation, you need to configure both mapping and session
properties:
·
Implement
mapping logic or filter to remove pre-existing data.
·
Configure
the session for incremental aggregation and verify that the file directory has
enough disk space for the aggregate files.
Configuring the
Mapping
Before enabling incremental
aggregation, you must capture changes in source data. You can use a Filter or
Stored Procedure transformation in the mapping to remove pre-existing source
data during a session.
Configuring the
Session
Use the following
guidelines when you configure the session for incremental aggregation:
(i) Verify the
location where you want to store the aggregate files.
The index and data
files grow in proportion to the source data. Be sure the cache directory has
enough disk space to store historical data for the session.
When you run multiple
sessions with incremental aggregation, decide where you want the files stored.
Then, enter the appropriate directory for the process variable, $PMCacheDir, in
the Workflow Manager. You can enter session-specific directories for the index
and data files. However, by using the process variable for all sessions using
incremental aggregation, you can easily change the cache directory when
necessary by changing $PMCacheDir.
Changing the cache
directory without moving the files causes the Integration Service to
reinitialize the aggregate cache and gather new aggregate data.
In a grid,
Integration Services rebuild incremental aggregation files they cannot find.
When an Integration Service rebuilds incremental aggregation files, it loses
aggregate history.
(ii) Verify the
incremental aggregation settings in the session properties.
You can configure the
session for incremental aggregation in the Performance settings on the
Properties tab.
You can also
configure the session to reinitialize the aggregate cache. If you choose to
reinitialize the cache, the Workflow Manager displays a warning indicating the
Integration Service overwrites the existing cache and a reminder to clear this
option after running the session.
Scenarios for
choosing hash partitioning:
• Not enough knowledge about how much data maps into a give range.
• Sizes of range partition differ quite substantially, or are difficult to balance manually
• Range partitioning would cause data to be clustered undesirably.
• Features such as parallel DML, partition pruning, joins etc are important.
• Not enough knowledge about how much data maps into a give range.
• Sizes of range partition differ quite substantially, or are difficult to balance manually
• Range partitioning would cause data to be clustered undesirably.
• Features such as parallel DML, partition pruning, joins etc are important.
You Can Define
Following Partition Types In Workflow Manager:
1) Database
Partitioning
The
integration service queries the IBM db2 or oracle system for table partition
information. It reads partitioned data from the corresponding nodes in the
database. Use database partitioning with oracle or IBM db2 source instances on
a multi-node table space. Use database partitioning with db2 targets
2) Hash Partitioning
Use
hash partitioning when you want the integration service to distribute rows to
the partitions by group. For example, you need to sort items by item id, but
you do not know how many items have a particular id number
3) Key Range
you
specify one or more ports to form a compound partition key. The integration
service passes data to each partition depending on the ranges you specify for
each port. Use key range partitioning where the sources or targets in the
pipeline are partitioned by key range.
4) Simple
Pass-Through
The
integration service passes all rows at one partition point to the next
partition point without redistributing them. Choose pass-through partitioning
where you want to create an additional pipeline stage to improve performance,
but do not want to change the distribution of data across partitions
5) Round-Robin
The
integration service distributes data evenly among all partitions. Use
round-robin partitioning where you want each partition to process approximately
the same number of rows.
No comments:
Post a Comment