Informatica Interview Question Part -12 - ETL- iNFORMATICA DEVELOPER

Tuesday, June 25, 2019

Informatica Interview Question Part -12

ktnewslive


Mapplet

A mapplet is a reusable object that we create in the Mapplet Designer.
It contains a set of transformations and lets us reuse that transformation logic in multiple mappings.


Created in Mapplet Designer in Designer Tool.

We need to use same set of 5 transformations in say 10 mappings. So instead of making 5 transformations in every 10 mapping, we create a mapplet of these 5 transformations. Now we use this mapplet in all 10 mappings. Example: To create a surrogate key in target. We create a mapplet using a stored procedure to create Primary key for target table. We give target table name and key column name as input to mapplet and get the Surrogate key as output.


Mapplets help simplify mappings in the following ways:

Include source definitions: Use multiple source definitions and source qualifiers to provide source data for a mapping.
Accept data from sources in a mapping

Include multiple transformations: As many transformations as we need.
Pass data to multiple transformations: We can create a mapplet to feed data to multiple transformations. Each Output transformation in a mapplet represents one output group in a mapplet.

Contain unused ports: We do not have to connect all mapplet input and output ports in a mapping.


Mapplet Input:

Mapplet input can originate from a source definition and/or from an Input transformation in the mapplet. We can create multiple pipelines in a mapplet.
We use Mapplet Input transformation to give input to mapplet.
Use of Mapplet Input transformation is optional.


Mapplet Output:

The output of a mapplet is not connected to any target table.
We must use Mapplet Output transformation to store mapplet output.
A mapplet must contain at least one Output transformation with at least one connected port in the mapplet.

Example1: We will join EMP and DEPT table. Then calculate total salary. Give the output to mapplet out transformation.
· EMP and DEPT will be source tables.
· Output will be given to transformation Mapplet_Out.

Steps:

·         Open folder where we want to create the mapping.
·         Click Tools -> Mapplet Designer.
·         Click Mapplets-> Create-> Give name. Ex: mplt_example1
·         Drag EMP and DEPT table.
·         Use Joiner transformation as described earlier to join them.
·         Transformation -> Create -> Select Expression for list -> Create -> Done
·         Pass all ports from joiner to expression and then calculate total salary as described in expression transformation.
·         Now Transformation -> Create -> Select Mapplet Out from list –> Create -> Give name and then done.
·         Pass all ports from expression to Mapplet output.
·         Mapplet -> Validate
·         Repository -> Save

Use of mapplet in mapping:

We can mapplet in mapping by just dragging the mapplet from mapplet folder on left pane as we drag source and target tables.
When we use the mapplet in a mapping, the mapplet object displays only the ports from the Input and Output transformations. These are referred to as the mapplet input and mapplet output ports.
Make sure to give correct connection information in session.

Making a mapping:

 We will use mplt_example1, and then create a filter
transformation to filter records whose Total Salary is >= 1500.
·         mplt_example1 will be source.
·         Create target table same as Mapplet_out transformation as in picture above. Creating Mapping
Open folder where we want to create the mapping.
·         Click Tools -> Mapping Designer.
·         Click Mapping-> Create-> Give name. Ex: m_mplt_example1
·         Drag mplt_Example1 and target table.
·         Transformation -> Create -> Select Filter for list -> Create -> Done.
·         Drag all ports from mplt_example1 to filter and give filter condition.
·         Connect all ports from filter to target. We can add more transformations after filter if needed.
·         Validate mapping and Save it.
Make session and workflow.
Give connection information for mapplet source tables.
Give connection information for target table.
Run workflow and see result.



Indirect Loading For Flat Files
Suppose, you have 10 flat files of same structure. All the flat files have same number of columns and data type. Now we need to transfer all the 10 files to same target.
Names of files are say EMP1, EMP2 and so on.

Solution1:

1. Import one flat file definition and make the mapping as per need.
2. Now in session give the Source File name and Source File Directory location of one file.
3. Make workflow and run.
4. Now open session after workflow completes. Change the Filename and Directory to give information of second file. Run workflow again.
5. Do the above for all 10 files.

Solution2:


1. Import one flat file definition and make the mapping as per need.
2. Now in session give the Source Directory location of the files.
3. Now in Fieldname use $InputFileName. This is a session parameter.
4. Now make a parameter file and give the value of $InputFileName.
$InputFileName=EMP1.txt
5. Run the workflow
6. Now edit parameter file and give value of second file. Run workflow again.
7. Do same for remaining files.

Solution3:


1. Import one flat file definition and make the mapping as per need.
2. Now make a notepad file that contains the location and name of each 10 flat files.


Sample:
D:\EMP1.txt
E:\EMP2.txt
E:\FILES\DWH\EMP3.txt and so on
3. Now make a session and in Source file name and Source File Directory location fields, give the name and location of above created file.
4. In Source file type field, select Indirect.
5. Click Apply.
6. Validate Session
7. Make Workflow. Save it to repository and run.



Incremental Aggregation


When we enable the session option->


Incremental Aggregation the Integration Service performs incremental aggregation, it passes source data through the mapping and uses historical cache data to perform aggregation calculations incrementally.
When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a session. If the source changes incrementally and you can capture changes, you can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same data each time you run the session.
For example, you might have a session using a source that receives new data every day. You can capture those incremental changes because you have added a filter condition to the mapping that removes pre-existing data from the flow of data. You then enable incremental aggregation.
When the session runs with incremental aggregation enabled for the first time on March 1, you use the entire source. This allows the Integration Service to read and store the necessary aggregate data. On March 2, when you run the session again, you filter out all the records except those time-stamped March 2. The Integration Service then processes the new data and updates the target accordingly. Consider using incremental aggregation in the following circumstances:
You can capture new source data. Use incremental aggregation when you can capture new source data each time you run the session. Use a Stored Procedure or Filter transformation to process new data.
Incremental changes do not significantly change the target. Use incremental aggregation when the changes do not significantly change the target. If processing the incrementally changed source alters more than half the existing target, the session may not benefit from using incremental aggregation. In this case, drop the table and recreate the target with complete source data.
Note: Do not use incremental aggregation if the mapping contains percentile or median functions. The Integration Service uses system memory to process these functions in addition to the cache memory you configure in the session properties. As a result, the Integration Service does not store incremental aggregation values for percentile and median functions in disk caches.


Integration Service Processing for Incremental Aggregation


(i)The first time you run an incremental aggregation session, the Integration Service processes the entire source. At the end of the session, the Integration Service stores aggregate data from that session run in two files, the index file and the data file. The Integration Service creates the files in the cache directory specified in the Aggregator transformation properties.
(ii)Each subsequent time you run the session with incremental aggregation, you use the incremental source changes in the session. For each input record, the Integration Service checks historical information in the index file for a corresponding group. If it finds a corresponding group, the Integration Service performs the aggregate operation incrementally, using the aggregate data for that group, and saves the incremental change. If it does not find a corresponding group, the Integration Service creates a new group and saves the record data.
(iii)When writing to the target, the Integration Service applies the changes to the existing target. It saves modified aggregate data in the index and data files to be used as historical data the next time you run the session.
(iv) If the source changes significantly and you want the Integration Service to continue saving aggregate data for future incremental changes, configure the Integration Service to overwrite existing aggregate data with new aggregate data.
Each subsequent time you run a session with incremental aggregation, the Integration Service creates a backup of the incremental aggregation files. The cache directory for the Aggregator transformation must contain enough disk space for two sets of the files.
(v)When you partition a session that uses incremental aggregation, the Integration Service creates one set of cache files for each partition.
The Integration Service creates new aggregate data, instead of using historical data, when you perform one of the following tasks:
·         Save a new version of the mapping.
·         Configure the session to reinitialize the aggregate cache.
·         Move the aggregate files without correcting the configured path or directory for the files in the session properties.
·         Change the configured path or directory for the aggregate files without moving the files to the new location.
·         Delete cache files.
·         Decrease the number of partitions.
·         When the Integration Service rebuilds incremental aggregation files, the data in the previous files is lost.
Note: To protect the incremental aggregation files from file corruption or disk failure, periodically back up the files.



Preparing for Incremental Aggregation:


When you use incremental aggregation, you need to configure both mapping and session properties:
·         Implement mapping logic or filter to remove pre-existing data.
·         Configure the session for incremental aggregation and verify that the file directory has enough disk space for the aggregate files.


Configuring the Mapping
Before enabling incremental aggregation, you must capture changes in source data. You can use a Filter or Stored Procedure transformation in the mapping to remove pre-existing source data during a session.


Configuring the Session
Use the following guidelines when you configure the session for incremental aggregation:


(i) Verify the location where you want to store the aggregate files.


The index and data files grow in proportion to the source data. Be sure the cache directory has enough disk space to store historical data for the session.
When you run multiple sessions with incremental aggregation, decide where you want the files stored. Then, enter the appropriate directory for the process variable, $PMCacheDir, in the Workflow Manager. You can enter session-specific directories for the index and data files. However, by using the process variable for all sessions using incremental aggregation, you can easily change the cache directory when necessary by changing $PMCacheDir.
Changing the cache directory without moving the files causes the Integration Service to reinitialize the aggregate cache and gather new aggregate data.
In a grid, Integration Services rebuild incremental aggregation files they cannot find. When an Integration Service rebuilds incremental aggregation files, it loses aggregate history.


(ii) Verify the incremental aggregation settings in the session properties.


You can configure the session for incremental aggregation in the Performance settings on the Properties tab.
You can also configure the session to reinitialize the aggregate cache. If you choose to reinitialize the cache, the Workflow Manager displays a warning indicating the Integration Service overwrites the existing cache and a reminder to clear this option after running the session.







Scenarios for choosing hash partitioning:
• Not enough knowledge about how much data maps into a give range.
• Sizes of range partition differ quite substantially, or are difficult to balance manually
• Range partitioning would cause data to be clustered undesirably.
• Features such as parallel DML, partition pruning, joins etc are important.
You Can Define Following Partition Types In Workflow Manager:


1) Database Partitioning

The integration service queries the IBM db2 or oracle system for table partition information. It reads partitioned data from the corresponding nodes in the database. Use database partitioning with oracle or IBM db2 source instances on a multi-node table space. Use database partitioning with db2 targets


2) Hash Partitioning

Use hash partitioning when you want the integration service to distribute rows to the partitions by group. For example, you need to sort items by item id, but you do not know how many items have a particular id number


3) Key Range

you specify one or more ports to form a compound partition key. The integration service passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range.


4) Simple Pass-Through

The integration service passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions


5) Round-Robin


The integration service distributes data evenly among all partitions. Use round-robin partitioning where you want each partition to process approximately the same number of rows.

No comments:

Post a Comment