Pages

Wednesday 9 September 2015

Data Warehousing From the Front Lines(Pravat)

Introduction:
I know that the education system in India still follows once in a 10 year syllabus appraisal system and students still think that computer industry means C, C++, Java and .Net (in some colleges, in fact they still teach COBOL.
So ultimately they fall in love with these languages and feel disheartened if somebody asks them to work on something new. his is especially true for BI & DWH as other IT work would have some amount of language programming (e.g. in Mobile/Web development etc.) but BI & DWH is mostly playing with software tools at junior level.
LinkedIn Reveals 5 Hottest Skills in Demand:
Professional networking website LinkedIn has revealed 5 hottest skills that were in demand in 2014 & 2015. LinkedIn analyzed the skills and experience data in over 330 million member profiles to gather the data.
The LinkedIn report shows that tech skills dominated the 'hottest skills' list. "Across the globe, Business intelligence and Data Warehousing (Data analysis) skills were highly valued. In the US, India, and France, cloud and distributed computing skills were in particularly high demand".

                5 Hottest Skills in India on LinkedIn:
                     1. Business intelligence/data analysis
                     2. Data engineering and data warehousing
                     3. Statistical analysis and data mining
                     4. Market research and insights
                     5. Social media marketing /Email Marketing

The demand of most businesses in today's time is, in context of the question asked, data collection, data cleaning, reporting and gathering insights/ answering business problems/ suggesting methods and ways to boost business.   And so it comes down to two broad categories Business Intelligence and Data Warehousing, and Business Consulting. 
Job Requirements:
·         Candidate should be a graduate or post graduate in the field of arts, science or engineering or MBA. Final Year College students can apply.
·         Should have minimum 50% marks in school and college.
·         Age should be less than 30 years for Fresher. No age restriction for experienced people.

Essential Skill :

Anybody new to industry first need to understand what it is and then start working on coding assignments. I would classify skill and knowledge in two categories, one technical and others soft skills.
Technical SKILL:

1. Solid concepts of Data warehousing
2. Basic knowledge of Database concepts
3. Good knowledge of SQL
4. OLAP Tools
5. MS Excel
6. Basic knowledge on BI/ETL/Analytics tools.
Soft SKILL:

*Excellent in mathematics (numerical aptitude)

* Enjoy Learning tools and new technologies
* Positive attitude
* Love for data, calculations and reports
Note:
I would recommend that you take some Software tool training Like ETL courses (Not mandatory.) Some ETL Courses Like:
1. Informatica  2. Oracle Data Integrator (ODI) 3. Oracle - Warehouse Builder 4. IBM - Cognos  ,etc
Data warehousing (DWH):
Data warehousing is the combination of
 Business Intelligence (BI) + Data Analyst (DA) + Master Data Management (MDD) + Mapping Analyst + ETL Developer + data steward +Database Marketing.

This person is going to perform almost all the tasks. They will work on crude data (unclean) using the many tools available in the market to clean data, generate reports/ dashboards.
Also, technically this profile is a part of Data Warehousing and Business Intelligence domain.
Ø  Data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources.
Ø  They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

Ø  The data stored in the warehouse is uploaded from the operational systems (such as marketing, sales, etc., shown in the figure to the left). The data may pass through an operational data store for additional operations before it is used in the DW for reporting.

Ø  Any Enterprise must necessarily maintain a lot of data about its operation. This is its "Operational Data".

Note: Business Analyst is generalist and role varys from company to company based on their nature of business or even on project level.

 


What is Data Analyst/Business intelligence/ DWH?

This is the set of techniques and tools for the transformation of raw data into meaningful and useful information for business analysis purposes.

BI technologies are capable of handling large amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities.
            BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, and predictive analytics and prescriptive analytics.
BI can be used to support a wide range of business decisions ranging from operational to strategic. Basic operating decisions include product positioning or pricing. Strategic business decisions include.

1. What is Data?
Ans:
 Data is a collection of facts, such as numbers, words, measurements, observations or even just    descriptions of things.
               Data can be qualitative or quantitative.
  • Qualitative data is descriptive information (it describes something)
  • Quantitative data is numerical information (numbers).
2. What is Data Cleansing?
Ans:
·         Data cleansing is the process of removing incomplete, inaccurate or duplicated data from a database.
·         Data cleansing tools can help an organisation save time and costs spent on fixing data errors manually, and, can lead to a more efficient customer service.

3. What is raw data?
Ans: Raw data (sometimes called source data or atomic data) is data that has not been processed for use. A distinction is sometimes made between data and information to the effect that information is the end product of data processing.
4. What is Datawarehousing?
A Datawarehouse is the repository of a data and it is used for Management decision support system. Datawarehouse consists of wide variety of data that has high level of business conditions at a single point in time.
In single sentence, it is repository of integrated information which can be available for queries and analysis.


5.What are the benefits of data warehouse?
A data warehouse helps to integrate data (see Data integration) and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes.
cooked data.

6. What is Business Intelligence?
Business Intelligence is also known as DSS – Decision support system which refers to the technologies, application and practices for the collection, integration and analysis of the business related information or data. Even, it helps to see the data on the information itself.
7. What is Dimension Table?
Dimension table is a table which contain attributes of measurements stored in fact tables. This table consists of hierarchies, categories and logic that can be used to traverse in nodes.
8. What is Fact Table?
Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine – measure of the business process

9. What are the stages of Datawarehousing?
There are four stages of Datawarehousing:
  • Offline Operational Database
  • Offline Data Warehouse
  • Real Time Datawarehouse
  • Integrated Datawarehouse

10. What is Data Mining?
Data Mining is set to be a process of analyzing the data in different dimensions or perspectives and summarizing into a useful information. Can be queried and retrieved the data from database in their own format.
11. What is OLTP?
OLTP is abbreviated as On-Line Transaction Processing, and it is an application that modifies the data whenever it received and has large number of simultaneous users.
12. What is OLAP?
OLAP is abbreviated as Online Analytical Processing, and it is set to be a system which collects, manages, processes multi-dimensional data for analysis and management purposes.
13. What is the difference between OLTP and OLAP?
Following are the differences between OLTP and OLAP:
OLTP
OLAP
Data is from original data source
Data is from various data sources
Simple queries by users
Complex queries by system
Normalized small database
De-normalized Large Database
Fundamental business tasks
Multi-dimensional business tasks

14. What is ODS?
ODS is abbreviated as Operational Data Store and it is a repository of real time operational data rather than long term trend data.
15. What is the difference between View and Materialized View?
A view is nothing but a virtual table which takes the output of the query and it can be used in place of tables.
A materialized view is nothing but an indirect access to the table data by storing the results of a query in a separate schema.
16. What is ETL?
ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the data from the specified data source and extracts a desired subset of data. Next, it transform the data using rules and lookup tables and convert it to a desired state.
Then, load function is used to load the resulting data to the target database.
17. What is VLDB?
VLDB is abbreviated as Very Large Database and its size is set to be more than one terabyte database. These are decision support systems which is used to server large number of users.
18. What is real-time datawarehousing?
Real-time datawarehousing captures the business data whenever it occurs. When there is business activity gets completed, that data will be available in the flow and become available for use instantly.
19. What are Aggregate tables?
Aggregate tables are the tables which contain the existing warehouse data which has been grouped to certain level of dimensions. It is easy to retrieve data from the aggregated tables than the original table which has more number of records.
This table reduces the load in the database server and increases the performance of the query.
20. What is factless fact tables?
A factless fact tables are the fact table which doesn’t contain numeric fact column in the fact table.
21. How can we load the time dimension?
Time dimensions are usually loaded through all possible dates in a year and it can be done through a program. Here, 100 years can be represented with one row per day.
22. What are Non-additive facts?
Non-Addictive facts are said to be facts that cannot be summed up for any of the dimensions present in the fact table. If there are changes in the dimensions, same facts can be useful.
23. What is conformed fact?
Conformed fact is a table which can be used  across multiple data marts in combined with the multiple fact tables.
24. What is Datamart?
A Datamart is a specialized version of Datawarehousing and it contains a snapshot of operational data that helps the business people to decide with the analysis of past trends and experiences. A data mart helps to emphasizes on easy access to relevant information.
25. What is Active Datawarehousing?
An active datawarehouse is a datawarehouse that enables decision makers within a company or organization to manage customer relationships effectively and efficiently.
26. What is the difference between Datawarehouse and OLAP?
Datawarehouse is a place where the whole data is stored for analyzing, but OLAP is used for analyzing the data, managing aggregations, information partitioning into minor level information.
27. What is ER Diagram?
ER diagram is abbreviated as Entity-Relationship diagram which illustrates the interrelationships between the entities in the database. This diagram shows the structure of each tables and the links between the tables.
28. What are the key columns in Fact and dimension tables?
Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of fact tables are the primary keys of the dimension tables.
30. What is SCD?
SCD is defined as slowly changing dimensions, and it applies to the cases where record changes over time.
31. What are the types of SCD?
There are three types of SCD and they are as follows:
SCD 1 – The new record replaces the original record
SCD 2 – A new record is added to the existing customer dimension table
SCD 3 – A original data is modified to include new data
32. What is called data cleaning?
Name itself implies that it is a self explanatory term. Cleaning of Orphan records, Data breaching business rules, Inconsistent data and missing information in a database.
33. What is Metadata?
Metadata is defined as data about the data. The metadata contains information like number of columns used, fix width and limited width, ordering of fields and data types of the fields.
34. What are loops in Datawarehousing?
In datawarehousing, loops are existing between the tables. If there is a loop between the tables, then the query generation will take more time and it creates ambiguity. It is advised to avoid loop between the tables.
35. What is the definition of Cube in Datawarehousing?
Cubes are logical representation of multidimensional data. The edge of the cube has the dimension members,and the body of the cube contains the data values.
36. What is called Dimensional Modelling?
Dimensional Modeling is a concept which can be used by dataware house designers to build their own datawarehouse. This model can be stored in two types of tables – Facts and Dimension table.
Fact table has facts and measurements of the business and dimension table contains the context of measurements.
37. What are the types of Dimensional Modeling?
There are three types of Dimensional Modeling and they are as follows:
  • Conceptual Modeling
  • Logical Modeling
  • Physical Modeling
38. What is surrogate key?
Surrogate key is nothing but a substitute for the natural primary key. It is set to be a unique identifier for each row that can be used for the primary key to a table.
39. What is the difference between ER Modeling and Dimensional Modeling?
ER modeling will have logical and physical model but Dimensional modeling will have only Physical model.
ER Modeling is used for normalizing the OLTP database design whereas Dimensional Modeling is used for de-normalizing the ROLAP and MOLAP design.
40. What are the steps to build the datawarehouse?
Following are the steps to be followed to build the datawaerhouse:
  • Gathering business requirements
  • Identifying the necessary sources
  • Identifying the facts
  • Defining the dimensions
  • Defining the attributes
  • Redefine the dimensions and attributes if required
  • Organize the Attribute hierarchy
  • Define Relationships
  • Assign unique Identifiers
41. What are the different types of datawarehosuing?
Following are the different types of Datawarehousing:

  • Enterprise Datawarehousing
  • Operational Data Store
  • Data Mart
42. What needs to be done while starting the database?
Following need to be done to start the database:
  1. Start an Instance
  2. Mount the database
  3. Open the database
43. What needs to be done when the database is shutdown?
Following needs to be done when the database is shutdown:
  1. Close the database
  2. Dismount the database
  3. Shutdown the Instance
44. Can we take backup when the database is opened?
No, We cannot take full backup when the database is opened.
45. What is defined as Partial Backup?
A Partial backup in an operating system is a backup short of full backup and it can be done while the database is opened or shutdown.
46. What is the goal of Optimizer?
The goal to Optimizer is to find the most efficient way to execute the SQL statements.
47. What is Execution Plan?
Execution Plan is a plan which is used to the optimizer to select the combination of the steps.
48. What are the approaches used by Optimizer during execution plan?
There are two approaches:
  1. Rule Based
  2. Cost Based
49. What are the tools available for ETL?
Following are the ETL tools available:
Informatica
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction
50.What is the difference between metadata and data dictionary?
Metadata is defined as data about the data. But, Data dictionary contain the information about the project information, graphs, abinito commands and server information.
What Is Master Data Management?
Ans:
Ø  Master Data Management (MDM) refers to the process of creating and managing data that an organization must have as a single master copy, called the master data. Usually, master data can include customers, vendors, employees, and products, but can differ by different industries and even different companies within the same industry.
Ø  MDM is important because it offers the enterprise a single version of the truth. Without a clearly defined master data, the enterprise runs the risk of having multiple copies of data that are inconsistent with one another.
Ø  MDM is typically more important in larger organizations. In fact, the bigger the organization, the more important the discipline of MDM is, because a bigger organization means that there are more disparate systems within the company, and the difficulty on providing a single source of truth, as well as the benefit of having master data, grows with each additional data source.
Ø   A particularly big challenge to maintaining master data occurs when there is a merger/acquisition. Each of the organizations will have its own master data, and how to merge the two sets of data will be challenging. Let's take a look at the customer files: The two companies will likely have different unique identifiers for each customer. Addresses and phone numbers may not match. One may have a person's maiden name and the other the current last name. One may have a nickname (such as "Bill") and the other may have the full name (such as "William"). All these contribute to the difficulty in creating and maintain in a single set of master data.

Ø  At the heart of the master data management program is the definition of the master data. Therefore, it is essential that we identify who is responsible for defining and enforcing the definition. Due to the importance of master data, a dedicated person or team should be appointed. At the minimum, a data steward should be identified. The responsible party can also be a group -- such as a data governance committee or a data governance council.

Tuesday 7 April 2015

what is dwh?
ans:
  1. In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources.

Tuesday 18 November 2014

Master Data Management?

Why Should I Manage Master Data?

 http://msdn.microsoft.com/en-us/library/bb190163.aspx

Because it is used by multiple applications, an error in master data can cause errors in all the applications that use it. For example, an incorrect address in the customer master might mean orders, bills, and marketing literature are all sent to the wrong address. Similarly, an incorrect price on an item master can be a marketing disaster, and an incorrect account number in an Account Master can lead to huge fines or even jail time for the CEO—a career-limiting move for the person who made the mistake!

Here is a typical master-data horror story: A credit-card customer moves from 2847 North 9th St. to 1001 11th St. North. The customer changed his billing address immediately, but did not receive a bill for several months. One day, the customer received a threatening phone call from the credit-card billing department, asking why the bill has not been paid. The customer verifies that they have the new address, and the billing department verifies that the address on file is 1001 11th St. N. The customer asks for a copy of the bill, to settle the account. After two more weeks without a bill, the customer calls back and finds the account has been turned over to a collection agency. This time, they find out that even though the address in the file was 1001 11th St. N, the billing address is 101 11th St. N. After a bunch of phone calls and letters between lawyers, the bill finally gets resolved and the credit-card company has lost a customer for life. In this case, the master copy of the data was accurate, but another copy of it was flawed. Master data must be both correct and consistent.
Even if the master data has no errors, few organizations have just one set of master data. Many companies grow through mergers and acquisitions. Each company you acquire comes with its own customer master, item master, and so forth. This would not be bad if you could just Union the new master data with your current master data, but unless the company you acquire is in a completely different business in a faraway country, there's a very good chance that some customers and products will appear in both sets of master data—usually, with different formats and different database keys. If both companies use the Dun & Bradstreet number or Social Security number as the customer identifier, discovering which customer records are for the same customer is a straightforward issue; but that seldom happens. In most cases, customer numbers and part numbers are assigned by the software that creates the master records, so the chances of the same customer or the same product having the same identifier in both databases is pretty remote. Item masters can be even harder to reconcile, if equivalent parts are purchased from different vendors with different vendor numbers.
Merging master lists together can be very difficult. The same customer may have different names, customer numbers, addresses, and phone numbers in different databases. For example, William Smith might appear as Bill Smith, Wm. Smith, and William Smithe. Normal database joins and searches will not be able to resolve these differences. A very sophisticated tool that understands nicknames, alternate spellings, and typing errors will be required. The tool will probably also have to recognize that different name variations can be resolved, if they all live at the same address or have the same phone number. While creating a clean master list can be a daunting challenge, there are many positive benefits to your bottom line from a common master list:
A single, consolidated bill saves money and improves customer satisfaction.
Sending the same marketing literature to a customer from multiple customer lists wastes money and irritates the customer.
Before you turn a customer account over to a collection agency, it would be good to know if they owe other parts of your company money or, more importantly, that they are another division's biggest customer.
Stocking the same item under different part numbers is not only a waste of money and shelf space, but can potentially lead to artificial shortages.
The recent movements toward SOA and SaaS make Master Data Management a critical issue. For example, if you create a single customer service that communicates through well-defined XML messages, you may think you have defined a single view of your customers. But if the same customer is stored in five databases with three different addresses and four different phone numbers, what will your customer service return? Similarly, if you decide to subscribe to a CRM service provided through SaaS, the service provider will need a list of customers for their database. Which one will you send them?
For all these reasons, maintaining a high-quality, consistent set of master data for your organization is rapidly becoming a necessity. The systems and processes required to maintain this data are known as Master Data Management.

What Is Master Data Management?

For purposes of this article, we define Master Data Management (MDM) as the technology, tools, and processes required to create and maintain consistent and accurate lists of master data. There are a couple things worth noting in this definition. One is that MDM is not just a technological problem. In many cases, fundamental changes to business process will be required to maintain clean master data, and some of the most difficult MDM issues are more political than technical. The second thing to note is that MDM includes both creating and maintaining master data. Investing a lot of time, money, and effort in creating a clean, consistent set of master data is a wasted effort unless the solution includes tools and processes to keep the master data clean and consistent as it is updated and expanded.
While MDM is most effective when applied to all the master data in an organization, in many cases the risk and expense of an enterprise-wide effort are difficult to justify. It may be easier to start with a few key sources of Master Data and expand the effort, once success has been demonstrated and lessons have been learned. If you do start small, you should include an analysis of all the master data that you might eventually want to include, so you do not make design decisions or tool choices that will force you to start over when you try to incorporate a new data source. For example, if your initial Customer master implementation only includes the 10,000 customers your direct-sales force deals with, you don't want to make design decisions that will preclude adding your 10,000,000 Web customers later.
An MDM project plan will be influenced by requirements, priorities, resource availability, time frame, and the size of the problem. Most MDM projects include at least these phases:
Identify sources of master data. This step is usually a very revealing exercise. Some companies find they have dozens of databases containing customer data that the IT department did not know existed.
Identify the producers and consumers of the master data. Which applications produce the master data identified in the first step, and—generally more difficult to determine—which applications use the master data. Depending on the approach you use for maintaining the master data, this step might not be necessary. For example, if all changes are detected and handled at the database level, it probably does not matter where the changes come from.
Collect and analyze metadata about for your master data. For all the sources identified in step one, what are the entities and attributes of the data, and what do they mean? This should include attribute name, datatype, allowed values, constraints, default values, dependencies, and who owns the definition and maintenance of the data. The owner is the most important and often the hardest to determine. If you have a repository loaded with all your metadata, this step is an easy one. If you have to start from database tables and source code, this could be a significant effort.
Appoint data stewards. These should be the people with the knowledge of the current source data and the ability to determine how to transform the source into the master-data format. In general, stewards should be appointed from the owners of each master-data source, the architects responsible for the MDM systems, and representatives from the business users of the master data.
Implement a data-governance program and data-governance council. This group must have the knowledge and authority to make decisions on how the master data is maintained, what it contains, how long it is kept, and how changes are authorized and audited. Hundreds of decisions must be made in the course of a master-data project, and if there is not a well-defined decision-making body and process, the project can fail, because the politics prevent effective decision making.
Develop the master-data model. Decide what the master records look like: what attributes are included, what size and datatype they are, what values are allowed, and so forth. This step should also include the mapping between the master-data model and the current data sources. This is normally both the most important and most difficult step in the process. If you try to make everybody happy by including all the source attributes in the master entity, you often end up with master data that is too complex and cumbersome to be useful. For example, if you cannot decide whether weight should be in pounds or kilograms, one approach would be to include both (WeightLb and WeightKg). While this might make people happy, you are wasting megabytes of storage for numbers that can be calculated in microseconds, as well as running the risk of creating inconsistent data (WeightLb = 5 and WeightKg = 5). While this is a pretty trivial example, a bigger issue would be maintaining multiple part numbers for the same part. As in any committee effort, there will be fights and deals resulting in sub-optimal decisions. It's important to work out the decision process, priorities, and final decision maker in advance, to make sure things run smoothly.
Choose a toolset. You will need to buy or build tools to create the master lists by cleaning, transforming, and merging the source data. You will also need an infrastructure to use and maintain the master list. These functions are covered in detail later in the paper.
You can use a single toolset from a single vendor for all of these functions, or you might want to take a best-of-breed approach. In general, the techniques to clean and merge data are different for different types of data, so there are not a lot of tools that span the whole range of master data.
The two main categories of tools are Customer Data Integration (CDI) tools for creating the customer master and Product Information Management (PIM) tools for creating the product master. Some tools will do both, but generally they are better at one or the other.
The toolset should also have support for finding and fixing data-quality issues and maintaining versions and hierarchies. Versioning is a critical feature, because understanding the history of a master-data record is vital to maintaining its quality and accuracy over time. For example, if a merge tool combines two records for John Smith in Boston, and you decide there really are two different John Smiths in Boston, you need to know what the records looked like before they were merged, in order to "unmerge" them.
Design the infrastructure. Once you have clean, consistent master data, you will need to expose it to your applications and provide processes to manage and maintain it. This step is a big-enough issue, I devote a section to it later in the document. When this infrastructure is implemented, you will have a number of applications that will depend on it being available, so reliability and scalability are important considerations to include in your design. In most cases, you will have to implement significant parts of the infrastructure yourself, because it will be designed to fit into your current infrastructure, platforms, and applications.
Generate and test the master data. This step is where you use the tools you have developed or purchased to merge your source data into your master-data list. This is often an iterative process requiring tinkering with rules and settings to get the matching right. This process also requires a lot of manual inspection to ensure that the results are correct and meet the requirements established for the project. No tool will get the matching done correctly 100 percent of the time, so you will have to weigh the consequences of false matches versus missed matches to determine how to configure the matching tools. False matches can lead to customer dissatisfaction, if bills are inaccurate or the wrong person is arrested. Too many missed matches make the master data less useful, because you are not getting the benefits you invested in MDM to get.
Modify the producing and consuming systems. Depending on how your MDM implementation is designed, you might have to change the systems that produce, maintain, or consume master data to work with the new source of master data. If the master data is used in a system separate from the source systems—a data warehouse, for example—the source systems might not have to change. If the source systems are going to use the master data, however, there will likely be changes required. Either the source systems will have to access the new master data or the master data will have to be synchronized with the source systems, so that the source systems have a copy of the cleaned-up master data to use. If it's not possible to change one or more of the source systems, either that source system might not be able to use the master data or the master data will have to be integrated with the source system's database through external processes, such as triggers and SQL commands.
The source systems generating new records should be changed to look up existing master record sets before creating new records or updating existing master records. This ensures that the quality of data being generated upstream is good, so that the MDM can function more efficiently and the application itself manages data quality. MDM should be leveraged not only as a system of record, but also as an application that promotes cleaner and more efficient handling of data across all applications in the enterprise. As part of MDM strategy, all three pillars of data management need to be looked into: data origination, data management, and data consumption. It is not possible to have a robust enterprise-level MDM strategy if any one of these aspects is ignored.
Implement the maintenance processes. As we stated earlier, any MDM implementation must incorporate tools, processes, and people to maintain the quality of the data. All data must have a data steward who is responsible for ensuring the quality of the master data. The data steward is normally a business person who has knowledge of the data, can recognize incorrect data, and has the knowledge and authority to correct the issues. The MDM infrastructure should include tools that help the data steward recognize issues and simplify corrections. A good data-stewardship tool should point out questionable matches that were made—customers with different names and customer numbers that live at the same address, for example. The steward might also want to review items that were added as new, because the match criteria were close but below the threshold. It is important for the data steward to see the history of changes made to the data by the MDM systems, to isolate the source of errors and undo incorrect changes. Maintenance also includes the processes to pull changes and additions into the MDM system, and to distribute the cleansed data to the required places.
As you can see, MDM is a complex process that can go on for a long time. Like most things in software, the key to success is to implement MDM incrementally, so that the business realizes a series of short-term benefits while the complete project is a long-term process. No MDM project can be successful without the support and participation of the business users. IT professionals do not have the domain knowledge to create and maintain high-quality master data. Any MDM project that does not include changes to the processes that create, maintain, and validate master data is likely to fail. The rest of this paper will cover the details of the technology and processes for creating and maintaining master data.
How Do I Create a Master List?
Whether you buy a tool or decide to roll your own, there are two basic steps to creating master data: clean and standardize the data, and match data from all the sources to consolidate duplicates. Before you can start cleaning and normalizing your data, you must understand the data model for the master data. As part of the modeling process, the contents of each attribute were defined, and a mapping was defined from each source system to the master-data model. This information is used to define the transformations necessary to clean your source data.
Cleaning the data and transforming it into the master data model is very similar to the Extract, Transform, and Load (ETL) processes used to populate a data warehouse. If you already have ETL tools and transformation defined, it might be easier just to modify these as required for the master data, instead of learning a new tool. Here are some typical data-cleansing functions:
Normalize data formats. Make all the phone numbers look the same, transform addresses (and so on) to a common format.
Replace missing values. Insert defaults, look up ZIP codes from the address, look up the Dun & Bradstreet number.
Standardize values. Convert all measurements to metric, convert prices to a common currency, change part numbers to an industry standard.
Map attributes. Parse the first name and last name out of a contact-name field, move Part# and partno to the PartNumber field.
Most tools will cleanse the data that they can, and put the rest into an error table for hand processing. Depending on how the matching tool works, the cleansed data will be put into a master table or a series of staging tables. As each source is cleansed, the output should be examined to ensure the cleansing process is working correctly.
Matching master-data records to eliminate duplicates is both the hardest and most important step in creating master data. False matches can actually lose data (two Acme Corporations become one, for example) and missed matches reduce the value of maintaining a common list. The matching accuracy of MDM tools is one of the most important purchase criteria. Some matches are pretty trivial to do. If you have Social Security numbers for all your customers, or if all your products use a common numbering scheme, a database JOIN will find most of the matches. This hardly ever happens in the real world, however, so matching algorithms are normally very complex and sophisticated. Customers can be matched on name, maiden name, nickname, address, phone number, credit-card number, and so on, while products are matched on name, description, part number, specifications, and price. The more attribute matches and the closer the match, the higher degree of confidence the MDM system has in the match. This confidence factor is computed for each match, and if it surpasses a threshold, the records match. The threshold is normally adjusted depending on the consequences of a false match. For example, you might specify that if the confidence level is over 95 percent, the records are merged automatically, and if the confidence is between 80 percent and 95 percent, a data steward should approve the match before they are merged.
Most merge tools merge one set of input into the master list, so the best procedure is to start the list with the data in which you have the most confidence, and then merge the other sources in one at a time. If you have a lot of data and a lot of problems with it, this process can take a long time. You might want to start with the data from which you expect to get the most benefit having consolidated; run a pilot project with that data, to ensure your processes work and you are seeing the business benefits you expect; and then start adding other sources, as time and resources permit. This approach means your project will take longer and possibly cost more, but the risk is lower. This approach also lets you start with a few organizations and add more as the project demonstrates success, instead of trying to get everybody on board from the start.
Another factor to consider when merging your source data into the master list is privacy. When customers become part of the customer master, their information might be visible to any of the applications that have access to the customer master. If the customer data was obtained under a privacy policy that limited its use to a particular application, you might not be able to merge it into the customer master. You might want to add a lawyer to your MDM planning team.
At this point, if your goal was to produce a list of master data, you are done. Print it out or burn it to a CD, and move on. If you want your master data to stay current as data is added and changed, you will have to develop infrastructure and processes to manage the master data over time. The next section provides some options on how to do just that.
How Do I Maintain a Master List?
There are many different tools and techniques for managing and using master data. We will cover three of the more common scenarios here:
Single-copy approach—In this approach, there is only one master copy of the master data. All additions and changes are made directly to the master data. All applications that use master data are rewritten to use the new data instead of their current data. This approach guarantees consistency of the master data, but in most cases it's not practical. Modifying all your applications to use a new data source with a different schema and different data is, at least, very expensive; if some of your applications are purchased, it might even be impossible.
Multiple copies, single maintenance—In this approach, master data is added or changed in the single master copy of the data, but changes are sent out to the source systems in which copies are stored locally. Each application can update the parts of the data that are not part of the master data, but they cannot change or add master data. For example, the inventory system might be able to change quantities and locations of parts, but new parts cannot be added, and the attributes that are included in the product master cannot be changed. This reduces the number of application changes that will be required, but the applications will minimally have to disable functions that add or update master data. Users will have to learn new applications to add or modify master data, and some of the things they normally do will not work anymore.
Continuous merge—In this approach, applications are allowed to change their copy of the master data. Changes made to the source data are sent to the master, where they are merged into the master list. The changes to the master are then sent to the source systems and applied to the local copies. This approach requires few changes to the source systems; if necessary, the change propagation can be handled in the database, so no application code is changed. On the surface, this seems like the ideal solution. Application changes are minimized, and no retraining is required. Everybody keeps doing what they are doing, but with higher-quality, more complete data. This approach does have several issues:
Update conflicts are possible and difficult to reconcile. What happens if two of the source systems change a customer's address to different values? There's no way for the MDM system to decide which one to keep, so intervention by the data steward is required; in the meantime, the customer has two different addresses. This must be addressed by creating data-governance rules and standard operating procedures, to ensure that update conflicts are reduced or eliminated.
Additions must be remerged. When a customer is added, there is a chance that another system has already added the customer. To deal with this situation, all data additions must go through the matching process again to prevent new duplicates in the master.
Maintaining consistent values is more difficult. If the weight of a product is converted from pounds to kilograms and then back to pounds, rounding can change the original weight. This can be disconcerting to a user who enters a value and then sees it change a few seconds later.
In general, all these things can be planned for and dealt with, making the user's life a little easier, at the expense of a more complicated infrastructure to maintain and more work for the data stewards. This might be an acceptable trade-off, but it's one that should be made consciously.
Versioning and Auditing
No matter how you manage your master data, it's important to be able to understand how the data got to the current state. For example, if a customer record was consolidated from two different merged records, you might need to know what the original records looked like, in case a data steward determines that the records were merged by mistake and really should be two different customers. The version management should include a simple interface for displaying versions and reverting all or part of a change to a previous version. The normal branching of versions and grouping of changes that source-control systems use can also be very useful for maintaining different derivation changes and reverting groups of changes to a previous branch.
Data stewardship and compliance requirements will often include a way to determine who made each change and when it was made. To support these requirements, an MDM system should include a facility for auditing changes to the master data. In addition to keeping an audit log, the MDM system should include a simple way to find the particular change you are looking for. An MDM system can audit thousands of changes a day, so search and reporting facilities for the audit log are important.
Hierarchy Management
In addition to the master data itself, the MDM system must maintain data hierarchies—for example, bill of materials for products, sales territory structure, organization structure for customers, and so forth. It's important for the MDM system to capture these hierarchies, but it's also useful for an MDM system to be able to modify the hierarchies independently of the underlying systems. For example, when an employee moves to a different cost center, there might be impacts to the Travel and Expense system, payroll, time reporting, reporting structures, and performance management. If the MDM system manages hierarchies, a change to the hierarchy in a single place can propagate the change to all the underlying systems. There might also be reasons to maintain hierarchies in the MDM system that do not exist in the source systems. For example, revenue and expenses might need to be rolled up into territory or organizational structures that do not exist in any single source system. Planning and forecasting might also require temporary hierarchies to calculate "what if" numbers for proposed organizational changes. Historical hierarchies are also required in many cases to roll up financial information into structures that existed in the past, but not in the current structure. For these reasons, a powerful, flexible hierarchy-management feature is an important part of an MDM system.

Conclusion:

The recent emphasis on regulatory compliance, SOA, and mergers and acquisitions has made the creating and maintaining of accurate and complete master data a business imperative. Both large and small businesses must develop data-maintenance and governance processes and procedures, to obtain and maintain accurate master data. While it's easy to think of master-data management as a technological issue, a purely technological solution without corresponding changes to business processes and controls will likely fail to produce satisfactory results. This paper has covered the reasons for adopting master-data management, the process of developing a solution, and several options for the technological implementation of the solution. Future papers in this series will explain the technological and procedural issues that must be resolved to implement an MDM system.

Sunday 26 October 2014

Some of the topics covered in this tutorial are:-


1.Understanding Mapping Analyst for Excel
2.Configuring one Mappings
3.Configuring Multiple Mappings
4.Importing and exporting Mapping Specifications
5.Customizing Mapping Specification Templates

Mapping Analyst for Excel in Informatica

Hi friends , hope you all have come across the Mapping Analyst for Excel in Informatica Power Centre 8.6 version.I would like to just give a brief introduction about the same.In the coming days we will discuss more about the same.

As you all know before when we start a DWH project we need to gather business requirements from the client.It takes some time for us to analyze the functionality of the project and then make that as a document in technical perspective for ETL. This process is too tedious as a lot of valuable development time is spent gathering business requirements and then translating them into technical specifications, and then converting these  specifications to PowerCenter data integration mappings.So in order to avoid this Informatica has come with a new feature called Mapping Analyst

for Excel:-




Mapping Analyst for Excel simplifies the complexity in mapping by directly converting the data in the Excel into Informatica Mappings regardless of the format.This will mainly help the Analyst.They can follow a specific template for defining the source and target in the Excel and then send the same to the Developers who can directly convert this template into Informatica Mappings and execute the workflows.