Reltio Connect

 View Only

Match Tuning Best Practices: Data Profiling and Analysis

By Joel Snipes posted 08-30-2021 08:40


Reltio Master Data Management’s purpose is to consolidate thousands of profiles and data sources to a cloud-native, user-friendly system that ultimately helps businesses run smoothly. Match Tuning’s purpose is to ensure that consolidation happens in an effective and efficient manner.

What is Match Tuning?

Match tuning is the iterative process of developing match rules while minimizing bad matches, maximizing automation, and preserving system performance. In order: 

Match tuning is best done by utilizing a three-step process, or the match tuning life cycle. These three steps are:

  •   Data profiling and analysis
  •   Rule design and implementation
  •   Testing and improving

Data Profiling Tools and Analysis

Though underappreciated, data profiling is an important first step in the match tuning process. If you do not understand your data, you will not be able to effectively design rules to match your data. There are lots of tools available to help with the profiling process, or it can be completed via a series of sql queries in RIQ/Spark. Below I will cover some important factors to consider in your profiling process.

  •   Cardinality is the number of distinct values in a dataset. For example, in a column titled gender we might have 4 distinct values: Female, Male, Other, Unknown. This column would have a cardinality of 4. 

           Cardinality is important because high cardinality attributes are more valuable to identifying potential matches than low cardinality attributes. After  profiling for cardinality we will be able to identify the best attributes to build our match rules around.

               Example SQL Query: SELECT COUNT(DISTINCT GENDER) AS Gender_Cardinality FROM Customer

  •   Completeness indicates the percentage of records for which an attribute is populated. You cannot match data that does not exist, but you can design your match rules to compensate for it. 

             Example SQL Query: SELECT COUNT(GENDER)/COUNT(*) AS Gender_Completeness FROM Customer

  •   Uniqueness indicates the percentage of records that do not share a value with another record. In our previous example of gender, there are likely 0 records with unique values. However an attribute called social security may have a uniqueness of 100%.

    Example SQL Query: SELECT COUNT(DISTINCT SocialSecurity)/COUNT(*) AS SocialSecurity_Uniqueness FROM Customer
  •   Common values are often used in place of nulls. Look for the ten most common values for each field. You may notice incorrect or junk data, for example a birthdate of 12/31/9999 or a social security of 555-55-5555. It’s best practice not to load these placeholder values as it will affect matching. By catching some of these anomalies before they enter your system, you will avoid the process of deleting and reloading.

         Example SQL Query
: SELECT Birthday, Count(Birthday) AS BDayCount  FROM Customer ORDER BY BDayCount DESC LIMIT 10

  Standardization and Noise Words are handled out of the box in Reltio saving you the time and energy of managing them. Data cleansing comes pre-configured in your tenant for Addresses,  Phone numbers and email addresses. Noise words are handled by some token classes such as “OrganizationNameMatchToken” but custom lists can be added.

Design and Implementation

After obtaining a firm grasp of what your data looks like, it’s time to start designing your match rules. I like to use the results of my profiling to divide my attributes into three buckets. These three buckets are identifying attributes, differentiating attributes, and categorizing attributes.

  1.     Identifying attributes are attributes that have high levels of uniqueness and completeness, such as social security numbers, email addresses, or phone numbers. This may vary in different datasets. These attributes will be the center pieces of your match rules. Consider using  the “Exact” operator for precise attributes like IDs, phone numbers and emails and “Fuzzy” for attributes like first name and last name.
  2.     Differentiating attributes have low levels of uniqueness, and varying levels of completeness. While identifying attributes will help your match rules decide which records to compare, differentiating attributes help you validate that match.
           For example, if you created a match rule around the identifying attribute “Phone.Number” you might consider using differentiating attributes such as              “Gender” or “Date of Birth” to validate. If the “Gender” and “Date of Birth” differ, it is likely note a valid match.

           For differentiating attributes consider the ExactOrNull and ExactOrAllNull operators if you are working with data that have low levels of completeness. 

  1.     Categorizing attributes describe the nature of other data and will be a factor when working with nested attributes in Reltio. For example, if two individuals shared an address, you may assume two individuals are a match. However, if “Address.Type” is billing, sharing an address may only indicate that two individuals share a CPA. 

Categorizing attributes help to further exclude or reduce the set of data considered in a match and merge. This is accomplished with the  “Equals”  operator. Equals operates similarly to a WHERE clause in SQL by filtering the set to only consider a match when “Address.Type” equals “Home”.  

In part II of this blog we will talk about effective match tuning...Stay TUNED! In the meantime consider reviewing the docs on designing comparison formulas for more details.