In this video, we're talking all things Matching and Merging. Reltio community program manager, Chris Detzel, and Reltio technical consultant, Joel Snipes, discuss match rule features, design and tuning in order to get the most out of your data. This session explores what makes a good merge rule, what makes a good potential match, and shows some of the common pitfalls new MDM practitioners fall into.
Summary:
Developing Effective Match Rules in Reltio for MDM Matching and Merging
In this article, we will discuss the importance of match tuning in Master Data Management (MDM) and explore the three-step process for developing effective match rules in Reltio. We will also delve into the significance of profiling in the MDM matching and merging process.
Introduction
Chris Detzel, the Community Program Manager, introduces a webinar on MDM matching and merging with Joel Snipes, a senior technical consultant. Chris requests attendees to ask their questions in the chat and reminds them that the webinar will be recorded and shared with the Reltio community. He also announces that they are looking for presenters and encourages attendees to email him if they are interested.
The Three-Step Process for Developing Effective Match Rules
Joel Snipes shares that match tuning is the process of developing and improving natural matches. In Reltio, building and tuning match rules is necessary for match merging. He outlines the three-step process for developing match rules, which involves analysis, design, and tuning.
Firstly, the process begins with analyzing the source data and performing profiling. Secondly, the profiling results inform the design of match rules. Lastly, after implementation, testing, and iteration, tuning of the match rules can take place based on the analysis tools and performance of the match emerge process on the data.
The Importance of Profiling
Joel emphasizes the importance of profiling in developing effective match rules. Profiling helps understand the data and informs the development of match rules. Without a strong understanding of the data, the initial cut of match rules can be challenging to create.
Joel further notes that cardinality is a critical factor in profiling. High cardinality, which means unique records, is essential to a good matching attribute. It reduces the likelihood of bad matches. For instance, if each record has a unique customer ID, creating a match rule based on customer ID is not practical. In contrast, an account number, which may not have perfect uniqueness, may be a more suitable matching attribute.
Conclusion
In conclusion, developing effective match rules in Reltio for MDM matching and merging requires a three-step process: analysis, design, and tuning. Profiling is a crucial aspect of this process, and it is essential to understand the data's cardinality to identify suitable matching attributes. Understanding these concepts will help ensure the success of MDM matching and merging in Reltio.
13:59 - 27:51
The Challenges of Address Verification and Standardization
Introduction
Address verification and standardization are essential for accurate data matching and merging. However, it is not a simple process, especially when dealing with different countries, languages, and formats. In this article, we will explore the challenges of address verification and standardization and how they affect businesses' data quality.
Partial Verification and Localization
Partial verification is a common issue when it comes to address verification. In some cases, even after applying address cleansing tools, the verification remains partially complete. One of the reasons for this is the inability of the tool to match the address to the right level. Different countries have different address formats and structures, and it is not easy to standardize them all. Another issue that arises is localization. Not all address cleansing tools are created equal. Some work better in certain countries than others. For example, a tool that works well in the US may not work well in Germany or India. This raises concerns about how reliable the data is if the address verification tool is not effective.
Standardization Across Different Languages
Another challenge in address verification and standardization is the use of different languages. For example, in some languages, there are different ways to write the same name or word. It can be a problem when matching or merging data, especially when data comes from different sources or different countries. Noise words, like prepositions or conjunctions, are also used differently in different languages, making standardization even more challenging.
Third-Party Address Cleansing and Standardization Services
One potential solution to the challenges of address verification and standardization is to use third-party address cleansing and standardization services. However, this solution is not without its challenges. Some address cleansing services may not have the same level of coverage or accuracy in all countries, which can lead to partial verification or inaccurate data. Additionally, some services may not be able to handle specific languages or naming conventions, which can lead to issues with standardization.
Conclusion
Address verification and standardization are essential for accurate data matching and merging, but they come with their own set of challenges. The challenges of partial verification and localization, standardization across different languages, and the use of third-party address cleansing and standardization services must be addressed to ensure high-quality data. Businesses need to be aware of these challenges and find the right tools and solutions to address them. Only then can they be confident in the accuracy and reliability of their data.
27:54 - 40:36
Best Practices for Fuzzy Matching and Merging Data
Introduction
Fuzzy matching and merging is a powerful technique for consolidating data from different sources, but it can also be a risky process. In this article, we will go over some best practices for fuzzy matching and merging to ensure that your data is accurate and reliable.
Suspect Setting
Before enabling automatic merging, it's important to use the suspect setting. This allows you to manually review and approve potential matches before they are merged. You should only switch to automatic merging after a lot of investigation and with a high confidence rating that all your matches are correct.
Match Rules
Creating match rules is key to successful fuzzy matching. You need to define rules for what constitutes a match, such as fuzzy first name, fuzzy last name, and exact email. After developing a rule, click through to one of the profiles and review the potential match view. After going through a large set, you should see over time that you have not had any bad matches, and then you can turn the rule into an automatic one.
Data Stewards
Data stewards should be responsible for manually merging potential matches that are not automatically merged. If you find matches that are not the same person, you should click "not a match" and leave the rule as a suspect one.
Start with Suspect Setting
In general, everything should start with the suspect setting, even if it's on something very tight like social security or an old MDM systems ID. It's best to make sure that things are working the way you expect them to before switching to automatic merging.
Fuzzy Matching
Fuzzy matching is a complex process that involves tokenization and comparators. There are many options for fuzzy matching, such as the Levenshtein distance and Soundex comparator. Reltio enables the ability to mix and match these options to create your own special sauce for fuzzy matching.
Relevance-based Match
Relevance-based matching is different from automatic and suspect rule matches. It uses machine learning to find matches based on relevance scores. It is important to use this feature carefully and to monitor its results to ensure accuracy.
Match IQ
Match IQ is another feature that uses machine learning to improve matching accuracy. It is important to keep in mind that Match IQ should be used in addition to, not instead of, your own match rules.
In conclusion, fuzzy matching and merging can be a powerful tool for consolidating data, but it requires careful planning and execution. By following these best practices, you can ensure that your data is accurate and reliable. Remember to always start with the suspect setting, create clear match rules, and use data stewards to manually merge potential matches.
40:39 - 53:22
Understanding Cardinality and Tokenization in Data Matching
Introduction
Matching data is crucial in data management, as it helps to find similar records and merge them to eliminate duplicates. In this article, we will discuss the importance of cardinality and tokenization in data matching and how to configure them for optimal performance.
Cardinality Attributes
Cardinality attributes are attributes that have a low or high number of distinct values. For example, gender is a low cardinality attribute because it only has two possible values: male or female. On the other hand, IDs and social security numbers are high cardinality attributes because each value is unique. When matching data, it is important to bring together records with similar identifying attributes, such as name and address, before reducing the set with differentiating attributes like gender. Low cardinality attributes can be ignored in tokenization, while high cardinality attributes should be considered.
Tokenization
Tokenization is the process of breaking down data into tokens, which are small, normalized representations of the data. Tokenization is crucial in data matching because it enables fuzzy matching, which is the process of matching data that is similar but not exactly the same. Tokens are generated for each attribute of a record and are compared to tokens from other records to find matches.
Ignore in Token
When configuring tokenization, the "ignore in token" option can be set for attributes. This option is used for attributes that should not be considered when generating tokens. For example, static or unique values like social security numbers should not be ignored in token because they are high cardinality identifying attributes.
Similar Match Rules
When multiple match rules generate similar tokens, the "ignore in token" option can be used on most of the attributes for one of the rules to reduce the number of tokens in the set. This can improve performance by making the matching process more efficient.
Troubleshooting
If you are not getting the matches you expect, it is possible that you have too many attributes set to "ignore in token". Make sure that high cardinality attributes are not set to "ignore in token" and that low cardinality attributes are ignored. Additionally, segment-based matching can be used to execute matching only for specific types of customers. This is done using the "equals" operand to match records with specific values for a certain attribute.
Reports
Finally, reports can be generated to review potential matches in bulk. An open-source tool for potential match reports can be used to compare two records side-by-side externally.
In conclusion, understanding cardinality and tokenization is crucial for successful data matching. By configuring these settings appropriately and troubleshooting any issues, data stewards can achieve optimal performance and eliminate duplicate records.
53:23 - 1:05:58
Maximizing Token Phrase Creation for Efficient Matching
Introduction
In this article, we will discuss token phrase creation and how it can be used to generate efficient matching of data. We will also explore how to keep the number of tokens generated in check, as generating too many tokens can slow down the system.
Background
Token phrases are created by analyzing the data and identifying phrases that commonly appear together. Tokenizing the data allows us to identify the most common phrases and create token phrases that represent them. These token phrases can be used for efficient matching of data, as they represent the most common patterns in the data.
Generating Token Phrases
To generate token phrases, we first need to tokenize the data. This is done by analyzing the data and identifying the most common phrases. Once we have identified the most common phrases, we can create token phrases that represent them. These token phrases can then be used for efficient matching of data.
It is important to keep the number of tokens generated in check, as generating too many tokens can slow down the system. The lower limit for the number of tokens should be five, and the upper limit should be around 300. It is important to keep the number of tokens generated within this range for efficient matching.
Examining Token Phrases
To examine token phrases, we can use a map that shows the lower and upper limits of tokens. By examining this map, we can see how our data set fits between these two limits. If our data set is too close to the lower limit, we may be ignoring token phrases in too many places. If our data set is too close to the upper limit, we may be generating needless tokens that slow down the system.
We can examine token phrases on a rule-by-rule basis by using a drop-down menu. This allows us to examine individual graphs and get an idea of what our data looks like.
For more information, head to the Reltio Community and get answers to the questions that matter to you: https://community.reltio.com/home