Then, the next record, E2, arrives. E2 comes in and generates a similar token, and associates that new record with the same token that already exists.
Without a token, you would only be able to match records if you matched every record in the data set with every other record. This would be the only way to match an entire data set, thus taking up a substantial amount of time. These tokens, on the other hand, create smaller universes inside of the data set. Then, anything that is a part of the individual is matched, and nothing else.
So, if something is not matching, first check to see if they are in the same bucket. If they are not in the same bucket, that is why they are not matching. In fact, it is completely impossible for them to match while they are located in separate universes. This is why tokenization is so important.
The Data Comparison Process
Once the tokens are created, they must be picked up for the match rule. So, if there are four tokens for the entity E1, Reltio will pick up the four buckets and construct a possible list of candidates across all of the matched candidate buckets. It does this because each token bucket includes E1. In our example depicted in the picture below, E1 may match with E2, E3, E4, E7, E8, and E9.
Once a candidate list for comparison is created, there are multiple match rules to consider. And, there is no hierarchy to the match rules. In fact, the matching using different match rules are triggered at once. So, E1 and E2 will be picked first, and you will execute the matching for all rules at the same time.
For every rule, an action is associated with it. For example, if this rule evaluates this pair to be true, then what was configured for that rule will be triggered. So, if E1 and E3 are matched, and this rule is marked as automatic, then the pair will be merged automatically. On the other hand, if rule N is suspect and E1 and E2 is found to be true, it will be submitted for data stewardship review. This is how the end-to-end process of overall matching works.
Tokenization Scenarios
Having multiple redundant match rules hurts your potential matches. This is because the more tokens you create, Reltio must perform multiple evaluations, even if it is true or not. Here are some examples:
- If you have six records, each with different versions of Michael Branson, one of these may have a different state noted in their physical address. This indicates that this is probably a different person.
- Your sixth record, Rachel Branson, is most definitely a different person.
- If you do not define a tokenization scheme, each of these records will go in separate buckets and never match.
- So, you ignore the first name while creating the tokenization and create a Benson bucket. Now, you can match records with Branson in them, but if the address line is slightly different, they will not match in the same bucket.
- So, follow the same approach, and leave the address line one out of the tokenization.
- Continue this process until all of your records are in the same bucket.
However, then Michael Branson with a different address, and Rachel Branson, will be in those buckets. This is one example of a tokenization issue.
Tokenization Issues
Another tokenization issue is an excessive number of tokens for an entity. One example of this is fuzzy matching on more than one multi-valued attribute. Another is fuzzy matching on an attribute in combination with synonym matching.
Another example of a tokenization issue is the excessive number of entities sharing a token phrase. You may experience this when too many entities share the same set of attribute values for different attributes, or if there are multiple duplicate copies of the profiles from the source systems.
Configuration Examples
Let’s talk through another configuration example. In the webinar, you can watch as Suchen Chodankar, our presenter, loads all of the records in question in real time.
- Micahel Branson is listed with the same address.
- Miguel and Michael are listed, with different first name spellings.
- Michael is listed twice, exactly identical.
- We refresh, and look at Michael Branson for the potential match.
- We see that Michael matched with Michael, and know that the match worked. This match worked because he has put in the configurations correctly.
- Next, Suchen takes out rules number two and three to take out the configuration.
- Instead, he places fuzzy on the first name. The rule has not changed. Thus, he has taken out the configuration.
- This triggers the rebuilding and rematching.
- Now, Michael is very different. It is only matching the exact Michael entity.
- So, he does a double metaform so that Michael should match. He reviews the matches and finds that it did match by document ,but not by UI, as the matched final outcome is false.
- This tells you that your tokenization is either wrong or insufficient.
- So, we must go back and find the common tokenization. Suchen chooses to try ignore.
- Now, there is an intersection token that did match, and in UI, the record returns.
All told, it is of utmost importance that you carefully configure your tokenization to make sure that you have the optimal performance and desired output for forming match pairs.
Rule-Based Matching
Rule-based matching is a complex process. If you have any further questions about this or any other Reltio process, simply post to our community page. And, look ahead to further webinars on this subject and many others coming up at Reltio.
Relevant content:
Rule-Based Matching: Matching AnatomyThe Rule-Based Matching ProcessWatch the Rule-Based Matching: A Deep Dive webinar to understand more.