The Rule-Based Matching Process

View Only

The Rule-Based Matching Process

By Suchen Chodankar posted 10-31-2021 17:49

Recommend

Reltio’s webinar, Rule-Based Matching: A Deep Dive, complements another webinar, Reltio MDM Matching and Merging. In a previous blog, we broke down the beginning of this webinar by reviewing the match tuning best practices presented by Joel Snipes.

Then, we began to break down matching into a general overview. Here, we discussed how matching is the process of identifying records that are either identical or similar, and how Reltio can assist you in completing this process with rule-based matching.

Rule-based matching is based on instruction, which means the configuration provides the instruction and Reltio executes the matching dictated by those instructions. Then, Reltio merges these records either automatically or create a suspect match for the data stewards to review and resolve manually.

Then, we discussed the anatomy of these instructions, or match rules. We also gave examples of different fields of property of the match rule, like uri, label, and type. We broke down the components of each match rule configuration, and then considered certain tuning APIs and strategies for match troubleshooting.

Now, we’d like to touch on that API example and a few notes on important questions posed by our users that may be useful before beginning to break down the full process of rule-based matching.

An API Example

To view these APIs in real time, be sure to tune into the webinar and watch our presenter, Suchen Chodankar, perform many of the aforementioned processes. For example, he starts in his Reltio tenant with zero records available. He uses the data modeler to review the list of rules available. In this example, he has five rules.

Scope is internal, so this rule is only used for intra-tenant matching. If a rule is configured as external, the rule will not be used for matching your tenant data. Internal, relevance-based, and a custom rule are also listed. He loads the record, and walks us through the matching and merging of that record, according to the rules.

A Note on Complex Match Rules

You may wonder if you are able to design match rules to match a set of attributes to be concatenated before a match. For example, it may seem easier to apply the entirety of an address to another address, instead of each aspect of the address individually.

However, the more complexity you bring into a match rule, the slower it will catch. Thus, you may hurt other match pairs that you would have rather kept. Instead, create that value, and store it in a different attribute. Then, use it in your matching. In other words, create a concatenated string, push it to a different attribute, and use a similar concatenation on the other records as well.

A Note on the Sub-String of a Given Attribute

Can users in Reltio match on a sub-string of a given attribute, like character two to seven of one of their crosswalk attributes? The answer is yes. Reltio users may use a custom comparator to match the sub-string. Furthermore, if it is an organization name, or something similar, they may use a comparator.

However, characters seven to eleven are not as easy to configure. All told, it is better to have a standardized or trimmed attribute, as creating a new attribute in Reltio only takes a matter of seconds. So, create the sub-string, push it in a new attribute, and follow the same process for all profiles to apply different algorithms on that derived attribute.

A Note on Multiple Match Actions

Reltio supports multiple match actions. However, this strategy is only supported when using relevance-based matching.

A Note on External Files

Reltio does not create a file format for external matches based on entity. The external file is something that you bring to the platform, and the Comma separated values (CSV) is the only file format currently supported.

The Matching Process Overview

Wondering if these rules have a hierarchy? The matching process overview will cover that question and more.

The matching process begins with a data log, or data load. After this data is pushed into Reltio, it is placed in primary storage. You will always find this initially loaded data in primary storage.

After the data is in primary storage, information from that data is received in the CRUD queue. The CRUD queue receives every update of a new record pushed into Reltio. Messages flow through the CRUD queue, and you have access to this information from the queue monitor in your data tenant. The queue then publishes the data to all other consumers, which takes the data and updates it back to a component of the Reltio platform.

These components might be Reltio UI, Analytics, other services, or the Match Document Processor. If any new record is created, it will inform the components of that information.

The Match

Once the Match Document receives the copy of the new record, it immediately creates a match document. This may be where your API practice comes in. If you return to your match document API, you will be able to see all of the attributes and sub-attributes that you have already configured in your match rule.

Along with this raw value, you will also see the tokens, what tokens you have created for the entity, and what rules you have in the system.

After the update, a new record is created. This process is repeated for every single record.

If the record has been updated, it’s time to see whether something about the record changed . This will indicate whether or not it makes sense to run the match. If nothing has changed, the process is complete. If something has changed, we proceed to the match queue.

The match queue is the component that triggers matching and comparison. After the message is delivered here, the matching service triggers these actions. If a pair is found, a potential match is created. If a pair matches and needs to be automatically merged, the record is updated and placed back in the correct queue. After the merge, you must then consider if a recalculation is necessary to get more matches, or if you must remove some existing matches. The cycle continues, as each merge changes your data set.

Data Tokenization Process

When the match document is created, we create these token phrases. Each rule has its own tokenization rule that has been constructed in the background. In other words, for every rule there is a tokenization rule.

The rule consists of every attribute that has participated or that is configured as a match attribute, and will be a part of the token phrase. As an example, the token phrase consists of the first name, colon, last name, colon, address line one, and so on for a rule that has these list of attributes

Multiple rules and tokens may generate the same results. This may happen if two rules are very similar, or if two rules are aligning on the same token. Fuzzy matching is a common cause of this. If you have certain conditions in your match rule, like

exact or null,
all null, or
not equal and
in,

some attributes may be automatically excluded from tokenization, resulting in better matches.