Reltio Connect

 View Only
  • 1.  Can someone help me understand token generation better?

    Posted 08-04-2021 15:59
    We have noticed that if we use ignoreintoken sometimes the tenant's performance is good but, sometimes it doesn't match on similar values for two different entities.

    Kartik Shah

  • 2.  RE: Can someone help me understand token generation better?
    Best Answer

    Reltio Employee
    Posted 08-04-2021 16:02

    @Kartik Shah

    Great question. Let me talk about what a match token is first. A match token is basically a shortcut to prevent comparing one entire data set to another and go ahead and reduces it to likely matches so that there's a smaller set to compare. And it does that by taking the most identifying attributes and stringing them together into what we call a token, and when those tokens match, it then considers the other attributes not in a token.

    For example, if you had last name as one of your match attributes, and you had a data set with 10,000 Smiths, with that as a token or a major part of the token, you could end up comparing many, many Smiths to one another and creating a huge performance overhead. Ignoring last name intoken in that scenario would improve performance, but the danger in ignoreIn Token is that you are going to reduce the number of things being compared, and you are more likely to mismatches. There is a bit of an art to deciding which things need to be in a token and which don't. I have some base rules here on some things that should pretty much always be marked with ignoreIn Token. If you are doing a negative rule where something does not equal a certain value, you almost always want to have ignoreIn Token turned on for that.  Whenever you are using ExactOrNull or ExactAndAllNull, this compares, say you were comparing on a suffix, you had John Sr, John Jr.

    A lot of times suffix might not be populated, so if you just do an exact match, if suffix on one record John Sr was populated and on the other was just John, they would not match because null does not match Sr. But ExactOrNull, would allow Sr to match null. And ExactAndAllNull, would allow null to match null. Whenever you are using ExactOrNull, or ExactAndAllNull, you want to make sure ignoreIn Token is turned on so that you don't create too many tokens and reduce your match quality. Threshold character is something that through some of the matching tokens, you can limit the number of characters that are going to be compared, so that a certain number of characters have to be compared before it's considered for a match. If you have two short strings that match like Inc, might be meaningless, it wouldn't meet the threshold of maybe five characters that you set.

    Anytime you are using the threshold character setting, you want ignoreIn Token turned on because small strings are more likely to generate mini matches in the tokens. For low cardinality attributes as well. Things like gender, you don't need gender compared considered in your token, you want to bring them together on those identifying attributes. And then after the tokens match, you want to reduce the set with the differentiating attributes like gender or name. So low cardinality attributes you want ignoreIn Token turned on, high cardinality attributes you want the token to consider them. Things like IDs and social securities, emails make for good things not to ignoreIn Token.

    And when you have similar match rules that are generating similar tokens, you might want to consider ignoreIn Token on most of the attributes on one of those rules, so that you're getting less tokens in your set and that'll improve your performance. So to go back to the question specifically, if you're not getting matches you expect to get, I would say you probably have too many things in ignoreIn Token, and I would go back and make sure that your high cardinality attributes are not set to ignoreIn Token, and that might help you there. And a little later in the presentation below, I show how to troubleshoot some matches.

    Make sure you checkout the webinar I hosted on the Community below: 

    Joel Snipes

  • 3.  RE: Can someone help me understand token generation better?

    Reltio Employee
    Posted 08-10-2021 06:56

    @Kartik Shah

    Matching has 2 separate parts. The first is the tokenization process and the second is the comparison process.

    Tokenization is needed to find pairs of entities for comparison.  The comparison process is a more detailed comparison of entities, works with pairs of candidates from tokenization. We need to build good tokens that provide us only potential matches for comparison. In some cases, we can use all attributes for tokens but in the case of fuzzy or big numbers of attributes ignoreInToken element helps to reduce the number of tokens. Since the token is the concatenation of attribute values, fewer attribute values -> fewer tokens -> fewer pairs for comparison -> better performance.

    You can find more details in this documentation: ignoreInToken

    Reltio remove preview
    The ignoreInToken element prevents the generation of tokens for attributes that are specified within it. The ignoreInToken functionality is used to suppress generation of tokens for certain attributes when you feel those tokens will not serve a meaningful benefit toward the goal of finding match candidates and will reduce the performance of your rules due to either the quantity of tokens generated or the quantity of match candidates returned.
    View this on Reltio >


    Alexander Gusarov

  • 4.  RE: Can someone help me understand token generation better?

    Posted 08-11-2021 09:35
    Kartik, if you wish a fuller and more detailed treatment on tokens and token generation, I encourage you to watch the module 6.020 "Identifying Match Candidate Pairs through Tokenization" within the course Creating Reltio Match Rules, which is available at Reltio Academy.

    Curt Pearlman
    Next Phase Solutions and Services, Inc
    Baltimore MD