@Kartik Shah,
Great question. Let me talk about what a match token is first. A match token is basically a shortcut to prevent comparing one entire data set to another and go ahead and reduces it to likely matches so that there's a smaller set to compare. And it does that by taking the most identifying attributes and stringing them together into what we call a token, and when those tokens match, it then considers the other attributes not in a token.
For example, if you had last name as one of your match attributes, and you had a data set with 10,000 Smiths, with that as a token or a major part of the token, you could end up comparing many, many Smiths to one another and creating a huge performance overhead. Ignoring last name intoken in that scenario would improve performance, but the danger in ignoreIn Token is that you are going to reduce the number of things being compared, and you are more likely to mismatches. There is a bit of an art to deciding which things need to be in a token and which don't. I have some base rules here on some things that should pretty much always be marked with ignoreIn Token. If you are doing a negative rule where something does not equal a certain value, you almost always want to have ignoreIn Token turned on for that. Whenever you are using ExactOrNull or ExactAndAllNull, this compares, say you were comparing on a suffix, you had John Sr, John Jr.
A lot of times suffix might not be populated, so if you just do an exact match, if suffix on one record John Sr was populated and on the other was just John, they would not match because null does not match Sr. But ExactOrNull, would allow Sr to match null. And ExactAndAllNull, would allow null to match null. Whenever you are using ExactOrNull, or ExactAndAllNull, you want to make sure ignoreIn Token is turned on so that you don't create too many tokens and reduce your match quality. Threshold character is something that through some of the matching tokens, you can limit the number of characters that are going to be compared, so that a certain number of characters have to be compared before it's considered for a match. If you have two short strings that match like Inc, might be meaningless, it wouldn't meet the threshold of maybe five characters that you set.
Anytime you are using the threshold character setting, you want ignoreIn Token turned on because small strings are more likely to generate mini matches in the tokens. For low cardinality attributes as well. Things like gender, you don't need gender compared considered in your token, you want to bring them together on those identifying attributes. And then after the tokens match, you want to reduce the set with the differentiating attributes like gender or name. So low cardinality attributes you want ignoreIn Token turned on, high cardinality attributes you want the token to consider them. Things like IDs and social securities, emails make for good things not to ignoreIn Token.
And when you have similar match rules that are generating similar tokens, you might want to consider ignoreIn Token on most of the attributes on one of those rules, so that you're getting less tokens in your set and that'll improve your performance. So to go back to the question specifically, if you're not getting matches you expect to get, I would say you probably have too many things in ignoreIn Token, and I would go back and make sure that your high cardinality attributes are not set to ignoreIn Token, and that might help you there. And a little later in the presentation below, I show how to troubleshoot some matches.
Make sure you checkout the webinar I hosted on the Community below:
------------------------------
Joel Snipes
------------------------------
Original Message:
Sent: 08-04-2021 15:58
From: Kartik Shah
Subject: Can someone help me understand token generation better?
We have noticed that if we use ignoreintoken sometimes the tenant's performance is good but, sometimes it doesn't match on similar values for two different entities.
------------------------------
Kartik Shah
BCBSNC
------------------------------