Join Suchen Chodankar , Principal Product Manager at Reltio and Chris Detzel, Director of Customer Community and Engagement for another Community show on the topic traditional Match Rule Based Matching. In this one hour session, you, the audience, will come with questions to ask Suchen with all things Traditional Match Rules within Reltio.
Check out the Reltio Community for questions about Match Rules. Reltio Online Master Data Management Community: https://community.reltio.com/home
Questions asked here: https://community.reltio.com/discussion/ask-me-anything-traditional-match-rule-based-matching
Find the Transcript here
Chris Detzel:
Again, ask your questions either in the chat and/or you can take yourself off mute and ask directly. I do have some questions directly and quickly let me share my screen. We do have actually two more shows coming up. I have not pushed those out yet, but one of the shows is called, The Reltio Way Delivery Methodology Implementation. So I've heard from you some of the shows that you want and one is about our implementation team and our methodology and things like that. So please sign up for that and then we've got a couple more that I haven't pushed out yet, but they're coming soon. So let me get out of that real quick. And as I mentioned a little bit ago, I'm having potential audio issues or internet issues, so if I drop it's okay. Suchen will take it and I'll get back on, I promise. So our first question does come from the community. Suchen, are you ready? All right. Get your questions ready everyone, but right. And the first question is, Suchen you can ask it if you want that. That'd be fine. Go ahead or I can do it.
Suchen Chodankar:
Yeah. So Chris, actually my audio is bit noisy so can you please take it over?
Chris Detzel:
Yeah, I got it. So first question is, this is regarding traditional matching rules which follows edit distance concept. So based on reviewing Retlio's advance match, I can see how am I going to say this, Demeral Levenshtein compared to functionality matched the same. So does this scoring algorithm at the backend also work the same way? So for example, differences between the pattern used on multiple hops determining the score between two strings, Suchen thoughts there?
Suchen Chodankar:
Yeah, actually you know what? I'm going to share my screen and we can just play around some of the examples. It'll be fun that way. Right? Let me do...
Chris Detzel:
Here's the question in the chat too.
Suchen Chodankar:
Okay. So I have prepared few examples, more generic that we can use. All right. I think the question is the algorithms that we use for, or actually it's not really a question about a specific competitor, right? In general the question is, the formula that we use for calculating various similarity, does that basically translate into the raw value of that underlying function in the relevance based matching? So just for context, we have two types of rules within the rule based matching. One is binary outcome based matching, the other one is relevance based matching, which is basically score based matching. In the binary outcome based matching, we use the output of the underlying function in this case Levenshtein algorithm. For example, if there's an add distance of three or more, right? Then in the binary based outcome, the answer is true or false. If it falls within the range of the Levenshtein algorithm, then the answer is true.
If it is outside of that range, the answer of that comparison is false. So as you can see here, I have a rule that is configured here, which is Levenshtein algorithm for first name. And the data that I'm looking at, let me pull up that very quickly. So I'm looking at this two records here, Robbie and Robert Smith. So as you can see, the number of replacements or adds that you'll have to do to make this two string similar is more than three. So there is E, there is a R, there is a T which is missing here and there is obviously the insertion also, so that makes it more than three, which is why the result is coming out as false. So let me go back here. So the match comparison between this two string for first name is coming out as false. Now if we go back to the same rule but now configured as a relevance based match rule, you will see that it is giving the relevance score of 0.5, which means it is 50% match.
Now to answer your question, the score that comes out of the Levenshtein algorithm is not used as is. So we have our internal algorithm for generating the relevance score, which is for Levenshtein algorithm it is, I think I documented here so that we all can see it here. So it's a distance between S one and S two, which is the Levenshtein algorithm provides. And then we have divided it by the maximum length of string one and length of S two. So that is for the Levenshtein algorithm. For the dynamic one, we have slightly different one here. So I think the answer that you're looking for is whether we are using the value that is coming from this algorithm as is, the answer is no. So this gives us similarity as 50% between Robert and Robbie. So if we were to use the Levenshtein algorithm as is, this could be slightly lower, which is why we are not using this formula now. I hope that answers your question.
Chris Detzel:
Truly Suchen yeah. Thank you so much.
Suchen Chodankar:
Sure. Chris, do you mind if I just read through this one because it's the same questions that I got from you?
Chris Detzel:
Yes, sorry about that. Go ahead. Yeah. And I'll start looking at some of the questions directly from the chat once you kind of answer some of these. So yep, go ahead. Just if you can read the question. Yeah, just read the question.
Suchen Chodankar:
Or you can just add to this document, then we all get to kind of see it here in one.
Chris Detzel:
That's good yeah. And I'll be posting these on the community at some point soon. So go ahead.
Suchen Chodankar:
Okay. All right, so the next one is from Mark. How is matching impacted by multiple survivorship group? Again, just to provide a little bit of background here for anyone who is not familiar with the survivorship groups or multiple survivorship groups. So in Reltio we have this concept of you can define different set of survivorship groups for different consuming systems or applications and whatnot. For example, marketing could say I trust ABC source system when it comes to first name or the address for example. Now a different application could say that I trust X, Y, Z when it comes to address for whatever purpose. So you get to define this multiple source system survivorship groups. When you request for a entity, you basically will get a virgin based on your role that you're tied to that survivorship role. So the question here is, how does that impact the matching?
How is matching impacted by multiple survivorship groups? So if I have survivor survivorship group one telling me that 101 North Street is a surviving address and survivorship group two is telling me that 101 South street is the surviving right, this matters a lot because when you're comparing and your rule says match only the surviving value, it matters which value you're going to pick. So the answer here is matching is not impacted by this because we do not consider multiple survivorship group.
There is a flag on the survivorship group which customer gets to set as a default survivorship group and that is what is used by matching in the matching process. So I see an opportunity here, probably an enhancement in future where you could say that, I don't like the default one for matching. Can you use the matching for generating this potential match using this different survivorship group? That is currently not supported. So the one which is marked as default will be picked up and the values that comes out from that match group calculation will be used for matching purpose. Let me know if there are any follow up questions. Feel free to interrupt me, otherwise I'm just going to go through the list here.
Chris Detzel:
Yeah, just keep going.
Mark Burlock:
Okay. Actually could just on top of that if I can Suchen. So with that explanation makes total sense. Are there, just from your experience, any gotchas or are folks using that? I mean are you seeing people doing that and are any gotchas that we need to be just obviously concerned about?
Suchen Chodankar:
No. Usually, especially for the automatic merges, you want to have an agreement between this different application to agree on one survivorship rule. So it is very important that you share that information with the rest of the application who might have configured slightly different match groups. It is very common for people to assume that since we allow multiple survivorship groups, there is a way for matching engine to pick on those different survivorship groups. So I think that's the only thing I will point out here that it is very important to call that out with the consuming application that the survivorship rule that is marked as default is the only one that is going to get used. And it becomes super important when you are using automatic merging, because if you do not agree with that value then everybody is going to basically going to disagree with the version that comes out as a consolidated value.
You could basically, if you have the potential matches, doesn't matter as much because you look at the pair, you don't agree, you change the match rule and you're fine. But the moment you consolidate, it's kind of committed. And then if you don't agree with that, you'll have the task of unmerging them and reemerging them and those kind of things. I would just point out that I've seen this where people assume that since we allow multiple survivorship groups based on the rules and whatnot, it'll automatically pick the survivorship group during the match calculation, which is not the case.
Mark Burlock:
Okay, terrific. And just a couple other things that we can with this, because this is great by the way so I appreciate it. Is the potential matches though that also uses default survivorship group then?
Suchen Chodankar:
Yes. So the way it works is like, and let me bring up this one. I like to use this one. So this is the overall matching process. When the data is loaded here, it moves from primary into all these different secondary storages that we have, which is used for UI, analytics, history, logging and all that kind of stuff. Matching is one of the consumer of that primary storage. So when matching processor gets that data and it has multiple survivorship groups, when it creates this match document, which is basically a copy of your record but not the whole thing, only that is required for matching purposes.
This is where it is getting calculated. This is where basically it is ignoring anything that is not relevant to the matching. That includes the survivorship value, which comes out from non-default survivorship group. So the moment it gets here, everything related to the matching happens on this match document data now. It is not even looking here. So this whole process is keeping the match document and the primary storage in sync, but the moment data is persisted in this matched document, that is what is used for the matching, everything matching automatic relevance based matching, suspect matching, any kind of matching. So if the data is not here, we're not going to match that. So that's how it works.
Chandra:
Okay, thank you. So what is the switch agenda here? What is the improvement that you're saying that there's an opportunity for improvement?
Suchen Chodankar:
The opportunity is basically right now as we are pulling just the default survivorship of values. So there is an opportunity here or an enhancement where we could say do not just take survivorship rule values from the default one, but let me tell you which survivorship you should use for calculating my message.
Chandra:
Or all of them.
Suchen Chodankar:
All of them could be little confusing because again, going back to that example, you have to pick let's say one address. One survivorship group says it's 101 North Street, the other one says 101 South Street. You have to make up your mind.
Chandra:
Do I need to have separate set of match rules for each survivorship then?
Suchen Chodankar:
Yeah, that's why it gets confusing. So that's why I don't like the idea of having all the survivorship rules used during the matching. What I was talking about, the improvement is you get to basically tell or pick a survivorship rules that we should use. Right now default is used for showing on the UI and all of that stuff, but what if you want to use the survivorship values, which is different from what you see in the UI for the matching purpose. So that's where I was suggesting that we can probably have a flag or something which will tell this is a default for everybody looking at the UI record and whatnot, but use this survivorship rule for the matching purpose. So yeah, something that we can think about.
Chandra:
But in a way, it is coming to the point that we may have to tie the user role and the matching engine and the survivorship engine together.
Suchen Chodankar:
Yeah. There are multiple things that we will have to think about there and we can definitely chat more about that. Probably we can create an idea and have a discussion there. But from the implementation point of view, the survivorship rule is tied to the role today. That's the only way we understand what user is asking for, what version of consolidated copy. So we'll have to basically think about a different way of doing that for matching.
Chandra:
Okay. Yeah, because we are actually planning about having multiple woi. So we have one where I may have to produce more than one woi for my corporate groups, one woi versus let's say my clinical business a different woi for the same head. So there we need to know all the pros and cons of doing it. I'm even wondering, is there anybody who is using this woi concept in production today?
Suchen Chodankar:
Yeah. So the multi-woi concept is definitely useful for different the use cases that I mentioned at the beginning of this question which is, you have multiple application or downstream source system who wants the consolidated profile to look differently than any other application that they might have. So the multiple woi concepts is definitely used and is, I would not say it's almost every customer is using it, but we have many customer who are using that. But from the matching point of view, I've have not seen a requirement where we hear that you have multiple OB rules and use this one, which is different than what you show on the UI. So I've not heard anybody basically say that I want this to be the default, but I want this to be used in the matching. That never happens. You either basically change the default to this second survivorship rule and we use that for matching and for UI or any default value. So that's why we have the default as the one which we pick for matching.
Mark Burlock:
And Suchen just one [inaudible 00:18:16] on this if I may. As far as people using the multiple survivorship groups and related to the matching, are you seeing people usually basing it OV only or not OV only? We're going to be doing this very shortly and that's why these questions are so near and dear to us.
Suchen Chodankar:
Right. So I think this is a question at Mark is asking about, right? So we have two types of setting in the match rules. You get to basically say that match all the values for any particular value, oh sorry, any particular attribute. For first name for example, if you have Robbie, Robert, Bob, all of those things in one profile and you picked Robert as your surviving name and then you have other record which has just say Robbie, right? You have a setting in match rule which is a match on OB only. If you do that in this example that I just talked about, they won't match because one record has the surviving value as Robert and the other record has the surviving value as Robbie. Unless you have on matching in that kind of stuff, it won't match. So other setting is basically to set that match OV only is to false.
If you do that, what matching engine does, it considers all the values of the first record, which is Robert, Robbie, Bob and all of that stuff. And the second record now has Robbie, which will match with one of the values from the first record and that's how you have a match. So to answer your question, I think everybody probably will prefer non OV matching because it captures more record.
But there's a performance implication to that because the moment you have multiple values here and multiple values here and then you have some sort of fuzzy matching on that, the combination it'll create in the tokenization and the comparison that adds to the processing time. So it really depends, like I've seen people use a match on a non OV a lot because it captures a lot of things, but when they see that the performance is a little slower and if they know that the data does not require you to basically compare the non OV, they would basically split that so that the first part like auto merging is done only on the OV value so that you're not really comparing everything there in the auto merging rule, and then whatever remains can be captured or can be caught by the second rule which goes after non OV values.
So that's kind of pattern that I have seen. But I would say that it's very common that people go for the non OV first to see what all it catches and if there are no slowness or whatever they stick with that and the OV value only basically is configured when you want to separate it out and you don't want to use the non OV value. In some cases, let's say there is a source system that you do not trust at all and they're sending names.
If you use that name for matching, you might end up with over merging. So that is one other case where you do not want to go with the non OV matching. So it depends on the use case really. And I talked about pros and cons of this one. The biggest thing that you should look for here is if there is any source system which is sending data, especially the match attribute data that you do not trust or you trust little, then we should stick with the match using OV value only because now what you're doing is your OV rule make sure that the non-trusting source system values are not survived and your match engine will never use that during the matching.
So auto merging I will start with OV and for potential matching I can start with non OV and see whether you can move some of the non OV into OV or the other way around.
Chris Detzel:
So quickly Suchen, there's a lot of questions around the OV stuff. So is there an option where we can specify OV for few attributes and non OV for the rest of the attributes?
Suchen Chodankar:
Unfortunately no. It's basically if you look at this one. So this is the attribute property in the match configuration that we are talking about. And that's a good question because we are thinking about that particular thing. There are scenarios where you want to use OV value from only specific attribute and non OV value from rest of the attribute used within a match rule. But this particular property is at the match rule level, which means it is applied to all the attribute that is used. So if you set this to true only OV value for all, the attribute will be used. If it is false, then non OV value are also included for all the attributes.
Chris Detzel:
Great, and I'm going to go through some of these questions that are asked and then we go back to Mark's questions. Do we have translations supported in the rule-based matching?
Suchen Chodankar:
Yes, we do have and I do have example which I can quickly show.
Chris Detzel:
Keep the questions coming and I promise that if we don't get to the question, we'll definitely have you post it on the community and then I'll have Suchen answer those. But we still have 30 minutes plus so we're doing good.
Suchen Chodankar:
Okay. So here is an example of how translation can be done. So here we set this up as a cleanse function just like you used the name dictionary, our first synonymous matching. You provide the list of attribute that you want to translate and it basically does the matching based on that. Actually I might have an example here which... So if you look at this particular record, there are two different languages here. So if you use the Google Translator or something, you will see that this character here translate right to Zao, and this is the same character. So it was able to translate and find this as a match using that one match rule that I've configured to use the translation. So yeah, it is set up as cleanse function and you use the cleanse adapter just like you use the name dictionary and then provide the list of all the attribute that you want to translate.
Chandra:
Yeah. And does it need to be specific to any language or something? Because what we wanted to do is we want to do a comparison between two real environments, one with the science and the commercial standpoint. And most of the science are in the English and we want to compare with any of the languages within the Reltio. So what your example says, we want to define a particular language, which means if you're supporting 64 markets and or 80 markets, we need to define the same rules in each of the languages in the translation or something like a single market can.
Suchen Chodankar:
No, you don't have to provide the language here. As you can see here it says any Latin. So it just translate to English data first and then does the comparison. So it has that auto detection to translate right from any language to English and then does the comparison apply all the comparison function. So it does not translate, I wanted to point that out. It just translate. So if you have something in different language that will be just translated to the English first and then...
Chandra:
Okay great. Thank you.
Chris Detzel:
And then I do have this one question I want to make sure I ask. So Sheth from Pfizer is a scalability of match. So if you scroll down you can see it, but at Pfizer, so it's scalability of match engines, so no number of rules. At Pfizer, we have a hundred plus countries loaded into Reltio and we are adding new data sources as we load new markets. Even for the same source, each market has different quality and profile of data. So we hear from the Reltio PS team that we should try to limit the number of match rules per tenant to about 20 to 25 max. So assuming it's not a hard limit, but is this a real constraint which could cause performance issues. So if we want to have generic rules for cross markets but then have specific rules for outlier market et cetera, and want to increase it to 50 a hundred rules for example, is this going to be an issue? And he says go ahead.
Suchen Chodankar:
No, I think I get the question here. So I want to go back to this diagram here. So 5,000 or anything more than 25 is little unusual number or uncommon number basically of match rule that I have seen, but it really comes down to this. The number of records, if you have 10 records, a thousand records, 10 million, a hundred million record as long as you can efficiently create this buckets. So this is your universe, which is basically the number of records that you load and this can be a hundred million record and these are the buckets that basically you put them in to compare or for the comparison process. So if you have Suchen and Chris, you don't want to basically put them in the same bucket because it's never going to be sorry, the end result is never going to be even close to the match.
So why put them in the same bucket and spend that processing time and execution time on that? So this bucket is what you should configure efficiently and I understand that 50 and a hundred match rules really translate to the different scenarios that you might have, which you may not be able to kind of stuff inside say 20 or 25 match rules. So in those cases it is okay to have more rules provided each of those rules are not super complex and spends a lot of time in tokenization process, and the overall tokens that are going to come out of every record should be within the range that we expect. So to answer your question, the number of rules really doesn't matter. You could have basically hundred rules, but if they all basically point to one or two or three token buckets, then it doesn't matter, you have a hundred rules or 200 rules, it's just the management of the rules is going to be super difficult.
But if you have 200 unique rules which basically create their own token, then it might be a problem because this tokenization process is what takes a lot of time. So here we are just dividing the whole universe into smaller universe and now we are comparing records only within this bucket. So it doesn't matter, there are other 200 rules which is not applicable for this bucket and it doesn't matter. We'll just kind of skip through that or basically run through that very quickly.
So one other thing to consider is consolidation of the rules. Even if they have different region, if there are similarity in the scenarios, then we can look at combining those rules. It is just the management of the rule becomes easier. But to answer your question tokenization, if you can devise a token organization's key that does not create a lot of tokens for entities, then we are fine. The 50 or a hundred rules would not matter. But again, like I said, it's very uncommon and if you have that then we would definitely like to have a look at it and see how we can make it a shorter list which will be easier for you to manage.
Chris Detzel:
Great. Well, so let's see this other question. Can we use multiple comparative classes and tokenization classes for an attribute? I'm not sure if you answered that and which is the best compar?
Suchen Chodankar:
No, I have not. Let me take the first one. The answer is we cannot at this point of time, we cannot use multiple competitor auto organization in the same match rule. If you have a need to have multiple competitor auto organizer for any given attribute, they have to be in two different rules. They can be exact same rule definition, the instructions and their comparison logic at the attribute level can be different in these two different rules. So just one example here. So if you look at this, okay, rule number two I think. I have exact for this, sorry no, let's look at the comparator. Okay, I have Levenshtein algorithm here. Now if I want to also use say double metaphor or whatever, you cannot use that in the same rule. You have to basically copy this same rule and then use a different comparator for the second rule.
This part can remain same. So what it'll do is while comparison for the second rule, it'll use the double megaphone and for rule number one it'll use the Levenshtein algorithm. And then you also get to basically see that record one match with record two using rule number one and you know that it matched because it uses the Levenshtein algorithm and the second rule record one did not match with record two but match with record three using a different comparator. So that's how you do it. Yeah, so the short answer is you cannot use multiple competitor and tokenization within the same match rule, but for the same entity type you can definitely use multiple comparator using multiple rules, which is the best comparator and tokenization class for first name, last name, or full name. Again I would say this really depends on the use case. So if you want to match say Robert, Robbie, Bobby and those kind of names, then obviously you want to use fuzzy matching bundled with synonym name matching. So again I have an example here.
So this is how you do it, using as you can see there's a cleanse set on the match on the first name here and the comparison basically uses double metaphor. And this basically covers the sound X or something that sounds like similar names, those kind of things. So this is applied on top of the synonym names. So just to give you an example here, so if you look at this, if Ali and Alina was a record that you want to compare, what will happen is basically the values that is extracted in the match document for Ali will be all of these values and then the comparators that you use, double meta phone, sound X for the text match or whatever it is applied on all these values. So again, the short answer is I've seen a lot of implementation, especially for the first name, we use the synonym name dictionary and then use a double metaphor is what I've seen.
I hope that answers the question. The full name. I want to just quickly answer this one as well. Personally I don't like matching on the full name, especially when you have the breakdown of the first name and the last name because you have more control over the components of the name, which is first and last name, you can match them differently and you can basically go a little granular level where you can say the last name matches exactly and the first name is slightly off. That is a better match.
But if I were to look at the whole stream and match based on just the whole string and if it says 50% match, you would not know whether it's a 50% match between first name and the last part of the last name or the other way around. So that's why it becomes a little difficult with the full name, I would strongly recommend that we have the first name and last name strategy and for full name just for catching anything that you cannot catch using first name and last name you can have little loose competitor on the full name, but that should be potential match just to kind of see what kind of matches it brings. Should we move to the next one?
Chris Detzel:
Yes, please do.
Suchen Chodankar:
Yeah, potential matches under relevance based match rule is, is there a way we could publish the score on the profile level? We currently don't have that score here, but it's on our roadmap and what you would see here is, let me quickly bring that up. Okay, so if you look at this rule, it matched on two. This one is a binary outcome based matching, the second one is the score based matching. So looking at this, you cannot tell how close this one is. It is like 90%, 70% or whatnot. And we do have that score in the API, it's just that it is not available on the UI just yet. And it is on our roadmap.
We will plan to bring this here on the UI and you will see that not at the profile level but on the potential match, something like next to this rule in parenthesis or something, you will see that if there are multiple relevance based match rule for example, and this tells me that this is 70% and there is another record which is 98% and that the one with the 98% has a better match using the same rule for the first name and exact you can go, with the second record for the merging purpose. So the score is not available at this point of time in the UI, but we are working on bringing that on the UI and it will be available very soon.
Is there a way to do an external matching via API based on the rule? Yes. You can basically trigger the external match and say just use this one particular rule and not the whole thing, if that is the question here. So you can specify the list of rules or a specific rule and if you don't provide any rule parameter there, then it'll just execute on all the external and match rules which is marked for both external and internal matching. Please let me know if that was not the question and I can answer that.
Chandra:
Yeah, that was great.
Suchen Chodankar:
Yeah, we can provide this information that is a utility that we use. I see Java utility probably I can confirm that again with my engineering team and we can answer this offline. But yeah, it's a utility that we use for doing the translation. Any plan to separate organization from individual match rule groups and competitor classes? I'll come to this, I want to cap answer as many questions as possible, but if somebody can just elaborate on this question, I didn't get that in the chat.
Chandra:
Yeah, I can elaborate. I have seen the matching tools and the products where you don't have to cluster tokenize together with always with match rule, you can have the tokenization separated all together and generate your tokens, then you have your own match rules the way you want it and where you use all kinds of comparative classes, all other claims.
Suchen Chodankar:
Yes, definitely. That's a good idea and actually something that we are already thinking about. By the way, this is already happening in Reltio today. So the tokenization process is separate from the comparison process. So what you see here, T1, T2, T3, T4 are the tokenization rules that are derived from the match rules that you've configured. So if I go back here and you'll see that I have seven match rules ,right? Now, if rule number one says, and actually it is in this case I think, it is ignoring this first name, which means it is using only the last name in the tokenization and the class, if you look at this, it'll say that exact token. Now if you look at the second rule here, let me close this. Again, ignoring the first name and match token class is last name and yeah, oh, it is ignoring the first name.
So this doesn't matter what I have here. So it'll ignore this. And so technically, the first rule and the second rule has the same tokenization rule and that's what we are going to use. So the rule tokenization rule seems like it is tied to the match rule at this point of time. But what we do is we go through the match rule and we don't derive the tokenization rule separately. The data basically does that for us.
For example, if I have John Smith getting loaded, it'll go through this tokenization process for every single rule, but the end token that will come out of that John Smith is just Smith. For rule number two, again, the token that will come out of that record is Smith again, but then we will see that Smith token already exists. In this case it is this token. So it'll simply go and attach that new record to the existing tokens. So what you're saying is already happening, but I think if your question is, if there is a way to define that organization process separate and leave it out of the comparison, then yes, it's something that will be very helpful and we are considering that. Yes.
Chandra:
Yeah, I want to separate it then. Yes, I can have exceptions in certain match rules saying that, okay, ignore this tokenization.
Suchen Chodankar:
Right. Yeah, that makes sense and that that's what we are really doing. Create the smaller universes first and then apply match rule on all these universes, right? So yeah, any plans... Okay, we answered that. Just to clarify, are you able to set match rules to have a mix and match of non OV and OV, the different attribute? No. Okay, so we are answered this one. We are but yeah, there is something that we can consider. Can be sure relevance, we talked about that. Apart from traditional and relevance matching, can you speak a little bit about... I will save this for last. If time permit, I will come back to this one because we wanted to focus on the rule-based matching. But if time permits, I'll answer this question. We had a session done on this ML based matching. Chris can provide a link to that recording and to the blog that we had written.
But yeah, I can come back and talk little more. Can we add source system name in the filter option when defining the match rule? Filter option when defining the match. There's a way you can basically... Here I have an example of how you can use the source system in the match room. So here as you can see it is matching for exact sources and then you can define which source system you want to consider.
The way it works is, if you have two records, one having just one crosswalk from SAP and second record having two crosswalk, one is from SAP and another is from say CRM or whatever, it basically filter out everything which is not SAP and see whether this two records should still be compared. Now going back to the same example, if the first record was SAP and the second record was just CRM, after I take the CRM out, I see that there is no SAP here. So those records will not be matched. So that's how this filtering works here. So it's not really comparing at this point of time, it filters the record first before you send it for matching when it comes to the equals constraint
Mark Burlock:
And could they filter on multiple source systems as opposed to just-
Suchen Chodankar:
Yeah, you can basically have multiple values in here. But the way you can do that is also you can have a different operate here for and and or and those kind of combination. There was a question I want to go back which is sort of related to this topic here, which I cannot find it here now. But the recommendation was or basically the ask was, okay, here it is. Consider using a custom metric for storing the source system. I personally strongly recommend this one because what happens is when you have this data as a attribute, you get to use all this other comparators that you can apply on the source system name. Imagine a situation where you want to basically say if there is a surviving source system SAP only then you match. There is no way to do that using this approach. But now if you were to use a custom attribute for storing the source system and you define the survivorship rule to say that survive SAP if SAP exists, right?
So only those records will be matched when you set the match OV only equals to true. So that's why I like it. It is very kind of flexible and something that you can play around with lot of use cases there. So I personally like that option very much. Even if you don't have any use case for matching on the source system, if there is an opportunity to store that information as an attribute, I would always suggest that you do that in future you might use it or you can use it and if not, then you can just simply hide it and it stays there.
Okay. I think we are at the end of the list here, but I'm very sure I probably might have skipped one or two questions here. Jump this one here.
Chris Detzel:
Go back to those for sure.
Suchen Chodankar:
Yeah, right here. Best practice for implementing new match rule when using event streaming for integration? If I understand this question correctly, maybe the question is, if you are introducing a new match rule and you have the streaming enabler, the moment you run the match it's going to publish a lot of events. So the question here is, what is the best practice for implementing such scenarios? Did I get that right?
Mark Burlock:
Yes.
Suchen Chodankar:
Okay. So when you're introducing a new match rule, you want to basically have those events even for the existing record. Let's say you are introducing a new match rule, which you know should basically go after only a subset of data. When you run the re-index job, you are not going to get matches for the existing record. That has nothing to do with this new match rule that you have implemented. But for whatever reason, if you miss something and those records start giving you potential matches because of this new rule, that's a good thing, because now you're catching all those things.
So I would say have it enable so that you can basically capture all of that and synchronize your data. But if you have some sort of batch process in place where you can basically just let Reltio do all the matching first and then grab the latest and greatest copy of the consolidated profile and the potential matches, then yes, the recommendation is of course to switch off the integration, let the matching happen. You can have multiple iterations and whatnot instead of publishing over and over again, potential matches between A and B, and B and C, and C and A and those kind of thing, which could be transitive in nature and whatnot. So if you have that option, it's always recommended to turn off the streaming, run your matching, run your other processing that you might have, and then pull the records from Reltio one.
Mark Burlock:
Could they streaming cannot be refined so that as other things are happening not related to that new match, they're still going downstream as opposed to just an offer on? Is there any way to just-
Suchen Chodankar:
Oh sorry, can you repeat that please?
Mark Burlock:
Yes, sure. So if we're putting a new match rule in and we wanted to not have that say it's very effective where we were, so it's a lot of volume with it, is there a way to just the changes because of that natural not go downstream, but all other activities go downstream? Is there a way to just filter that out or none?
Suchen Chodankar:
No. So once you set up a match rule, it is going to publish all the pairs unless you have defined a specific custom actions for all of the rules. So in Reltio there is a way basically you can say for rule number one, send the event, for rule number two, send the event. But you cannot use the automatic rule type or the suspect rule type. So those have specific behavior typed or basically sort of hard coded in those rules. But there's a way you can create your own custom rule type which basically gives you control over whether that rule type should publish or not publish events. If you use that now, what you can do is you can say that for all my rule types I want the event to be published for the new rule type or for the new match rules you will create another rule type which will not have the published. So it'll not publish only for that specific rule. Let me see if I can quickly bring that up here.
Chris Detzel:
Quickly. We have five minutes.
Suchen Chodankar:
Sure. Oh yeah. Okay. So I can send more details on this one. Are there any other questions? Otherwise we can pick up this.
Mark Burlock:
Just one other thing also, I just wanted to make sure. With the transitive matching, that only has to do with potential matches, right? That's not for automatic?
Suchen Chodankar:
That is correct. So transitive matches also applies to automatic in indirect way. So if you have a merging with B and B is a match of C, then all those records come together, but the merging happens in pair. So because of that A B merge, it is going to trigger the matching evaluation again with the C. And if it is still a perfect match, it'll bring that record together. But let's say after A and B, it's no more match with C, then it'll leave that C entity alone. But yeah, the transitive matching is applicable for all kind of matching. In some cases it is indirect, in other potential matches it is sort of direct in a way that it calculates and it stores those transitive matches.
Mark Burlock:
And I guess the OV only would tie into that then also, I mean if A and B because of the OV, you're limiting what matches that.
Suchen Chodankar:
Exactly, yes. So after merging, if something else survive and now C does not match with A B consolidated profile, then that basically stops that match there. Otherwise, it's going to just keep consuming all the entities. So transitivity applies to that as well.
Mark Burlock:
Thank you.
Suchen Chodankar:
Okay, I think we have just four minutes. Any other questions? The only thing that we didn't answer here is this one here, apart from traditional and relevance, please can you speak a little bit about matching based on has been talking about it for a while. Yeah.
Chandra:
Yes, Suchen. This is Chandra. So yeah, I understand. So you have the traditional rules based matching with the tokenization better. You have relevance based scoring and then of course transitive matching, which is something that I like. Are there any other the things like where we can reduce the suspect's review and the data steward's going in, doing all that?
Suchen Chodankar:
Yeah. So I would recommend trying to match IQ. Basically, this obviously requires your training. It really comes down to the scenarios that you train. If you have perfectly set up match rules and there is nothing else you can do to reduce the number of potential matches because it does require human intervention, then match IQ is not magically going to reduce that. What it'll do is it'll probably, here is an example. As you can see here, match IQ is recommending this meaning match IQ also agrees with this pair. It gives me a little bit more confidence to go and just go ahead and click on merge. So data stewards can spend little less time worrying about doing any wrong merges here, knowing that there are two rules that is recommending this and then I have match IQ also recommending this. So that's how some of our customers are looking at this as match IQ as another matching gene, which is kind of validating independently of the match rule.
So in places where you might have missed the data scenario, so missed something to configure in the match rule, that's where match IQ can discover those, provided you have answered those scenarios in during the training process, that's how Match IQ can help. So it's a reverse process. There is no defined instructions and I'll do the matching based on the instructions. It's the other way around. I will show you some sample pairs and you tell me whether it matches or not. That approach always basically end up getting you better matches because you're looking at the data and not really looking at the instructions. So you might answer yes to something that you would've missed putting in as a match rule.
Chandra:
Okay. This is [inaudible 00:57:25] because other way I was thinking about. So I may have some match rules I am not confident that I never turn them into auto, but based on the way my data keywords are using those potential max, can Reltio recommend to me?
Suchen Chodankar:
Yes, that's where we are going with this one. So as you merge more and more any specific rule or match IQs recommended, then we have a plan basically which will say hey, we have been seeing out of hundred profile, you always end up merging hundred of them when there's a rule one exists. So do you want to convert this into merge? So those kind of things will come in the future.
Mark Burlock:
Could match IQ be auto or it's only just for suspect?
Suchen Chodankar:
Match IQ can also be auto. So you can configure match IQ for automatic merge based on the score. So there is a setting there, just like the relevance base. You can pick a score, you can say anything which is perfect, a hundred percent match, go ahead and merge them. So you can do that. And you do that using the publish option, which is available in the match IQ.
Mark Burlock:
Is match IQ core product or is it extra licensing or do you have any idea?
Suchen Chodankar:
It is included in your subscription now. So this is how you basically configure it. You can say I like everything more than 83% match score do automo. So match IQ can automo as well.
Chandra:
The way we get the scoring is really important then.
Suchen Chodankar:
Correct. So there is a review process. You get to see what the scoring it is generating for sample pair and you can basically get comfortable based on the pairs that you see and you can say anything which is more than 83%, I've never seen anything that I don't agree with, which tells me that 83 and higher is auto merge or you can basically be a little cautious and say okay, 90 and higher. So that's let other slightly lowered score match pair be reviewed first before you kind of lower the threshold for auto merge. So it can be a iterative, incremental process.
Chris Detzel:
And Suchen, it seems like there could be an opportunity to do a ask me anything around this particular topic. Thank you so much Suchen for answering all these questions. Thank you everyone for bringing your questions to this ask me anything. It's a lot of fun. So hopefully you thought it was Suchen and I think there's a lot of-
Suchen Chodankar:
Yeah, it was lots of fun.
Chris Detzel:
Great question. So I did post those questions on the community. This will be recorded or has been recorded. We posted the community in the next day or two. And so if you have other additional questions, go to community.reltio.com and ask the questions there. I'll push Suchen and others to answer some of those questions. Until next time. And please take the survey with other potential ask me anything types of sessions like this. We'd love your feedback on that. So thank you everyone.
Suchen Chodankar:
Thank you.
Chris Detzel:
Bye-bye.
Suchen Chodankar:
Thank you.