Reltio Connect

 View Only

Unlocking Entity Resolution with AI: How Fern is Revolutionizing Data Matching

By Christopher Detzel posted 11 days ago

  
Unlocking Entity Resolution with AI How Fern is Revolutionizing Data Matching

Find the PPT hereHow Fern is Revolutionizing Data Matching

In this insightful Reltio Community Show, senior staff machine learning engineer Rob Sylvester dives deep into the world of entity resolution and how Reltio's Fern is revolutionizing data matching using AI.

Rob explains the challenges of traditional entity resolution techniques and demonstrates how Fern leverages large language models (LLMs) and modern AI to unlock patterns and improve accuracy without the need for extensive rule-based systems. Discover how Fern handles various data attributes such as names, addresses, phone numbers, and more, while considering language variations, nicknames, and typographic errors.

Rob also discusses the importance of customization and performance profiling to cater to different use cases and industries. Throughout the show, Rob answers thought-provoking questions from the audience and provides a live demo of Fern in action. He also touches on the future of entity resolution and how Fern can be integrated into existing MDM solutions. Whether you're a data steward, data scientist, or simply interested in the latest advancements in entity resolution, this video offers valuable insights into how AI is transforming data matching. Don't miss the opportunity to learn from one of Reltio's top machine learning experts!

Transcript: 

Chris: [00:00:00] So welcome everyone to another show. Really appreciate you being here. Today's topic is somewhat similar to last week but it's called unlocking entity resolution with AI. How Fern is revolutionizing data matching. We have our machine learning specialist.

Chris: His name's Rob Sylvester. He's a senior staff machine learning engineer here at Reltio. How's it going, Rob?

Rob: Hey, Chris. How's

Chris: I've been trying to get him on a show for a long time now, and here he is. So I'm really excited about this. Finally made it happen. We made it happen. This is the day. The rules of the show is the usual.

Chris: Keep yourself on mute questions, push them into the chat. Or take yourself off mute. We're going to have a lot of questions and we'll get to as many as possible. I've been talking to Rob and also from last week, and we will be doing a live. Ask me anything around this topic. We'll just have to get that scheduled because there will be a tons and we'll get to as many questions as possible.

Chris: We are recording it. [00:01:00] So if you miss it, or want to go deeper into it. We will definitely make sure you get that recording. Here's some upcoming events. We have a couple more that we've added and that that are in the works, but today's show on unlocking into the resolution, we have a show next week on customer 360.

Chris: Data product, powering AI driven data unification for enhanced CX. We have Venky. He's our product leader that will be presenting on that. And then we have some stuff on Realto Integration Hub. It's what everybody loves. And I'm really excited. Dan Gage is going to be presenting specifically on two shows on enhancing data integration, address cleansing techniques with Realto Integration Hub.

Chris: And then. On the 20th of June, we'll do one on Advanced Data Enrichment with Reltio Integration Hub. And lastly, two days from now, we have our Reltio Life Science event. I'll push that into the chat, the links. If you haven't registered [00:02:00] in your life science company, it's an on site event that's half day.

Chris: Really excited with, for that. There'll be breakfast and lunch and a lot of networking and, lots of people talking about some of their implementations and what they're doing and things like that. It's going to be a lot of fun. Really excited about that as well, but let's get to our show.

Chris: Rob, what do you think?

Rob: I think I'm ready.

Chris: All right, let's do it.

Rob: Let's take it away. So thanks for having me on here. Chris wasn't lying. He's been trying to get me on here for quite a while. We finally were able to make it happen. And of course I get sick a few days before it. So you guys I don't normally have my normal panache.

Rob: That's what's going on. I'm going to get my screen up here. No, when you got it,

Rob: we're good. Yeah. Yeah.

Chris: We're good.

Rob: All right. So I was at the show last week. A lot of you guys were at the show last week. And while we'd been planning for this [00:03:00] for a few weeks now to do this, One, two punch with Suton where we would, talk a little bit at high level, collect people's questions, and then I could dive in a bit more technically, I would, I was surprised that just how many people attended, how many questions there were, which kinds of questions there were, there's a lot of interest, a lot of interest in not only entity resolution, but in the underlying mechanisms.

Rob: We're using to solve it. So what I'm going to do for the next like 20 20 to 25 minutes is run through a deeper dive technical deeper dive into Fern go, a little bit a little into the nitty gritty. And after that, what I want to make sure to do is try to capture as much coverage as I can.

Rob: On those questions because inevitably you guys are going to have more of them and I'm going to answer most of the [00:04:00] high level themes that you guys asked about last week and so what I want to do then is Put the questions in the chat. I'll stop two or three times and talk to Chris and we'll do that.

Rob: Otherwise, I'll just blabber about machine learning for like hours and hours and nobody will ever get anything done. Let's dive in.

Rob: Most of you know what Entity Resolution is, but let's talk about it a bit from my perspective. I was brought in for the most part to, to Reltio to work on this problem. And it is a beast of a computer science problem. This is an example of what I would say, nine, nine entities with three attributes.

Rob: And if I sat and I asked you guys to build a machine learning model that could solve this and do it well, it would already be difficult, right? It would already be difficult when you have to consider. Names, prefixes, suffixes, [00:05:00] different alphabets, abbreviations and nicknames, locations, formatting differences, standardization, typographic errors, combinations of different errors.

Rob: You have to add all of these things up, and there wasn't really a great way to do this, and arguably, even today, nobody's dominating the entity resolution space, because this is such a difficult thing to solve, and it may not be solved, it may not be difficult to solve when you have, in this case, nine entities with three different attributes, but what if you have A hundred million entities or billion entities.

Rob: What if you're trying to resolve or de duplicate match pairs of individuals in the United States and you have U. S. Census data and you have 300 some odd million people and these people have not just names and addresses and emails, but they have phone numbers and they have multiple addresses and they have.

Rob: So [00:06:00] security numbers and different identifiers, birth information, you have all this, it spirals out of control very quickly. Most of the players have tried to solve this with traditional matching techniques. String comparisons, edit distances. And people start to throw in phonetic distances and say, this word sounds like this word, and it gets you some of the way there, and they work pretty fast, but that landscape has changed a lot in the last two years.

Rob: And for somebody like me that's been doing computational linguistics for most of my life, when I see this problem, I thought to myself, I'm going to make something new. And so last year, basically on weekend. I sat for a few months and I built what I call flexible entity resolution networks and brought it to Ralphio and it started to take off for reasons that I'm going to demo to you [00:07:00] guys today, because what it does is it allows us to not sit there and try to code out all of these little rules.

Rob: Around, which sound, which words sound like this word, which characters in this language maps in this language. What if this guy has a vacation home? This guy has another home. What if this guy's formatting on his email is different than this email? What if it's a work email? What if it's a phone number?

Rob: What if that phone number has a country code? All of this stuff and it just balloons out of control. These are the kinds of things that machine learning is good at, right? Really good at. And that's why we went down that direction. So I wanted to kind of preface what I'm talking about today with that.

Rob: To let you know the motivations for it, but also to To capture for those of you that, that didn't catch the show or are new to this or didn't see the show last week with Suchan, that's the motivation. How can we de duplicate at scale across many different sources of variation something like [00:08:00] two humans without creating tons and tons of these rules.

Rob: But a technical note then. There are two steps to entity resolution. We didn't touch on this last week. But I want to touch on it now. Grouping together candidates is something called blocking. We're not talking about blocking today. Inevitably, I will probably be back here with Chris sometime later this year to talk about blocking.

Rob: But, blocking is the step where we go in and we find the good candidates to send to our machine learning model. If we have a hundred million people, We can't compare a hundred million people against a hundred million people that, that would take thousands of years on the machine to do.

Rob: So this first step in entity resolution is called blocking. That's not what we're going to talk about today. If you have questions about blocking, you can send them to us and, try to get, answer them now or later, preferably later. Fern though is an object comparison model. So Fern and most machine learning models and really, [00:09:00] and the academic definition of entity resolution, what it is it's the second step.

Rob: It's the step that says, okay, take, for example, Rob Sylvester in San Francisco over here, and Robert Sylvester in San Francisco over here, similar email. Those would get blocked together and the machine would say, okay, these are similar enough that they should get sent to Fern. It's a high recall, low latency process.

Rob: Whereas step two here is a high precision, higher latency process, right? Go get the candidates first and then spend your, spend the bulk of your time on, on whether or not they're good. So let's dive in. What happens when we have many people to choose from? How do we, for example, and I'm going to run with the example of people today, because I think it's one that most of you guys are trying to, Solve with your data sets.

Rob: Most people have data sets of people. If you have, let's say, just [00:10:00] names together, Rob Sylvester and Rob Sylvester, that's a pretty easy one. You probably don't need the fanciest machine learning model in the world. If you have a misspelling, you maybe don't need the fanciest machine learning model in the world there either.

Rob: Maybe a perfect string comparison wouldn't catch it.

Chris: If

Rob: you throw in middle name, things get a little hairier. Some of those added distances won't work out. You can look at substrings and maximum kinds of variation. He's throwing nicknames. Things get a little more complicated.

Rob: We can start to throw in nickname dictionaries. Those are fun. They don't have the greatest coverage in the world but maybe you can download a bunch of nickname dictionaries off the internet and try to get them. You start throwing in different alphabets, transliterated alphabets, and all of your edit distances break down, your phonetic edit distances break down.

Rob: That becomes a little more difficult. You have translations that happen. You have different, I live about half of my life down in Columbia. And a lot of people call me

Chris: Roberto

Rob: when I'm down there. That [00:11:00] it's a source of variance that, that I didn't I'll be honest. I underestimated, but it does exist, especially from the, on the end of people that do data entry.

Rob: And then you have, of course, the false positives, names that are pretty similar, but they actually are not the same person. And then you have something that should actually get blocked out of the way. So the question is, how long do you want to sit and spend to try to code up these rules? How are you going to capture all the nicknames, all the phonetic variants, all the typographical errors?

Rob: standardizations, formatting, different languages, in Latin America, in India, they, people tend to have multiple first names and last names, in Japanese cultures, last names are listed first before first names. All of these sources of variants are going to, pretty soon things get out of control.

Rob: And this is [00:12:00] just names, right? This is just first, middle, last name, try throwing in another 10 or 15. So to me, what I When I tried to start working with this problem, I said, what if I had a machine learning model out there that had just seen all of this stuff, that it's seen in its training data, nicknames, that it's seen different alphabets, transliterations, and alphabetic variants, that it's seen different versions of names, depending on the country, where the person is, that it's seen it.

Rob: Formatting differences that had seen common misspelling. Those models exist. They're called large language models. We don't think about them necessarily in that way. We tend to think about them in terms of which words appear in the context of other words. But you know what? Misspellings often [00:13:00] occur in the context of the correct spelling of that word.

Rob: The same kind of context. And often words in different languages tend to appear together. In the same way and phonetic variants and nicknames tend to appear in the same context. And when I made this realization, I thought, holy hell, if we get the prompts right, and we find the right way to capture and ask these probabilities the right way.

Rob: We can capture all of this right off the bat, and our customers don't need to come in and code rules for any of it. We don't need to go in there and say, this edit distance, this fuzzy match, first name exact, last name exact, transliterate this name approximate distance on first name can be this, look up this nickname table.

Rob: Get it all and get it for free. [00:14:00] And that's what machine learning models, modern machine learning, generated, generative AI models are good at if you use them the right way. Now, maybe that's not super clear with something like name, but tell me how you are going to differentiate two product descriptions on Amazon, paragraph long product description.

Rob: How are you going to get that right? So this is where entity resolution models with modern AI, large language model. Transformers, sentence transformers, these kinds of vector based representations of text tend to shine because they're unlocking patterns that we haven't been able to code for.

Rob: Traditionally. So under the hood, burn is built almost entirely with large language model and these different kinds of transformers, Chris, do we want to stop and ask any questions [00:15:00] before we start diving into the real nitty gritty?

Chris: Right now, let's just dive in. There's no questions. I think you got people just waiting.

Chris: Hey, I have one question. I know we were talking about like a person, but is this subjectable to organized as well?

Rob: Can you repeat that?

Chris: Currently we are discussing about the person entity, correct? Will this be applicable to the organization as well, whatever?

Rob: Yeah so an organization model, it tends to not be super important.

Rob: We won't talk too much about the organization firm model. There is an organization firm model, by the way. I'm not sure if that's been publicly announced, but there you go. And the, it turns out that the contact for the organization model is not super important in differentiating whether or not two organizations are the same, maybe for some customers they are.

Rob: But the same strategies are used, right? Like a [00:16:00] contact point, could be Rob Sylvester for this organization. But the same strategies can be used in the same kinds of prompts, the same kinds of the organization name itself. Can be treated the same way as what we do here with a person's name and try to compare to organization name in that way, that's a, that's a good way to think about it because it is similar.

Rob: The prompts are different in the, some of the features are different, but it's a high level.

Chris: Quickly, I would think that the first example is actually really hard, but, looks like the same person, but two different people. So we'll probably get there. You'll show us that.

Rob: Okay. Yeah.

Rob: Absolutely. All right. So

Rob: Let's dive in a little more on the, if it wasn't obvious, by the way, I'm not a designer. I'm definitely an engineer. So if you're wondering why my slides and my colors look really stupid, it's because I made them. We'll have to deal with that. We'll be engineers today, all of us.

Rob: So here's here's an example of what a large language model gives us. If I look at the attributes for an entity, Reltio entity, prefix, first name, middle [00:17:00] name, last name, and suffix. And I put them all together, Dr. Sarah Jackson, junior, something like that. And I say to myself, how similar is this person this group of attributes put together to this other person.

Rob: So if I actually load this in the firm, not talking about an entity model and talking about, give me a large language model representation of the similarity between this name. The name that I'm showing here in a name one prefix, it says Alexander Ivanov in Cyrillic and Russian. And how similar is that name to this candidate, right?

Rob: And the first candidate is Dr. Sarah Jackson. Now, I've redacted the suffix. I'm not going to give you guys how, how we put up these things. I'm also not going to tell you how we turn them into scores. But, because obviously if you guys run things like ChatDBT, you don't see numbers come out, right?

Rob: [00:18:00] Words come out. But for anybody that works with these models, you know that what they actually do under the hood is they give you probabilities of words. You can pull out those probabilities and do things with them. Now, this is when the moment hit me last year when I said, okay this is powerful, I was messing around with these models.

Rob: And I said, I'm going to take this name, Alexander Ivanov, and I'm going to slide it from a name that looks nothing like that to a name that looks a lot like that. And I'm going to ask something similar to, how likely is it that these are the same people? And you can watch these, if you look over on that table on the right, watch what happens.

Rob: Sarah Jackson sort of transitions the name from being an American name, female name, Russian name, Russian male name, and you get that [00:19:00] distribution that tends to reflect that proximity, that semantic proximity. It's capturing a lot under the hood here. It's capturing gender under the hood, capturing phonetics, it's capturing frequency, it's capturing nicknames under the hood.

Rob: This is a powerful realization, because it means that you get something for free, and you get something for free. Would probably have been very difficult. If I were to ask you guys to make a machine learning model that can give you this sort of distribution, the smooth real distance that looks like a, behaves the right way.

Rob: This would be very difficult to make. And large language models give you this, if you use them the right way, almost for free. Now, does it work with other attributes? Yes, of course it does. These things are trained on reading, most of the internet. So I grew up in Juneau, Alaska, I [00:20:00] don't live there anymore, and I said let's give an example of an address object, it will slide, for example, an address in Juneau, Alaska, 123 depression boulevard, and we will transition that to some address in Baltimore, and you get the same result.

Rob: Distribution is there are these two addresses referring to the same location. That's not the actual prompt. There's actually a few different tweaks under the hood. And then, I can get into them and I will get into them a little bit. But some of these things are patent pending and I won't get into them, but I'll answer the best that I can.

Rob: The important part here is to realize that if you look at this table, you get the same thing. You get the same sliding distribution to go from something that's nowhere close to very close. And the nice thing about a distribution that works well is we can use these as attribute scores. We can, as data stewards, set a threshold and say, Okay, show me [00:21:00] 80%, show me 70%, show me 90 percent on something like addresses.

Rob: And you can get something that behaves the right way. So I'm going to stop here again and ask, let's see, I saw 10 or something chats popped in. Chris, do we have anything pressing for similar questions people are asking?

Chris: We do have several questions. Let's get to a couple of them. There's some thumbs up on this one.

Chris: Is the primary use case for this to operate in a batch mode, like a warehousing environment, or can this be leveraged in a real time environment? Where you have to compare incoming transactions to an existing data set of millions of records and find the match in real time search before create like SBC.

Rob: Yeah, so we'll get to this in a little bit, but basically the flexible part of flexible entity resolution networks are the switches on that trade off. A lot of it is the switches on that trade off of. How big of a model are we running? [00:22:00] Are we running this in batch where we don't care if it takes 500 milliseconds to resolve an entity?

Rob: Or are we aiming to use smaller LLM, smaller sets of features, smaller number of slices, smaller, Those kinds of tweaks so that we can get a much, much faster result. So the way that Fern actually works is it takes all of these different, let's call them features, right? These LLM features, as well as a few that are not LLM and combine them together to make one score for an entity.

Rob: Now, the way that combination works is very customizable. Very customizable. We can do fast. We can do really slow, but more accurate. We can do it for this entity. We can do it for an organization entity. We can do it for single attributes or groups of attributes. We can do it for some customer that comes to us and says I have a very specialized [00:23:00] use case.

Rob: The reason we call it a flexible Entity Resolution Network is because unlike a lot of traditional machine learning models, this address feature that I'm using here, I can pick this thing up and I can throw it at, for example, two organizations. Because it's, it's a Railsio address object, right?

Rob: It's standardized. An address, two addresses are two addresses, period. Whether it's addresses for, resolving some guy's work or, or two organizations or where some piece of mail is going. And a large language model doesn't care, and an entity resolution model doesn't really care. So that sort of plug and play modular approach is a lot of where the flexible piece comes in.

Rob: Do we want to answer one more?

Chris: Sure. Are there any benchmarks done on what the traditional ones couldn't do, but this new framework using LLMs will have either solved or increased the accuracy of resolution? So with the largest dataset? [00:24:00]

Rob: Yeah, so we have We run a lot of our benchmarks on real data set, they have partnerships with customers that we're able to actually look at, and I'm not going to show you guys like customer data set, scoring metrics, but basically the way that we look at this is under two lenses.

Rob: One is like aggregated total. On, match rates and consolidation rates, latency, protection costs, things like that. And then the other one is specific catered types of matches for things like, for things that people have traditionally struggled with and people have traditionally struggled with.

Rob: Non English characters. For example, people have traditionally struggled with nicknames a lot with nicknames. People have traditionally struggled with long text descriptions [00:25:00] like product descriptions. People really struggle with that organization needs. People really struggle with that. So what we do is we end up building out like specific evaluation data sets around just those and we try to solve them With, Fern or large language models or whatever.

Rob: And then we also try to solve them with traditional string matching and whatnot. And that ends up being what motivates us to use the LLM because something like a product description, you just can't get anywhere with it. Um, hey, buddy.

Suchen: Yeah. So basically one of the way to look at this is imagine you have a pile of Potential matches, say 20, 000 potential matches currently sitting in your queue.

Suchen: And if you enable phone, what we have seen is it does provide a score for every single potential matches that you have there 20, 000, your data stewards [00:26:00] are required to go through every single one of them to resolve and spend time on every single pair to decide whether it's a match or not a match.

Suchen: What phone does it provides a score for those potential matches. And now suddenly you have 7, 000 of them getting a very high score. Now your data stewards can go after those records first and thereby basically improving your productivity of your data stewards, right? So that, that has been one of the.

Suchen: popular use cases with our early adopters that for the existing match pairs, you get to segregate them and go after the ones which phone things is as a very high chance of getting merged. And on the flip side, the ones which have generated very low score, you can you can look at them as well and quickly mark them as not a match.

Suchen: So you can clear your potential matches you pretty quickly, right? So [00:27:00] that is another way of looking at that. The benefit of enabling phone right away, whether you are in early stages of the development or you're already in production. All

Rob: Let's tackle a few more slides here and then jump back into we jump back into questions here in a bit.

Rob: Because I. I was looking at a few of them, and I think I'm going to capture some of this stuff in the next couple slides.

Chris: Perfect.

Rob: How does it actually work? You guys asked this a lot. Last week there were about five to six questions on, what are the actual features, what are the actual models, what are the actual LLMs, what are the actual, and I'll try to answer that here without giving away what's patent pending.

Rob: Okay. There are. I'm not going to tell you which LLM, but instruction tuned LLM are used for FERN, if you're a data scientist guy instruction tuned LLM is very useful because they're really good at answering new kinds of questions, right? They're basically [00:28:00] Made to answer questions that they haven't seen before.

Rob: And that's very useful for something like entity resolution because chances are these LLNs weren't really specially trained to resolve entities, but an instruction tuned LLN is going to be better at it. Then, um, Chad CPT you can't really take the word probabilities out of Chad CPT but even Chad CPT does not actually do a great job at being used to resolve.

Rob: Entities are resolved scores, which is strange. A lot of people think I'll chat to PT. That's the best one. There is right? No, it's not. There are LLM that are a lot better at it. And which LLM you choose is quite variable as well. So I'm like, 1 of the questions was. How do we sit in that performance versus accuracy trade off?

Rob: Some people that want to run in batch, you can use a much bigger and slower LLM. And some people that need real time performance, you can run a much smaller LLM, or you can even make small neural networks that approximate LLM, which is something that [00:29:00] we've applied for in patent. And I'll show a little demo of it a little later today.

Rob: And this time in this presentation, on top of the LLM, there are other Sentence Transformer features. Sentence Transformers are like LLMs. They're modern AI models that read pieces of text and they output numbers. And those numbers represent everything from how the text is written, syntactic morphological stuff, to what it means, the semantic stuff.

Rob: So these are very useful because some Sentence Transformers are very good at things like different languages and some sentence transformers are very good at things like long pieces of text. So product description can be very useful if you add in some of these additional machine learning models.

Rob: So the way that furnace works under the hood is that we have a combination of several of these LLM that, are delicately prompted. Some of these sentence transformers that swap in and out based on the entities, based [00:30:00] on the languages that are used, based on the length of the text that are used, based on the performance characteristics that we need on how many requests are sent.

Rob: And we do still have a few traditional simple machine learning features, things like edit distances, but not many. For the most part, those get turned off. So what happens then is, I've showed you guys like a full name model. When we slid two names back and forth, and we said, okay, Sarah Jackson and Alexander Ivanov, 84%.

Rob: Great. And we have an address that, okay, it is an address in Juneau, Alaska. He has an address in Baltimore, Maryland, 63%. So we're going to get, for example let's consider it, an individual entity, a person has a name, person has an address, person has a phone number, email, all that. So what do you do when you have.

Rob: Different scores for basically these attributes, names, [00:31:00] name score, some number, address score, some number, phone score, some number, identifiers, scores, some number, email address, some number. That's not super useful. You need one number, right? You need one final number that tells you if Entity A and Entity B are the same thing.

Rob: There's a final model. Earn users when this is where most of the customization and work going on right now is in this custom scoring model. These earlier pieces are pretty interchangeable. We can swap them out between organization entities, HCP entities, HCO entities, product entities attribute models.

Rob: It doesn't really matter, but how we combine them together to make one entity score is pretty critical. And there's a reason for it. A lot of people have asked us, if you have an address and you have a name, can't you just take like a weighted average of those two things, right?

Rob: Let's say we had a model for [00:32:00] humans and we said the name was 50 percent and the address is 30 percent average amount, 40%. That's the score. It turns out that doesn't work. It actually failed spectacularly and the very, the central theme of machine learning, the, Most of the reason we do this stuff is because weighted averages don't work because we want to find patterns that we can't find before.

Rob: And it's really hard, not just in Relatio, but anywhere. It would be really hard to take something like a name and an address, just those two, and combine them together to make a score with just rules, with just weights, for example, to say, okay, fuzzy address plus fuzzy name, or exact name plus fuzzy address, [00:33:00] or score on address is this much, name is this much.

Rob: A machine learning guy, an academic will tell you. That's simple. You can't fit a curve. And so what we decided firm last year was instead of asking customers to come out here and give these weighted averages and say, okay, I want the name to score this much. I want the address to score this much. The phone will score this much.

Rob: And then in these cases, the phone will switch back to this much. Pretty soon you just have a big old bowl of soup, a whole bunch of if statements and it just, it spirals out of control. So what we decided was We would do all that for you. We would fit a model on top that would take all of these scores and give you what we think is the best single score, because it can do things like represent simple curves.

Rob: And that's what this little, this horribly drawn little pink line [00:34:00] here is. And I, I realized this is a bit technical, but it's a very important piece to understand. As data stewards and as anybody working on machine learning for entity resolution is that you will not fit a curve with a bunch of boolean logic, you will not fit a curve with a bunch of weighted averages, you will not fit a curve with a bunch of linear model.

Rob: This is why, in a sentence. This is why machine learning. Is used for entity resolution. It's to approximate a really difficult space. This is 2d, this is addresses and names, throwing phones and emails and identifiers and birthdates and all of that curve is really complex and you're never going to code for it Even if you spend thousands and thousands of lines of if else statements and wait, you won't do it.

Rob: I want to get your guys opinion on that. I want to say to you guys, and this isn't a rhetorical [00:35:00] question. I actually want to get somebody's opinion here, where would you score this model? These two people. And by the way, this is not entirely me. I do have an Alaskan phone number.

Rob: All Alaskans are 907 phone numbers, by the way a large language model would know that or what that's worth,

Chris: but,

Rob: A non Alaskan might not. We also all, almost all have 574 social security numbers, and that's also not my social security number. It's also not my birth date but it is my birth year. I would want to know, if you were a data steward, where you would put this.

Rob: Where would you want this to score, between 0 percent and 100%?

Chris: Post it in the chat.

Rob: Rob Sylvester. There's probably not a lot of Rob Sylvester's out in, in, in Medellin. I've never met one. So that's just a common name. Got the same phone number. What are the like, what's the likelihood that somebody has the same phone number? And the same name phone number is a [00:36:00] very good identifier, right?

Rob: But at the same time, look at that birthday, the birthday, but the birth year is the same and the birth year is the same there's a one zero one zero, and then you start thinking I've signed up for a bunch of free trials before, where I just put in my birthday is January 1st. Cause I don't want to like actually give them my birthday.

Rob: A lot of people do that. Large language model knows that, because it sees it a lot. It also says sometimes, formatting differences can be there. Maybe the person really meant 1010 when they wrote in 11, right? The addresses are nowhere close, right? But what if the guy lives in two different places, right?

Rob: Now imagine trying to code all of that in. Imagine trying to put in all that placeholder info, abbreviations and nicknames. Rob Sylvester in Columbia, people there tend to have two first names and two last names. [00:37:00] This person doesn't. This is first name, middle name, last name. Typographic errors.

Rob: Country codes. Formatting codes. SSNs. Those don't really exist if you're in a different country. Frequency statistics on the names. How popular is Sylvester as a last name? Popular. There's actually another Rob Sylvester where I grew up in Alaska. So all the scores

Chris: are 50 percent, 65,

Chris: 70. Yeah, it's all over the place. Go ahead. Take a look.

Rob: And that's the other piece I want to talk to you about. It's all over the place. This is a tough task and what one data steward thinks. Another data steward might not think work or what another company has a different profile. Some companies say I need more than anything to not miss matches.

Rob: We lose revenue for every match that we don't make. [00:38:00] We want precision. What do we do? We would want those scores higher. We boost them higher because we want the data stewards to see that threshold. Other companies would say, Oh, Overmerging is a major problem for us. You put those scores too high, like that costs us money, you gotta do that unmerge, it's like a bank or something, put two people together and they're not actually the same person, that's a nightmare, right?

Rob: Not only do we have subjective opinion on this, and those opinions vary by a little bit, but the actual use cases for the places we work are different.

Chris: So

Rob: what I've done is I've thrown this in, Deferred, and I will throw I want to throw one more thing at you before we jump into this live demo and spend the rest of the time doing questions, which is that, look, the way that we build these models.

Rob: This was another piece that, is the patent pending that we generate examples. We generate [00:39:00] scenarios. We evaluate on customer data based on people that have these partnerships with us. But when it comes to actually building out the models, training the data, labeling it ourselves, saying, does this look right?

Rob: Does this not look right? We actually generate synthetic data. There are a few reasons for it. The main reason is we can do whatever the hell we want, right? It's very difficult when you have customers in all different clouds. Some say you can train on their data. Some say you can train on it, but not evaluate on it.

Rob: Some say you can train on it, but only we can use it. And then some say you can train on it, you can use it, anybody else can use it, but when we leave, if we leave, then you've got to pull away that data from the model, right? You, you have all of these different scenarios, and it just becomes a little easier to just all work on the same page.

Rob: So we actually use synthetic data that tries to capture scenarios. And when I say scenarios, something like names, the same [00:40:00] emails, not phone numbers, not. But, and then the city is not the same. You say, okay, what's the score going to be? Now, the reason we want to do something like that is because then we could pass these examples to you guys.

Rob: And we could say, where do you want these scores to be? And if you score them differently than somebody else, a whole lot different than somebody else. And we say, okay maybe this company gets a different model head for the individual entity. When I say model head, I'm talking about the scoring model here.

Rob: So if you got a use case, it's so different. And you guys agree that some number should be a 0. 8 and somebody else says it should be a 0. 5. That's different enough that maybe you deserve a different kind of model. But we don't have to reinvent the universe there, right? 99. 9 percent of that model is already built.

Rob: It's just how we fit those little curves for your use case. That's something a lot of you guys asked about last week, about that [00:41:00] customization. A lot of you guys asked about that customization.

Rob: Some guy's entity model and somebody else's entity model might be very different. And not just in. And with respect to how those curves are set, but the performance profiling, how fast do you need it? How many LLMs do we use and how many smaller LLMs do we use, or smaller transformers do we use? How many slices do we use?

Rob: What happens if you have five different first names and seven different addresses, right? You've got to compare all of these things. So under the hood, we're in actually variable numbers of sampling and slicing algorithms. Deterministic for what that's worth. But it's an important piece to capture because a lot of you guys asked about that last week.

Rob: What about slicing? What about permutation? So if you're looking at this, and you're a customer, and you say I've been struggling with Individuals with taking two individual entities and getting a [00:42:00] good score on them. I've been struggling with these two product entities. I've been struggling with these two organization entities.

Rob: I've been struggling with these two location entities and I keep making these rules and get a little better coverage, something goes up, something goes down, this is what you want. To be on the lookout for when we start GA multiple models, multiple entity types, multiple attribute models for firm, try these things out and see what you think.

Rob: Because make no mistake. This is where the space is going, right? Everybody is starting to use, I started building this about a year ago. Look at AWS entity resolution. Look at Google's entity resolution. Look at large language models being used for entity resolution. This is where entity resolution was going to make no mistake to where almost all software is going, right?

Rob: And when you build these models, it becomes very clear why, but maybe it's not very clear why if you just use these models, but once you guys start [00:43:00] to use them, you'll start to see. Now I want to. I want to, I don't know if we're going to have a lot of time for that. I want to dive into kind of a demo with this.

Rob: And I also want to get some questions. So let's do, let's play with the demo real quick. And then maybe while I do it, We can fire away some questions. Like I said, I'm not an engineer. I'm not a, I'm not a, I'm not a product guy. I don't make PowerPoints. I'm an engineer. And I'm going to give you like the nitty gritty and show you what Kern actually does on, my local machine.

Rob: So remember that model or that, that example I showed you, the Rob, the two Rob, these guys. I said, what. What does it, what does this model say? I'd be curious if I load this one up and that's what's loaded here on my screen. Sorry, I got to move my, send it to myself. The model score is about 79%.[00:44:00]

Rob: And this is a, this is not a full burn model. It doesn't fit on my laptop. Like I said, there's a lot of different ways we do it. So there's a few smaller transformers, a few bigger transformers, but it's the same it's the same example. So we score this, it's a 70, let's say 79.

Rob: 8%. On top of it, it says, here's a bunch of different scores. Notice there's 0.5 here, because there some, there's some missing information like the email and the identifier, right? Why is there there's no common identifier here. There's no common email, so 50%, that makes sense, right? Rob, Michael, Sylvester, r Sylvester, oops, sorry about that.

Rob: About 86%. City, Medellin is definitely not San Francisco. In many ways, so that's nearly 0%. On top of this, you also have explainability built in to SPHERN. We haven't released a lot of this yet, but this actually, literally gives you an [00:45:00] explanation of what these individual scores were and where they rank, generally speaking.

Rob: I know That's something I'm not sure we've even announced that either. You guys are learning a lot of fun new stuff today. But that's one piece about generative AI that's nice is you can turn these things on and you can actually ask for reasons as to why certain scores are made that way.

Rob: So it's not just about these, these scores being right. Sometimes it's about like I said, it's about how they behave when you slide them around. So I thought what we could do is we could, talk a few things. We had our Sylvester. Let's put in like M, R. M. Sylvester, right? So that would be the equivalent of comparing these people.

Rob: Or we can try something else. We can get some shout outs. If you want me to try it, see what happens. We boosted up. We boosted up to 83. 4%, right? And the name score boosted up. Obviously all the other ones stayed exactly the same. Cool. [00:46:00] What happens if, I don't know, we should take a different social security number.

Rob: What was the last social security number? 574 123456. Let's put in 574 265. It

Rob: drops to 56%. That's a weird one, right? This is one where I, like, where I would want to get your guys opinion because it doesn't make a ton of sense that somebody would have the same phone number, the exact same phone number, the same name. Or a very similar name, but not have a or have a different social security number, right?

Rob: There's almost no if I saw Rob Sylvester, same phone number guy in Columbia, Rob Sylvester, same phone number guy in, in San [00:47:00] Francisco, I'd say it's probably the same guy, more than 50%. Most of us agree. You don't have to look at these scores. There's people saying 60. Some people said 70.

Rob: But if the social security number's different, then you'd say, huh, not sure about that one. Berne says 56%. Going 56 percent here. Part of what you want to do when you make these models, obviously the use case is entity resolution, but another really good use case, I think, is even if you don't explicitly code for it is to find weird data, find anomalies, find things that are behaving correctly.

Rob: And to me, this is an example of a piece of data that if I was a data steward, I'd probably want this. To be verified by a human in some way, so who is this guy? Same phone number, Rob Silvestri, we've got two different social security numbers, what's going on here? Probably an error in the data,

Chris: right?

Chris: So that's

Rob: the kind of use case that, that actually, you maybe don't get if you try to explicitly code for something with rules because you're not searching for that. Is it something you get [00:48:00] for when you approximate a space instead of trying to rigidly poke at it and grab it, which is what you do when you build rules out for something.

Rob: I want to check the quick time check. We got eight minutes. Let's try to slam through as many questions as we can. I know there's 80 in here. Yeah,

Chris: let's do it. So I understand we're focused on individual entity resolution in the technology be used for business names and addresses. I feel like you might have covered that.

Chris: Okay.

Rob: Yeah, we can. This model itself actually has an address model sent in on it, individual address models. In this case, of course, really low, because we're talking about Metazine, but we can, it can change its functionality. Let's ask me another one while I do this.

Chris: Does the model require any initial configurations, so trainings, or is it ready to deploy for Life Sciences data?

Rob: For Life Sciences data, they're, the HCP one is almost built. We got [00:49:00] some features around like education and taxonomy that need to be tweaked a little bit, but otherwise it's pretty similar. You can use this. You could use it. Actually, there is an HCP model built in here. We'll just undo this real quick.

Rob: Nothing

Rob: like real time. I love it. This one scores 24 on this HCP. It's a smaller model. Oh, it doesn't have the approximate. So this one doesn't have the phone and the email in there. But it believes at least on this 1, because there's no education, there's no degree, there's no taxonomy that these people are likely not the same person that they CP data.

Rob: Does that make sense to me? 25 that might be a little low, but at the same time if you have no other information. What's the name of the HCP? I can't remember the name of the HCP model. It has the phone and more email in it. Maybe it's HCP, I'm probably, but anyway, the question was about life sciences.

Rob: Yeah, so the HCP model is pretty much, [00:50:00] it'll be ready to go as far as configuration. Probably customize like basic performance stuff. Do you want the big dog or do you want the smaller one that runs faster? And that goes back to the batch versus real time question. Most of the people in life sciences care about accuracy, less about performance.

Rob: So they go for batch models. They just want something very accurate, run it overnight, come back in, resolve the matches in the morning. Most of you that I've talked to any data stewards in life sciences are on that page. So when that gets deployed, anybody that wants to use it, I feel free Yeah, there's,

Chris: yeah.

Chris: So getting semantic proximity without additional attributes, such as like gender, which is many use cases are not available as it could be a powerful thing. Sorry. That's just a,

Rob: sorry. It is. It's very powerful. It's very powerful. That's when I didn't live either, for there are a lot of names where.

Rob: The semantic proximity, it's like the LLMs are able to realize that it's seen that name in multiple gender [00:51:00] contexts. Try finding a data set of that and loading a rule in for it, that stuff.

Chris: Several questions around this. Is it is Fern like a plug in or? Can we use it with our existing MDM solutions, or is it something that needs to be set up and turned on in the initial stages and configuration architecture, such as might be the best to answer this, which is probably

Rob: the best for it.

Rob: But, from my side, it's an API call, you go out and instead of hitting some other pre trained model, instead of going in the rest Reltio matches and going to go point at this, it's just going to point at burn. All it's going to do is look at the two entities. It's going to give you a score back or write a score somewhere.

Rob: That, that input output is the same. Suchin, do you want to give a more detailed answer?

Suchen: Yeah. So there is no additional configuration. It is just the enabling. It's a package matching method. Once enabled, you just have to basically say, okay run my match pair on phone in addition to the existing match [00:52:00] rules that you might have.

Suchen: So there is no additional configuration. The only thing that you might have to do is map the attribute. So in our phone model, if you have called first name and you're calling it something else. You just have to let us know that you're, say, first names, plural, is same as first name. Those kind of mapping, but other than that, there is absolutely no configuration.

Suchen: It is just enabling the model.

Chris: Great. And we use Fern externally via API request, for example, in score matches API.

Rob: It's run right now within RailTO. Yes. Matching engine. So yes,

Suchen: I can answer that. So the question is, basically, we have existing existing API that returns the score. So we have integrated into the core matching engine.

Suchen: So all your existing API will work with phone as well. So if you say give me [00:53:00] phone based matching on this pair using existing APIs like matches or scored API or transitive matches, all of those APIs are compatible with phone as well and there is no additional configuration.

Chris: Provide some example of response rate on 5 million organizational database, for example?

Rob: At the top of my head, no we do, we ran Matt's rebuild for consolidation rate on, 2 million, 2 million entities. And they ran in, I want to say a few hours. But it was on, one model head with, I think we had the cash turned way, way down. Obviously, these models, they have cash.

Rob: We're not going to run the same thing 500 times with John Smith. I want to say it ran in a few hours and it was 2 million with 10 million comparisons or something like that.

Suchen: And it depends on your tokenization scheme, right? So if you have a large number of common entities to be compared, [00:54:00] it's just going to take a little bit longer, but on the typical data set that we have seen the performance is in par with the rule based matching that you see today.

Suchen: So there is no additional time that is required for phone based matching, and that's what we have seen so far. And there are a lot of optimization that Robin team is working on. We want to make it even faster than the current approach that we are taking, right? Because. It can do in future of blocking better so that we don't have to go and unnecessarily compare Sarah with Alexander in that example there, even though they have no same last name or same address or phone number or whatever it is, it will be much better.

Suchen: So it will have a. More accurate matching candidates to compare with so that the performance will be even better than what we are seeing today,

Chris: right? Unfortunately, we're out of time. So it sounds like If the two of you are open to doing an ask me anything [00:55:00] because we have literally 20 plus questions In this one and the last one I can capture all of those and in the next few weeks You We can do a live session on ask me anything with both suture and rob if that's possible.

Rob: Yeah I'm a hundred percent down for it.

Chris: Great. Thank you everyone for coming i'm gonna have to stop us here, unfortunately, but please give us a rating and maybe Some other thoughts around some shows before, after you leave, so that, the zoom pops up with the rating and your opinions, but.

Chris: Thank you everyone for coming. Lots of great questions. Great job, Rob, on on this. This is really cool. Suchan, thanks for some input there. As usual, everyone, thank you so much. We'll see you in a, in another week or so. Thanks everyone.

Rob: Thanks guys. I really appreciate everybody jumping in nerd out with me.

Chris: That was awesome, man. That was really awesome. I enjoyed it. So lots of stuff we need to still cover, answer. But that's good. Let's do

Rob: that. [00:56:00] Let's do the Q and a session. I think that I would really like something like that.

Chris: Yeah, I figured you might. All right, everybody. Take care. Bye bye.


#CommunityWebinar
#Featured

0 comments
24 views

Permalink