Entity Resolution or Identity Resolution? Here Is How We Distinguish the Two.

By Dr. John R. Talburt

Does your organization use data for analytics? Are you trying to? Does your company use internal and/or external data for marketing campaigns, cross sell opportunities or customer analysis of any kind? If your business, institution, or organization is involved in the process of data management or data integration, key people in the organization should understand how data gets brought into the IT systems, merged, stored, and managed before use. A part of the lifecycle of data management is a process called entity resolution or identity resolution, but there is some confusion in industry about these terms, and what they mean.

An Example of Entity Resolution vs. Identity Resolution

Entity resolution and identity resolution are terms associated with record linkage, data matching, and deduplication. Even though these terms are often used interchangeably, there is a subtle, but important difference, in meaning. The difference revolves around the use of the words “entity” and “identity.”

In an entity resolution context, the definition of an entity is more narrowly defined than its general use. In entity resolution, “entity” denotes a real-world object that is distinguishable from other objects of the same type, i.e. objects having distinct identities. The same entities are often the subject of master data management (MDM) such as customers, patients, students, organizations, products, events, and locations.

Here is where the nuance comes in with the word “identity.” In particular, it is the idea of “known identity” versus “distinct identity.”

Perhaps this is best illustrated by a simple example related to crime solving. Suppose there has been a burglary at a business. The police investigate and find a number of fingerprints. When the police laboratory examines the prints, they discover that there are two distinct sets of fingerprints from two different people. We know the burglars are people, and these people have distinct identities, but at this point the police don’t actually “know” those identities. However, after sending the fingerprints to the FBI, the burglars are identified because their prints are in a database of previously convicted criminals.

So, let’s put this story in the context of data integration. The fingerprints are “references” to the persons who left them. The fingerprints are not themselves persons, but only references to persons. Similarly, the records we create in information systems to describe entities such as customers or patients are simply references to those customers or students, not the actual persons. Entity resolution is the process of determining whether two entity references are for the same entity, or for different entities. So, in our story, what the police did in their local laboratory was entity resolution. They sorted out the fingerprints into two groups referencing two different people.

On the other hand, identity resolution is resolving an entity reference against a collection of known identities. In our crime story, the FBI performed identity resolution. They took the fingerprints from the police and matched them against fingerprints of known criminals, i.e. known identities. Identity resolution is sometimes referred to as “recognition” as in “customer recognition.” From this perspective, identity resolution can be considered a special case of entity resolution in which one of the two references being resolved is from a known identity.

How the Burglar Story Relates to Data Management

To come full circle, let’s cast the burglary story into the world of data management. Suppose data are being collected from website visits and the only identity information being collected from the visitor is a personal email address. The email address is a reference to the visitor. These references can be easily sorted out into distinct groups (entity resolution), but the email itself does give us the identity of the visitor. However, if the company has a master customer list in which one of the attributes is an email address, it may be discovered that some of the visitors are known customers (identity resolution).

A couple of notes here. First of all, fingerprints are a much more unique reference than an email. Presumably different people could use the same email. However, the same principles of the story still apply. Secondly, a collection of entity references for which the known identity has not yet been determined is sometimes called a “persona.” A persona is a person’s identity projected into a specific context. Persona are often aggregated based on an alias or handle in social media, such as a twitter handle, an anonymous token in financial transactions, such as a digital wallet, or in some cases by a device identifier.

Entity Resolution in Healthcare: Master Patient Index (MPI)

An industry example is in Healthcare, an industry that Black Oak has experience working with over the past several years. Healthcare institutions have (or should have, if yours doesn’t call us), data governance and master data management policies around enterprise master patient index (MPI). In a field as regulated as Healthcare, they must be sure that every unique identifier refers to one and only one individual. Splitting or wrongly merging patient data can result in many adverse events in addition to fines and legal entanglements. While the MPI itself undergoes identity resolution, it comes from data sources where entity resolution is required, social security numbers cannot be used on every data source. Healthcare deals with three primary types of data:

  • Patient data includes both identity information as well as medical history, billing notes, etc.
  • Provider data includes the data from external sources, specifically whoever is insuring the patient and can include personally identifiable information as well as other unique IDs such as policy number, Medicaid ID, etc.
  • Supplier data is more ambiguous and refers to all other sources of information. This might be billing, customer service, biomedical devices, or any number of other things.

Because matching is not just matching unknown identities to known identities, the process used to develop and maintain an MPI is an entity resolution process, not just an identity resolution process.

If you are interested in improving your data quality, building or upgrading an existing enterprise Master Patient Index system, or decoupling your matching from an existing MDM platform through HiPER software, contact Black Oak Analytics today!

Leave a Reply

Your email address will not be published.