By Dr. John R. Talburt
As Black Oak Analytics’ team has been interacting with companies at the national and international level in 2016, we have been seeing several trends in entity resolution. The most obvious trend is companies are moving toward the adoption of Big Data technologies that exploit the power of distributed processing. At the same time, entity resolution is turning many of the traditional IT paradigms upside-down. For example, identity resolution software seeks to structure and model data when reading from the data store rather than transforming when writing to the data store. This so called “data lake” strategy is beginning to supplant the old extract-transform-and-load (ETL) model. The following trends in entity resolution will continue to drive new innovations in Big Data technologies.
New Challenges and Solutions in Entity Resolution
As much as the newly distributed computing environments like Hadoop MapReduce have enabled improved analytics on massive amounts of data, these technologies are also creating a new set of challenges for entity resolution. Even in a non-distributed environments, entity resolution is an N2 problem because it is about comparing and matching pairs of records. So, if even if you only have a 1,000 records to process, there are still 499,500 pairs of these records to review.
The traditional approach has been to cut the input data into subsets called “blocks” and perform entity resolution on the small blocks. The blocks are generally formed by bringing together records sharing certain common values called match keys. In cases where records are associated with more than one match key, the blocks are formed by performing transitive closure on the match key value.
Traditional algorithms for transitive closure assume all the match keys can be processed in a single memory space, butthis is not possible for large datasets in new distributed environments. For this reason, one of the important trends in entity resolution is developing new algorithms for transitive closure that work in distributed processing environments like Hadoop MapReduce. Most of the new transitive closure algorithms being used to solve the problem take an iterative approach.
Another trend is the incorporation of non-traditional matching attributes and the creation of identity personas. A persona is the expression of your identity within a particular context. For example, I may have one persona using my legal name and address when I apply for a job, but I may have a “handle” I use when I am on a social media site, and these handles may be different for each different site. Even now with certain credit card plans used to protect users from fraud, when I purchase an item in a store the point of sale terminal may only record my name and an anonymous token in place of my credit card number.
This has pushed identity resolution software to broaden the scope of identity attributes from traditional name and address fields to include things like email addresses, IP addresses, social media handles, and anonymous tokens. In addition, many new entity resolution systems are being designed to keep and manage these incomplete identities such as a Twitter handle with an IP address as a persistent persona in the hope it can later be connected with other personas to create a more complete identity structure.
Another emerging trend in entity resolution is replacing the process of building matching rules by hand with smart machines that create matching rules using machine learning. This trend was started when scoring-based probabilistic rules began to replace traditional if-then (deterministic) rules. This is now expanding into the realm of machine learning using techniques such as support vector machines, neural networks, and rough sets.
A more technical issue is indexing, which is used to create the blocks as described above.
Chief Data Officer
One of the latest trends in entity resolution is the rise of the Chief Data Officer (CDO) in all sectors. What do CDOs have to do with entity resolution? One of a CDO’s primary functions is to leverage an organizations data to create competitive advantage for the business, a beginning step in the role is understanding the organizations data quality, including matching processes.
If you would like to learn more about how your business can benefit from implementing the latest trends in entity resolution, contact Black Oak Analytics today. Black Oak Analytics employs High Performance Entity Resolution (HiPER), an entity identity information management (EIIM) system that allows you to effectively identify and market to a targeted audience with increased efficiency through entity matching and resolution.