Architectures for Entity Resolution-Part 2
Wednesday, March 10th, 2010By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
In the last post we examined how entity resolution (ER) systems are actually implemented, starting with the most basic merge/purge process and heterogeneous join systems. Both of these approaches focus on collecting equivalent references from among the sources provided, either as a large batch of references in a single file, or through queries against a federation of databases. The entity identities found by these ER systems are transient in the sense that they depend upon the sources input into the process. When different sources are provided, different identities will emerge.
On the other hand, there are ER systems that retain and manage identity information. By doing this they are able to “recognize” the same identity over time and assign that identity the same entity identifier (sometimes called “persistent identifiers” or “persistent links”). In Customer Data Integration (CDI) applications, these kinds of systems are sometimes called Customer Recognition Systems.
Two major types of ER systems perform identity management. The first type is the “identity resolution” system. It is most effective in situations where a fairly stable set of known identities of interest exists, such as the set of vendors or customers of a company, a set of products, or the students enrolled in a school. The attributes of these identities are pre-loaded into the system and assigned identifiers. When a reference is given to the system, it then decides whether the reference is to one of the known identities, and if so, returns the identifier of that identity.
Identity resolution systems can operate in either batch or transactional mode. In cases where there are a large number of pre-stored identities, the performance of batch operations can be improved through distributed processing where the identities are partitioned over multiple processors and resolved in parallel.
However, there are many situations where the identities are not necessarily known in advance, or in some cases the entities are known but simply not organized in such a way that they can be easily pre-loaded. For example, suppose two companies merge and each company has its own customer database. The customers are identified in different ways in each database, and furthermore, for the customers of one company, poor systems and practices prevent having any confidence that the master records are unduplicated across business lines or company locations.
The type of system often applied in these situations is an “identity capture” system. The identity capture architecture can be seen as a hybrid of merge/purge and identity resolution systems. It supports identity management and persistent identifiers, but without starting with a preloaded set of identities. In my next post, we’ll delve deeper into the identity capture process.
