By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
In an earlier post, I stated my view that identity resolution and entity resolution are somewhat different processes. In particular, I consider identity resolution as a special form of entity resolution in which entity references are resolved by comparing them to the characteristics of a given set of known entities. Regardless of the approach, identity plays an important role in all forms of entity resolution.
The identity of an entity is a set of attributes and rules for comparing the attribute values that allow it to be distinguished from all other entities of the same type in a given context. A key feature is that identity is context-dependent, i.e., it depends upon the total set of entities under consideration. For example, a common scheme for creating email addresses in an organization uses a person’s first two initials and last name, e.g. jrtalburt. In a small organization, this is usually sufficient to make a unique address for each employee. However, applying this in a much larger pool of users such as the yahoo.com or gmail.com domains quickly surfaces that these attributes are insufficient.
For a more relevant business example, consider the case of a customer, Mary Smith. For simplicity, assume that the totality of her adult residential address history comprises:
1. Mary Smith, 123 Oak St, Anytown, NY, 1998-06 to 2000-03
2. Mary Jones, 234 Elm St, Anytown, NY, 2000-04 to 2002-11
3. Mary Jones, 345 Pine St, Anytown, NY, 2002-12 to present
Despite having used 2 names and 3 addresses, these are all references to the same person. There are two ways to view the issue of identity as illustrated by this history.
One is to start with the identity based on vital statistics, e.g. Mary Smith, a female born on December 3, 1980, in Anytown, NY, to parents Robert and Susan Smith, then to follow that identity through its various representations of name and address as shown above. This “internal view of identity” is the view of Mary Smith herself and might well be the view of a sibling or other close relative, someone with complete knowledge about her address history. The internal view of identity represents a closed universe model in which all of the possible occupancy variants are known to the internal viewer (system) and any occupancy record not equivalent to one of the known variants must belong to some other identity.
On the other hand, an external view of identity is one in which some number of address records for a customer’s identity have been linked, but the viewer (system) does not know if it is the complete history. Given another customer address record not equivalent to one of the records in the history, it must be determined if it does or does not belong to Mary’s history.
Suppose that a system has only the first two address records of Mary’s history. In this case, the system’s knowledge of Mary’s identity would be incomplete. It may be incomplete because either the third address record is not in the system (has not been acquired) or because the system hasn’t linked it to the first two records. In the latter case, the system would assume that the third record is part of a different customer’s identity. Even though an internal viewer would know that the third address record should also be part of the Mary’s complete history, the external viewer has not made that determination.
Conversely, an external viewer may assemble an inaccurate view of Mary’s history by linking the first two records of her address history to an address for a different Mary Smith. These entity resolution failures, incomplete and inaccurate histories, are information quality dimensions and indicate why the areas of entity resolution and information quality are so closely related. (Several classes of failures were discussed in another recent post.)
In an external view, the identity of the customer is equivalent to the set of occupancy records that have been resolved (i.e. linked). The known address records comprise the external viewer’s (or system’s) entire knowledge of the customer’s identity. If additional occupancy records are acquired and are correctly determined to be for this same customer, then the system’s knowledge about this identity increases.
The external view of identity reflects the experience of a business or government agency using entity resolution tools and processes in an effort to link disparate records into a single view of a customer or agency client. The “external view of identity” represents an open universe model because if the system is presented with a new occupancy record, it does not necessarily follow that the new records must be a part of a different identity. It may or may not be part of an existing identity, something that the ER process must decide.
The major point to note is that an internal viewer is in a position to judge the quality of an external view. With complete knowledge, the internal viewer can determine if any particular external viewer has omitted some records (completeness) or has linked records from different identities or failed to link records for the same identity (accuracy).
Along with Dr. Wang at MIT, I have introduced a quality metric in the form of an index for assessing the similarity of two identity resolutions. In cases where one resolution represents an internal view (correct) and the other is an external view, the index provides a metric for entity resolution accuracy. I plan to explain this metric in my next post.