In a previous post I talked about the two sides of entity resolution - locating and merging. Before continuing the discussion on entity extraction, let’s briefly revisit the issue of merging, in particular the popular misconception that matching or record de-duplication is the same as entity resolution. Matching is a necessary part of entity resolution, but it is not sufficient.
Record matching as a proxy for entity resolution is based on the premise that “if two records share the same (or almost the same) set of identity attributes (i.e. they match), then they represent the same entity.” However, there are two problems with this assumption:
(1) The set of identity attributes being used is not always sufficient to differentiate among all of the entity references.
(2) The converse of the statement is not true: if two records represent the same entity, then they do not necessarily have the same set of identity attributes (i.e., they may not match).
Anyone who has worked with name and address information has been stung by the situation where “John Doe, 123 Oak St” was matched with “John Doe, 123 Oak St” and linked these records as references the same entity only to discover later that one record was a reference to John Doe, Sr. and the other to John Doe, Jr., different people.
The problem here is the absence of the name suffix attribute or other attributes such as age that would allow us to disambiguate between these references, thus a false positive resolution. The collection of identity attributes should be sufficient to differentiate entities within a specific context. For example, a simple email prefix created from initials and last name may be a unique identifier in your company, but is likely to collide with another email identifier in a larger context such as Yahoo mail.
Even given that the set of identity attributes is large enough to avoid a false positive, the larger problem with matching as a surrogate for entity resolution is that it produces false negatives. For example, “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” do not match, but does that mean they are not references to the same entity? It could very well be the case that Mary Doe married John Smith and moved to his house at 456 Pine St.
The false negative problem is a more difficult problem to solve. In the best case, it is a matter of updating our entity view with readily available information. In the worst case, it is deliberate attempt to conceal a connection or collaboration, something that can be much harder to determine. In any case it is an area of active research and development.
Currently there are two primary approaches to solving the false-negative problem. The first is to enlarge the scope of identity information in an attempt to insure there will be a path connecting any two references to the same entity. The second is to locate and save associative information among the entities of interest to build explicit declarations of connection. Both of these approaches have their advantages and disadvantages. In the next post I will discuss both of these approaches in more detail.