HOME

The Myth of Matching: Why We Need Entity Resolution

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In a previous post I talked about the two sides of entity resolution - locating and merging.  Before continuing the discussion on entity extraction, let’s briefly revisit the issue of merging, in particular the popular misconception that matching or record de-duplication is the same as entity resolution.  Matching is a necessary part of entity resolution, but it is not sufficient.

Record matching as a proxy for entity resolution is based on the premise that “if two records share the same (or almost the same) set of identity attributes (i.e. they match), then they represent the same entity.”  However, there are two problems with this assumption:

(1) The set of identity attributes being used is not always sufficient to differentiate among all of the entity references.

(2) The converse of the statement is not true: if two records represent the same entity, then they do not necessarily have the same set of identity attributes (i.e., they may not match).

Anyone who has worked with name and address information has been stung by the situation where “John Doe, 123 Oak St” was matched with “John Doe, 123 Oak St” and linked these records as references the same entity only to discover later that one record was a reference to John Doe, Sr. and the other to John Doe, Jr., different people.

The problem here is the absence of the name suffix attribute or other attributes such as age that would allow us to disambiguate between these references, thus a false positive resolution. The collection of identity attributes should be sufficient to differentiate entities within a specific context.  For example, a simple email prefix created from initials and last name may be a unique identifier in your company, but is likely to collide with another email identifier in a larger context such as Yahoo mail.

Even given that the set of identity attributes is large enough to avoid a false positive, the larger problem with matching as a surrogate for entity resolution is that it produces false negatives.  For example, “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” do not match, but does that mean they are not references to the same entity?  It could very well be the case that Mary Doe married John Smith and moved to his house at 456 Pine St.

The false negative problem is a more difficult problem to solve.  In the best case, it is a matter of updating our entity view with readily available information.  In the worst case, it is deliberate attempt to conceal a connection or collaboration, something that can be much harder to determine.  In any case it is an area of active research and development.

Currently there are two primary approaches to solving the false-negative problem.  The first is to enlarge the scope of identity information in an attempt to insure there will be a path connecting any two references to the same entity.  The second is to locate and save associative information among the entities of interest to build explicit declarations of connection. Both of these approaches have their advantages and disadvantages.  In the next post I will discuss both of these approaches in more detail.

5 Responses to “The Myth of Matching: Why We Need Entity Resolution”

  1. Dan Power Says:

    Thanks for an interesting, thought-provoking piece!

    The “deliberate attempt to conceal a connection” seems to be common in fraud, law enforcement and homeland security applications.

    It’s important to remind people about one of your main points (matching being a necessary but not sufficient part of entity resolution).

    There’s a lot more going on in true entity resolution than simple matching!

  2. John Talburt Says:

    Dan, thanks for your comment. I find the confusion of ER with matching fairly common. I am also seeing interest in finding the “concealed” connections in commercial practice starting to catch up with that of the government security agencies. BTW I am planning a graduate course in ER for this fall, if you or anyone other readers have suggestions for topics, please let me know. -jrt-

  3. Jim Harris Says:

    I just finished publishing a five part series of articles on data matching methodology for dealing with the common data quality problem of identifying duplicate customers.

    Topics covered in the series:

    • Why a symbiosis of technology and methodology is necessary when approaching the common data quality problem of identifying duplicate customers
    • How performing a preliminary analysis on a representative sample of real project data prepares effective examples for discussion
    • Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
    • How both false negatives and false positives illustrate the highly subjective nature of this problem
    • How to document your business rules for identifying duplicate customers
    • How to set realistic expectations about application development
    • How to foster a collaboration of the business and technical teams throughout the entire project
    • How to consolidate identified duplicates by creating a “best of breed” representative record

    Here is the link to article series on my blog:

    http://www.ocdqblog.com/home/identifying-duplicate-customers.html

    Best Regards…

    Jim Harris

  4. Daragh O Brien Says:

    John,

    Great post. You may recall the slides I’ve used at IAIDQ conferences about my name and how it got me into information quality at an early age.

    13+ spelling variants, can be male/female, can be miskeyed as Tara, or mangled to be Darren, Daryn, Daryl (also a male/female name), Dora (hence my love of exploring). And let’s not get started on my home address as a kid which seems to still confuse data quality tools (here’s a hint… St. in an address is not always an abbreviation of “street”). I have other examples…

    I think one of the mental gear-shifts that needs to be made when looking at these issues is to remember that data is a representation of a real world thing (in this case a person). It is not the thing itself. When we are elbows deep in the data it can be all too easy to loose sight of that.

    Looking forward to the follow ups to this.

  5. Steve Sieloff Says:

    John –

    Another great post and on point! I find it very interesting linking “point in time” occupancies to the current state location of an entity. Public records, while fruitful, are spotty in availability and lack many standard data quality measures. Name distributions per a given geography (zip or zip+4) are helping in making links between names with materially different addresses — Zawarek Timonsky 123 Main St and Zawarek Timonsky 456 Elm Dr in same zip code where only one Zawarek first name is known and 3 Timonsky surnames known … the unique combination creates a high degree of confidence we are talking same person — even with differing addresses.

    As for the example of St. in the street not always meaning Street, it is clear that the software causing the incorrect classification and standardization is not looking at both the keyword AND the pattern or semantics in which the keyword or phrase is referenced. This type of semantic parsing and standardization is gaining traction in document classification and phrase searching (aka Google).

    Keep up the thought provoking articles!

    Steve

Leave a Reply


Bad Behavior has blocked 779 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice