HOME

Entity Resolution Metrics

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post we looked at the problem of measuring the accuracy of entity resolution processes.  As with any accuracy measure, comparing to a known standard of correctness or benchmark is required.  However, even without a benchmark, other measures are also important in evaluating ER outcomes.

As discussed previously, if S is a list of references, we can think of the outcome of an ER process E acting on S as the partition of S produced by the assignment of links by E.  If the process E assigns each and every reference a single link ID, then the collection of subsets of S sharing the same link ID value will form a partition of S.

It is fairly easy to see that this must be true.  Since E assigns every reference a link ID, every reference must be in one of the subsets even if it is in a set by itself.  Because it assigns every reference only one link ID, a reference can only be a member of one of the link ID sharing subsets.  Therefore, the collection of subsets of S that share a common link ID fits the definition of a partition, i.e. a set of non-empty, non-overlapping subsets of S whose union is S.

The notation for this is P = (E, S, λ).  Wait a second – we talked about the process E and the set of references S, so where did λ come from?  Well, λ represents the order in which E processes the set S.  Unfortunately, many ER processes are order-dependent so that, even with the same process E acting on the same set of references S, it’s possible that the outcome (i.e. the partition of S) will be different if the order of processing is different.  In other words, it may be that (E, S, λ1) and (E, S, λ2) are different partitions when λ1 and λ2 are different orders of processing, either because of the physical ordering of S or because of differences in logical order due to parallel or distributed processing.

One of the reasons that ER processes may be order-dependent is that they often rely upon probabilistic matching of identity attributes.  Take as a simple example the Levenshtein Edit Distance that is commonly used for approximate string matching.  Edit distance is defined as the minimum number of “edits” that will transform one string into another string.  The edits most commonly used are “insert character”, “delete character”, and “replace character”.

Consider the three strings “JOHNSON”, “JOHNSTON”, and “JOHANSON”.  The edit distance from “JOHNSON” to “JOHNSTON” is 1 because we can simply insert a “T”.  The edit distance from “JOHNSON” to “JOHANSON” is also 1 because we can simply insert an “A”.  However, the edit distance from “JOHNSTON” to “JOHANSON” is 2 because there is no single edit that will convert one to the other, it requires at least 2, inserting an “A” and deleting the “T”.

So to illustrate order dependence of processing, suppose that S comprises the three names just given and that the E is a process that assigns links by building identity groups.  The process for building an identity group is that a reference belongs to the first existing identity group for which it differs from one of the identity group’s exemplars by an edit distance of 1 or less.  If there is no identity group that satisfies this condition, then the reference starts a new identity group.

Using this scheme, suppose that “JOHNSON” is the first reference to be processed.  Then by default, Identity Group 1 comprises [JOHNSON].  Suppose that the second reference processed is “JOHNSTON”.  Then by our rule, it also belongs to Identity Group 1 because the edit distance from “JOHNSTON” to the Identity 1 exemplar “JOHNSON” is 1.  So now Identity Group 1 comprises [JOHNSON, JOHNSTON].  Finally, suppose that the last reference processed is “JOHANSON”.  Since the edit distance from “JOHANSON” to “JOHNSON” is one, E will also assign it to Identity Group 1.  The final result is that the partition of S created by E is a single set, the set S itself, i.e. all 3 references are to the same entity.

However, using the same set S and process E, the result will be different if the order of processing is reversed.  If we suppose that “JOHANSON” is processed first, then Identity Group 1 comprises [JOHANSON].  If “JOHNSTON” is processed second, E will determine that it is a new identity 2 because its edit distance from “JOHANSON” is greater than 1.  Thus Identity Group 2 comprises [JOHANSON].  Finally, when “JOHNSON” is processed, it will differ from “JOHANSON” in Identity Group 1 by an edit distance of 1, thus it will be assigned to Identity Group 1.  The final result is that S is partitioned into two subsets [“JOHANSON”, “JOHNSON”] and [“JOHNSTON”] a different outcome than for the original order.

The underlying problem is that probabilistic matching processes such as edit distance are not transitive.  Equivalence relations such as equality exhibit the property that if A=B and B=C, then we can conclude that A=C.  This is called the transitive property.  However, as we have seen, “JOHNSTON” ≅ “JOHNSON” and “JOHNSON” ≅ “JOHANSON” does not imply that “JOHNSTON” ≅ “JOHANSON” where ≅ represents the relation that two strings differ by an edit distance of less than 2.

This issue of order dependence is basic to the development of entity resolution metrics.  Next time, I will wrap up this discussion and move into a model for entity-based information integration.

2 Responses to “Entity Resolution Metrics”

  1. Jim Zaiss Says:

    Suppose that, for a given S, there is a process E such that each possible ordering λi always results in the same partition P. Can we then conclude – ceteris paribus — that P is more likely to be the _correct_ partition of S than another partition P’ based on another process E’ such that the λi chosen sometimes results in a different partition of S?

    Please excuse me if this is a naïve question, or if it describes a situation that rarely occurs in practice. I’m new to this particular topic, but find it fascinating. I’ve long been interested in personal identity issues, though mostly from a logical, metaphysical, or phenomenological perspective.

    Regards,
    Jim Zaiss
    AWARE Software, Inc.

  2. John Talburt Says:

    Jim,
    Thank you. That is an interesting question that I have not fully considered. It is clear that if E is order dependent then at most only one of the partitions that it produces can be correct. However it is not clear if an order independent process E would necessarily have a better chance of producing a “more correct” partition. My first thought is that you could construct examples that work both ways, but I will think about this further.
    -jrt-

Leave a Reply


Bad Behavior has blocked 1420 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice