HOME

Metrics for Entity Resolution

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post I discussed the concepts of internal and external views of identity.  The fact that we can have different views of the same identity then raises the question of how to go about comparing different views.  What complicates this issue is that, even though we can talk about resolving references in pairs (i.e. linking two records if they refer to the same entity), the total number of references can be quite large, and consequently, there are many possible pair-wise combinations to consider.

The number of pair combinations increases geometrically as the size of the list grows linearly.  The basic formula is that the number of pairs of distinct items from a list of N items is calculated by the formula N*(N-1)/2.  So even in a list of 10 items, there are 45 possible pair combinations – not so bad.

But now consider the issue of how many different ways these 45 links (comparisons) among the 10 references could be labeled as true (the two references are to the same entity) or false (the two references are to different entities) and at the same time make sense as links.  By making sense, I mean that if we were to label the link between references 1 and 2 as true, and the link between references 2 and 3 as true, then we would also have to label the link between references 1 and 3 as true.  Even in light of this condition, it still turns out that there are 115,975 ways that a set of 10 references could be linked together.

So to follow our example. Suppose that the 10 records are represented by the first 10 letters of the alphabet {a, b, c, d, e, f, g, h, i, j}.  One of the 115,975 ways they could be linked together would be {a, b, c} all linked as belonging to the same entity, {d, e} together, {f} by itself, and {g, h, i, j} together.  Another way is {a, c, d} together, {b, f} together, {e} by itself, and {g, h, i} together and {j} by itself.  So how similar is the first way of linking these 10 references to the second way of linking them?

This is not a new problem, and there are many ways to approach it.  The problem is easier to visualize if we build an “intersection matrix”.  The matrix is simply a table that lists one set of groupings as row labels and the other set as column labels.  The cell at a particular row and column is the size of the intersection between the groupings.  Here is the intersection matrix for the example just given:

talburt-matrix.png

The number 2 in the cell at the second row and second column of the table indicates that the first grouping in the first way the references were linked and the first grouping in the second way the references were linked share 2 elements in common, “a” and “c”.

In statistics, this is called cluster analysis, and there are several methods for comparing these clusters.  Most notable is the Rand Index that has a value from 0 to 1, with values closer to zero indicating less similarity and closer to one indicating more similarity.  The value is equal to 1 only when the two sets of groupings are identical.

The calculation of the Rand Index takes a bit of explaining, but basically it involves counting the pair-wise links in the various cells shown in the table above.  In this example, the value of the Rand Index turns out to be 0.800, or 80% similar.

A few years ago, Dr. Rich Wang, Director of the MIT Information Quality Program, and I wanted a simpler similarity index that could be used as a quick way to assess entity resolution results.  The method we developed is much simpler to calculate, in that it does not involve the formula for combinations.  The key values for calculating our index are just the number of groupings and the number of overlaps between those groupings.  The formula is as follows:

talburt-formula-091609.png

Where
|A| represents the number of groupings in the first linkage (number of rows in the table)
|B| represents the number of groupings in the second linkage (number of columns)
|C| represents the number of overlaps between the groupings (number of cells > 0)

For the example given in the table the value is TW = SQRT(4 x 5)/7 = 0.639.

According to our index, the two grouping are only about 64% similar.  In the next post I will discuss the application of our index and other metrics that can be used to assess entity resolution outcomes.

Leave a Reply


Bad Behavior has blocked 1208 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice