By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
In the last post we looked at the problem of comparing two entity resolution (ER) outcomes. If S represents a list of entity references, then the effect of applying an ER process is to divide S into subsets where each subset comprises all of the references to the same entity. More formally this is called a “partition” of S. A partition of a set S is simply a collection of non-empty, non-overlapping subsets of S that contain all of the elements of S. In other words it is a way to divide S into subsets so that every element of S is in one, and only one, of the subsets.
By viewing ER outcomes as partitions of the underlying set of references, the problem of comparing outcomes translates into the problem of comparing two partitions of the same list S. As pointed out in the last post, there are several methods for making these comparisons including the Rand Index and the Talburt-Wang Index.
This contrasts with the traditional view of evaluating record linking in terms of “merging” or “de-duplicating” two lists of records. The book Data Quality and Record Linkage Techniques by Herzog, et al. provides a great overview of this treatment of resolution. The list-versus-list approach focuses on analyzing the set of all possible pairs of records that can be formed between the two lists. For example, if List A has 80 records and List B has 100 records, there would be 8,000 (80 x 100) possible record pairs in which the first record comes from List A, and the second from List B. However, most of the analytical techniques based on this approach, such as the Fellegi-Sunter Model, start with the assumption that each one of the lists does not have any internal duplication. This is a convenient, but often unrealistic assumption to make when working with large lists, especially those from external providers.
When the ER outcome problem is cast in terms of merging two lists, ER accuracy can be viewed in terms of precision and recall, measures borrowed from information retrieval. Each record in List A can be thought of as a query into List B. The precision of that query would be the ratio of the correct links it makes with records in List B to the total number of links it makes with records in List B. Similarly its recall would be the ratio of its correct links with records in List B to the total of number of records in B that it should be linked with. By extending these measures over all the records in List A, it is possible to define an overall precision and recall measure for the linkage between the two lists.
My preference, however, is to simply view List A and List B as forming a combined list in which linking can take place, not only between the records of A and B, but also internally between records within A and in B. In my opinion this is a better reflection of what is usually done in the processing of real list files. The records from two or more lists, or at least the identifying attributes from the records, are standardized into a common file format and combined into a single list. An ER process is then performed on the combined list leading to its partition into subsets as described above.
If the correct partition of a list references is known, then the accuracy of a given ER process acting on that list can be represented at the value of the similarity index (e.g. Rand of T-W) obtained by comparing the partition generated by the ER process to the correct partition. Partition similarity indices are designed to take values from 0 to 1. Values closer to 0 indicate less similarity, and values closer to 1 indicate closer similarity, with the value equal to 1 if and only if the two partitions are identical.
Whether you are using precision and recall measures or a partition similarity measure of accuracy, both require knowing the correct links. Of course, if we knew all of the correct links, we wouldn’t need the ER process to begin with. In general, we only know the correct links for some sample of the references that we are dealing with.
When the entities are people, e.g. customers, obtaining even a relatively small sample of records with the correct links can be difficult. My experience is that organizations generally do this in three ways: inspection by domain experts, information volunteered by employees, or telemarketing confirmation.
A random selection of records for inspection can be useful, but it is biased toward true positive linking and has little value in detecting false negatives. An expert might determine that “Jaems Doe on Main St” should link to “James Doe on Main St”, but is unlikely to determine that “Mary Doe on Main St” should link to “Mary Smith on Elm St” without prior knowledge that these are the same customer (or the presence of additional attributes besides “name”).
Employees and their families are often called upon to volunteer benchmark data for linking. Because this represents an internal view of identity, it can very rich and replete with prior and alternate names and addresses, dates-of-birth, and other biographical information. However, unless the company is very diverse, the benchmark will only address a very narrow population demographic.
Perhaps the most unbiased sample is that obtained by a third-party telemarketing firm. While this approach can reach a broad sample of the population with varying demographics, it is the most expensive of the three options, and without some internal validation, may not be as accurate as the others.
Next time, I will continue the discussion of entity resolution metrics. In the meantime, your thoughts are welcome.