In my last post, I suggested that entity resolution in the broadest sense (“Big ER”) really encompasses three activities. The first is locating and collecting entity references from unstructured sources (entity extraction), the second is resolving and merging references to the same entity (“Little ER”), and the third is analyzing associations among entities. Not every ER process involves all three activities. As I noted, the “pre-activity” of entity extraction only comes into play when the entity reference sources are unstructured, for example facial recognition in surveillance videos. Before the facial characteristics can be analyzed and compared to known faces, the portion of the images in the video that represents the face must first be located and extracted. In image processing this is called “feature extraction” and is the genesis of my use of the term “entity extraction” for this activity.
When the notion of entity resolution first developed, it was in the context of a database entity-relation schema. In those days, ER was just about merging all the references to the same entity. There was no entity extraction activity because the information in the database was already structured. The entity extraction activity grew out of the realization that useful information may reside in a structured format.
Now I’d like to talk about the third activity, exploring networks of associations. Once you have located and merged all of the references to the same entity, the next step is to ask whether any relationships exist among the entities. One of the first to be explored was the “household” relationship. Companies realize that there is value in understanding who’s living with whom at the same location, yet interestingly it is still one of the hardest relationships to define and manage. The simplest definition is “all the people at the same address with the same last name.” While simple, it doesn’t capture the nuances of current demographics such as unmarried couples, stepchildren, and extended families.
Exploring entity relationships brings us to the intersection of entity resolution with data mining. Data mining is all about discovering non-explicit (non-obvious) relationships. A record or database instance by definition is an explicit relationship among the attribute values, i.e. they belong to the same entity. However, just as in the case of households, we can discover relationships that are not explicitly given, e.g. people living at the same address.
Building associations is a natural extension of the Little ER process. Just because there is not enough asserted or inferred evidence to conclude that two references are to the same entity, it may still be possible to establish an association. For example, a record for Bill Smith at 123 Oak Street and a record for John Doe at 123 Oak Street would not resolve as references to the same person (unless there was evidence of deliberate deception), but it does establish that they shared a residence at some time. If they shared it at the same time, it might be an important relationship in the context of a criminal investigation, e.g. looking for known associates of Bill Smith.
Like the small world hypothesis and six degrees of separation, entity association can extend many levels beyond direct associations like a shared address. For example, Bill Smith and John Doe may never have shared the same address, but they may have both shared the same address with Fred Johnson, thus establishing an indirect connection.
This simple example is based on shared address, but entity connections can be established through many combinations of inferred associations such as shared telephone or PO Box address as well as asserted associations such as call records between telephone numbers or change-of-address records. Just as with entity extraction, the analysis of association networks has its own body of research and knowledge that practitioners can draw upon.
I hope that this series of posts has provided a broader perspective on the variety activities that comprise entity resolution. I certainly find it a fascinating subject. In my next post, I will discuss the concept and internal view of identity versus an external view of identity