HOME

Entity Extraction

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my last post I discussed my definitions for entity resolution, entity identification, entity disambiguation, and anonymous entity resolution.  (And I reiterate that these are just my definitions and are not binding on anyone except possibly my students.)

Let’s go back to the overarching term entity resolution (ER).  In its broadest sense, I see ER as encompassing three major activities:

1.    Extracting or collecting of entity references from sources
2.    Linking references to same entity
3.    Exploring networks of entity associations.

In this post let’s focus on the first process, extracting and collecting entity references.  For many academic researchers, this is what entity resolution is all about, i.e. it represents the really interesting and challenging part. An extensive body of research literature discusses the methods and techniques for finding entity references in unstructured information, especially unstructured textual information (UTI).

There are many other ER participants for whom this process holds little or no interest.  These are primarily the commercial ER processors who expect that the information they begin with is already structured.  For them, the starting point is a record or database instance assumed to relate to an entity (e.g. customer) and that has well-defined fields or columns.  Their game is all about the process of linking these records.

However, there’s a growing realization that most of an organization’s information assets reside in unstructured data stores such as emails, reports, spreadsheets, photos, graphs, comments, notes, and other sources that are not only unstructured but may not even be in computer readable format.  The consensus is that the 80-20 rule applies: 80% unstructured to 20% structured.  The actual proportion will vary from organization to organization, but there is no denying that a tremendous amount of information is tucked away in unstructured formats.  Consequently, the text miners, law enforcement, intelligence community, and other old hands at entity extraction are now being joined by the commercial world in the rush to exploit this new source of information and potentially business intelligence (BI).

Researchers in image processing have long recognized the process of “feature extraction” where the parts of an image of interest (such as a human face) are located within the larger image. Thus I like the term “entity extraction” to describe this process in a broader sense that also includes text, audio, and other media, not just images.

The level or degree to which an entity reference is classified is another important issue in entity extraction.  Entities are just the people, places, or things we are interested in for a given application, and as we have learned from object-oriented analysis, these entity/objects often exist within a logical hierarchy.  The level of classification often impacts the strategy and the complexity of the extraction process.

To illustrate levels of classification, consider the following example of unstructured text that might appear in a newspaper announcement:

“On July 21, 2008, Mary Jo Smith, daughter of Sam and Sue Smith of Ft. Worth, was married to John Doe, son of Bill and Mary Doe, in a ceremony at the St. Joe Church in Dallas.”

At the highest level, several parsers could read the text and classify most of these references as people (e.g. Mary Jo Smith, John Doe), places (e.g. Ft. Worth, Dallas), and dates (e.g. July 21, 2008).  At a deeper level, however, we are interested not just in entity class, but more particularly in the entity’s sub-class or role.  The context is often given in the form of an “ontology” that specifies the entities and roles within a given context.  In this case, a “marriage ontology” would have roles for Bride, Groom, Date-Of-Marriage, Parent-of-Bride, Parent-of-Groom, and Place-of-Marriage. In our example above, determining that the reference Mary Jo Smith is not only to a person, but to the person in the role of “bride” in the context of a marriage announcement is a more demanding problem than simply discovering that Mary Jo is a person.

Even from this simple example, it is clear that developing a general solution for extracting and classifying entity references is a formidable challenge.  Another growing area of ER research is the new focus on moving beyond linking a reference to the same entity to networks of linkages, a topic for my next post.

Leave a Reply


Bad Behavior has blocked 728 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice