By John Talburt, PhD, CDMP, Director of the Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock
Under our working definition of entity resolution as locating and merging references to the same entity, the last installment focused on the merge problem, and how matching is often used as a stand-in for ER. Now let’s take a look at the locating problem.
First we should note that information comes to us in two forms, structured and unstructured. The traditional world of IT has been built around structured information based on the discipline of relational database schemas. In essence, data is structured if it is ready to be loaded into a relational database, i.e. all of the entities and their attributes are clearly delimited or tagged in a way that a computer can correctly read the entire data set by following one simple, repeating pattern. In the good ole days, the flat-file format gave us this by requiring that every record must have a fixed length and every attribute must occupy a fixed position in the record. Inspired by the spreadsheet paradigm, a friendlier version came along only requiring that all of the attributes be presented together in a fixed order, each separated from the other by a specially designated character, the delimiter. Now XML has brought us yet another discipline of explicitly tagging the start and end of records and attributes with a consistent naming convention.
So in the structured world, locating is easy, you just follow the pattern. The problem is that we are now beginning to realize that there is a tremendous amount of information in unstructured formats such as free-form documents, photos, videos, audio files, sensor data, and other formats, formats that are not easily mapped into an entity-attribute schema. Even if we just focus on information encoded in character (text) format, the total amount of unstructured information in most organizations often exceeds the amount of structured information by a considerable amount. What’s more, we now realize that some of this information could be important, i.e. that processes like customer relationship management (CRM) could be transformed if the company only knew what their customers were saying in their emails to the company or in the comment they gave to telemarketers or technical support personnel who typed those comments into a free-form, notes field.
So how did we end up with so much unstructured information? Did good information go bad? No, the reason is that the information age operates on four channels – people to computers, computers to people, computers to computers, and people to people – and it is the latter generates the unstructured information. Person-to-person communication is inherently complex and often carries a tremendous amount of implicit and explicit context that people understand, but computers don’t.
Early in my career, I worked with a professor on the problem of disambiguation of homographs using thesauri (a fancy way of asking if a computer can understand the difference in meaning between two words that are spelled the same, but mean different things, just by looking at the synonyms of the words around them., e.g. “I can open this can.”) His favorite test was “Time flies like an arrow, but fruit flies like a banana.”
But getting back on topic, if you want to resolve whether references are to the same or different entities, you must first have the references. So if the information sources are unstructured, the locating side of entity resolution is about finding the entity references. This process is variously referred to as “named entity recognition”, “entity identification”, or “entity extraction”. In the next installment we will discuss some of the strategies for entity extraction from unstructured text documents.