Archive for the ‘Mistaken Identity Resolution’ Category

Identity Resolution Daily Links 2009-10-16

Friday, October 16th, 2009

[Post from Infoglide] Avoiding False Positives: Analytics or Humans?

“The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for ‘automatic detection of threats and abnormal behaviour and violence.’ Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty…”

SDTimes: Old thinking does a disservice to new data hubs

“The enterprise needs to be able to understand the origin, the time and possibly the reason for a change. These audit needs must be supported by the data hub at the attribute level. MDM solutions that maintain the golden record dynamically address this need by supporting the history of changes in the source systems record content.”

Accision Health Blog: Surveys Show Importance of EHR

“A new Rand study is one of the first to link the use of electronic health records in community-based medical practices with higher quality of care.  Rand Corporation researchers found in a study of 305 groups of primary care physicians that the routine use of multifunctional EHRs was more likely to be linked to higher quality care than other common strategies, such as structural changes used for improving care.”

NYSIF: Central NY Contractor Hit with Workers Comp Fraud Charges

“Investigators said Mr. Decker previously had an insurance policy with NYSIF when he operated RD Builders in November 2005, a policy cancelled for non-payment a few months later. In 2008, he applied to NYSIF’s Syracuse office for workers’ compensation insurance doing business as Bull Rock Development, Inc.”

public intelligence: Office of Intelligence and Analysis (DHS)

“These entities are unified under local fusion centers, which provide state and local officials with intelligence products while simultaneously gathering information for federal sources.  As of July 2009, there were 72 designated fusion centers around the country with 36 field representatives deployed. The Department has provided more than $254 million from FY 2004-2007 to state and local governments to support the centers.”

Avoiding False Positives: Analytics or Humans?

Wednesday, October 14th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for “automatic detection of threats and abnormal behaviour and violence.”

Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty:

According to the Open Europe think tank, the increased emphasis on co-operation and sharing intelligence means that European police forces are likely to gain access to sensitive information held by UK police, including the British DNA database. It also expects the number of UK citizens extradited under the controversial European Arrest Warrant to triple. Stephen Booth, an Open Europe analyst who has helped compile a dossier on the European justice agenda, said these developments and projects such as Indect sounded “Orwellian” and raised serious questions about individual liberty.

Shami Chakrabarti of Liberty, a UK human rights group, said, “Profiling whole populations instead of monitoring individual suspects is a sinister step in any society. It’s dangerous enough at [the] national level, but on a Europe-wide scale the idea becomes positively chilling.”

At IdentityResolutionDaily, we’ve consistently supported open and civil discussion about balancing security requirements with individual rights of privacy and liberty (e.g. “Walking the Privacy/Security Tightrope“) . We’ve also dealt with the criticality of using analytic technology that minimizes false positives (e.g. “False Positives versus Citizen Profiles“).

Not long ago, James Taylor of Decision Management Solutions made an excellent point about whether using analytic technologies (e.g. identity resolution) versus relying totally on human judgment increases or decreases the risk of false positives:

Humans, unlike analytics, are prone to prejudices and personal biases. They judge people too much by how they look (stopping the Indian with a beard for instance) and not enough by behavior (stopping the white guy who is nervously fiddling with his shoes say)… If we bring analytics to bear on a problem the question should be does it eliminate more biases and bad decision making than it creates new false positives… Over and over again studies show analytics do better in this regard… I think analytics are ethically neutral and the risk of something going “to the dark side” is the risk that comes from the people involved, with or without analytics.

We couldn’t have said it better ourselves.

Entity Resolution vs. Entity Identification

Wednesday, June 3rd, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In entity resolution, as in any new research area, different authors or practitioners may use the same term but intend different meanings. You always have to be careful to understand exactly what a writer means when he or she uses a particular term. For example, I have found that the terms “entity resolution”, “entity identification”, and “entity disambiguation” are often used with different meanings by different writers.

Over the years, I have developed my own definitions.  I don’t claim that these are standard definitions, but they are the way I use them in my own work.

First of all, entity resolution is the most general term that encompasses the other two.  Entity resolution (ER) is a process that covers everything from the extracting or collecting of entity references from sources, to linking references to same entity, to exploring networks of entity associations.  Having said that, I find there are generally two uses of the term entity resolution.  Just as we see the term information technology used in the sense of “Big IT” (anything to do with computers) and “Little IT” (a specific curriculum of computer studies), the same can be said for entity resolution.  Big ER is when entity resolution is used to describe the entire process from end-to-end (as in my definition above).  On the other hand, Little ER is when the same term is used to describe just the middle step, the logic of determining which references are to the same entities, i.e. “resolving” the references.

Whereas entity resolution is the process of resolving whether references are to the same entity or to different entities, entity identification describes the special case of entity resolution in which the references are linked to “known” entities, i.e. matching to a set of previously established identities (probably a better term for this than entity identification would be “entity recognition” as in “customer recognition”).  Thus, entity resolution and entity identification (or recognition) mean different things because it is possible to resolve two references without actually knowing the identity of the entities to which they refer.

A good analogy is in criminal investigation.  If two sets of fingerprints are found at a crime scene, it is possible to determine from their characteristics that they belong to two different suspects.  However, the identification of the suspects to whom the fingerprints belong depends upon the completeness of the fingerprint files (known identities).  This is also an example of what is meant by the third term, entity disambiguation, i.e. resolving that two references are to different entities.  In this example, we can resolve that the two references are to different entities without knowing their identities.  Another example might be two records with the name “John Smith” but with different dates of birth.  Without other information we may not know exactly which John Smith’s they are, but could conclude that they are different John Smiths.

Similarly, the same sets of fingerprints could be found at two different crime scenes, but again without the prints being on file.  This would be another case of entity resolution without entity identification, i.e. we know they belong to the same person, but just don’t know whose they are.  When this process is done intentionally, we call it anonymous entity resolution.  As an example, for privacy reasons we may give school records anonymous identifiers that allow us to collect and analyze all of the grades for the same student without revealing the identity of the student.

Entity extraction is another term that I see used to describe entirely different processes, but let’s save that discuss for next time.

Identity Resolution Daily Links 2009-04-27

Monday, April 27th, 2009

By the Infoglide Team

New York Times: Name Not on Our List? Change It, China Says

“By some estimates, 100 surnames cover 85 percent of China’s citizens. Laobaixing, or “old hundred names,” is a colloquial term for the masses. By contrast, 70,000 surnames cover 90 percent of Americans. The number of Chinese family names in use has tended to shrink as China’s population has grown, a winnowing of surnames that has occurred in many cultures over time.”

OCDQ Blog: All I Really Need To Know About Data Quality I Learned In Kindergarten

“When you present the business case for your data quality initiative to executive management and other corporate stakeholders, remember the lessons of show and tell.  Poor data quality is not a theoretical problem - it is a real business problem that negatively impacts the quality of decision critical enterprise information.”

BTNonline: Secure Flight Roils Booking Tech

“To facilitate the implementation of Secure Flight’s new data requirements for the travel industry, officials from the International Air Transport Association and Department of Homeland Security this year decided to use passenger data fields already used to transmit visa and passport information. TSA noted those IATA standards go into effect May 1.”

Security Systems News: Retail industry to ’speak with a single voice’

“There will now be a single entity both helping to establish best practices for loss prevention and lobbying state and federal government in regard to major security issues like organized retail crime.”

Solving the False Negative Problem

Wednesday, April 22nd, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my March 25, 2009 post “The Myth of Matching,” I discussed the confusion between entity resolution and matching as in record de-duplication.  Matching is a necessary part of entity resolution, but it is not sufficient.  In particular I brought up the issue of “false negatives,” cases where records don’t match, but are in fact references to the same entity.  I used the example of Mary Doe living on Elm Street who married John Smith living on Pine Street resulting in two references “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” that don’t match, but are never-the-less references to the same person.  Let’s discuss a couple of approaches to solving this problem - enlarging the scope of identity attributes and utilizing asserted associations.

The Mary Doe - Mary Smith case might be resolved if the scope of identity attributes were increased, i.e. if additional information such as date-of-birth, drivers license, or social security number were available in both records.  But as anyone acquainted with information quality understands, acquiring and maintaining additional information can create as many problems as it solves.  It also brings up a number of questions that the information custodians and collectors must answer.

Is this information available? Is it costly? Is use for this purpose permissible/legal?  Even if expanding the number of identity attributes is an option, it is not necessarily a panacea.  Increasing the number of identity attributes also increases the complexity of the matching.  What if some values are missing?  What if some values agree, but others disagree?

A second approach is to collect and use asserted associations.  The fundamental problem is that if Mary Doe and Mary Smith do not share any matching identity attributes, you cannot know that they are the same person without some separately acquired knowledge that they are in fact the same person.  Moreover, because not all Mary Doe’s are the same person as Mary Smith, you also need additional context such as the address to make the connection clear.  The upshot is that you need to possess the explicit knowledge that “Mary Doe at 234 Elm St is the same person as Mary Smith at 456 Pine St.”

If Mary lives in the United States and Mary registers her change of name and address with the US Postal Service, then you might be able to resolve this through the USPS Change of Address file.  Besides the fact that this is only helpful in the US, relying on the USPS COA file has other disadvantages, not the least of which is that Mary may have decided not to register with the USPS.  For this reason, some companies choose to maintain their own knowledge by acquiring information from other public and private sources.

For example in the US, marriage records are publicly available and are a possible source of this associative information.  It may also be true that while Mary didn’t register her change of address with the USPS, she may have wanted to avoid missing any issues of her Modern Square Dancing magazine subscription and promptly registered her change of address with the publisher.  There are potentially many other data sources, such as changes in utility service, cable service, or required licensure notifications.

Even though the application of external association information can alleviate the false negative problem, it comes at a cost.  The collection and maintenance of associative information can be a monumental task for some types of entities. For example, at least 20% of the US population moves each year.  Because it is too large a task for most organizations to take on by themselves, companies that aggregate large amounts of associative data sometimes offer the application of this knowledge as a product.

In the next installment, I will discuss another common confusion, the difference between entity resolution and identity resolution.

The Human Element in Identity Resolution

Wednesday, February 18th, 2009

By Robert Barker, Infoglide Senior VP and Chief Marketing Officer

We’ve written quite a few posts here on the subject of identity resolution’s application to a broad range of problems that include terrorism, insurance fraud, crime, lottery fraud, sexual predators, workers comp employer fraud, and retail returns fraud. What we haven’t discussed very much is the relationship between the technology and the human beings that employ it.

We software marketers are sometimes tempted to make it sound as though our products solve problems automatically. The truth is that identity resolution software performs tasks that humans could do, but it does them at a level of speed and precision that significantly enhances the results accomplished through those tasks. In order for the software to achieve excellent results, however, human judgment is required both in implementing the software and in applying the results.

The specifics of a particular problem differ markedly, and every solution is different. A person of interest in airline passenger screening has very different characteristics from a person of interest in workers compensation fraud, for example. Solutions differ even within a single problem domain, e.g. Nordstrom and Walmart have very different philosophies for merchandise returns.

In simpler data quality applications, default configurations can address many problems, but in identity resolution, a little tuning by experts greatly increases the solution’s value.  A domain expert may not understand the technology, but they understand their problem, industry, application, and company. And because of their depth of understanding of their domain, they can tell great results from good results in a heartbeat.

For maximum benefit, human domain experts work with technology experts to tune the software during implementation to apply similar “judgment” as the experts themselves would use to resolve multiple identities, uncover hidden relationships, and minimize false positives and false negatives. Technology’s critical role is to automate the process of sifting through the data to find likely matches and non-obvious relationships and to prioritize the cases that require human intervention so that finite human resources can focus on the most important things first.

While it’s critical to have software that can produce results right “off the shelf,” it is the domain expertise coupled with the technology expertise that creates a solution that is perfectly matched to the needs of a particular industry, application, and company.

Identity Resolution Daily Links 2008-5-30

Friday, May 30th, 2008

[Post from Infoglide] Mistaken Identity Resolution Part V: Identity Resolution vs. Data Quality

“In this series of posts on Mistaken Identity Resolution we have compared identity resolution with other market spaces that it’s sometimes confused with, such as Master Data Management (MDM), data integration, and data warehousing. With Informatica’s recent acquisition of Identity Systems, now’s a good time to address the confusion between identity resolution and data quality.”

The Daily News Tribune: Shoplifting a matter of opportunity

“LaRocca said the theft of items that are specifically targeted can be associated with the economic status of the country, however a direct correlation is difficult to make. ‘The economy is just one of a number of different triggers. It is not the driving force behind it,’ LaRocca said. ‘The largest factor is always the opportunity.’ . . . Last year, retailers nationwide lost $40.5 billion worth of merchandise due to theft. . . . That monetary loss, he said, is unfortunately passed on to honest consumers who must pay higher prices. According to National Retailers Federation, consumers pay 1.5 cents more per dollar spent because of the toll shoplifting takes on retail business. Furthermore, when merchandise is stolen, no one is paying sales tax to the state, he said.”

Government Computer News: Big Brother may listen in Britain

“The British government’s Home Office reportedly is considering building a massive database of virtually all the electronic communications generated by residents of the United Kingdom, including voice and data communications along with Web site views and other online traffic.”

Andy on Enterprise Software: A Burning Platform

“I was amused by a piece regarding data quality in which a data quality initiative at a chemical manufacturer was kicked off only after a warehouse burnt down and the company discovered that they had no way of tracing which customers would be affected.”

STORES Knowledge Series: Refunds Management: Balancing Customer Service and Loss Prevention (Free Webinar)

“Retailers are continually challenged with balancing good customer service and loss prevention. Good customer service breeds loyalty and has a strong impact on top-line revenue. Refunds management practices and tools provide retailers with a well-managed solution that can provide top- and bottom-line revenue and profit. Join us as we explore the refunds challenge, tools and practices in place today that can offer a way to manage the refund process, and explore emerging tools that can set your stores apart from the competition.”

PogoWasRight.org: Department of Homeland Security Information Sharing Strategy

“The Department’s Information Sharing Strategy provides strategic direction and guidance for all Department of Homeland Security information sharing efforts.”

Forbes.com: Supermarket group stocked $1.4M into 1Q lobbying

“The Food Marketing Institute, the trade group for food retailers and wholesalers, spent nearly $1.4 million in the first quarter to lobby on food safety, public health and other issues, according to a disclosure report. . . . FMI also lobbied on retail pharmacy and Medicaid drug reimbursement, credit card interchange fees, identity theft, organized retail crime, and pension, wage and tax issues, according to the report filed April 21 with the House clerk’s office.”

Mistaken Identity Resolution Part V: Identity Resolution vs. Data Quality

Wednesday, May 28th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

In this series of posts on Mistaken Identity Resolution we have compared identity resolution with other market spaces that it’s sometimes confused with, such as Master Data Management (MDM), data integration, and data warehousing. With Informatica’s recent acquisition of Identity Systems, now’s a good time to address the confusion between identity resolution and data quality.

A Gartner study done several years ago estimated that poor quality customer data costs U.S. businesses an estimated $611 billion dollars a year [see correction]. So obviously data quality is a very important component of data management.

Data quality is defined by Whatis.com as “the reliability and effectiveness of data… maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating it, standardizing it, and de-duplicating records to create a single view of the data, even if it is stored in multiple disparate systems.”

Identity resolution is defined by wikipedia as the process that “analyzes all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non-obvious relationships exist between those identities.”

So while both data quality and identity resolution seek to create a unified view of the data and determine which entities are the same, identity resolution takes the process further by also determining which entities are related.

While both technologies can de-dupe data records, identity resolution adds powerful matching functions to data quality, so numerous patterns that otherwise would go undetected are uncovered quickly and accurately. These data patterns include multi-cultural name and address matching, character insertions/deletions, nicknames, abbreviations, transpositions, repetitions, etc. And perhaps the most valuable reason for incorporating identity resolution into data quality is being able to automate the decision process and integrate it into existing business processes. Specific rules can be applied to the intelligence gathered from search, relationship, and identity results, and an explicit action executed based on business requirements.

Perhaps instead of asking how data quality and identity resolution differ, the better question to ask is, “Are you risking poor data quality if identity resolution is absent?”

Identity Resolution Daily Links 2008-4-18

Friday, April 18th, 2008

[Post from Infoglide Software] Mistaken Identity Resolution Part IV: Identity Resolution vs. Data Warehousing

“Thus far this series of posts on ‘Mistaken Identity Resolution’ has contrasted Identity Resolution with Master Data Management (MDM) and Data Integration. What about one of the more established data concepts - data warehouses? How are they different? And how do they work together? If at all?”

Ovum: Informatica gets new identity, reports steady Q1 growth

Informatica has agreed to buy Identity Systems, a subsidiary of Nokia, for around $85m in cash. . . . Informatica believes that Identity’s software will give it a ‘differentiated’ cross-language identity matching capabilities. That said, InfoGlide and IBM (from its SRD acquisition) also have identity resolution engines. Arguably these two products are more complete in that they use rules engine and workflow processes to automate decision making.”

Vindy.com: Organized retail theft to equal racketeering

“A Cincinnati lawmaker introduced legislation Tuesday cracking down on organized retail theft rings that target stores for everything from baby formula to small appliances. Republican Sen. Bill Seitz said such crime is costing retailers billions of dollars annually in lost sales and state and local governments some $88 million in sales tax revenues. Seitz’s SB 320 would add ‘organized retail theft’ to the list of offenses Ohio law categorizes as ‘corrupt activity,’ ranking it with racketeering.”

b-eye.com - Business Intelligence Network - Blog: Shawn Rogers: The stack is moving….

“Generally when we use the term stack in the business intelligence world we are talking about the big 4: IBM, Microsoft, Oracle and SAP along with all their recent acquisitions Cognos, BO etc. All of these companies claim to provide a platform to enable every type of business intelligence you might need to have. It’s interesting to me that one of the largest platforms out there is often ignored but continues to grow into our business world through the internet. Google is much more than a search engine. Over the past couple years the company has branched out to become a business solution platform. Some of the players in the BI space are already taking advantage of it.”

KNXV-TV: Website tells you how long security lines are at Sky Harbor Airport

“The next time you fly you may want to click on the TSA website in order to save some time or prepare to wait a little longer in security lines. . . . The government owned site allows a user to click on various airports, terminals, and security checkpoints to see how long you should expect to wait in line before making it to the gate.”

DataFlux Community of Experts: Master Data Discovery and Data Mapping

“In my experience, if you follow processes associated with customer data you start to discover master data residing in other unexpected places, e.g. Access databases, Excel spreadsheets etc. . . . Having discovered disparate master data, you will most likely find that even if copies of master data reside in different systems the data may be named differently across those systems. So, the next challenge is for each master data entity (e.g. customer, product, employee, asset etc.) to map disparate master data in identified systems to commonly defined master data described using the shared business vocabulary. This is a difficult and time-consuming exercise to do manually. Fortunately, there are tools now available to help explore systems looking for master data and to automatically map that disparate data to commonly defined enterprise wide definitions of master data.”

Homeland Security Watch: Homeland Secretary Offers 10-year Vision, in 4 Parts

“Last week DHS Secretary Michael Chertoff delivered a speech at Yale University entitled ‘Confronting The Threats To Our Homeland.’ Citing the five-year mark for DHS and the three-year mark for his tenure as its head, he explained that such an occasion warrants not just one speech, but four. And the first — the one he delivered at Yale — offers insight into the way he views the ‘challenges and threats that we face over the next five and ten years relating to homeland security in the broadest sense.’”

Mistaken Identity Resolution Part IV: Identity Resolution vs. Data Warehousing

Wednesday, April 16th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

Thus far this series of posts on “Mistaken Identity Resolution” has contrasted Identity Resolution with Master Data Management (MDM) and Data Integration. What about one of the more established data concepts - data warehouses? How are they different? And how do they work together? If at all?

The differences are pretty simple. In 1990 Bill Inmon defined a data warehouse as “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.” What was a new concept in 1990 has passed through Gartner’s hype cycle to become an expected component of IT infrastructure for most large- and many medium-sized organizations. Data is transferred into a data warehouse and is expected to reside there indefinitely (and without changing) to support mostly analytical, but increasingly operational, activities. So a data warehouse is a repository for data. Identity resolution technologies are software that operate on data.

So is the relationship between identity resolution and warehousing complementary, beneficial, or problematic? At times the presence of a well-managed data warehouse can ease the identity resolution process by providing one reliable data source that can be similarity searched with other “dirtier” data sources. At other times, the cleansing methodology that is often used to prepare data for warehousing is problematic because it can hide or even obliterate data variances caused by fraudsters. These data variances are exactly what identity resolution uses to resolve multiple identities into one and uncover hidden relationships.

To further confuse the matter, in certain cases, identity resolution technologies like our Identity Resolution Engine(tm) (IRE) operate on data in remote, disparate databases, acting as a virtual data warehouse by combining data “on the fly.” In other cases, IRE works perfectly well on existing data warehouses.

So the answer to our previous question about identity resolution and data warehouses is the famous one you often get from consultants: “it depends.” However, by carefully determining the right points in existing processes to implement identity resolution, data warehousing and identity resolution can be both complementary to each other and beneficial to existing systems.

Make sense? Or maybe not? Let us know what you think.

Bad Behavior has blocked 653 access attempts in the last 7 days.

E-mail It
Portfolio Strategy News The Direct Marketing Voice