HOME

Archive for October, 2009

Identity Resolution Daily Links 2009-10-30

Friday, October 30th, 2009

[Post from Infoglide] Enriching E-discovery Results with Identity Resolution

“Civil lawsuits often result in discovery orders from the court to produce every shred of possibly relevant internal communication. The need to comprehend patterns across the resulting vast amount of aggregated data is critical. To help organizations respond to these demands, powerful e-discovery software systems (e.g., see StoredIQ) create data topology maps that identify the relationships between active sources of multiple forms of electronically stored information (ESI).”

USA Players: Lottery Winner Demands Payment After Crooked Clerk Pilfers Ticket 

“Pankaj Joshi, the accused, was an employee at the convenience store in which Willis purchased his tickets. Joshi had allegedly told Willis that the ticket that he presumed was worth millions was worth only $2 dollars, which Joshi presumably paid to Willis. Joshi was charged with lottery fraud, and it is suspected that he took the winnings and fled to his homeland of Nepal.”

For more, seeLottery Fraud by Retailers Is an Identity Resolution Problem

The Daily Texan: Civil liberties groups voice ‘fusion center’ apprehension

“It will be funded initially by U.S. Department of Homeland Security grants and will then become self-sustaining, using personnel already within APD’s budget. ‘It is really important for law enforcement to be able to share information in a timely fashion, because when you share information, you can solve crimes quicker and, in some cases, prevent another serial offense from happening,’ Carter said. Carter said Central Texas agencies possess large amounts of lawfully collected information, but separate information systems hinder the sharing of information.”

BeyeNETWORK: Master Data Management Checklist #5: Data Quality Mechanics

[David Loshin] “The ability to use the traditional data quality toolset of data parsing, standardization and matching enables the development of a “customer master,” “product master,” “security master,” etc. that becomes the master entity index to be used for ongoing identity resolution and elimination of duplicate entries.”

Airlines and Destinations: Passenger Info Required at Booking under TSA’s Secure Flight Program

“When making a flight booking, each passenger must declare their full name just as it appears in their passport, as well as their gender and date of birth. The airline sends the information to the TSA 72 hours before the flight departure time. The TSA compares the information with watch lists with the purpose of identifying suspected terrorists, preventing access to flights by passengers prohibited from flying, and identifying individuals for whom an enhanced security check should be performed.”

Enriching E-discovery Results with Identity Resolution

Wednesday, October 28th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

Civil lawsuits often result in discovery orders from the court to produce every shred of possibly relevant internal communication. The need to comprehend patterns across the resulting vast amount of aggregated data is critical. To help organizations respond to these demands, powerful e-discovery software systems (e.g., see StoredIQ) create data topology maps that identify the relationships between active sources of multiple forms of electronically stored information (ESI).

Reading about how it works made me speculate, “What value might identity resolution bring to e-discovery?” Identity resolution (AKA “entity resolution”) technology has been used to create solutions for a wide range of problems. Most often, this involves creating an understanding about people and their hidden relationships with other people and organizations. Lottery retailer fraud, airline passenger screening, and workers’ compensation are a just a few examples of areas that have benefited from applying this emerging technology.

Since lawsuits revolve around people, it shouldn’t be surprising that technology capturing enriched information about the identities of the actors involved in the suit could greatly illuminate what’s known about the parties involved. For example, imagine if an augmentation of e-discovery with identity resolution could do two things for each person involved in the suit:

  1. Automate detection of hidden relationships between the participants and other relevant players by drawing from multiple public and private data sources; and
  2. Generate link analyses, including graphical depictions involving the participants and other “entities” like conversation threads, that greatly enrich the litigants’ understanding.

Caveat: I’m not an e-discovery expert. However, based on what we know about identity resolution, I wonder if it’s possible to enhance the results of the e-discovery process?

If you have knowledge and experience in the space, I’d like to hear your thoughts.

Identity Resolution Daily Links 2009-10-26

Monday, October 26th, 2009

By the Infoglide Team

Come by and see us at TDWI World in Orlando Nov. 3 & 4!

Forbes.com: Who Is In Charge Of Your Data?

[Dan Woods] “But in most companies, no single person is charged with the task of making sure that the right data is being captured in an efficient way that ensures data quality. The Data Warehousing Institute estimated the annual cost of poor data quality at $600 billion in 2002. Other studies have produced similar estimates.”

Austin American Statesman: Clerk accused of absconding with lottery cash

“So when the 25-year-old quit his job at the convenience store and claimed a $1 million lottery jackpot in Austin, Joshi’s co-workers were suspicious and told investigators, the affidavit said. Those investigators now believe that in May, after a regular customer brought in his lottery tickets and asked Joshi to check if they were winners, Joshi kept the winning ticket, did not tell the customer and claimed the prize for himself, according to the affidavit and Travis County Assistant District Attorney Patty Robertson.”

Hartford Business: State Recommits To Fighting Shadow Labor

“The state board charged with cracking down on employers who fail to pay employee taxes and workers’ compensation premiums will meet on Nov. 5, following a 10-month hiatus.”

cnet news: Gartner: Brace yourself for cloud computing

Cloud computing takes several forms, from the nuts and bolts of Amazon Web Services to the more finished foundation of Google App Engine to the full-on application of Salesforce.com. Companies should figure out what if any of those approaches are most suited to their challenges, Gartner said.”

Identity Resolution Daily Links 2009-10-23

Friday, October 23rd, 2009

[Post from Infoglide] Measuring Entity Resolution Accuracy

“In the last post we looked at the problem of comparing two entity resolution (ER) outcomes.  If S represents a list of entity references, then the effect of applying an ER process is to divide S into subsets where each subset comprises all of the references to the same entity.”

Cloud Avenue: Gartner Says Cloud Computing Is The Top Technology Trend In 2010

“Compared to the beginning of 2009, the cloud computing landscape now is very different with a huge potential to change the face of IT forever.”

iHealthBeat: Blumenthal: Officials Working To Boost EHR Connectivity, Security

Blumenthal also addressed concerns about whether EHR systems would compromise the privacy and security of personal health data. He said regulations are in place to ensure that any health data used for research purposes are stripped of all individually identifiable information.”

Informatica Blog: Data Sharing and Privacy - Eternally Opposed?

“Nevertheless, the risks to privacy from data breaches and concerns about government access to vast stores of private citizen information continue to be recurring themes in today’s security environment. But do the benefits of complete and actionable data always conflict with the desire to secure and maintain privacy?”

Workers’ Comp Insider: Fraud is on the rise

“Steve Tuckey is currently writing an in-depth series on fraud for Risk and Insurance. The first installment, Transparency of Evidence, deals with fraud by doctors, hospitals and other healthcare professionals. He notes that ‘grayer areas of so-called abuse or overutilization continue to vex payers, insurance companies and lawmakers eager to maintain the financial stability and integrity of the system that has protected workers for nearly a century.’”

Measuring Entity Resolution Accuracy

Wednesday, October 21st, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post we looked at the problem of comparing two entity resolution (ER) outcomes.  If S represents a list of entity references, then the effect of applying an ER process is to divide S into subsets where each subset comprises all of the references to the same entity.  More formally this is called a “partition” of S.  A partition of a set S is simply a collection of non-empty, non-overlapping subsets of S that contain all of the elements of S.  In other words it is a way to divide S into subsets so that every element of S is in one, and only one, of the subsets.

By viewing ER outcomes as partitions of the underlying set of references, the problem of comparing outcomes translates into the problem of comparing two partitions of the same list S.  As pointed out in the last post, there are several methods for making these comparisons including the Rand Index and the Talburt-Wang Index.

This contrasts with the traditional view of evaluating record linking in terms of “merging” or “de-duplicating” two lists of records.  The book Data Quality and Record Linkage Techniques by Herzog, et al. provides a great overview of this treatment of resolution.  The list-versus-list approach focuses on analyzing the set of all possible pairs of records that can be formed between the two lists.  For example, if List A has 80 records and List B has 100 records, there would be 8,000 (80 x 100) possible record pairs in which the first record comes from List A, and the second from List B.  However, most of the analytical techniques based on this approach, such as the Fellegi-Sunter Model, start with the assumption that each one of the lists does not have any internal duplication.  This is a convenient, but often unrealistic assumption to make when working with large lists, especially those from external providers.

When the ER outcome problem is cast in terms of merging two lists, ER accuracy can be viewed in terms of precision and recall, measures borrowed from information retrieval.  Each record in List A can be thought of as a query into List B.  The precision of that query would be the ratio of the correct links it makes with records in List B to the total number of links it makes with records in List B.  Similarly its recall would be the ratio of its correct links with records in List B to the total of number of records in B that it should be linked with.  By extending these measures over all the records in List A, it is possible to define an overall precision and recall measure for the linkage between the two lists.

My preference, however, is to simply view List A and List B as forming a combined list in which linking can take place, not only between the records of A and B, but also internally between records within A and in B.  In my opinion this is a better reflection of what is usually done in the processing of real list files.  The records from two or more lists, or at least the identifying attributes from the records, are standardized into a common file format and combined into a single list.  An ER process is then performed on the combined list leading to its partition into subsets as described above.

If the correct partition of a list references is known, then the accuracy of a given ER process acting on that list can be represented at the value of the similarity index (e.g. Rand of T-W) obtained by comparing the partition generated by the ER process to the correct partition.  Partition similarity indices are designed to take values from 0 to 1.  Values closer to 0 indicate less similarity, and values closer to 1 indicate closer similarity, with the value equal to 1 if and only if the two partitions are identical.

Whether you are using precision and recall measures or a partition similarity measure of accuracy, both require knowing the correct links.  Of course, if we knew all of the correct links, we wouldn’t need the ER process to begin with.  In general, we only know the correct links for some sample of the references that we are dealing with.

When the entities are people, e.g. customers, obtaining even a relatively small sample of records with the correct links can be difficult.  My experience is that organizations generally do this in three ways: inspection by domain experts, information volunteered by employees, or telemarketing confirmation.

A random selection of records for inspection can be useful, but it is biased toward true positive linking and has little value in detecting false negatives.  An expert might determine that “Jaems Doe on Main St” should link to “James Doe on Main St”, but is unlikely to determine that “Mary Doe on Main St” should link to “Mary Smith on Elm St” without prior knowledge that these are the same customer (or the presence of additional attributes besides “name”).

Employees and their families are often called upon to volunteer benchmark data for linking.  Because this represents an internal view of identity, it can very rich and replete with prior and alternate names and addresses, dates-of-birth, and other biographical information.  However, unless the company is very diverse, the benchmark will only address a very narrow population demographic.

Perhaps the most unbiased sample is that obtained by a third-party telemarketing firm.  While this approach can reach a broad sample of the population with varying demographics, it is the most expensive of the three options, and without some internal validation, may not be as accurate as the others.

Next time,  I will continue the discussion of entity resolution metrics. In the meantime, your thoughts are welcome.

Identity Resolution Daily Links 2009-10-19

Monday, October 19th, 2009

By the Infoglide Team

information management: Multi-Entity MDM Enablement

“Most efforts, however, are executed in surroundings inhibited by existing infrastructure (legacy applications, tools, hardware and integration), dispersed organizational structures and suboptimal processes. This reality introduces challenges in architecting and deploying efficient and effective multi-entity MDM solutions.”

BAM INTEL: BAM’s Thinking on the New DHS Standards

“Public Fusion Centers must be seen by citizens and policy-makers to play a direct role in the response to disasters as well as intelligence gathering. They cannot remain in the intelligence-sharing role only and not take some of the spotlight when their good work prevents or lessens the impact of America’s next disaster.”

newsday.com: OPINION: Revolution right in your doctor’s hand

“For doctors and their patients (in other words, all of us), the electronic health record is a far more revolutionary idea than those that brought us the ability to download a song, post a video online or read and send e-mails when you’re on a camping trip. While those other innovations indirectly enhance the quality of life, they are designed for entertainment or business purposes. The EHR directly improves quality of life because the end result of its design is better health.”

SmartData Collective: Data May Require Unique Data Quality Processes

“All data quality projects can appear the same from afar but ultimately can be as different as stars and planets. One of the biggest ways they vary is in the data itself and whether it is chiefly made up of name and address data or some other type of data.”

Identity Resolution Daily Links 2009-10-16

Friday, October 16th, 2009

[Post from Infoglide] Avoiding False Positives: Analytics or Humans?

“The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for ‘automatic detection of threats and abnormal behaviour and violence.’ Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty…”

SDTimes: Old thinking does a disservice to new data hubs

“The enterprise needs to be able to understand the origin, the time and possibly the reason for a change. These audit needs must be supported by the data hub at the attribute level. MDM solutions that maintain the golden record dynamically address this need by supporting the history of changes in the source systems record content.”

Accision Health Blog: Surveys Show Importance of EHR

“A new Rand study is one of the first to link the use of electronic health records in community-based medical practices with higher quality of care.  Rand Corporation researchers found in a study of 305 groups of primary care physicians that the routine use of multifunctional EHRs was more likely to be linked to higher quality care than other common strategies, such as structural changes used for improving care.”

NYSIF: Central NY Contractor Hit with Workers Comp Fraud Charges

“Investigators said Mr. Decker previously had an insurance policy with NYSIF when he operated RD Builders in November 2005, a policy cancelled for non-payment a few months later. In 2008, he applied to NYSIF’s Syracuse office for workers’ compensation insurance doing business as Bull Rock Development, Inc.”

public intelligence: Office of Intelligence and Analysis (DHS)

“These entities are unified under local fusion centers, which provide state and local officials with intelligence products while simultaneously gathering information for federal sources.  As of July 2009, there were 72 designated fusion centers around the country with 36 field representatives deployed. The Department has provided more than $254 million from FY 2004-2007 to state and local governments to support the centers.”

Avoiding False Positives: Analytics or Humans?

Wednesday, October 14th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for “automatic detection of threats and abnormal behaviour and violence.”

Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty:

According to the Open Europe think tank, the increased emphasis on co-operation and sharing intelligence means that European police forces are likely to gain access to sensitive information held by UK police, including the British DNA database. It also expects the number of UK citizens extradited under the controversial European Arrest Warrant to triple. Stephen Booth, an Open Europe analyst who has helped compile a dossier on the European justice agenda, said these developments and projects such as Indect sounded “Orwellian” and raised serious questions about individual liberty.

Shami Chakrabarti of Liberty, a UK human rights group, said, “Profiling whole populations instead of monitoring individual suspects is a sinister step in any society. It’s dangerous enough at [the] national level, but on a Europe-wide scale the idea becomes positively chilling.”

At IdentityResolutionDaily, we’ve consistently supported open and civil discussion about balancing security requirements with individual rights of privacy and liberty (e.g. “Walking the Privacy/Security Tightrope“) . We’ve also dealt with the criticality of using analytic technology that minimizes false positives (e.g. “False Positives versus Citizen Profiles“).

Not long ago, James Taylor of Decision Management Solutions made an excellent point about whether using analytic technologies (e.g. identity resolution) versus relying totally on human judgment increases or decreases the risk of false positives:

Humans, unlike analytics, are prone to prejudices and personal biases. They judge people too much by how they look (stopping the Indian with a beard for instance) and not enough by behavior (stopping the white guy who is nervously fiddling with his shoes say)… If we bring analytics to bear on a problem the question should be does it eliminate more biases and bad decision making than it creates new false positives… Over and over again studies show analytics do better in this regard… I think analytics are ethically neutral and the risk of something going “to the dark side” is the risk that comes from the people involved, with or without analytics.

We couldn’t have said it better ourselves.

Identity Resolution Daily Links 2009-10-12

Monday, October 12th, 2009

By the Infoglide Team

revenueXL: Web based EMR - ASP vs. SaaS? Should you really care?

SaaS applications differ from ASP applications in that SaaS solutions are developed specifically to leverage web technologies such as the browser, thereby making them web-native. The database design and architecture of SaaS applications are specifically built with ‘multi-tenancy’ in mind, thereby enabling multiple tenants (customers or users) to access a shared data model. An ASP application on the other hand in most cases is a typical Client-Server application (meant for a single client) that is accessed over the internet and therefore includes an independent instance of Database that is specifically meant for your medical office.”

The Data Asset: Closing the Loop: Selecting the Right Technology

“Data management tools include those for data profiling, data quality and identity resolution. Measures that need to be addressed include data standardization, pattern standardization, address verification, and adherence to business rules.”

Homeland Security: DHS Announces New Information-Sharing Tool to Help Fusion Centers Combat Terrorism

“State and major urban area fusion centers provide critical links for information sharing between and across all levels of government, and help fulfill key recommendations of the 9/11 Commission. This initiative will serve as a valuable resource to enhance situational awareness and support more timely and complete analysis of national security threats.”

ITBusinessEdge: Seven Data Integration Trends

Master data management, which should be an enterprisewide endeavor, is being deployed for tactical purposes. The result? MDM projects support specific business needs and aren’t fully integrated across the enterprise.”


Identity Resolution Daily Links 2009-10-09

Friday, October 9th, 2009

[Post from Infoglide] Privacy – A Dying Concept?

“An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of ‘anonymized’ data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that ‘as Ohm notes, this illustrates a central reality of data collection: data can either be useful or perfectly anonymous but never both.’”

ComputerworldUK: Data quality tools sub-par, says analyst

“A recent study on data quality by the Information Difference revealed that respondents view data quality as something that is not restricted to one area within the organisation. Instead, two-thirds of respondents said it is an issue spanning the entire organisation…Specifically, 81 per cent of respondents reported being focused on a broader scope than merely customer name and address data.”

BeyeNETWORK: Master Data Management and the Challenge of Reality

“One of the central problems of master data management, which is often poorly stated, is the need to determine if one individual thing is the same as another individual thing. But the only way we have to do this is by matching records, and a record is not the same as the thing it represents. Unlike The Matrix, we are more in danger of confounding two ‘realities’ rather than recognizing them as distinct.”

Information Management: Business Intelligence: A Blueprint to Success

“Fraud detection. Claims managers are using predictive analytics to help identify potentially fraudulent claims as early as the first notice of loss, and are analyzing claims costs to get a better handle on negative trends.”

Government Computer News: How entity resolution can help agencies connect the dots in investigations

“Imagine a law-enforcement scenario. A local police department has information on a crime suspect. Court systems, corrections facilities, the department of motor vehicles and even child-support enforcement may also have information on this person of interest, each specific to its own needs and applications. Implementation of an entity-centric environment would enable each of the organizations and systems to continue its operations while also providing the police a much more holistic view of the crime suspect along with potentially important pieces of information.”


Bad Behavior has blocked 1366 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice