Identity Resolution Daily Links 2009-10-26

October 26th, 2009

By the Infoglide Team

Come by and see us at TDWI World in Orlando Nov. 3 & 4!

Forbes.com: Who Is In Charge Of Your Data?

[Dan Woods] “But in most companies, no single person is charged with the task of making sure that the right data is being captured in an efficient way that ensures data quality. The Data Warehousing Institute estimated the annual cost of poor data quality at $600 billion in 2002. Other studies have produced similar estimates.”

Austin American Statesman: Clerk accused of absconding with lottery cash

“So when the 25-year-old quit his job at the convenience store and claimed a $1 million lottery jackpot in Austin, Joshi’s co-workers were suspicious and told investigators, the affidavit said. Those investigators now believe that in May, after a regular customer brought in his lottery tickets and asked Joshi to check if they were winners, Joshi kept the winning ticket, did not tell the customer and claimed the prize for himself, according to the affidavit and Travis County Assistant District Attorney Patty Robertson.”

cnet news: Gartner: Brace yourself for cloud computing

Cloud computing takes several forms, from the nuts and bolts of Amazon Web Services to the more finished foundation of Google App Engine to the full-on application of Salesforce.com. Companies should figure out what if any of those approaches are most suited to their challenges, Gartner said.”

Identity Resolution Daily Links 2009-10-23

October 23rd, 2009

[Post from Infoglide] Measuring Entity Resolution Accuracy

“In the last post we looked at the problem of comparing two entity resolution (ER) outcomes.  If S represents a list of entity references, then the effect of applying an ER process is to divide S into subsets where each subset comprises all of the references to the same entity.”

Cloud Avenue: Gartner Says Cloud Computing Is The Top Technology Trend In 2010

“Compared to the beginning of 2009, the cloud computing landscape now is very different with a huge potential to change the face of IT forever.”

iHealthBeat: Blumenthal: Officials Working To Boost EHR Connectivity, Security

Blumenthal also addressed concerns about whether EHR systems would compromise the privacy and security of personal health data. He said regulations are in place to ensure that any health data used for research purposes are stripped of all individually identifiable information.”

Informatica Blog: Data Sharing and Privacy - Eternally Opposed?

“Nevertheless, the risks to privacy from data breaches and concerns about government access to vast stores of private citizen information continue to be recurring themes in today’s security environment. But do the benefits of complete and actionable data always conflict with the desire to secure and maintain privacy?”

Workers’ Comp Insider: Fraud is on the rise

“Steve Tuckey is currently writing an in-depth series on fraud for Risk and Insurance. The first installment, Transparency of Evidence, deals with fraud by doctors, hospitals and other healthcare professionals. He notes that ‘grayer areas of so-called abuse or overutilization continue to vex payers, insurance companies and lawmakers eager to maintain the financial stability and integrity of the system that has protected workers for nearly a century.’”

Measuring Entity Resolution Accuracy

October 21st, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post we looked at the problem of comparing two entity resolution (ER) outcomes.  If S represents a list of entity references, then the effect of applying an ER process is to divide S into subsets where each subset comprises all of the references to the same entity.  More formally this is called a “partition” of S.  A partition of a set S is simply a collection of non-empty, non-overlapping subsets of S that contain all of the elements of S.  In other words it is a way to divide S into subsets so that every element of S is in one, and only one, of the subsets.

By viewing ER outcomes as partitions of the underlying set of references, the problem of comparing outcomes translates into the problem of comparing two partitions of the same list S.  As pointed out in the last post, there are several methods for making these comparisons including the Rand Index and the Talburt-Wang Index.

This contrasts with the traditional view of evaluating record linking in terms of “merging” or “de-duplicating” two lists of records.  The book Data Quality and Record Linkage Techniques by Herzog, et al. provides a great overview of this treatment of resolution.  The list-versus-list approach focuses on analyzing the set of all possible pairs of records that can be formed between the two lists.  For example, if List A has 80 records and List B has 100 records, there would be 8,000 (80 x 100) possible record pairs in which the first record comes from List A, and the second from List B.  However, most of the analytical techniques based on this approach, such as the Fellegi-Sunter Model, start with the assumption that each one of the lists does not have any internal duplication.  This is a convenient, but often unrealistic assumption to make when working with large lists, especially those from external providers.

When the ER outcome problem is cast in terms of merging two lists, ER accuracy can be viewed in terms of precision and recall, measures borrowed from information retrieval.  Each record in List A can be thought of as a query into List B.  The precision of that query would be the ratio of the correct links it makes with records in List B to the total number of links it makes with records in List B.  Similarly its recall would be the ratio of its correct links with records in List B to the total of number of records in B that it should be linked with.  By extending these measures over all the records in List A, it is possible to define an overall precision and recall measure for the linkage between the two lists.

My preference, however, is to simply view List A and List B as forming a combined list in which linking can take place, not only between the records of A and B, but also internally between records within A and in B.  In my opinion this is a better reflection of what is usually done in the processing of real list files.  The records from two or more lists, or at least the identifying attributes from the records, are standardized into a common file format and combined into a single list.  An ER process is then performed on the combined list leading to its partition into subsets as described above.

If the correct partition of a list references is known, then the accuracy of a given ER process acting on that list can be represented at the value of the similarity index (e.g. Rand of T-W) obtained by comparing the partition generated by the ER process to the correct partition.  Partition similarity indices are designed to take values from 0 to 1.  Values closer to 0 indicate less similarity, and values closer to 1 indicate closer similarity, with the value equal to 1 if and only if the two partitions are identical.

Whether you are using precision and recall measures or a partition similarity measure of accuracy, both require knowing the correct links.  Of course, if we knew all of the correct links, we wouldn’t need the ER process to begin with.  In general, we only know the correct links for some sample of the references that we are dealing with.

When the entities are people, e.g. customers, obtaining even a relatively small sample of records with the correct links can be difficult.  My experience is that organizations generally do this in three ways: inspection by domain experts, information volunteered by employees, or telemarketing confirmation.

A random selection of records for inspection can be useful, but it is biased toward true positive linking and has little value in detecting false negatives.  An expert might determine that “Jaems Doe on Main St” should link to “James Doe on Main St”, but is unlikely to determine that “Mary Doe on Main St” should link to “Mary Smith on Elm St” without prior knowledge that these are the same customer (or the presence of additional attributes besides “name”).

Employees and their families are often called upon to volunteer benchmark data for linking.  Because this represents an internal view of identity, it can very rich and replete with prior and alternate names and addresses, dates-of-birth, and other biographical information.  However, unless the company is very diverse, the benchmark will only address a very narrow population demographic.

Perhaps the most unbiased sample is that obtained by a third-party telemarketing firm.  While this approach can reach a broad sample of the population with varying demographics, it is the most expensive of the three options, and without some internal validation, may not be as accurate as the others.

Next time,  I will continue the discussion of entity resolution metrics. In the meantime, your thoughts are welcome.

Identity Resolution Daily Links 2009-10-19

October 19th, 2009

By the Infoglide Team

information management: Multi-Entity MDM Enablement

“Most efforts, however, are executed in surroundings inhibited by existing infrastructure (legacy applications, tools, hardware and integration), dispersed organizational structures and suboptimal processes. This reality introduces challenges in architecting and deploying efficient and effective multi-entity MDM solutions.”

BAM INTEL: BAM’s Thinking on the New DHS Standards

“Public Fusion Centers must be seen by citizens and policy-makers to play a direct role in the response to disasters as well as intelligence gathering. They cannot remain in the intelligence-sharing role only and not take some of the spotlight when their good work prevents or lessens the impact of America’s next disaster.”

newsday.com: OPINION: Revolution right in your doctor’s hand

“For doctors and their patients (in other words, all of us), the electronic health record is a far more revolutionary idea than those that brought us the ability to download a song, post a video online or read and send e-mails when you’re on a camping trip. While those other innovations indirectly enhance the quality of life, they are designed for entertainment or business purposes. The EHR directly improves quality of life because the end result of its design is better health.”

SmartData Collective: Data May Require Unique Data Quality Processes

“All data quality projects can appear the same from afar but ultimately can be as different as stars and planets. One of the biggest ways they vary is in the data itself and whether it is chiefly made up of name and address data or some other type of data.”

Identity Resolution Daily Links 2009-10-16

October 16th, 2009

[Post from Infoglide] Avoiding False Positives: Analytics or Humans?

“The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for ‘automatic detection of threats and abnormal behaviour and violence.’ Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty…”

SDTimes: Old thinking does a disservice to new data hubs

“The enterprise needs to be able to understand the origin, the time and possibly the reason for a change. These audit needs must be supported by the data hub at the attribute level. MDM solutions that maintain the golden record dynamically address this need by supporting the history of changes in the source systems record content.”

Accision Health Blog: Surveys Show Importance of EHR

“A new Rand study is one of the first to link the use of electronic health records in community-based medical practices with higher quality of care.  Rand Corporation researchers found in a study of 305 groups of primary care physicians that the routine use of multifunctional EHRs was more likely to be linked to higher quality care than other common strategies, such as structural changes used for improving care.”

NYSIF: Central NY Contractor Hit with Workers Comp Fraud Charges

“Investigators said Mr. Decker previously had an insurance policy with NYSIF when he operated RD Builders in November 2005, a policy cancelled for non-payment a few months later. In 2008, he applied to NYSIF’s Syracuse office for workers’ compensation insurance doing business as Bull Rock Development, Inc.”

public intelligence: Office of Intelligence and Analysis (DHS)

“These entities are unified under local fusion centers, which provide state and local officials with intelligence products while simultaneously gathering information for federal sources.  As of July 2009, there were 72 designated fusion centers around the country with 36 field representatives deployed. The Department has provided more than $254 million from FY 2004-2007 to state and local governments to support the centers.”

Avoiding False Positives: Analytics or Humans?

October 14th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for “automatic detection of threats and abnormal behaviour and violence.”

Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty:

According to the Open Europe think tank, the increased emphasis on co-operation and sharing intelligence means that European police forces are likely to gain access to sensitive information held by UK police, including the British DNA database. It also expects the number of UK citizens extradited under the controversial European Arrest Warrant to triple. Stephen Booth, an Open Europe analyst who has helped compile a dossier on the European justice agenda, said these developments and projects such as Indect sounded “Orwellian” and raised serious questions about individual liberty.

Shami Chakrabarti of Liberty, a UK human rights group, said, “Profiling whole populations instead of monitoring individual suspects is a sinister step in any society. It’s dangerous enough at [the] national level, but on a Europe-wide scale the idea becomes positively chilling.”

At IdentityResolutionDaily, we’ve consistently supported open and civil discussion about balancing security requirements with individual rights of privacy and liberty (e.g. “Walking the Privacy/Security Tightrope“) . We’ve also dealt with the criticality of using analytic technology that minimizes false positives (e.g. “False Positives versus Citizen Profiles“).

Not long ago, James Taylor of Decision Management Solutions made an excellent point about whether using analytic technologies (e.g. identity resolution) versus relying totally on human judgment increases or decreases the risk of false positives:

Humans, unlike analytics, are prone to prejudices and personal biases. They judge people too much by how they look (stopping the Indian with a beard for instance) and not enough by behavior (stopping the white guy who is nervously fiddling with his shoes say)… If we bring analytics to bear on a problem the question should be does it eliminate more biases and bad decision making than it creates new false positives… Over and over again studies show analytics do better in this regard… I think analytics are ethically neutral and the risk of something going “to the dark side” is the risk that comes from the people involved, with or without analytics.

We couldn’t have said it better ourselves.

Identity Resolution Daily Links 2009-10-12

October 12th, 2009

By the Infoglide Team

revenueXL: Web based EMR - ASP vs. SaaS? Should you really care?

SaaS applications differ from ASP applications in that SaaS solutions are developed specifically to leverage web technologies such as the browser, thereby making them web-native. The database design and architecture of SaaS applications are specifically built with ‘multi-tenancy’ in mind, thereby enabling multiple tenants (customers or users) to access a shared data model. An ASP application on the other hand in most cases is a typical Client-Server application (meant for a single client) that is accessed over the internet and therefore includes an independent instance of Database that is specifically meant for your medical office.”

The Data Asset: Closing the Loop: Selecting the Right Technology

“Data management tools include those for data profiling, data quality and identity resolution. Measures that need to be addressed include data standardization, pattern standardization, address verification, and adherence to business rules.”

Homeland Security: DHS Announces New Information-Sharing Tool to Help Fusion Centers Combat Terrorism

“State and major urban area fusion centers provide critical links for information sharing between and across all levels of government, and help fulfill key recommendations of the 9/11 Commission. This initiative will serve as a valuable resource to enhance situational awareness and support more timely and complete analysis of national security threats.”

ITBusinessEdge: Seven Data Integration Trends

Master data management, which should be an enterprisewide endeavor, is being deployed for tactical purposes. The result? MDM projects support specific business needs and aren’t fully integrated across the enterprise.”

Identity Resolution Daily Links 2009-10-09

October 9th, 2009

[Post from Infoglide] Privacy – A Dying Concept?

“An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of ‘anonymized’ data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that ‘as Ohm notes, this illustrates a central reality of data collection: data can either be useful or perfectly anonymous but never both.’”

ComputerworldUK: Data quality tools sub-par, says analyst

“A recent study on data quality by the Information Difference revealed that respondents view data quality as something that is not restricted to one area within the organisation. Instead, two-thirds of respondents said it is an issue spanning the entire organisation…Specifically, 81 per cent of respondents reported being focused on a broader scope than merely customer name and address data.”

BeyeNETWORK: Master Data Management and the Challenge of Reality

“One of the central problems of master data management, which is often poorly stated, is the need to determine if one individual thing is the same as another individual thing. But the only way we have to do this is by matching records, and a record is not the same as the thing it represents. Unlike The Matrix, we are more in danger of confounding two ‘realities’ rather than recognizing them as distinct.”

Information Management: Business Intelligence: A Blueprint to Success

“Fraud detection. Claims managers are using predictive analytics to help identify potentially fraudulent claims as early as the first notice of loss, and are analyzing claims costs to get a better handle on negative trends.”

Government Computer News: How entity resolution can help agencies connect the dots in investigations

“Imagine a law-enforcement scenario. A local police department has information on a crime suspect. Court systems, corrections facilities, the department of motor vehicles and even child-support enforcement may also have information on this person of interest, each specific to its own needs and applications. Implementation of an entity-centric environment would enable each of the organizations and systems to continue its operations while also providing the police a much more holistic view of the crime suspect along with potentially important pieces of information.”

Privacy – A Dying Concept?

October 7th, 2009

By Gary Seeger, Infoglide Vice President

An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of “anonymized” data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that “as Ohm notes, this illustrates a central reality of data collection: ‘data can either be useful or perfectly anonymous but never both.’”

A seminal study published in 2000 by Latanya Sweeney at Carnegie Mellon opened the issue by proving that a simple combination of a very small number of publicly available attributes can uniquely identify individuals:

“It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides… In general, few characteristics are needed to uniquely identify a person.”

Faced with a choice between exploiting easily obtainable data for righteous ends versus the potential misuse of identifying individuals, can an appropriate balance be struck by privacy legislation? Anderson points out that:

“Because most data privacy laws focus on restricting personally identifiable information (PII), most data privacy laws need to be rethought. And there won’t be any magic bullet; the measures that are taken will increase privacy or reduce the utility of data, but there will be no way to guarantee maximal usefulness and maximal privacy at the same time.”

Looking at the subject from a business perspective, using technologies such as identity resolution to connect non-obvious data relationships serves many initiatives. It would seem admirable to exploit public records and other forms of publicly available information to mitigate risks, uncover fraud, or track down “bad” guys. Yet some cry foul when the technology exposes individuals who didn’t anticipate that their “private” information would be used to identify and/or track them down.

In the rapidly evolving cyber-information age, the desires, conflicts, and limitations of protecting privacy will continue to be sorted out in the legal realm. Those of us who solve business issues using identity resolution technology will swim in this legal quagmire for many years. Finding an appropriate balance between the protection of individual privacy and bona fide business uses of “public” data will almost certainly be a growing challenge to the moral and legal minds of our community.

Identity Resolution Daily Links 2009-10-05

October 5th, 2009

By the Infoglide Team

todaysthv.com: Arkansas Business on Today’s THV: Arkansas Lottery

“The efforts start at the lottery’s west little rock distribution center, home to 26 million lottery tickets potentially worth about $48 million in winnings. But those tickets are worthless until they pass through multiple security scans. The system ensures that no one can redeem a winning ticket if it was taken from a hijacked delivery truck, or a smash-and-grab at a convenience store that sells the tickets.”

Telegraph.co.uk: EU funding ‘Orwellian’ artificial intelligence plan to monitor public for ‘abnormal behaviour’

“York University’s computer science department website details how its task is to develop ‘computational linguistic techniques for information gathering and learning from the web”…’Our focus is on novel techniques for word sense induction, entity resolution, relationship mining, social network analysis [and] sentiment analysis,’ it says.”

Information Management: Risk Management and the Need for Master Data Management

“By reconciling disparate master data (clients, products, vendors, chart-of-accounts, reference data) across the enterprise, MDM can provide organizations with a comprehensive and accurate view of their businesses, helping them understand their risk exposure to clients and vendors and their overall financial health.”

Government Computer News: Fusion center approach could be effective in other areas

“Closely related cousins to fusion centers are emergency operations centers. Although these centers might also deal with security-related data feeds, their main function is to import real-time data that’s related to specific events such as national disasters or terrorist incidents. An emergency operations center may track everything from the location of ambulances or rescue personnel to available hospital beds or even the location of victims who need to be rescued.”

Bad Behavior has blocked 594 access attempts in the last 7 days.

E-mail It
Portfolio Strategy News The Direct Marketing Voice