HOME

Archive for the ‘Identity Matching’ Category

Identity Resolution Daily Links 2009-10-16

Friday, October 16th, 2009

[Post from Infoglide] Avoiding False Positives: Analytics or Humans?

“The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for ‘automatic detection of threats and abnormal behaviour and violence.’ Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty…”

SDTimes: Old thinking does a disservice to new data hubs

“The enterprise needs to be able to understand the origin, the time and possibly the reason for a change. These audit needs must be supported by the data hub at the attribute level. MDM solutions that maintain the golden record dynamically address this need by supporting the history of changes in the source systems record content.”

Accision Health Blog: Surveys Show Importance of EHR

“A new Rand study is one of the first to link the use of electronic health records in community-based medical practices with higher quality of care.  Rand Corporation researchers found in a study of 305 groups of primary care physicians that the routine use of multifunctional EHRs was more likely to be linked to higher quality care than other common strategies, such as structural changes used for improving care.”

NYSIF: Central NY Contractor Hit with Workers Comp Fraud Charges

“Investigators said Mr. Decker previously had an insurance policy with NYSIF when he operated RD Builders in November 2005, a policy cancelled for non-payment a few months later. In 2008, he applied to NYSIF’s Syracuse office for workers’ compensation insurance doing business as Bull Rock Development, Inc.”

public intelligence: Office of Intelligence and Analysis (DHS)

“These entities are unified under local fusion centers, which provide state and local officials with intelligence products while simultaneously gathering information for federal sources.  As of July 2009, there were 72 designated fusion centers around the country with 36 field representatives deployed. The Department has provided more than $254 million from FY 2004-2007 to state and local governments to support the centers.”

Avoiding False Positives: Analytics or Humans?

Wednesday, October 14th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for “automatic detection of threats and abnormal behaviour and violence.”

Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty:

According to the Open Europe think tank, the increased emphasis on co-operation and sharing intelligence means that European police forces are likely to gain access to sensitive information held by UK police, including the British DNA database. It also expects the number of UK citizens extradited under the controversial European Arrest Warrant to triple. Stephen Booth, an Open Europe analyst who has helped compile a dossier on the European justice agenda, said these developments and projects such as Indect sounded “Orwellian” and raised serious questions about individual liberty.

Shami Chakrabarti of Liberty, a UK human rights group, said, “Profiling whole populations instead of monitoring individual suspects is a sinister step in any society. It’s dangerous enough at [the] national level, but on a Europe-wide scale the idea becomes positively chilling.”

At IdentityResolutionDaily, we’ve consistently supported open and civil discussion about balancing security requirements with individual rights of privacy and liberty (e.g. “Walking the Privacy/Security Tightrope“) . We’ve also dealt with the criticality of using analytic technology that minimizes false positives (e.g. “False Positives versus Citizen Profiles“).

Not long ago, James Taylor of Decision Management Solutions made an excellent point about whether using analytic technologies (e.g. identity resolution) versus relying totally on human judgment increases or decreases the risk of false positives:

Humans, unlike analytics, are prone to prejudices and personal biases. They judge people too much by how they look (stopping the Indian with a beard for instance) and not enough by behavior (stopping the white guy who is nervously fiddling with his shoes say)… If we bring analytics to bear on a problem the question should be does it eliminate more biases and bad decision making than it creates new false positives… Over and over again studies show analytics do better in this regard… I think analytics are ethically neutral and the risk of something going “to the dark side” is the risk that comes from the people involved, with or without analytics.

We couldn’t have said it better ourselves.

Identity Resolution Daily Links 2009-10-09

Friday, October 9th, 2009

[Post from Infoglide] Privacy – A Dying Concept?

“An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of ‘anonymized’ data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that ‘as Ohm notes, this illustrates a central reality of data collection: data can either be useful or perfectly anonymous but never both.’”

ComputerworldUK: Data quality tools sub-par, says analyst

“A recent study on data quality by the Information Difference revealed that respondents view data quality as something that is not restricted to one area within the organisation. Instead, two-thirds of respondents said it is an issue spanning the entire organisation…Specifically, 81 per cent of respondents reported being focused on a broader scope than merely customer name and address data.”

BeyeNETWORK: Master Data Management and the Challenge of Reality

“One of the central problems of master data management, which is often poorly stated, is the need to determine if one individual thing is the same as another individual thing. But the only way we have to do this is by matching records, and a record is not the same as the thing it represents. Unlike The Matrix, we are more in danger of confounding two ‘realities’ rather than recognizing them as distinct.”

Information Management: Business Intelligence: A Blueprint to Success

“Fraud detection. Claims managers are using predictive analytics to help identify potentially fraudulent claims as early as the first notice of loss, and are analyzing claims costs to get a better handle on negative trends.”

Government Computer News: How entity resolution can help agencies connect the dots in investigations

“Imagine a law-enforcement scenario. A local police department has information on a crime suspect. Court systems, corrections facilities, the department of motor vehicles and even child-support enforcement may also have information on this person of interest, each specific to its own needs and applications. Implementation of an entity-centric environment would enable each of the organizations and systems to continue its operations while also providing the police a much more holistic view of the crime suspect along with potentially important pieces of information.”

Privacy – A Dying Concept?

Wednesday, October 7th, 2009

By Gary Seeger, Infoglide Vice President

An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of “anonymized” data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that “as Ohm notes, this illustrates a central reality of data collection: ‘data can either be useful or perfectly anonymous but never both.’”

A seminal study published in 2000 by Latanya Sweeney at Carnegie Mellon opened the issue by proving that a simple combination of a very small number of publicly available attributes can uniquely identify individuals:

“It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides… In general, few characteristics are needed to uniquely identify a person.”

Faced with a choice between exploiting easily obtainable data for righteous ends versus the potential misuse of identifying individuals, can an appropriate balance be struck by privacy legislation? Anderson points out that:

“Because most data privacy laws focus on restricting personally identifiable information (PII), most data privacy laws need to be rethought. And there won’t be any magic bullet; the measures that are taken will increase privacy or reduce the utility of data, but there will be no way to guarantee maximal usefulness and maximal privacy at the same time.”

Looking at the subject from a business perspective, using technologies such as identity resolution to connect non-obvious data relationships serves many initiatives. It would seem admirable to exploit public records and other forms of publicly available information to mitigate risks, uncover fraud, or track down “bad” guys. Yet some cry foul when the technology exposes individuals who didn’t anticipate that their “private” information would be used to identify and/or track them down.

In the rapidly evolving cyber-information age, the desires, conflicts, and limitations of protecting privacy will continue to be sorted out in the legal realm. Those of us who solve business issues using identity resolution technology will swim in this legal quagmire for many years. Finding an appropriate balance between the protection of individual privacy and bona fide business uses of “public” data will almost certainly be a growing challenge to the moral and legal minds of our community.

Identity Resolution Daily Links 2009-10-05

Monday, October 5th, 2009

By the Infoglide Team

todaysthv.com: Arkansas Business on Today’s THV: Arkansas Lottery

“The efforts start at the lottery’s west little rock distribution center, home to 26 million lottery tickets potentially worth about $48 million in winnings. But those tickets are worthless until they pass through multiple security scans. The system ensures that no one can redeem a winning ticket if it was taken from a hijacked delivery truck, or a smash-and-grab at a convenience store that sells the tickets.”

Telegraph.co.uk: EU funding ‘Orwellian’ artificial intelligence plan to monitor public for ‘abnormal behaviour’

“York University’s computer science department website details how its task is to develop ‘computational linguistic techniques for information gathering and learning from the web”…’Our focus is on novel techniques for word sense induction, entity resolution, relationship mining, social network analysis [and] sentiment analysis,’ it says.”

Information Management: Risk Management and the Need for Master Data Management

“By reconciling disparate master data (clients, products, vendors, chart-of-accounts, reference data) across the enterprise, MDM can provide organizations with a comprehensive and accurate view of their businesses, helping them understand their risk exposure to clients and vendors and their overall financial health.”

Government Computer News: Fusion center approach could be effective in other areas

“Closely related cousins to fusion centers are emergency operations centers. Although these centers might also deal with security-related data feeds, their main function is to import real-time data that’s related to specific events such as national disasters or terrorist incidents. An emergency operations center may track everything from the location of ambulances or rescue personnel to available hospital beds or even the location of victims who need to be rescued.”

Identity Resolution Daily Links 2009-10-02

Friday, October 2nd, 2009

[Post from Infoglide] To Move or Not to Move: That is the Question

A continual theme at IdentityResolutionDaily is maintaining the privacy and confidentiality of data at all times. Two recent posts concerned fusion centers and citizen profiling, but the same issues apply to virtually any application of entity resolution technology. The fact is that, in some cases, anonymous identity resolution is a requirement for more sensitive identity resolution implementations.

GCN: Entity resolution’s growing role in security efforts

“Research firm and consultancy Gartner has been tracking the entity-resolution market for several years. ‘Entity resolution and analysis was previously an obscure technology that has come to the forefront as a result of world events and market forces where it is used to identify the use of false identities and networks of individuals who are attempting to hide their relationships to each other,’ stated Gartner in ‘Hype Cycle for Master Data Management,’ a report released in June.”

iHealthBeat: Consensus Needed on EHR Access, Privacy Issues, Panelists Say

“Panelists noted that although some patients want the ability to segregate and mask certain sections of their EHRs, physicians are wary of protections that would deny them access to critical patient medical information.”

Security Management: Fusion Centers Forge Ahead

“More than 70 operate at the state, regional, and urban levels. The question eight years after 9-11 is: How well are these centers fulfilling their goals of information collection, analysis, and dissemination—and to the extent that these efforts are falling short, what remains to be done to meet the goals and to ensure the future sustainability of these centers?”

SmartData Collective: Poor Data Quality is a Virus

“Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise’s mission to survive and thrive in today’s highly competitive and rapidly evolving marketplace. Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions.”

Identity Resolution Daily Links 2009-09-28

Monday, September 28th, 2009

[Post from Infoglide] Social CRM, CDI, and Identity Resolution

“In her well-read book on CDI, Jill Dyché offers a definition of CDI that also seems to describe social CRM. Try reading her definition of CDI, replacing ‘CDI’ with ’social CRM’: CDI is a set of procedures, controls, skills and automation that standardize and integrate customer data originating from multiple sources.”

Concord Monitor: Don’t play games when giving your name

“What do they want? Your date of birth, your gender and your middle initial. This information will be relayed to the TSA, and the TSA will match the information against information maintained by the Terrorist Screening Center (an arm of the FBI that gathers and consolidates watch lists). The theory is that a 12-year-old boy named John X. Doe can more easily be separated from John Z. Doe, who happens to be a 37-year-old man with a history of making bombs, if additional information is collected during the booking process. Once TSA has cleared you, you’ll be issued a boarding pass.”

pressdemocrat.com: Achieving paperless health care

“Medical record-keeping, until recently, relied on rooms full of paper files that were easily misplaced and filled with hurried, handwritten entries that could be hard to read. Electronic records hold orderly, keyboard-entered data that never leaves a hard drive and have the potential to move seamlessly from a primary care provider’s office to an emergency room or specialist’s suite.”

ebizQ: MDM Becoming More Critical in Light of Cloud Computing

[David Linthicum] “We’re moving from complex federated on-premise systems, to complex federated on-premise and cloud-delivered systems.   Typically, we’re moving in these new directions without regard for an underlying strategy around MDM, or other data management issues for that matter.”

Homeland Security: I&A Reconceived: Defining a Homeland Security Intelligence Role

“There are currently 72 fusion centers up and running around the country (a substantial increase from 38 centers in 2006).  I&A has deployed 39 intelligence officers to fusion centers nationwide, with another five in pre-deployment training and nearly 20 in various stages of administrative processing.  I&A will deploy a total of 70 officers by the end of FY 2010, and will complete installation of the Homeland Secure Data Network (HSDN), which allows the federal government to share Secret-level intelligence and information with state and local partners, at all 72 fusion centers.”

Identity Resolution Daily Links 2009-09-18

Friday, September 18th, 2009

[Post from Infoglide] Metrics for Entity Resolution

“In the last post I discussed the concepts of internal and external views of identity.  The fact that we can have different views of the same identity then raises the question of how to go about comparing different views.  What complicates this issue is that, even though we can talk about resolving references in pairs (i.e. linking two records if they refer to the same entity), the total number of references can be quite large, and consequently, there are many possible pair-wise combinations to consider.”

FederalComputerWeek: DOD opens some classified information to non-federal officials

“The non-federal officials will get access via the Homeland Security department’s secret-level Homeland Security Data Network. That network is currently deployed at 27 of the more than 70 fusion centers located around the country, according to DHS.”

Gerson Lehrman Group: Stylish Master Data Management

“In my experience, one of these styles is nigh-on impractical.  ‘Centralized’ (also called ‘transaction’) implies a wrenching architectural shift whereby the a master data hub becomes the one and only source of master data for an enterprise, replacing the functionality of generating master data in existing transaction systems, and serving up ‘golden copy’ data to other systems, perhaps via an enterprise service bus architecture. This sounds elegant, but is extremely invasive.”

INFORMATICA PERSPECTIVES: Get To “Meaningful Use” Faster

Identity resolution’s goal is to find the right person at the right time, regardless of the potential for error and variation in what information is available at the time of request.  This could be during patient registration and admission, patient transfers or referrals, emergency room visits, and simply sharing information across providers or insurers.  The ability to do this effectively must become the most basic and core function.”


Metrics for Entity Resolution

Thursday, September 17th, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post I discussed the concepts of internal and external views of identity.  The fact that we can have different views of the same identity then raises the question of how to go about comparing different views.  What complicates this issue is that, even though we can talk about resolving references in pairs (i.e. linking two records if they refer to the same entity), the total number of references can be quite large, and consequently, there are many possible pair-wise combinations to consider.

The number of pair combinations increases geometrically as the size of the list grows linearly.  The basic formula is that the number of pairs of distinct items from a list of N items is calculated by the formula N*(N-1)/2.  So even in a list of 10 items, there are 45 possible pair combinations – not so bad.

But now consider the issue of how many different ways these 45 links (comparisons) among the 10 references could be labeled as true (the two references are to the same entity) or false (the two references are to different entities) and at the same time make sense as links.  By making sense, I mean that if we were to label the link between references 1 and 2 as true, and the link between references 2 and 3 as true, then we would also have to label the link between references 1 and 3 as true.  Even in light of this condition, it still turns out that there are 115,975 ways that a set of 10 references could be linked together.

So to follow our example. Suppose that the 10 records are represented by the first 10 letters of the alphabet {a, b, c, d, e, f, g, h, i, j}.  One of the 115,975 ways they could be linked together would be {a, b, c} all linked as belonging to the same entity, {d, e} together, {f} by itself, and {g, h, i, j} together.  Another way is {a, c, d} together, {b, f} together, {e} by itself, and {g, h, i} together and {j} by itself.  So how similar is the first way of linking these 10 references to the second way of linking them?

This is not a new problem, and there are many ways to approach it.  The problem is easier to visualize if we build an “intersection matrix”.  The matrix is simply a table that lists one set of groupings as row labels and the other set as column labels.  The cell at a particular row and column is the size of the intersection between the groupings.  Here is the intersection matrix for the example just given:

talburt-matrix.png

The number 2 in the cell at the second row and second column of the table indicates that the first grouping in the first way the references were linked and the first grouping in the second way the references were linked share 2 elements in common, “a” and “c”.

In statistics, this is called cluster analysis, and there are several methods for comparing these clusters.  Most notable is the Rand Index that has a value from 0 to 1, with values closer to zero indicating less similarity and closer to one indicating more similarity.  The value is equal to 1 only when the two sets of groupings are identical.

The calculation of the Rand Index takes a bit of explaining, but basically it involves counting the pair-wise links in the various cells shown in the table above.  In this example, the value of the Rand Index turns out to be 0.800, or 80% similar.

A few years ago, Dr. Rich Wang, Director of the MIT Information Quality Program, and I wanted a simpler similarity index that could be used as a quick way to assess entity resolution results.  The method we developed is much simpler to calculate, in that it does not involve the formula for combinations.  The key values for calculating our index are just the number of groupings and the number of overlaps between those groupings.  The formula is as follows:

talburt-formula-091609.png

Where
|A| represents the number of groupings in the first linkage (number of rows in the table)
|B| represents the number of groupings in the second linkage (number of columns)
|C| represents the number of overlaps between the groupings (number of cells > 0)

For the example given in the table the value is TW = SQRT(4 x 5)/7 = 0.639.

According to our index, the two grouping are only about 64% similar.  In the next post I will discuss the application of our index and other metrics that can be used to assess entity resolution outcomes.

Identity Resolution Daily Links 2009-09-14

Monday, September 14th, 2009

By the Infoglide Team

MAINJUSTICE: Report Finds Flaws in DOJ Worker Comp Oversight

[easy registration required] “The Justice Department does not have effective measures in place to prevent fraud, abuse and waste in its program to provide compensation for employees with work-related injuries or illnesses, according to a DOJ Office of Inspector General report released today.”

Information Management: HP and Informatica’s Expanded Relationship: Portent of Bigger Deals to Come?

“So is the partnership with Informatica a ‘proof of concept’ for future acquisition or is it simply HP BIS’s answer: ‘We are a services business and we will leave software to our partners’?”

FederalComputerWeek: 5 decisions that will determine the fate of e-health records

“Under the economic stimulus law passed earlier this year, as much as $45 billion will be distributed to health care providers who buy and use approved electronic health record systems. The road ahead is still bumpy for EHRs, but experts say success hinges on the outcomes of five major decisions.”

Dalton’s Blog: Migrating Data into an MDM Repository - Case Study

“Notice that if you’re using Data Federation to implement your MDM solution, there is no data migration. Data Federation acts as a virtual central repository, and as such, does not require a physical copy of your source data. Data Federation “translates” the source information in real time according to required business rules and definitions. It is, so to speak, a real-time Extract-Transform process.”


Bad Behavior has blocked 862 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice