HOME

Archive for the ‘Anonymous Identity Resolution’ Category

Architectures for Entity Resolution-Part 2

Wednesday, March 10th, 2010

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post we examined how entity resolution (ER) systems are actually implemented, starting with the most basic merge/purge process and heterogeneous join systems. Both of these approaches focus on collecting equivalent references from among the sources provided, either as a large batch of references in a single file, or through queries against a federation of databases.  The entity identities found by these ER systems are transient in the sense that they depend upon the sources input into the process.  When different sources are provided, different identities will emerge.

On the other hand, there are ER systems that retain and manage identity information.  By doing this they are able to “recognize” the same identity over time and assign that identity the same entity identifier (sometimes called “persistent identifiers” or “persistent links”).  In Customer Data Integration (CDI) applications, these kinds of systems are sometimes called Customer Recognition Systems.

Two major types of ER systems perform identity management.  The first type is the “identity resolution” system.  It is most effective in situations where a fairly stable set of known identities of interest exists, such as the set of vendors or customers of a company, a set of products, or the students enrolled in a school.  The attributes of these identities are pre-loaded into the system and assigned identifiers.  When a reference is given to the system, it then decides whether the reference is to one of the known identities, and if so, returns the identifier of that identity.

Identity resolution systems can operate in either batch or transactional mode.  In cases where there are a large number of pre-stored identities, the performance of batch operations can be improved through distributed processing where the identities are partitioned over multiple processors and resolved in parallel.

However, there are many situations where the identities are not necessarily known in advance, or in some cases  the entities are known but simply not organized in such a way that they can be easily pre-loaded.  For example, suppose two companies merge and each company has its own customer database. The customers are identified in different ways in each database, and furthermore, for the customers of one company, poor systems and practices prevent having any confidence that the master records are unduplicated across business lines or company locations.

The type of system often applied in these situations is an “identity capture” system.  The identity capture architecture can be seen as a hybrid of  merge/purge and identity resolution systems.  It supports identity management and persistent identifiers, but without starting with a preloaded set of identities.  In my next post, we’ll delve deeper into the identity capture process.

Master Data Movement

Thursday, January 28th, 2010

By Douglas Wood, Infoglide Senior Vice President

I read with interest yesterday’s article at SeekingAlpha which discusses rumors swirling around the MDM software industry.  According to the article, sources suggest that two deals are very near completion.  The first of those rumored transactions would see Informatica picking up MDM provider Siperian.  On the heels of their acquisitions of Identity Systems and AddressDoctor, the Siperian purchase could not be totally unexpected – but would most certainly create some ripple effect worth watching.

The first thing that springs to mind is what Oracle would intend to do with Informatica.  A long-time business partner of Oracle, strengthened through the 2008 purchase of Identity Systems, Informatica could now only be classified as a true and direct competitor to Oracle.  Can Oracle continue to OEM technology (SSA Name3, for example) from what would instantly become a major competitor?  Sleeping with the enemy is one thing… leaving money on the nightstand afterwards is another thing altogether!  It will be interesting to see what happens here, to say the least.

The other rumored acquisition is that of Initiate Systems by IBM.  Thought to be roughly twice the size of Siperian, Initiate would tend to give further credibility to IBM’s vast – and growing – presence in the Health Care industry, where Initiate has become a recognized industry leader.  What muddies the waters, however, would be the question of what IBM would intend to do with Initiate’s entity resolution engine.  In a nutshell, Initiate has been one of two software vendors doing an excellent job of providing technologies applicable for both MDM and fraud/risk related implementations.  Infoglide Software Corporation is the other.

Marketed in an eerily similar fashion to Infoglide’s earlier-released Identity Resolution Engine (is imitation the most sincere form of flattery?), Initiate’s offering in this identity resolution space could become short-lived given IBM’s large and ongoing investment in InfoSphere Identity Insight Solutions (formerly Entity Analytics Solutions).  How soon that would happen, of course, is anyone’s guess.

One thing is certain, however: the need for technology that is applicable to both MDM initiatives and that exposes risk and fraud through matching and linking of entities is very real and growing.  How the other major industry players react – should either or both of these rumors become reality – will define the industry for years to come.

Actionable Identity Intelligence from Identity Resolution

Friday, January 8th, 2010

By Brian Calvert, Infoglide Senior Software Architect

The recent “Christmas Bomber” incident incited many posts about applying technology to address the gaps that allowed it to happen. For example, David Loshin wrote about a piece for BeyeNETWORK about a “master terrorist system” while Lawrence Dubov suggested improving the watch list process using entity resolution. While technology is a critical component of any solution, some specific issues about the technology are important to understand.

In an address this week, President Obama outlined the shortcomings in people, processes, and technologies that gave the now infamous Christmas Bomber the opportunity to take down a Detroit-bound flight.

President Obama identified three major problem areas:

It’s now clear that shortcomings occurred in three broad and compounding ways. First, although our intelligence community had learned a great deal about the al Qaeda affiliate in Yemen called al Qaeda in the Arabian Peninsula — that we knew that they sought to strike the United States, and that they were recruiting operatives to do so — the intelligence community did not aggressively follow up on and prioritize particular streams of intelligence related to a possible attack against the homeland.

Second, this contributed to a larger failure of analysis — a failure to connect the dots of intelligence that existed across our intelligence community, and which together could have revealed that Abdulmutallab was planning an attack.

Third, this in turn fed into shortcomings in the watch-listing system which resulted in this person not being placed on the no-fly list; thereby allowing him to board that plane in Amsterdam for Detroit.

CNN highlighted one additional failing that’s relevant to the topic of Identity Resolution (my emphasis):

A timeline provided by the State Department officials, who spoke on condition of anonymity, showed that an initial check of the suspect based on his father’s information failed to disclose he had a multiple-entry U.S. visa. The reason was that AbdulMutallab’s name was misspelled. “That search did not come back positive,” said one official, who called it a quick search without using multiple variants of spelling.

What are the specific technology issues?

While the details of the technologies used by the State Department are not identified, the story is typically the same for government and industry. Simple equivalency lookups are not enough. “John Kennedy” will not match “Jhon Kennedy” with standard database lookups. Furthermore, some technologies rely on strategies that actually destroy the forensic integrity of the data. They force it into pre-existing molds in a variety of ways to perform similarity matching. We’ve addressed the many challenges to matching names in this blog in the past, especially in “Playing the Name Game with Terrorist Watch Lists and Shoplifter Databases”.

Indexing is one approach that can fail. It tries to turn common names and known variations and nicknames into identical easily matched tokens. So John, Jack, and Johnny might all translate to “F12391″, facilitating a quick match. But what happens when John’s name — like AbdulMutallab’s — is misspelled? “Jhon” will fail to be matched to the common code and, thus, the match will quickly fail. Encoding is another common example that we addressed. Algorithms like “soundex” attempt to translate words into a fuzzy phonetic equivalent. But the promise of these algorithms falls short, especially when they encounter misspellings, nicknames, and cultural variations.

So while merging all information into a common view or improving watchlist management might be part of the solution, they will still fail if the technology used to merge or search is not up to the task.

Not all identity resolution technologies are the same. Ours can be configured using a number of strategies to fit particular customer performance requirements, sensitivity to false positives or false negatives, and Similarity Search behaviors, including specialized name algorithms that catch misspellings, nicknames, and ordering variations.

Although the consequences are grimmer in homeland security situations, the challenges are the same for financial, healthcare, gaming, state and local government, and marketing applications. While it remains to be seen what improvements the US government will apply to the people, processes, and technology used to secure the country, it’s easy to see that simple misspellings need not break the system or, for that matter, any other system.

Identity Resolution Daily Links 2009-10-16

Friday, October 16th, 2009

[Post from Infoglide] Avoiding False Positives: Analytics or Humans?

“The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for ‘automatic detection of threats and abnormal behaviour and violence.’ Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty…”

SDTimes: Old thinking does a disservice to new data hubs

“The enterprise needs to be able to understand the origin, the time and possibly the reason for a change. These audit needs must be supported by the data hub at the attribute level. MDM solutions that maintain the golden record dynamically address this need by supporting the history of changes in the source systems record content.”

Accision Health Blog: Surveys Show Importance of EHR

“A new Rand study is one of the first to link the use of electronic health records in community-based medical practices with higher quality of care.  Rand Corporation researchers found in a study of 305 groups of primary care physicians that the routine use of multifunctional EHRs was more likely to be linked to higher quality care than other common strategies, such as structural changes used for improving care.”

NYSIF: Central NY Contractor Hit with Workers Comp Fraud Charges

“Investigators said Mr. Decker previously had an insurance policy with NYSIF when he operated RD Builders in November 2005, a policy cancelled for non-payment a few months later. In 2008, he applied to NYSIF’s Syracuse office for workers’ compensation insurance doing business as Bull Rock Development, Inc.”

public intelligence: Office of Intelligence and Analysis (DHS)

“These entities are unified under local fusion centers, which provide state and local officials with intelligence products while simultaneously gathering information for federal sources.  As of July 2009, there were 72 designated fusion centers around the country with 36 field representatives deployed. The Department has provided more than $254 million from FY 2004-2007 to state and local governments to support the centers.”

Avoiding False Positives: Analytics or Humans?

Wednesday, October 14th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

The European Union recently started a five-year research program in conjunction with its expanding role in fighting crime and terrorism. The purpose of Project Indect is to develop advanced analytics that help monitor human activity for “automatic detection of threats and abnormal behaviour and violence.”

Naturally, the project has drawn suspicion and criticism, both from those who oppose the growing power of the EU and from watchdog groups concerned about encroachments into privacy and civil liberty:

According to the Open Europe think tank, the increased emphasis on co-operation and sharing intelligence means that European police forces are likely to gain access to sensitive information held by UK police, including the British DNA database. It also expects the number of UK citizens extradited under the controversial European Arrest Warrant to triple. Stephen Booth, an Open Europe analyst who has helped compile a dossier on the European justice agenda, said these developments and projects such as Indect sounded “Orwellian” and raised serious questions about individual liberty.

Shami Chakrabarti of Liberty, a UK human rights group, said, “Profiling whole populations instead of monitoring individual suspects is a sinister step in any society. It’s dangerous enough at [the] national level, but on a Europe-wide scale the idea becomes positively chilling.”

At IdentityResolutionDaily, we’ve consistently supported open and civil discussion about balancing security requirements with individual rights of privacy and liberty (e.g. “Walking the Privacy/Security Tightrope“) . We’ve also dealt with the criticality of using analytic technology that minimizes false positives (e.g. “False Positives versus Citizen Profiles“).

Not long ago, James Taylor of Decision Management Solutions made an excellent point about whether using analytic technologies (e.g. identity resolution) versus relying totally on human judgment increases or decreases the risk of false positives:

Humans, unlike analytics, are prone to prejudices and personal biases. They judge people too much by how they look (stopping the Indian with a beard for instance) and not enough by behavior (stopping the white guy who is nervously fiddling with his shoes say)… If we bring analytics to bear on a problem the question should be does it eliminate more biases and bad decision making than it creates new false positives… Over and over again studies show analytics do better in this regard… I think analytics are ethically neutral and the risk of something going “to the dark side” is the risk that comes from the people involved, with or without analytics.

We couldn’t have said it better ourselves.

Identity Resolution Daily Links 2009-10-09

Friday, October 9th, 2009

[Post from Infoglide] Privacy – A Dying Concept?

“An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of ‘anonymized’ data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that ‘as Ohm notes, this illustrates a central reality of data collection: data can either be useful or perfectly anonymous but never both.’”

ComputerworldUK: Data quality tools sub-par, says analyst

“A recent study on data quality by the Information Difference revealed that respondents view data quality as something that is not restricted to one area within the organisation. Instead, two-thirds of respondents said it is an issue spanning the entire organisation…Specifically, 81 per cent of respondents reported being focused on a broader scope than merely customer name and address data.”

BeyeNETWORK: Master Data Management and the Challenge of Reality

“One of the central problems of master data management, which is often poorly stated, is the need to determine if one individual thing is the same as another individual thing. But the only way we have to do this is by matching records, and a record is not the same as the thing it represents. Unlike The Matrix, we are more in danger of confounding two ‘realities’ rather than recognizing them as distinct.”

Information Management: Business Intelligence: A Blueprint to Success

“Fraud detection. Claims managers are using predictive analytics to help identify potentially fraudulent claims as early as the first notice of loss, and are analyzing claims costs to get a better handle on negative trends.”

Government Computer News: How entity resolution can help agencies connect the dots in investigations

“Imagine a law-enforcement scenario. A local police department has information on a crime suspect. Court systems, corrections facilities, the department of motor vehicles and even child-support enforcement may also have information on this person of interest, each specific to its own needs and applications. Implementation of an entity-centric environment would enable each of the organizations and systems to continue its operations while also providing the police a much more holistic view of the crime suspect along with potentially important pieces of information.”

Privacy – A Dying Concept?

Wednesday, October 7th, 2009

By Gary Seeger, Infoglide Vice President

An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of “anonymized” data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that “as Ohm notes, this illustrates a central reality of data collection: ‘data can either be useful or perfectly anonymous but never both.’”

A seminal study published in 2000 by Latanya Sweeney at Carnegie Mellon opened the issue by proving that a simple combination of a very small number of publicly available attributes can uniquely identify individuals:

“It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides… In general, few characteristics are needed to uniquely identify a person.”

Faced with a choice between exploiting easily obtainable data for righteous ends versus the potential misuse of identifying individuals, can an appropriate balance be struck by privacy legislation? Anderson points out that:

“Because most data privacy laws focus on restricting personally identifiable information (PII), most data privacy laws need to be rethought. And there won’t be any magic bullet; the measures that are taken will increase privacy or reduce the utility of data, but there will be no way to guarantee maximal usefulness and maximal privacy at the same time.”

Looking at the subject from a business perspective, using technologies such as identity resolution to connect non-obvious data relationships serves many initiatives. It would seem admirable to exploit public records and other forms of publicly available information to mitigate risks, uncover fraud, or track down “bad” guys. Yet some cry foul when the technology exposes individuals who didn’t anticipate that their “private” information would be used to identify and/or track them down.

In the rapidly evolving cyber-information age, the desires, conflicts, and limitations of protecting privacy will continue to be sorted out in the legal realm. Those of us who solve business issues using identity resolution technology will swim in this legal quagmire for many years. Finding an appropriate balance between the protection of individual privacy and bona fide business uses of “public” data will almost certainly be a growing challenge to the moral and legal minds of our community.

Identity Resolution Daily Links 2009-10-02

Friday, October 2nd, 2009

[Post from Infoglide] To Move or Not to Move: That is the Question

A continual theme at IdentityResolutionDaily is maintaining the privacy and confidentiality of data at all times. Two recent posts concerned fusion centers and citizen profiling, but the same issues apply to virtually any application of entity resolution technology. The fact is that, in some cases, anonymous identity resolution is a requirement for more sensitive identity resolution implementations.

GCN: Entity resolution’s growing role in security efforts

“Research firm and consultancy Gartner has been tracking the entity-resolution market for several years. ‘Entity resolution and analysis was previously an obscure technology that has come to the forefront as a result of world events and market forces where it is used to identify the use of false identities and networks of individuals who are attempting to hide their relationships to each other,’ stated Gartner in ‘Hype Cycle for Master Data Management,’ a report released in June.”

iHealthBeat: Consensus Needed on EHR Access, Privacy Issues, Panelists Say

“Panelists noted that although some patients want the ability to segregate and mask certain sections of their EHRs, physicians are wary of protections that would deny them access to critical patient medical information.”

Security Management: Fusion Centers Forge Ahead

“More than 70 operate at the state, regional, and urban levels. The question eight years after 9-11 is: How well are these centers fulfilling their goals of information collection, analysis, and dissemination—and to the extent that these efforts are falling short, what remains to be done to meet the goals and to ensure the future sustainability of these centers?”

SmartData Collective: Poor Data Quality is a Virus

“Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise’s mission to survive and thrive in today’s highly competitive and rapidly evolving marketplace. Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions.”

To Move or Not to Move: That is the Question

Wednesday, September 30th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

A continual theme at IdentityResolutionDaily is maintaining the privacy and confidentiality of data at all times. Two recent posts concerned fusion centers and citizen profiling, but the same issues apply to virtually any application of entity resolution technology. The fact is that, in some cases, anonymous identity resolution is a requirement for more sensitive identity resolution implementations.

The strong emphasis in data management for the last decade or so has been to implement data warehouses, data marts, and master data management. When bundled with associated processes like data extraction, transformation, and cleansing, these methods have been widely accepted as the best approach to solve any data problem. Here at IdentityResolutionDaily, we tend to talk about this over-handling of data as “data deterioration.”

A more basic approach is simply working with data sources undisturbed in their native environments. New principles suggest that you should perform scoring analyses as close to the source as possible. By exploiting existing security layers already in place, the need to add new layers of security is obviated.

Of course, for key sources of operational data, existing IT policies may deny direct access. In other cases, it may be necessary or preferable to move data for other reasons. For example, achieving desired performance parameters may dictate working with an extracted subset of the data rather than the entire data store.

The point I’m making is not to forbid moving data or creating data marts under any circumstances. Rather, I’m suggesting that the most rational approach is the following:

  1. Develop solutions that adapt easily to multiple, disparate, remote data sources.
  2. Default to leaving data where it lives whenever and wherever possible.
  3. Provide the appropriate levels of entity anonymity within the solution and with the least possible intrusion to the enterprise.

Bad Behavior has blocked 1423 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice