HOME

Archive for the ‘Data Matching’ Category

Privacy – A Dying Concept?

Wednesday, October 7th, 2009

By Gary Seeger, Infoglide Vice President

An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of “anonymized” data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that “as Ohm notes, this illustrates a central reality of data collection: ‘data can either be useful or perfectly anonymous but never both.’”

A seminal study published in 2000 by Latanya Sweeney at Carnegie Mellon opened the issue by proving that a simple combination of a very small number of publicly available attributes can uniquely identify individuals:

“It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides… In general, few characteristics are needed to uniquely identify a person.”

Faced with a choice between exploiting easily obtainable data for righteous ends versus the potential misuse of identifying individuals, can an appropriate balance be struck by privacy legislation? Anderson points out that:

“Because most data privacy laws focus on restricting personally identifiable information (PII), most data privacy laws need to be rethought. And there won’t be any magic bullet; the measures that are taken will increase privacy or reduce the utility of data, but there will be no way to guarantee maximal usefulness and maximal privacy at the same time.”

Looking at the subject from a business perspective, using technologies such as identity resolution to connect non-obvious data relationships serves many initiatives. It would seem admirable to exploit public records and other forms of publicly available information to mitigate risks, uncover fraud, or track down “bad” guys. Yet some cry foul when the technology exposes individuals who didn’t anticipate that their “private” information would be used to identify and/or track them down.

In the rapidly evolving cyber-information age, the desires, conflicts, and limitations of protecting privacy will continue to be sorted out in the legal realm. Those of us who solve business issues using identity resolution technology will swim in this legal quagmire for many years. Finding an appropriate balance between the protection of individual privacy and bona fide business uses of “public” data will almost certainly be a growing challenge to the moral and legal minds of our community.

Identity Resolution Daily Links 2009-09-18

Friday, September 18th, 2009

[Post from Infoglide] Metrics for Entity Resolution

“In the last post I discussed the concepts of internal and external views of identity.  The fact that we can have different views of the same identity then raises the question of how to go about comparing different views.  What complicates this issue is that, even though we can talk about resolving references in pairs (i.e. linking two records if they refer to the same entity), the total number of references can be quite large, and consequently, there are many possible pair-wise combinations to consider.”

FederalComputerWeek: DOD opens some classified information to non-federal officials

“The non-federal officials will get access via the Homeland Security department’s secret-level Homeland Security Data Network. That network is currently deployed at 27 of the more than 70 fusion centers located around the country, according to DHS.”

Gerson Lehrman Group: Stylish Master Data Management

“In my experience, one of these styles is nigh-on impractical.  ‘Centralized’ (also called ‘transaction’) implies a wrenching architectural shift whereby the a master data hub becomes the one and only source of master data for an enterprise, replacing the functionality of generating master data in existing transaction systems, and serving up ‘golden copy’ data to other systems, perhaps via an enterprise service bus architecture. This sounds elegant, but is extremely invasive.”

INFORMATICA PERSPECTIVES: Get To “Meaningful Use” Faster

Identity resolution’s goal is to find the right person at the right time, regardless of the potential for error and variation in what information is available at the time of request.  This could be during patient registration and admission, patient transfers or referrals, emergency room visits, and simply sharing information across providers or insurers.  The ability to do this effectively must become the most basic and core function.”


Metrics for Entity Resolution

Thursday, September 17th, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post I discussed the concepts of internal and external views of identity.  The fact that we can have different views of the same identity then raises the question of how to go about comparing different views.  What complicates this issue is that, even though we can talk about resolving references in pairs (i.e. linking two records if they refer to the same entity), the total number of references can be quite large, and consequently, there are many possible pair-wise combinations to consider.

The number of pair combinations increases geometrically as the size of the list grows linearly.  The basic formula is that the number of pairs of distinct items from a list of N items is calculated by the formula N*(N-1)/2.  So even in a list of 10 items, there are 45 possible pair combinations – not so bad.

But now consider the issue of how many different ways these 45 links (comparisons) among the 10 references could be labeled as true (the two references are to the same entity) or false (the two references are to different entities) and at the same time make sense as links.  By making sense, I mean that if we were to label the link between references 1 and 2 as true, and the link between references 2 and 3 as true, then we would also have to label the link between references 1 and 3 as true.  Even in light of this condition, it still turns out that there are 115,975 ways that a set of 10 references could be linked together.

So to follow our example. Suppose that the 10 records are represented by the first 10 letters of the alphabet {a, b, c, d, e, f, g, h, i, j}.  One of the 115,975 ways they could be linked together would be {a, b, c} all linked as belonging to the same entity, {d, e} together, {f} by itself, and {g, h, i, j} together.  Another way is {a, c, d} together, {b, f} together, {e} by itself, and {g, h, i} together and {j} by itself.  So how similar is the first way of linking these 10 references to the second way of linking them?

This is not a new problem, and there are many ways to approach it.  The problem is easier to visualize if we build an “intersection matrix”.  The matrix is simply a table that lists one set of groupings as row labels and the other set as column labels.  The cell at a particular row and column is the size of the intersection between the groupings.  Here is the intersection matrix for the example just given:

talburt-matrix.png

The number 2 in the cell at the second row and second column of the table indicates that the first grouping in the first way the references were linked and the first grouping in the second way the references were linked share 2 elements in common, “a” and “c”.

In statistics, this is called cluster analysis, and there are several methods for comparing these clusters.  Most notable is the Rand Index that has a value from 0 to 1, with values closer to zero indicating less similarity and closer to one indicating more similarity.  The value is equal to 1 only when the two sets of groupings are identical.

The calculation of the Rand Index takes a bit of explaining, but basically it involves counting the pair-wise links in the various cells shown in the table above.  In this example, the value of the Rand Index turns out to be 0.800, or 80% similar.

A few years ago, Dr. Rich Wang, Director of the MIT Information Quality Program, and I wanted a simpler similarity index that could be used as a quick way to assess entity resolution results.  The method we developed is much simpler to calculate, in that it does not involve the formula for combinations.  The key values for calculating our index are just the number of groupings and the number of overlaps between those groupings.  The formula is as follows:

talburt-formula-091609.png

Where
|A| represents the number of groupings in the first linkage (number of rows in the table)
|B| represents the number of groupings in the second linkage (number of columns)
|C| represents the number of overlaps between the groupings (number of cells > 0)

For the example given in the table the value is TW = SQRT(4 x 5)/7 = 0.639.

According to our index, the two grouping are only about 64% similar.  In the next post I will discuss the application of our index and other metrics that can be used to assess entity resolution outcomes.

Identity Resolution Daily Links 2009-09-12

Saturday, September 12th, 2009

[Post from Infoglide] False Positives versus Citizen Profiles

“A post from Steve Bennett in Australia refers to an announcement by the Dutch government about their intent to prevent crime by profiling their citizens. By creating a digital profile of each citizen using banking, flight, and internet usage information, their justice department plans to compare citizen profiles with those of convicted criminals, then let law enforcement authorities know when matches are found. Needless to day, the move has created quite a bit of discussion in the Netherlands.”

MAINJUSTICE: Rival Agencies Agree to Halt Turf Battles

[quick registration] “‘By bringing together the agencies and personnel with existing resources and expertise we can work more effectively as partners to shut down organized crime networks, seize assets and save taxpayer dollars in the process,’ said Deputy Attorney General David Ogden in [a] statement announcing the partnership.”

HealthData Management: Assessing Demand for EHRs

“On Aug. 20, David Blumenthal, M.D., national coordinator for health information technology, predicted that the final definition of the “meaningful use” of electronic health records that will be used to determine eligibility for incentive payments under the economic stimulus program will not be available until the middle or end of spring 2010.”

South Florida Business Journal: NICB: Suspicious insurance claims up

“The number of suspicious insurance claims rose to 41,619 in the first half of the year, up from 36,743 in the prior-year period, according to a review of insurance claims referred to the National Insurance Crime Bureau.”

False Positives versus Citizen Profiles

Wednesday, September 9th, 2009

By Mike Shultz, Infoglide Software CEO

A post from Steve Bennett in Australia refers to an announcement by the Dutch government about their intent to prevent crime by profiling their citizens. By creating a digital profile of each citizen using banking, flight, and internet usage information, their justice department plans to compare citizen profiles with those of convicted criminals, then let law enforcement authorities know when matches are found. Needless to day, the move has created quite a bit of discussion in the Netherlands.

In no way would such a move fly in the United States. From the time of its founding, our citizens have consistently shown a distrust of government that has limited its control over basic freedoms. While some would argue that the U.S. government has gained too much control over the years, that healthy distrust has definitely limited government intrusion into our personal freedoms.

In contrast to the broad approach proposed by the Dutch Minister of Justice, systems using entity resolution can avoid the “boil the ocean” approach. You can target specific data sources that hold relevant information, and then compare the bare minimum of attributes needed to discover hidden relationships, all without creating and storing profiles on millions of non-criminal citizens.

With such a system, can false positives occur? Yes, but the technology has become so sophisticated that the chance of a false positive is minuscule. The judgment to be made is whether the number of false positives outweighs the increased level of security afforded the public.

No doubt, the lively discussion between those concerned about invasion of privacy and those focused on keeping the populace safe will continue. And that’s how it should be.

Identity Resolution Daily Links 2009-09-04

Friday, September 4th, 2009

[Post from Infoglide] Shell Games

“We’ve talked before about how some employers will dissolve a company, then re-form it with the same people but under a new name. The objective? Reduce payments to workers’ compensation programs, where premiums are based on the historical level of claims. Erase the history by forming a new company, and voila! Your premiums are now lower, but there’s a catch – doing that constitutes fraud and it’s illegal.”

OCDQ Blog: To Parse or Not To Parse

[Jim Harris] “Data matching often uses data standardization to prepare its input.  This allows for more direct and reliable comparisons of parsed sub-fields with standardized values, decreases the failure to match records because of data variations, and increases the probability of effective match results.”

kpvi.com: Idaho Falls Woman Arrested in Undercover Lottery Sting

“An undercover operative gave McKelley a decoy lottery ticket that McKelley thought was worth at least $100,000. She kept the ticket and took [it] to Boise to the Idaho Lottery Headquarters to collect. Police arrested her when she showed up and she now faces felony charges of presenting an illegally obtained lottery ticket.”

Initiate Blog: Data Hubs: Master Data Repository or Master Data Service?

“The current reality is that the concept of a data hub includes a much more active approach to data than just storage of a “golden record”. The data hub makes the best decisions on entity and relationship resolution by arbitrating the content of data in the source systems where the master data is created.”

Identity Resolution Daily Links 2009-08-28

Friday, August 28th, 2009

[Post from Infoglide] Internal and External Views of Identity

“In an earlier post, I stated my view that identity resolution and entity resolution are somewhat different processes.  In particular, I consider identity resolution as a special form of entity resolution in which entity references are resolved by comparing them to the characteristics of a given set of known entities.  Regardless of the approach, identity plays an important role in all forms of entity resolution.”

Modern Medicine: State privacy laws deter EHR adoption in hospitals

“The study looked at 19 states, and shows that states that have enacted medical privacy laws restricting hospitals from disclosing patient information have seen a 24 percent reduction in EHR adoption over a ten-year period, while states without these regulations experienced a 21 percent gain in hospital EHR adoption.”

Workers’ Comp Kit Blog: CALIFORNIA Millions of Dollars Medical Insurance Fraud Scheme

“The defendants  in the outpatient surgery center were accused of participating in a $154 million medical insurance fraud scheme by recruiting 2,841 healthy people nationwide and bribing them with money or low cost cosmetic surgery, to receive unnecessary and dangerous surgeries and submitting fraudulent claims to medical insurance companies.”

OCDQ Blog: Adventures in Data Profiling (Part 4)

“In Part 4, you will continue your adventures in data profiling by going postal…postal address that is, by first analyzing the following fields : City Name, State Appreviation, State Abbreviation, Zip Code, and Country Code.”

Internal and External Views of Identity

Thursday, August 27th, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In an earlier post, I stated my view that identity resolution and entity resolution are somewhat different processes.  In particular, I consider identity resolution as a special form of entity resolution in which entity references are resolved by comparing them to the characteristics of a given set of known entities.  Regardless of the approach, identity plays an important role in all forms of entity resolution.

The identity of an entity is a set of attributes and rules for comparing the attribute values that allow it to be distinguished from all other entities of the same type in a given context.  A key  feature is that identity is context-dependent, i.e., it depends upon the total set of entities under consideration.  For example, a common scheme for creating email addresses in an organization uses a person’s first two initials and last name, e.g. jrtalburt.  In a small organization, this is usually sufficient to make a unique address for each employee.  However, applying this in a much larger pool of users such as the yahoo.com or gmail.com domains quickly surfaces that these attributes are insufficient.

For a more relevant business example, consider the case of a customer, Mary Smith.  For simplicity, assume that the totality of her adult residential address history comprises:
1.    Mary Smith, 123 Oak St, Anytown, NY, 1998-06 to 2000-03
2.    Mary Jones, 234 Elm St, Anytown, NY, 2000-04 to 2002-11
3.    Mary Jones, 345 Pine St, Anytown, NY, 2002-12 to present

Despite having used 2 names and 3 addresses, these are all references to the same person. There are two ways to view the issue of identity as illustrated by this history.

One is to start with the identity based on vital statistics, e.g. Mary Smith, a female born on December 3, 1980, in Anytown, NY, to parents Robert and Susan Smith, then to follow that identity through its various representations of name and address as shown above.  This “internal view of identity” is the view of Mary Smith herself and might well be the view of a sibling or other close relative, someone with complete knowledge about her address history.  The internal view of identity represents a closed universe model in which all of the possible occupancy variants are known to the internal viewer (system) and any occupancy record not equivalent to one of the known variants must belong to some other identity.

On the other hand, an external view of identity is one in which some number of address records for a customer’s identity have been linked, but the viewer (system) does not know if it is the complete history.  Given another customer address record not equivalent to one of the records in the history, it must be determined if it does or does not belong to Mary’s history.

Suppose that a system has only the first two address records of Mary’s history.  In this case, the system’s knowledge of Mary’s identity would be incomplete.  It may be incomplete because either the third address record is not in the system (has not been acquired) or because the system hasn’t linked it to the first two records.  In the latter case, the system would assume that the third record is part of a different customer’s identity.  Even though an internal viewer would know that the third address record should also be part of the Mary’s complete history, the external viewer has not made that determination.

Conversely, an external viewer may assemble an inaccurate view of Mary’s history by linking the first two records of her address history to an address for a different Mary Smith.  These entity resolution failures, incomplete and inaccurate histories, are information quality dimensions and indicate why the areas of entity resolution and information quality are so closely related. (Several classes of failures were discussed in another recent post.)

In an external view, the identity of the customer is equivalent to the set of occupancy records that have been resolved (i.e. linked).  The known address records comprise the external viewer’s (or system’s) entire knowledge of the customer’s identity.  If additional occupancy records are acquired and are correctly determined to be for this same customer, then the system’s knowledge about this identity increases.

The external view of identity reflects the experience of a business or government agency using entity resolution tools and processes in an effort to link disparate records into a single view of a customer or agency client.  The “external view of identity” represents an open universe model because if the system is presented with a new occupancy record, it does not necessarily follow that the new records must be a part of a different identity.  It may or may not be part of an existing identity, something that the ER process must decide.

The major point to note is that an internal viewer is in a position to judge the quality of an external view.  With complete knowledge, the internal viewer can determine if any particular external viewer has omitted some records (completeness) or has linked records from different identities or failed to link records for the same identity (accuracy).

Along with Dr. Wang at MIT, I have introduced a quality metric in the form of an index for assessing the similarity of two identity resolutions.  In cases where one resolution represents an internal view (correct) and the other is an external view, the index provides a metric for entity resolution accuracy. I plan to explain this metric in my next post.

Walking the Privacy/Security Tightrope

Wednesday, August 19th, 2009

By Mike Shultz, Infoglide Software CEO

In a post last April, we talked about the privacy/security balance issue for fusion centers and for vendors with supporting technology. Now an article in the Austin Sunday paper about a proposed fusion center again highlights the tension between security and privacy. Each time a fusion center is proposed, the story goes like this:

“Local law enforcement officials see benefit of two-way information sharing with other local, state, and national agencies… privacy groups are concerned about unnecessary intrusions into personal information.”

As of July 2009, 72 such centers have been put in place and are operational across the country. The Department of Homeland Security (DHS), in conjunction with the Justice Department, has tried to address the need for consistent operating principles. Starting in 2005, they published and continue to maintain a set of guidelines suggesting how to establish collaboration and data sharing between agencies while protecting the privacy and civil liberties of citizens.

It would be nice to report that every fusion center has performed flawlessly in solving crimes while preserving American freedoms. Given that they are run by human beings, execution at every center hasn’t always fallen within the guidelines. There are instances where the centers have been ineffective, and there are instances where controversial privacy issues have been raised when centers overstepped their bounds.

The Austin American Statesman article presented a balanced view of the issues surrounding fusion centers without sensationalizing them. Instances of controversies surrounding fusion centers were discussed, yet instances of the benefits of existing centers were also given.

As Jack Thomas Tomarchio, former deputy undersecretary for intelligence and analysis operations at DHS was quoted, “These things are brand new. They haven’t been around 20 years, and even the ones that have been around three or four years are still in their formative years. In many cases, they don’t have a track record.”

While existing software technology addresses both privacy and security issues, the ultimate decision to use it wisely falls to the people who run the fusion centers. In the City of Austin case, the concerns of privacy and security seem to be receiving equal consideration so that the best results can be achieved without trampling on civil liberties.

Identity Resolution Daily Links 2009-08-14

Friday, August 14th, 2009

[Post from Infoglide] Vetting Sharks and Whales

“If you’re not in the casino industry, the title of this post may be meaningless, but for casino managers, “sharks” are the bad guys and “whales” are the good guys. Sharks are people who try to defraud the casino through illegal activities, while whales are the high rollers who are apt to win $20,000 one trip and lost $25,000 the next. If there’s any environment where you’d be motivated as a businessperson to know as much as you can about who you’re dealing with, it’s a casino.”

DATAWARE HOUSING: Business Intelligence and Identity Recognition—IBM’s Entity Analytics

“This article will define master data management (MDM) and explain how customer data integration (CDI) fits within MDM’s framework. Additionally, this article will provide an understanding of how MDM and CDI differ from entity analytics, outline their practical uses, and discuss how organizations can leverage their benefits.”

Workers’Comp Kit Blog: Failure to Pay Workers Compensation Premiums

“A New York asbestos  contractor failed to pay $1.6 Million in workers’ compensation premiums and will serve four years in prison. Upon his release he will be deported to his home country as he is an illegal immigrant… He repeatedly changed the name of his company.”

The TSA Blog: Secure Flight Q&A II

“Each one of these layers alone is capable of stopping a terrorist attack. In combination their security value is multiplied, creating a much stronger, formidable system. A terrorist who has to overcome multiple security layers in order to carry out an attack is more likely to be pre-empted, deterred, or to fail during the attempt.”


Bad Behavior has blocked 846 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice