Archive for the ‘Data Warehousing’ Category

Privacy – A Dying Concept?

Wednesday, October 7th, 2009

By Gary Seeger, Infoglide Vice President

An intriguing post by Nate Anderson on Ars Technica highlights a difficult reality about today’s easy availability of vast quantities of “anonymized” data. Quoting from a recent paper by Paul Ohm at the University of Colorado Law School, Anderson writes that “as Ohm notes, this illustrates a central reality of data collection: ‘data can either be useful or perfectly anonymous but never both.’”

A seminal study published in 2000 by Latanya Sweeney at Carnegie Mellon opened the issue by proving that a simple combination of a very small number of publicly available attributes can uniquely identify individuals:

“It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only {place, gender, date of birth}, where place is basically the city, town, or municipality in which the person resides… In general, few characteristics are needed to uniquely identify a person.”

Faced with a choice between exploiting easily obtainable data for righteous ends versus the potential misuse of identifying individuals, can an appropriate balance be struck by privacy legislation? Anderson points out that:

“Because most data privacy laws focus on restricting personally identifiable information (PII), most data privacy laws need to be rethought. And there won’t be any magic bullet; the measures that are taken will increase privacy or reduce the utility of data, but there will be no way to guarantee maximal usefulness and maximal privacy at the same time.”

Looking at the subject from a business perspective, using technologies such as identity resolution to connect non-obvious data relationships serves many initiatives. It would seem admirable to exploit public records and other forms of publicly available information to mitigate risks, uncover fraud, or track down “bad” guys. Yet some cry foul when the technology exposes individuals who didn’t anticipate that their “private” information would be used to identify and/or track them down.

In the rapidly evolving cyber-information age, the desires, conflicts, and limitations of protecting privacy will continue to be sorted out in the legal realm. Those of us who solve business issues using identity resolution technology will swim in this legal quagmire for many years. Finding an appropriate balance between the protection of individual privacy and bona fide business uses of “public” data will almost certainly be a growing challenge to the moral and legal minds of our community.

To Move or Not to Move: That is the Question

Wednesday, September 30th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

A continual theme at IdentityResolutionDaily is maintaining the privacy and confidentiality of data at all times. Two recent posts concerned fusion centers and citizen profiling, but the same issues apply to virtually any application of entity resolution technology. The fact is that, in some cases, anonymous identity resolution is a requirement for more sensitive identity resolution implementations.

The strong emphasis in data management for the last decade or so has been to implement data warehouses, data marts, and master data management. When bundled with associated processes like data extraction, transformation, and cleansing, these methods have been widely accepted as the best approach to solve any data problem. Here at IdentityResolutionDaily, we tend to talk about this over-handling of data as “data deterioration.”

A more basic approach is simply working with data sources undisturbed in their native environments. New principles suggest that you should perform scoring analyses as close to the source as possible. By exploiting existing security layers already in place, the need to add new layers of security is obviated.

Of course, for key sources of operational data, existing IT policies may deny direct access. In other cases, it may be necessary or preferable to move data for other reasons. For example, achieving desired performance parameters may dictate working with an extracted subset of the data rather than the entire data store.

The point I’m making is not to forbid moving data or creating data marts under any circumstances. Rather, I’m suggesting that the most rational approach is the following:

  1. Develop solutions that adapt easily to multiple, disparate, remote data sources.
  2. Default to leaving data where it lives whenever and wherever possible.
  3. Provide the appropriate levels of entity anonymity within the solution and with the least possible intrusion to the enterprise.

Identity Resolution Daily Links 2009-08-14

Friday, August 14th, 2009

[Post from Infoglide] Vetting Sharks and Whales

“If you’re not in the casino industry, the title of this post may be meaningless, but for casino managers, “sharks” are the bad guys and “whales” are the good guys. Sharks are people who try to defraud the casino through illegal activities, while whales are the high rollers who are apt to win $20,000 one trip and lost $25,000 the next. If there’s any environment where you’d be motivated as a businessperson to know as much as you can about who you’re dealing with, it’s a casino.”

DATAWARE HOUSING: Business Intelligence and Identity Recognition—IBM’s Entity Analytics

“This article will define master data management (MDM) and explain how customer data integration (CDI) fits within MDM’s framework. Additionally, this article will provide an understanding of how MDM and CDI differ from entity analytics, outline their practical uses, and discuss how organizations can leverage their benefits.”

Workers’Comp Kit Blog: Failure to Pay Workers Compensation Premiums

“A New York asbestos  contractor failed to pay $1.6 Million in workers’ compensation premiums and will serve four years in prison. Upon his release he will be deported to his home country as he is an illegal immigrant… He repeatedly changed the name of his company.”

The TSA Blog: Secure Flight Q&A II

“Each one of these layers alone is capable of stopping a terrorist attack. In combination their security value is multiplied, creating a much stronger, formidable system. A terrorist who has to overcome multiple security layers in order to carry out an attack is more likely to be pre-empted, deterred, or to fail during the attempt.”

Identity Resolution Daily Links 2009-07-31

Friday, July 31st, 2009

[Post from Infoglide] Data Finds Data in Real-Time Entity Resolution

“Jeff Jonas of IBM recently quoted from a chapter called “Data Finds Data”  that he co-wrote for a book entitled Beautiful Data: The Stories Behind Elegant Data Solutions, and I was impressed by how well this passage describes the effective use of entity resolution software (e.g., IRE 2.2)…”

IT-Director.com: GRC is not enough

[Philip Howard]”If you think about these different forms of risk, they can mostly be managed within existing GRC frameworks: business risk, data and IT governance and compliance cover five of these seven types of risk. But they don’t cover fraud or cyber attacks or similar security issues.”

SunSentinel.com: Roofer ducked $400,000 in worker’s comp premiums

“Investigators with the state’s Division of Insurance Fraud said Robert McDonald, owner of Gulfstream Roofing Inc., funneled $3 million in payroll through several fake companies between 2002 and 2006, claiming the money was being paid to insured subcontractors instead of his own workers.”

BNET Healthcare: What Can US Learn From European Health IT Experience?

“The three countries also use universal patient identification numbers in health care. This is much easier to do in Europe than it is in the U.S., where the mistrust of government is so high that the issue of having a single patient identifier number is no longer even under discussion. There’s also the small matter of our low EHR adoption rate, which is less than 20 percent for physicians and lower for hospitals. By contrast, most physicians in the three European countries are using some kind of EHR.”

Identity Resolution Daily Links 2009-07-24

Friday, July 24th, 2009

[Post from Infoglide] Entity Resolution as Data Mining

“In my last post, I suggested that entity resolution in the broadest sense (“Big ER”) really encompasses three activities.  The first is locating and collecting entity references from unstructured sources (entity extraction), the second is resolving and merging references to the same entity (“Little ER”), and the third is analyzing associations among entities.  Not every ER process involves all three activities.”

BeyeNETWORK: Some Perspectives on Quality

[Bill Inmon] “There are then very legitimate circumstances where incorrect data is best left in the database or data warehouse. Stated differently, there is no circumstance where correcting data or not correcting data is the right thing to do. In order to determine which approach is proper, the context of the corrections has to be known. Only then can it be determined whether correcting errors is the proper thing to do.”

Homeland Security Watch: How To Improve Homeland Security: Give the ODNI Oversight Responsibility for Fusion Centers

“To me, fusion centers are a fine example of Darwinian logic in homeland security.  There was no comprehensive national plan to create fusion centers.  In original intent, Founding-Fathers-federalism fashion, states and cities decided they were not getting the intelligence they wanted.  Arizona, Georgia, Illinois, New York and a handful of other jurisdictions took responsibility for processing - or “fusing” - their own intelligence.”

ITBusinessEdge: Master Data Management and the CIO’s Strategic Plan

“If we look at MDM as a collection of techniques providing enterprise-wide data requirements analysis and subsequent implementation of best practices in data management, then the savvy IT manager might cherry-pick from the tools offered by vendors to provide the optimal solution that unifies the view of critical data concepts while satisfying the data quality requirements imposed by a horizontal information solution.”

I, Cringely: Medical Records R Us

“So medical records are an area where IT could make us healthier and, if done correctly, ought to save lots of money, too.  What we need is some form of centralized medical record keeping that preserves patient privacy yet, at the same time, keeps us from shopping all over town for bogus Oxycontin prescriptions.”

Identity Resolution Daily Links 2009-07-17

Friday, July 17th, 2009

[Post from Infoglide] iPhones, Identity Resolution, and Cloud Computing

“A personal favorite saying for years has been “invention is the mother of necessity” (a twist on the original saying, of course). It aptly conveys what has driven the high tech industry for the last several decades. Principles like Moore’s Law and its equivalent for the internet have created unanticipated waves of computing and networking power. All that available power has released the combined creativity of tens of thousands of engineers and marketers who dreamed up ways of interacting and managing our lives and businesses that were inconceivable 30 years ago…”

Liliendahl on Data Quality: Match Destinations

“When matching party data – names and addresses – very often it is not just only about hitting similar records, but also about performing some form of transformation with the data before, during and after the hitting.”

Tech Law Notes: Health IT & Open Source

“Repeatedly, I hear the refrain that this stimulus money is going to go to systems that can be put to a “meaningful use,” and that is going to exclude rogue open source Health IT developers from being funded, squelching innovation in the market place.  I imagine that complying with the security regulations under HIPAA probably hinder innovation, too, but they increase the reliability of the system vendors that remain in the market place and reduce the risk to the data of patients that might be in their computer systems.”

The Data Doghouse: People, Process & Politics: Integration Portfolio

“Existing IT projects may be under the label of: Corporate Performance Management (CPM), Master Data Management (MDM), Customer Data Integration (CDI), Product Information Management (PIM), Enterprise Information Management (EIM), Data Warehousing (DW) and Business Intelligence (BI).”

Identity Resolution Daily Links 2009-05-18

Monday, May 18th, 2009

By the Infoglide Team

e-patients.net: Meaningful Use: The Elephant IS In The Room

“A recent NPR/Kaiser Family Foundation poll shows that the American public is surprisingly more positive about the potentials of EHRs than most professionals. People already are familiar with computerized information and accept its risks.”

IT-Director.com: Trends in Master Data Management

“The interesting question is how much pressure this puts on the other MDM players with data quality solutions (like Dataflux and SAP/Business Objects) to build out their data profiling capabilities into the area of data discovery.”


“There are less than 400,000 individuals on the consolidated terrorist watch list and less than 50,000 individuals on the no-fly and selectee lists. Individuals on the no-fly and selectee lists are identified by law enforcement and intelligence partners as legitimate threats to transportation requiring either additional screening or prohibition from boarding an aircraft.”

OCDQ Blog: TDWI World Conference Chicago 2009

“TDWI World Conference Chicago 2009 was held May 3-8 in Chicago, Illinois at the Hyatt Regency Hotel and was a tremendous success.  I attended as a Data Quality Journalist for the International Association for Information and Data Quality (IAIDQ). I used Twitter to provide live reporting from the conference.  Here are my notes from the courses I attended…”

Solving the False Negative Problem

Wednesday, April 22nd, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my March 25, 2009 post “The Myth of Matching,” I discussed the confusion between entity resolution and matching as in record de-duplication.  Matching is a necessary part of entity resolution, but it is not sufficient.  In particular I brought up the issue of “false negatives,” cases where records don’t match, but are in fact references to the same entity.  I used the example of Mary Doe living on Elm Street who married John Smith living on Pine Street resulting in two references “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” that don’t match, but are never-the-less references to the same person.  Let’s discuss a couple of approaches to solving this problem - enlarging the scope of identity attributes and utilizing asserted associations.

The Mary Doe - Mary Smith case might be resolved if the scope of identity attributes were increased, i.e. if additional information such as date-of-birth, drivers license, or social security number were available in both records.  But as anyone acquainted with information quality understands, acquiring and maintaining additional information can create as many problems as it solves.  It also brings up a number of questions that the information custodians and collectors must answer.

Is this information available? Is it costly? Is use for this purpose permissible/legal?  Even if expanding the number of identity attributes is an option, it is not necessarily a panacea.  Increasing the number of identity attributes also increases the complexity of the matching.  What if some values are missing?  What if some values agree, but others disagree?

A second approach is to collect and use asserted associations.  The fundamental problem is that if Mary Doe and Mary Smith do not share any matching identity attributes, you cannot know that they are the same person without some separately acquired knowledge that they are in fact the same person.  Moreover, because not all Mary Doe’s are the same person as Mary Smith, you also need additional context such as the address to make the connection clear.  The upshot is that you need to possess the explicit knowledge that “Mary Doe at 234 Elm St is the same person as Mary Smith at 456 Pine St.”

If Mary lives in the United States and Mary registers her change of name and address with the US Postal Service, then you might be able to resolve this through the USPS Change of Address file.  Besides the fact that this is only helpful in the US, relying on the USPS COA file has other disadvantages, not the least of which is that Mary may have decided not to register with the USPS.  For this reason, some companies choose to maintain their own knowledge by acquiring information from other public and private sources.

For example in the US, marriage records are publicly available and are a possible source of this associative information.  It may also be true that while Mary didn’t register her change of address with the USPS, she may have wanted to avoid missing any issues of her Modern Square Dancing magazine subscription and promptly registered her change of address with the publisher.  There are potentially many other data sources, such as changes in utility service, cable service, or required licensure notifications.

Even though the application of external association information can alleviate the false negative problem, it comes at a cost.  The collection and maintenance of associative information can be a monumental task for some types of entities. For example, at least 20% of the US population moves each year.  Because it is too large a task for most organizations to take on by themselves, companies that aggregate large amounts of associative data sometimes offer the application of this knowledge as a product.

In the next installment, I will discuss another common confusion, the difference between entity resolution and identity resolution.

Identity Resolution Daily Links 2009-04-21

Tuesday, April 21st, 2009

By the Infoglide Team

Los Angeles Times: L.A. County reserve deputy is accused of fraud at his security firm

“Jane Robison, a spokeswoman for Dist. Atty. Steve Cooley, said the men created a shell company, International Armored Solutions Inc., to hide the true number of employees at the security firm to avoid paying higher workers’ compensation insurance premiums to the State Compensation Insurance Fund.”

ArticleRooms.com: The Benefits of Master Data Management

“Next, Master Data Management can also help prevent fraud. With the passing of Sarbanes-Oxley which holds executives of public companies accountable for their financial statement, these executives have now placed pressure on the organization to get things right.”

Greene County Daily World: Looking back: Area schools safer because of Columbine shooting incident

Fusion centers are central locations where local, state and federal officials work to receive, integrate and analyze intelligence. The ultimate goal of a fusion center is to provide a mechanism where law enforcement, public safety, and private partners can come together with a common purpose and improve the ability to safeguard our homeland and prevent criminal activity.”

SmartDataCollective: Enterprise Data World 2009

[Jim Harris] “Enterprise Data World is the business world’s most comprehensive vendor-neutral educational event about data and information management.  This year’s program was bigger than ever before, with more sessions, more case studies, and more can’t-miss content.”

All About B2B: PAXLST and CUSRES – How EDI keeps our planes safe from Terrorists

“Through government ownership, the risk of security breaches is minimized and a higher level of consistency can be enforced across airlines.  In the first phase of the program, TSA will perform screening of only US domestic flights.  In future versions of the program, monitoring will expand to include international flights as well.”

TDWI Interview: Identity Resolution Reveals

Wednesday, April 1st, 2009

This post is based on a March 31 interview with Infoglide Senior Vice President Douglas Wood by Linda L. Briggs for TDWI (The Data Warehousing Institute), March 31, 2009. Click here to read the entire interview.

In a comprehensive discussion, Doug Wood of Infoglide Software spoke about an area of confusion that exists when people discuss identity resolution. He pointed out that the term is sometimes misapplied to describe software that performs data matching alone.

[An identity resolution engine is] software that allows organizations to connect disparate data sources in order to understand possible entity matches and non-obvious relationships. It boils down to this: Providing capabilities for organizations to understand “who’s who” and “who knows whom” across multiple data silos. Occasionally, when we introduce the concept of identity resolution technology to a new customer, their immediate response is “I see, but we already have a data matching engine.” The truth is, identity resolution engines have data matching at the core, but [they] provide much more functionality and flexibility than that.

So if identity resolution incorporates data matching, what really differentiates it from data matching products?

Perhaps the key element of an identity resolution engine is the ability to take the entity match and relationship results, then apply domain-specific rules to them. How does the enterprise treat Customer A by virtue of the fact that he or she has some non-obvious relationship with Customer B? The answer to that question is specific to the domain.

Most of us are familiar with the use of data matching for removing duplicates from databases and in cleansing input for data warehouses and master data management, but identity resolution is used in unique ways.

With identity resolution technology, data isn’t subjected to deterioration processes such as cleansing or record merging. Rather, the data can remain in its original state and in its original location.

What this means is that identity resolution is especially adept at uncovering risk, fraud, and conflicts of interest since the “forensic value of individual records is preserved for ongoing analysis.” While government is a key market for identity resolution engines, a growing number of commercial applications are emerging in financial services (e.g., PATRIOT Act compliance, detection of loan fraud and credit card fraud), retail (e.g., uncovering organized retail crime activities by comparing shoplifters with employees or frequent merchandise returners), workers compensation (e.g., finding employers that change attributes to avoid paying premiums), lottery corporations (e.g., compare winning ticket holders against lottery retail employees to discover potential fraud), and customer relationship management (e.g., keeping two similar records unmerged until birth date attribute uncovers a father/son relationship).

What prevents traditional data matching products from addressing these problems?

There are no problems with matching engines per se. They just aren’t identity resolution engines. Matching engines typically use one or two algorithms, perhaps mathematical in nature, that examine structured data looking for name matches. They provide an improvement over Soundex, but little more. What they do is say “Yes, this is a match” or “No, this is not a match.”

Common requirements for solving identity resolution problems include connecting to more diverse data types and formats, handling higher volumes, and delivering deeper insight into non-obvious relationships from existing uncleansed data sources.

If identity resolution engines have existed for years, what has caused the recent rush to employ them?

As part of the 9/11 Commission recommendations, the Department of Homeland Security began a robust search for identity resolution technology that could keep terrorists off airplanes by comparing passenger attributes against terrorist watch lists, no-fly lists, and other types of threat-related data. Through a few iterations, the selection and implementation process evolved into the Transportation Security Administration’s (TSA) Secure Flight Program, which ultimately thrust identity resolution technology into the limelight. Secure Flight is now widely recognized as the pre-eminent use case for identity resolution technology requirements today, and a number of companies are involved in delivering that solution.

Bad Behavior has blocked 862 access attempts in the last 7 days.

E-mail It
Portfolio Strategy News The Direct Marketing Voice