HOME

Archive for the ‘Data-Mining’ Category

Solving the False Negative Problem

Wednesday, April 22nd, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my March 25, 2009 post “The Myth of Matching,” I discussed the confusion between entity resolution and matching as in record de-duplication.  Matching is a necessary part of entity resolution, but it is not sufficient.  In particular I brought up the issue of “false negatives,” cases where records don’t match, but are in fact references to the same entity.  I used the example of Mary Doe living on Elm Street who married John Smith living on Pine Street resulting in two references “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” that don’t match, but are never-the-less references to the same person.  Let’s discuss a couple of approaches to solving this problem - enlarging the scope of identity attributes and utilizing asserted associations.

The Mary Doe - Mary Smith case might be resolved if the scope of identity attributes were increased, i.e. if additional information such as date-of-birth, drivers license, or social security number were available in both records.  But as anyone acquainted with information quality understands, acquiring and maintaining additional information can create as many problems as it solves.  It also brings up a number of questions that the information custodians and collectors must answer.

Is this information available? Is it costly? Is use for this purpose permissible/legal?  Even if expanding the number of identity attributes is an option, it is not necessarily a panacea.  Increasing the number of identity attributes also increases the complexity of the matching.  What if some values are missing?  What if some values agree, but others disagree?

A second approach is to collect and use asserted associations.  The fundamental problem is that if Mary Doe and Mary Smith do not share any matching identity attributes, you cannot know that they are the same person without some separately acquired knowledge that they are in fact the same person.  Moreover, because not all Mary Doe’s are the same person as Mary Smith, you also need additional context such as the address to make the connection clear.  The upshot is that you need to possess the explicit knowledge that “Mary Doe at 234 Elm St is the same person as Mary Smith at 456 Pine St.”

If Mary lives in the United States and Mary registers her change of name and address with the US Postal Service, then you might be able to resolve this through the USPS Change of Address file.  Besides the fact that this is only helpful in the US, relying on the USPS COA file has other disadvantages, not the least of which is that Mary may have decided not to register with the USPS.  For this reason, some companies choose to maintain their own knowledge by acquiring information from other public and private sources.

For example in the US, marriage records are publicly available and are a possible source of this associative information.  It may also be true that while Mary didn’t register her change of address with the USPS, she may have wanted to avoid missing any issues of her Modern Square Dancing magazine subscription and promptly registered her change of address with the publisher.  There are potentially many other data sources, such as changes in utility service, cable service, or required licensure notifications.

Even though the application of external association information can alleviate the false negative problem, it comes at a cost.  The collection and maintenance of associative information can be a monumental task for some types of entities. For example, at least 20% of the US population moves each year.  Because it is too large a task for most organizations to take on by themselves, companies that aggregate large amounts of associative data sometimes offer the application of this knowledge as a product.

In the next installment, I will discuss another common confusion, the difference between entity resolution and identity resolution.

Identity Resolution Daily Links 2009-04-21

Tuesday, April 21st, 2009

By the Infoglide Team

Los Angeles Times: L.A. County reserve deputy is accused of fraud at his security firm

“Jane Robison, a spokeswoman for Dist. Atty. Steve Cooley, said the men created a shell company, International Armored Solutions Inc., to hide the true number of employees at the security firm to avoid paying higher workers’ compensation insurance premiums to the State Compensation Insurance Fund.”

ArticleRooms.com: The Benefits of Master Data Management

“Next, Master Data Management can also help prevent fraud. With the passing of Sarbanes-Oxley which holds executives of public companies accountable for their financial statement, these executives have now placed pressure on the organization to get things right.”

Greene County Daily World: Looking back: Area schools safer because of Columbine shooting incident

Fusion centers are central locations where local, state and federal officials work to receive, integrate and analyze intelligence. The ultimate goal of a fusion center is to provide a mechanism where law enforcement, public safety, and private partners can come together with a common purpose and improve the ability to safeguard our homeland and prevent criminal activity.”

SmartDataCollective: Enterprise Data World 2009

[Jim Harris] “Enterprise Data World is the business world’s most comprehensive vendor-neutral educational event about data and information management.  This year’s program was bigger than ever before, with more sessions, more case studies, and more can’t-miss content.”

All About B2B: PAXLST and CUSRES – How EDI keeps our planes safe from Terrorists

“Through government ownership, the risk of security breaches is minimized and a higher level of consistency can be enforced across airlines.  In the first phase of the program, TSA will perform screening of only US domestic flights.  In future versions of the program, monitoring will expand to include international flights as well.”

Data Quality, Entity Resolution, and OFAC Compliance

Wednesday, April 8th, 2009

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

In a February post blogger Steve Sarsfield talked about government mandates that direct financial institutions to avoid doing business with known “bad guys”:

The mandates have to do with the lists of terrorists offered by the European Union, Australia, Canada and the United States. For example, in the U.S., the US Treasury Department publishes a list of terrorists and narcotics traffickers. These individuals and companies are called “Specially Designated Nationals” or “SDNs.” Their assets are blocked and companies in the U.S. are discouraged from dealing with them by the Office of Foreign Asset Control (OFAC)… If your company fails to identify and block a bad guy… there could be real world consequences such as an enforcement action against your bank or company, and negative publicity.

He goes on to describe the role that data quality software plays in addressing the problem. While I agree with Steve that improving data quality is an important component of some solutions, I’d emphasize that it’s critical to know when and where to improve it. Too much “quality” can actually hurt a solution’s effectiveness.

Professor John Talburt (ERIQ) illustrated this notion in a recent guest post. Making the case for using entity resolution to find hidden relationships, he first showed how the absence of sufficient attributes can cause false positives. He then went on to say:

Even given that the set of identity attributes is large enough to avoid a false positive, the larger problem with matching as a surrogate for entity resolution is that it produces false negatives.  For example, “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” do not match, but does that mean they are not references to the same entity?  It could very well be the case that Mary Doe married John Smith and moved to his house at 456 Pine St.

So in looking for bad actors, suppose the address of one of the two Mary Does above had been resolved to the “correct” address by applying data quality software before using entity resolution to search for hidden relationships. We might have never discovered that Mary Doe married the nefarious John Smith who is on the OFAC list!

If the goal of the solution is merely compliance with a minimum of false positives, data quality can help achieve these goals. But if the goal of the solution is to find bad guys by discovering non-obvious relationships, false negatives are a more important consideration. While false positives are a costly annoyance that require extra resources to resolve, false negatives can mean missing bad guys altogether, and that hurts much more than the bottom line. It can mean not complying with the mandates.

Identity Resolution Daily Links 2009-03-02

Monday, March 2nd, 2009

By the Infoglide Team

Background Now: AG Seeks Injunction Against Contractors Asset Protection Association, Inc. (ConAPA) and Eugene Magre

“‘This company falsely promised its clients that if they gave their employees empty titles and worthless shares of stock they could avoid tens of thousands of dollars in workers compensation premiums,’ Attorney General Brown said. ‘But you can’t simply call a security guard a vice president and avoid complying with the law through a sophisticated and fraudulent scheme.’”

DailyTech: New Bills Target Stolen Merchandise Sold Online

“Under the new legislation, the brick and mortar retailers would score a major coup in that they could order eBay.com, Overstock.com, and Amazon.com to remove numerous goods without any proof.  Under the proposed laws, failure by the online retailers to ‘expeditiously investigate’ and remove the items would result in criminal penalties.”

BeyeNetwork: Business Drivers and Master Data

“Is the actual business need for a single version of the data, or just multiple versions, each of which is of higher quality? Drill down into this a little bit and you may need additional information from your business customers. What constitutes a requirement for master data? A situation in which two business processes need to have a fully shared view of the same representation of a data item?”

Web of Data: Report on Data Discovery by Bloor Research

“…there are now a number of products on the market that can discover data relationships that do not fall within the category of either data profiling or data quality. As a result, it is time to consider the importance of data discovery, and its requirements, as a market in its own right.”

Identity Resolution Daily Links 2009-02-06

Friday, February 6th, 2009

[Post from Infoglide] Identity Resolution: Taking Off in 2009?

“On February 2nd, the Wall Street Journal ran an article about IBM’s application of its identity resolution technology in government organizations, noting they expect it to generate $1 billion in the next four years. Just 18 months ago, Gartner initially identified entity resolution and analysis (aka identity resolution) as a technology “on the rise” in its analysis of the business intelligence (BI) market.  A year later in July 2008, it had moved from the bottom to near the top of the curve.”

Wall Street Journal: At IBM, New Uses for Old Software

“IBM’s software compares data in various databases and finds suspicious relationships. For example, if several applications for visa extensions had different addresses, but all used the same cellphone number, the system would alert immigration staffers that they might be associates requiring a closer look.”

PostalNewsBlog: Massachusetts carrier charged with workers comp fraud

“According to authorities, in November 2007, McComb allegedly intimidated a former customer who spoke to investigators from USPS regarding McComb’s alleged employment status. The alleged fraudulent activities were initially detected by investigators from the OWCP and USPS who referred the case to the Attorney General’s Office. Authorities allege McComb fraudulently collected payments totaling $25,431.09.”

Gartner: Best of Breed MDM versus Generalist MDM – which is best?

[Andrew White]”Users have to decide what they need to focus on – and this may change over time.  Business drivers may lead to the recognition that “deep MDM” skills are needed first hand to get to grips with very complex product data workflows, but later, a more general approach is needed to master other domains.”

Coalition Against Insurance Fraud: 2008 Insurance Fraud Hall of Shame

“Thousands of employees had no workers’ compensation protection when three men helped sell fake policies to small businesses. The scheme stole at least $70 million in premiums. One injured worker couldn’t afford a prosthetic leg. Another lost his home and marriage. A grandmother lost her home and lived in her car.”

Free Information Flow

Thursday, January 22nd, 2009

By Mike Shultz, Infoglide Software CEO

The introduction of free flowing information open to all in the form of a blog is a great way to communicate and share information and ideas.  I especially appreciate the opportunity to push ideas and concepts around in an open forum, in agreement or disagreement.

Doug Wood clearly scratched a sore spot with his post of January 13, “Does Data Matching Qualify as Identity Resolution?”.  Dan Power of Hub Solution Designs, a sometimes guest writer on Identity Resolution Daily, chimed in with his thoughts on how and when identity resolution fits into the MDM world. Then Tom Allen posted a spirited response that put forth his thoughts on the subject of Identity Resolution that elicited a strong response on January 21 from Bob Barker.

No single person’s position is as important as their collective opportunity to openly communicate their ideas and to debate the issues.  I’m delighted that Identity Resolution Daily is the place on the web to host some of the leading thoughts from some of the best thinkers on the complex subject of identity and entity resolution.  Infoglide Software is committed to remaining the leader in this exciting technology, and we really appreciate the interest.

Thanks for joining in.

Are You Serious?

Wednesday, January 21st, 2009

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

We received a comment on our recent post that contrasted identity resolution with data matching that I can’t let go unanswered. Here’s what the respondent said:

“Interesting.  So what Identity Resolution consists of is a bunch of data standardization tables and a matching tool?  Seems like a name equivalence table and a color equivalence table and any of the off the shelf matching tools would solve your problem.  That is a pretty trivial solution to a pretty complicated problem.  Thanks for [sic] you insight.”

Wow, talk about missing the point! Data matching products are clearly useful for certain types of problems, e.g. cleansing data before insertion into a data warehouse. What we stated is that whole classes of problems demand a different approach, and the current tendency to re-brand data quality products as “identity resolution” is misleading. Here’s why.

First, identity resolution is far more than a name equivalence table and a color equivalence table. There is no such thing as a passport equivalence table. The technology has to “understand” the standard formats for passports from many different countries and also understand the possible ways those passport numbers may be manipulated. Also, an equivalence table and COTS matching tools wouldn’t be able to determine that two homes are right next to each other even though they have totally different street names.

Additionally, a robust identity resolution technology needs to be able to search and analyze free text and compare different elements in an unstructured blob of text to find similarities. Those are just a few examples of the types of data comparisons that can be accomplished with identity resolution. Our Identity Resolution Engine™, for example, uses over 50 domain-specific Similarity Search algorithms, each with its own intellectual property, to compare many different types of attributes.

Second, data matching tools typically reduce the amount of available data by combining “like” entities. The goal is “de-duping” and standardization of the data. Typical responses are simply “yes it’s a match” or “no it’s not a match.”  While fine for basic MDM and data warehousing efforts, it’s not so great for mission-critical applications, or if you’re trying to retain the data for future analysis. Losing data is not an option – the diversity of the data contains valuable forensic information about how (id)entities are matched or linked as relationships.

With ten years of R&D, we’ve perfected the combination of lexigraphic algorithms with over 50 domain-specific algorithms to deliver a high degree of precision which precludes false positives – something that can’t be approached using a single generic equation.  For example, DHS’s Secure Flight program required true identity resolution, and that’s why we won that business over hundreds of others.

And finally, a comprehensive identity resolution technology, in addition to data matching, should have the ability to:
•    Uncover non-obvious relationships between seemingly disparate identities/entities,
•    Apply rules and decisioning based on the specific industry, application, and organization,
•    And integrate that knowledge back into existing business applications.

If you’re seriously interested about educating yourself on how identity resolution differs from data matching, we’ve written extensively about the subject. Check out posts in early December, then a week later, then once more right before the holidays.

If you’re not serious about identity resolution, then I’m not sure why you’re reading this! If you ARE serious, we’d like to hear your thoughts.

#3: If Only Data Quality Were That Simple

Wednesday, December 17th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

Our previous post was a response to Phillip Howard at Bloor who recently raised questions about data quality solutions in a series of posts.  One final point he made raises an issue that deserves an extended comment.

On this point Philip says “where I think there may be a significant difference between products is in their ability to discover relationships. However, I will not comment on this now as I am conducting research into this issue and plan to publish a detailed report in the New Year.”  This brief comment highlights what I believe is a rapidly emerging market.

Discovering hidden relationships is a crucial part of a market known by various aliases: entity resolution and analysis, entity analytics, and our favorite of course, identity resolution. Regardless of the name you use, identity resolution problems have distinct characteristics not adequately addressed by existing solutions for data quality (DQ). Neither are they addressed by customer relationship management (CRM), master data management (MDM), or business intelligence (BI).

Identity resolution solutions focus on horizontal need: identifying bad actors in multiple industries. They require at a minimum the following capabilities:
1.    Identity matching through an extensive library of attribute-specific analytics
2.    Relationship detection and resolution resolution
3.    Decisioning that leverages industry-standard and other rules-based systems
4.    Seamless integration with existing business processes via web services and APIs.

Until recently, identity resolution problems were overlooked, addressed by custom in-house applications, or served by cobbling together products from adjacent and overlapping markets. As the identity resolution market has emerged, vendors from these adjacent markets have tried to address the needs with existing products, but customers quickly learn that these products lack the combination of integrated capabilities needed to adequately this unique problem area.

Identity resolution has a similar relationship to each of the 4 adjacent areas mentioned above. Each area causes the creation of data sources that can be consumed by identity resolution solutions, while the addition of identity resolution technology enhances the accuracy of DQ, BI, CRM, and MDM.

Here’s a chart that seeks to clarify these relationships and to characterize the strengths and weaknesses of the adjacent products when applied to identity resolution problems. The adjacent products simply can’t address identity resolution well because they were built for a different purpose. Do you agree?

Identity Resolution Daily Links 2008-12-12

Friday, December 12th, 2008

[Post from Infoglide] Part Deux: If Only Data Quality Were That Simple

“Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.”

Data Value Talk: The added value of an integrated customer view

“So it appears that the data itself plays a crucial role in the lack of an integrated customer view. Or more accurately, the better the data - the better the customer view.  And the better the matching of customer records across separate systems the better the integrated customer view. So Data Quality and Matching (Identity Resolution) determine in large parts the quality of the integrated customer view and the added value that it delivers.”

Marion Star: Muzzle loading and compensation

“Investigators from the Ohio Bureau of Workers’ Compensation, posing as gun enthusiasts, twice visited SMS. Those visits consisted primarily of small talk about guns and ammo. McGraw discussed some pistols that he had recently sold and invited one of the investigators to bring in an allegedly defective gun, telling them he would ‘take a look at it.’”

Intelligent Enterprise: ‘Surround Strategy:’ A Prediction for 2009

” Rather than trying to remodel the data warehouse to accommodate fresher and more detailed operational data (near real-time activity in operational systems, process logs, etc.), these data sources will operate in parallel (or horizontally, whichever word you like) as complementary feeds to analytics. It takes too long and is too expensive to expand the data warehouse concept to do this.”

New York State Insurance Department: Cortland Woman Accused of Workers’ Comp Fraud

“Horton is charged with making false statements and submitting false testimony to the Workers’ Compensation Board to receive benefits. She claimed that an April 2006 back injury she suffered while she was a health aide prevented her from working or attending school. Investigators learned that she was attending school full-time.”

Gartner: When is SOA, DOA? When it’s without MDM!

[Andrew White] “Clearly, if every SOA-based application interaction had to incur the costs of data reconciliation, mapping, clean up etc, then the cost of building and maintaining that SOA-based application would exceed what it costs today without SOA.  The bottom line: SOA needs MDM to help with the evolution of the information infrastructure.”

The State Journal: Insurance Fraud Unit Wins 45 Convictions This Year

“Since January 2007, the fraud unit has received 1,703 case referrals for review from those in the insurance industry and private citizens. After reviewing the referrals, field investigators have been assigned 397 cases to pursue. During that time, [West Va. Insurance Commissioner Jane] Cline said, 292 criminal cases have been referred to various prosecuting authorities, as well as in-house prosecutors who have been assigned to the unit on a full-time basis. Further, the fraud unit has secured indictments on 84 individuals for 294 felony counts and successfully obtained 73 convictions, including 45 in 2008.”

Part Deux: If Only Data Quality Were That Simple

Wednesday, December 10th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

During the past two weeks, Phillip Howard at Bloor Research has raised interesting questions about the nature and efficiency of data quality solutions in a series of posts entitled “The problem with data quality solutions.” Last week I responded on his blog and posted an expanded discussion of the same points here.

His fourth installment opens some interesting new topics. Perhaps the best approach is to lift some quotes and then respond below.

“Where I will comment is on the importance of understanding relationships not just between data elements but also between data and applications and even between data and the business. Understanding data relationships is arguably the most important factor whenever you are moving and transforming data, especially in data migration and data archiving environments but also for moving data into a warehouse and similar applications.” We agree that finding non-obvious connections is crucial to building effective data quality solutions. Many technologies fall short in this regard. They are unable to evaluate relationships based on similarity when data is inconsistent. Philip’s simple example baffles many technologies:

“A typical case might be where one application required a five digit numeric field and another application requires the same five numbers plus an additional two alphabetic characters. So, here’s a question for data quality vendors: can your software tell the difference?”  Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.

He goes on: “Unfortunately, discovering relationships is not just about profiling your database. There may be relationships that exist across data sources (and types of data source) that you need to understand; and then there is the application factor. While it may not be theoretically correct from a purist data management perspective the fact is that many data relationships are defined within applications so, in one way or another, you really need to discover these.”  We couldn’t have articulated it any better. Many data quality solutions assume a higher degree of order than actually exists in the real world. Being able to deal with ambiguity (e.g., data sometimes missing, data entered in wrong fields) distinguishes the best technologies from their more simplistic brethren.

This post is getting a little long, so we’ll continue this discussion next week. In the meantime, we’d like to hear your reaction.


Bad Behavior has blocked 1207 access attempts in the last 7 days.

Close
E-mail It
Portfolio Strategy News The Direct Marketing Voice