Archive for the ‘Data-Mining’ Category

Identity Resolution Daily Links 2009-08-21

Friday, August 21st, 2009

[Post from Infoglide] Walking the Privacy/Security Tightrope

“In a post last April, we talked about the privacy/security balance issue for fusion centers and for vendors with supporting technology. Now an article in the Austin Sunday paper about a proposed fusion center again highlights the tension between security and privacy. Each time a fusion center is proposed, the story goes like this…”

information management: MDM for Tough Times: 5 trends to strengthen organizations during recession

[Aaron Zornes] “Enterprise MDM solutions are steadily but rapidly evolving away from data-centric hubs into full-blown application stacks. In other words, MDM is becoming less of a standalone technology infrastructure as the emphasis is increasingly on relationships between domains, user interface and integration with other emerging and adjacent technologies such as RFID, entity analytics and business intelligence.”

InformationWeek: Healthcare Tech: Can BI Help Save The System?

“Healthcare IT is a good place to be these days. While IT budgets in many verticals have been tightly reined, healthcare is enjoying multiple government mandates. This has resulted in an infusion of funds to modernize and integrate IT infrastructure, applications, and data. However, we aren’t starting from a high ground. There are multiple challenges to attaining a 21st century-grade IT environment.”

OCDQ Blog: Adventures in Data Profiling (Part 2)

“The adventures began with the following scenario – You are an external consultant on a new data quality initiative.  You have got 3,338,190 customer records to analyze, a robust data profiling tool, half a case of Mountain Dew, it’s dark, and you’re wearing sunglasses…ok, maybe not those last two or three things – but the rest is true.”

VIDEO: Interview with Secure Flight

TSA Secure Flight Program Director Paul Leyh is interviewed about recent developments.

Identity Resolution Daily Links 2009-07-31

Friday, July 31st, 2009

[Post from Infoglide] Data Finds Data in Real-Time Entity Resolution

“Jeff Jonas of IBM recently quoted from a chapter called “Data Finds Data”  that he co-wrote for a book entitled Beautiful Data: The Stories Behind Elegant Data Solutions, and I was impressed by how well this passage describes the effective use of entity resolution software (e.g., IRE 2.2)…”

IT-Director.com: GRC is not enough

[Philip Howard]”If you think about these different forms of risk, they can mostly be managed within existing GRC frameworks: business risk, data and IT governance and compliance cover five of these seven types of risk. But they don’t cover fraud or cyber attacks or similar security issues.”

SunSentinel.com: Roofer ducked $400,000 in worker’s comp premiums

“Investigators with the state’s Division of Insurance Fraud said Robert McDonald, owner of Gulfstream Roofing Inc., funneled $3 million in payroll through several fake companies between 2002 and 2006, claiming the money was being paid to insured subcontractors instead of his own workers.”

BNET Healthcare: What Can US Learn From European Health IT Experience?

“The three countries also use universal patient identification numbers in health care. This is much easier to do in Europe than it is in the U.S., where the mistrust of government is so high that the issue of having a single patient identifier number is no longer even under discussion. There’s also the small matter of our low EHR adoption rate, which is less than 20 percent for physicians and lower for hospitals. By contrast, most physicians in the three European countries are using some kind of EHR.”

Identity Resolution Daily Links 2009-07-24

Friday, July 24th, 2009

[Post from Infoglide] Entity Resolution as Data Mining

“In my last post, I suggested that entity resolution in the broadest sense (“Big ER”) really encompasses three activities.  The first is locating and collecting entity references from unstructured sources (entity extraction), the second is resolving and merging references to the same entity (“Little ER”), and the third is analyzing associations among entities.  Not every ER process involves all three activities.”

BeyeNETWORK: Some Perspectives on Quality

[Bill Inmon] “There are then very legitimate circumstances where incorrect data is best left in the database or data warehouse. Stated differently, there is no circumstance where correcting data or not correcting data is the right thing to do. In order to determine which approach is proper, the context of the corrections has to be known. Only then can it be determined whether correcting errors is the proper thing to do.”

Homeland Security Watch: How To Improve Homeland Security: Give the ODNI Oversight Responsibility for Fusion Centers

“To me, fusion centers are a fine example of Darwinian logic in homeland security.  There was no comprehensive national plan to create fusion centers.  In original intent, Founding-Fathers-federalism fashion, states and cities decided they were not getting the intelligence they wanted.  Arizona, Georgia, Illinois, New York and a handful of other jurisdictions took responsibility for processing - or “fusing” - their own intelligence.”

ITBusinessEdge: Master Data Management and the CIO’s Strategic Plan

“If we look at MDM as a collection of techniques providing enterprise-wide data requirements analysis and subsequent implementation of best practices in data management, then the savvy IT manager might cherry-pick from the tools offered by vendors to provide the optimal solution that unifies the view of critical data concepts while satisfying the data quality requirements imposed by a horizontal information solution.”

I, Cringely: Medical Records R Us

“So medical records are an area where IT could make us healthier and, if done correctly, ought to save lots of money, too.  What we need is some form of centralized medical record keeping that preserves patient privacy yet, at the same time, keeps us from shopping all over town for bogus Oxycontin prescriptions.”

Entity Resolution as Data Mining

Wednesday, July 22nd, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my last post, I suggested that entity resolution in the broadest sense (“Big ER”) really encompasses three activities.  The first is locating and collecting entity references from unstructured sources (entity extraction), the second is resolving and merging references to the same entity (“Little ER”), and the third is analyzing associations among entities.  Not every ER process involves all three activities.  As I noted, the “pre-activity” of entity extraction only comes into play when the entity reference sources are unstructured, for example facial recognition in surveillance videos.  Before the facial characteristics can be analyzed and compared to known faces, the portion of the images in the video that represents the face must first be located and extracted.  In image processing this is called “feature extraction” and is the genesis of my use of the term “entity extraction” for this activity.

When the notion of entity resolution first developed, it was in the context of a database entity-relation schema.  In those days, ER was just about merging all the references to the same entity.  There was no entity extraction activity because the information in the database was already structured.  The entity extraction activity grew out of the realization that useful information may reside in a structured format.

Now I’d like to talk about the third activity, exploring networks of associations.  Once you have located and merged all of the references to the same entity, the next step is to ask whether any relationships exist among the entities.  One of the first to be explored was the “household” relationship.  Companies realize that there is value in understanding who’s living with whom at the same location, yet interestingly it is still one of the hardest relationships to define and manage.  The simplest definition is “all the people at the same address with the same last name.”  While simple, it doesn’t capture the nuances of current demographics such as unmarried couples, stepchildren, and extended families.

Exploring entity relationships brings us to the intersection of entity resolution with data mining.  Data mining is all about discovering non-explicit (non-obvious) relationships.  A record or database instance by definition is an explicit relationship among the attribute values, i.e. they belong to the same entity.  However, just as in the case of households, we can discover relationships that are not explicitly given, e.g. people living at the same address.

Building associations is a natural extension of the Little ER process.  Just because there is not enough asserted or inferred evidence to conclude that two references are to the same entity, it may still be possible to establish an association.  For example, a record for Bill Smith at 123 Oak Street and a record for John Doe at 123 Oak Street would not resolve as references to the same person (unless there was evidence of deliberate deception), but it does establish that they shared a residence at some time.  If they shared it at the same time, it might be an important relationship in the context of a criminal investigation, e.g. looking for known associates of Bill Smith.

Like the small world hypothesis and six degrees of separation, entity association can extend many levels beyond direct associations like a shared address.  For example, Bill Smith and John Doe may never have shared the same address, but they may have both shared the same address with Fred Johnson, thus establishing an indirect connection.

This simple example is based on shared address, but entity connections can be established through many combinations of inferred associations such as shared telephone or PO Box address as well as asserted associations such as call records between telephone numbers or change-of-address records.  Just as with entity extraction, the analysis of association networks has its own body of research and knowledge that practitioners can draw upon.

I hope that this series of posts has provided a broader perspective on the variety activities that comprise entity resolution.  I certainly find it a fascinating subject.  In my next post, I will discuss the concept and internal view of identity versus an external view of identity

Identity Resolution Daily Links 2009-06-22

Monday, June 22nd, 2009

By the Infoglide Team

intelligent enterprise: They Better Get This MDM Program Right

“As reported in The New York Times and on the TSA Web site, the Secure Flight program will improve upon current practices in matching passenger identities to watch lists in many ways. At first glance, this appears to be a well thought-out program that conforms to several basic tenets of Master Data Management (in bold below), in this case for the ‘Customer’ entity.”

EHRWMS: Georgia’s Best EMR Used By Three of Top Ten Pediatricians

“Of approximately 100 respondents, 28 used an EMR, of which 40% used the EncounterPRO Pediatric EMR. There were only three other EMRs used more than once, and they were used by only 10%, 7%, and 7% of the survey respondents respectively.”

Government Executive: Enforcement agencies boost cooperation on drug investigations

“In addition, ICE agents for the first time will fully participate in the Organized Crime Drug Enforcement Task Force Fusion Center. The center allows participating federal, state and local law enforcement agencies, including DEA and the FBI, to share information and analytical resources to enhance their overall investigative capacity.”

SmartData Collective: The Data-Information Continuum

“Data could be considered a constant while information is a variable that redefines data for each specific use. Data is not truly a constant since it is constantly changing. However, information is still derived from data and many different derivations can be performed while data is in the same state (i.e. before it changes again).”

Identity Resolution Daily Links 2009-06-12

Friday, June 12th, 2009

[Post from Infoglide] Data Source Disintermediation?

“According to Wikipedia, ‘disintermediation is the removal of intermediaries in a supply chain: ‘cutting out the middleman’… Buyers bypass the middlemen (wholesalers and retailers) in order to buy directly from the manufacturer and thereby pay less.’”

[Jim Harris] OCDQ Blog: The Two Headed Monster of Data Matching

“Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household). Data matching is commonly plagued by what I refer to as The Two Headed Monster…”

CorpWatch: CorpWatch announces release of the CrocTail application and open CorpWatch API

CrocTail provides an interface for browsing information about several hundred thousand U.S. publicly traded corporations and their many foreign and domestic subsidiaries. Information from company Securities and Exchange Commission (SEC) filings has been parsed and annotated by CorpWatch to highlight specific corporate accountability issues. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.”

Vos Is Neias: Washington - TSA Advising Travelers To Book Airline Tickets Using Full Real Names

“While the T.S.A. has announced Aug. 15 as a target date for the airlines to begin asking for each passenger’s full name, gender and date of birth, and has already begun publicizing the program, called Secure Flight, the agency acknowledged that it would go into effect in phases as the airlines update their systems.”

Data Source Disintermediation?

Wednesday, June 10th, 2009

By Robert Barker, Infoglide Senior VP & Chief Marketing Officer

According to Wikipedia, “disintermediation is the removal of intermediaries in a supply chain: ‘cutting out the middleman’… Buyers bypass the middlemen (wholesalers and retailers) in order to buy directly from the manufacturer and thereby pay less.” Some famous disintermediation examples are:

•    Bookselling (e.g., Amazon’s long-tail marketing of millions of books online)
•    Travel (e.g., Southwest Airlines selling tickets direct to consumers on the web)
•    Computers (e.g., Dell selling computers direct to consumer and businesses over the internet).

Disintermediation was THE hot topic during the dot com boom, but the heady prediction that virtually every industry would be disintermediated has yet to become a reality. Nevertheless, over the past decade or so we’ve all tracked the news as one business model after another is attacked by competitors who seek a way to “disintermediate” a particular sector.

Part of the power of identity resolution solutions derives from the data sources upon which they’re based, and both the quantity and quality of data sources can affect the results. One challenging identity resolution problem we’ve written about that relies on a variety of data sources is insider trading (see Leveraging Identity Resolution Data Sources). Drawing on multiple data internal and external, public and private data sources, identity resolution unwinds multiple degrees of business, friendship, and familial relationships to uncover likely illegal stock market gains.

Now potential disintermediation plays related to data sources are emerging. CrunchBase is a well-known example, offering a free database of technology companies, people, and investors that anyone can edit. San Francisco-based CorpWatch is a non-profit engaged in “investigative research and journalism to expose corporate malfeasance and to advocate for multinational corporate accountability and transparency”. They’ve just announced an API that makes it easier to search SEC data:

“Although the SEC provides a search interface for locating company filings (EDGAR / IDEA), and the subsidiary information is not presented in a standardized format suitable for automated use or insertion into a database. The CorpWatch API uses parsers to “scrape” the subsidiary relationship information from Exhibit 21 of the 10-K filings and provides a well-structured interface for programs to query and process the subsidiary data.”

The free CorpWatch API enables identity resolution and other applications to look up the formal names of corporations, ascertain their relationships to other corporations, find their locations around the world, learn their alternate and formal names, and access other useful information. Up to now, you could only get this kind of information from relatively expensive paid subscriptions from commercial data providers.

Is it possible that the efforts of organizations like CorpWatch point to a future in which an abundance of new, free sources of data will make it even easier to create identity resolution applications?

Solving the False Negative Problem

Wednesday, April 22nd, 2009

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In my March 25, 2009 post “The Myth of Matching,” I discussed the confusion between entity resolution and matching as in record de-duplication.  Matching is a necessary part of entity resolution, but it is not sufficient.  In particular I brought up the issue of “false negatives,” cases where records don’t match, but are in fact references to the same entity.  I used the example of Mary Doe living on Elm Street who married John Smith living on Pine Street resulting in two references “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” that don’t match, but are never-the-less references to the same person.  Let’s discuss a couple of approaches to solving this problem - enlarging the scope of identity attributes and utilizing asserted associations.

The Mary Doe - Mary Smith case might be resolved if the scope of identity attributes were increased, i.e. if additional information such as date-of-birth, drivers license, or social security number were available in both records.  But as anyone acquainted with information quality understands, acquiring and maintaining additional information can create as many problems as it solves.  It also brings up a number of questions that the information custodians and collectors must answer.

Is this information available? Is it costly? Is use for this purpose permissible/legal?  Even if expanding the number of identity attributes is an option, it is not necessarily a panacea.  Increasing the number of identity attributes also increases the complexity of the matching.  What if some values are missing?  What if some values agree, but others disagree?

A second approach is to collect and use asserted associations.  The fundamental problem is that if Mary Doe and Mary Smith do not share any matching identity attributes, you cannot know that they are the same person without some separately acquired knowledge that they are in fact the same person.  Moreover, because not all Mary Doe’s are the same person as Mary Smith, you also need additional context such as the address to make the connection clear.  The upshot is that you need to possess the explicit knowledge that “Mary Doe at 234 Elm St is the same person as Mary Smith at 456 Pine St.”

If Mary lives in the United States and Mary registers her change of name and address with the US Postal Service, then you might be able to resolve this through the USPS Change of Address file.  Besides the fact that this is only helpful in the US, relying on the USPS COA file has other disadvantages, not the least of which is that Mary may have decided not to register with the USPS.  For this reason, some companies choose to maintain their own knowledge by acquiring information from other public and private sources.

For example in the US, marriage records are publicly available and are a possible source of this associative information.  It may also be true that while Mary didn’t register her change of address with the USPS, she may have wanted to avoid missing any issues of her Modern Square Dancing magazine subscription and promptly registered her change of address with the publisher.  There are potentially many other data sources, such as changes in utility service, cable service, or required licensure notifications.

Even though the application of external association information can alleviate the false negative problem, it comes at a cost.  The collection and maintenance of associative information can be a monumental task for some types of entities. For example, at least 20% of the US population moves each year.  Because it is too large a task for most organizations to take on by themselves, companies that aggregate large amounts of associative data sometimes offer the application of this knowledge as a product.

In the next installment, I will discuss another common confusion, the difference between entity resolution and identity resolution.

Identity Resolution Daily Links 2009-04-21

Tuesday, April 21st, 2009

By the Infoglide Team

Los Angeles Times: L.A. County reserve deputy is accused of fraud at his security firm

“Jane Robison, a spokeswoman for Dist. Atty. Steve Cooley, said the men created a shell company, International Armored Solutions Inc., to hide the true number of employees at the security firm to avoid paying higher workers’ compensation insurance premiums to the State Compensation Insurance Fund.”

ArticleRooms.com: The Benefits of Master Data Management

“Next, Master Data Management can also help prevent fraud. With the passing of Sarbanes-Oxley which holds executives of public companies accountable for their financial statement, these executives have now placed pressure on the organization to get things right.”

Greene County Daily World: Looking back: Area schools safer because of Columbine shooting incident

Fusion centers are central locations where local, state and federal officials work to receive, integrate and analyze intelligence. The ultimate goal of a fusion center is to provide a mechanism where law enforcement, public safety, and private partners can come together with a common purpose and improve the ability to safeguard our homeland and prevent criminal activity.”

SmartDataCollective: Enterprise Data World 2009

[Jim Harris] “Enterprise Data World is the business world’s most comprehensive vendor-neutral educational event about data and information management.  This year’s program was bigger than ever before, with more sessions, more case studies, and more can’t-miss content.”

All About B2B: PAXLST and CUSRES – How EDI keeps our planes safe from Terrorists

“Through government ownership, the risk of security breaches is minimized and a higher level of consistency can be enforced across airlines.  In the first phase of the program, TSA will perform screening of only US domestic flights.  In future versions of the program, monitoring will expand to include international flights as well.”

Data Quality, Entity Resolution, and OFAC Compliance

Wednesday, April 8th, 2009

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

In a February post blogger Steve Sarsfield talked about government mandates that direct financial institutions to avoid doing business with known “bad guys”:

The mandates have to do with the lists of terrorists offered by the European Union, Australia, Canada and the United States. For example, in the U.S., the US Treasury Department publishes a list of terrorists and narcotics traffickers. These individuals and companies are called “Specially Designated Nationals” or “SDNs.” Their assets are blocked and companies in the U.S. are discouraged from dealing with them by the Office of Foreign Asset Control (OFAC)… If your company fails to identify and block a bad guy… there could be real world consequences such as an enforcement action against your bank or company, and negative publicity.

He goes on to describe the role that data quality software plays in addressing the problem. While I agree with Steve that improving data quality is an important component of some solutions, I’d emphasize that it’s critical to know when and where to improve it. Too much “quality” can actually hurt a solution’s effectiveness.

Professor John Talburt (ERIQ) illustrated this notion in a recent guest post. Making the case for using entity resolution to find hidden relationships, he first showed how the absence of sufficient attributes can cause false positives. He then went on to say:

Even given that the set of identity attributes is large enough to avoid a false positive, the larger problem with matching as a surrogate for entity resolution is that it produces false negatives.  For example, “Mary Doe, 234 Elm St” and “Mary Smith, 456 Pine St” do not match, but does that mean they are not references to the same entity?  It could very well be the case that Mary Doe married John Smith and moved to his house at 456 Pine St.

So in looking for bad actors, suppose the address of one of the two Mary Does above had been resolved to the “correct” address by applying data quality software before using entity resolution to search for hidden relationships. We might have never discovered that Mary Doe married the nefarious John Smith who is on the OFAC list!

If the goal of the solution is merely compliance with a minimum of false positives, data quality can help achieve these goals. But if the goal of the solution is to find bad guys by discovering non-obvious relationships, false negatives are a more important consideration. While false positives are a costly annoyance that require extra resources to resolve, false negatives can mean missing bad guys altogether, and that hurts much more than the bottom line. It can mean not complying with the mandates.

Bad Behavior has blocked 701 access attempts in the last 7 days.

E-mail It
Portfolio Strategy News The Direct Marketing Voice