Archive for the ‘Data Synchronization’ Category

Identity Resolution Daily Links 2009-03-02

Monday, March 2nd, 2009

By the Infoglide Team

Background Now: AG Seeks Injunction Against Contractors Asset Protection Association, Inc. (ConAPA) and Eugene Magre

“‘This company falsely promised its clients that if they gave their employees empty titles and worthless shares of stock they could avoid tens of thousands of dollars in workers compensation premiums,’ Attorney General Brown said. ‘But you can’t simply call a security guard a vice president and avoid complying with the law through a sophisticated and fraudulent scheme.’”

DailyTech: New Bills Target Stolen Merchandise Sold Online

“Under the new legislation, the brick and mortar retailers would score a major coup in that they could order eBay.com, Overstock.com, and Amazon.com to remove numerous goods without any proof.  Under the proposed laws, failure by the online retailers to ‘expeditiously investigate’ and remove the items would result in criminal penalties.”

BeyeNetwork: Business Drivers and Master Data

“Is the actual business need for a single version of the data, or just multiple versions, each of which is of higher quality? Drill down into this a little bit and you may need additional information from your business customers. What constitutes a requirement for master data? A situation in which two business processes need to have a fully shared view of the same representation of a data item?”

Web of Data: Report on Data Discovery by Bloor Research

“…there are now a number of products on the market that can discover data relationships that do not fall within the category of either data profiling or data quality. As a result, it is time to consider the importance of data discovery, and its requirements, as a market in its own right.”

Entity Extraction: The Flip Side of Entity Resolution

Wednesday, February 25th, 2009

By John Talburt, PhD, CDMP, Director of the Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock

John Talburt - smallUnder our working definition of entity resolution as locating and merging references to the same entity, the last installment focused on the merge problem, and how matching is often used as a stand-in for ER.  Now let’s take a look at the locating problem.

First we should note that information comes to us in two forms, structured and unstructured.  The traditional world of IT has been built around structured information based on the discipline of relational database schemas.  In essence, data is structured if it is ready to be loaded into a relational database, i.e. all of the entities and their attributes are clearly delimited or tagged in a way that a computer can correctly read the entire data set by following one simple, repeating pattern.  In the good ole days, the flat-file format gave us this by requiring that every record must have a fixed length and every attribute must occupy a fixed position in the record.  Inspired by the spreadsheet paradigm, a friendlier version came along only requiring that all of the attributes be presented together in a fixed order, each separated from the other by a specially designated character, the delimiter.  Now XML has brought us yet another discipline of explicitly tagging the start and end of records and attributes with a consistent naming convention.

So in the structured world, locating is easy, you just follow the pattern.  The problem is that we are now beginning to realize that there is a tremendous amount of information in unstructured formats such as free-form documents, photos, videos, audio files, sensor data, and other formats, formats that are not easily mapped into an entity-attribute schema.  Even if we just focus on information encoded in character (text) format, the total amount of unstructured information in most organizations often exceeds the amount of structured information by a considerable amount.  What’s more, we now realize that some of this information could be important, i.e. that processes like customer relationship management (CRM) could be transformed if the company only knew what their customers were saying in their emails to the company or in the comment they gave to telemarketers or technical support personnel who typed those comments into a free-form, notes field.

So how did we end up with so much unstructured information? Did good information go bad?  No, the reason is that the information age operates on four channels –  people to computers, computers to people, computers to computers, and people to people – and it is the latter generates the unstructured information.  Person-to-person communication is inherently complex and often carries a tremendous amount of implicit and explicit context that people understand, but computers don’t.

Early in my career, I worked with a professor on the problem of disambiguation of homographs using thesauri (a fancy way of asking if a computer can understand the difference in meaning between two words that are spelled the same, but mean different things, just by looking at the synonyms of the words around them., e.g. “I can open this can.”)  His favorite test was “Time flies like an arrow, but fruit flies like a banana.”

But getting back on topic, if you want to resolve whether references are to the same or different entities, you must first have the references.  So if the information sources are unstructured, the locating side of entity resolution is about finding the entity references.  This process is variously referred to as “named entity recognition”, “entity identification”, or “entity extraction”.  In the next installment we will discuss some of the strategies for entity extraction from unstructured text documents.

Identity Resolution Daily Links 2008-12-12

Friday, December 12th, 2008

[Post from Infoglide] Part Deux: If Only Data Quality Were That Simple

“Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.”

Data Value Talk: The added value of an integrated customer view

“So it appears that the data itself plays a crucial role in the lack of an integrated customer view. Or more accurately, the better the data - the better the customer view.  And the better the matching of customer records across separate systems the better the integrated customer view. So Data Quality and Matching (Identity Resolution) determine in large parts the quality of the integrated customer view and the added value that it delivers.”

Marion Star: Muzzle loading and compensation

“Investigators from the Ohio Bureau of Workers’ Compensation, posing as gun enthusiasts, twice visited SMS. Those visits consisted primarily of small talk about guns and ammo. McGraw discussed some pistols that he had recently sold and invited one of the investigators to bring in an allegedly defective gun, telling them he would ‘take a look at it.’”

Intelligent Enterprise: ‘Surround Strategy:’ A Prediction for 2009

” Rather than trying to remodel the data warehouse to accommodate fresher and more detailed operational data (near real-time activity in operational systems, process logs, etc.), these data sources will operate in parallel (or horizontally, whichever word you like) as complementary feeds to analytics. It takes too long and is too expensive to expand the data warehouse concept to do this.”

New York State Insurance Department: Cortland Woman Accused of Workers’ Comp Fraud

“Horton is charged with making false statements and submitting false testimony to the Workers’ Compensation Board to receive benefits. She claimed that an April 2006 back injury she suffered while she was a health aide prevented her from working or attending school. Investigators learned that she was attending school full-time.”

Gartner: When is SOA, DOA? When it’s without MDM!

[Andrew White] “Clearly, if every SOA-based application interaction had to incur the costs of data reconciliation, mapping, clean up etc, then the cost of building and maintaining that SOA-based application would exceed what it costs today without SOA.  The bottom line: SOA needs MDM to help with the evolution of the information infrastructure.”

The State Journal: Insurance Fraud Unit Wins 45 Convictions This Year

“Since January 2007, the fraud unit has received 1,703 case referrals for review from those in the insurance industry and private citizens. After reviewing the referrals, field investigators have been assigned 397 cases to pursue. During that time, [West Va. Insurance Commissioner Jane] Cline said, 292 criminal cases have been referred to various prosecuting authorities, as well as in-house prosecutors who have been assigned to the unit on a full-time basis. Further, the fraud unit has secured indictments on 84 individuals for 294 felony counts and successfully obtained 73 convictions, including 45 in 2008.”

Part Deux: If Only Data Quality Were That Simple

Wednesday, December 10th, 2008

By Robert Barker, Infoglide Senior Vice President & Chief Marketing Officer

During the past two weeks, Phillip Howard at Bloor Research has raised interesting questions about the nature and efficiency of data quality solutions in a series of posts entitled “The problem with data quality solutions.” Last week I responded on his blog and posted an expanded discussion of the same points here.

His fourth installment opens some interesting new topics. Perhaps the best approach is to lift some quotes and then respond below.

“Where I will comment is on the importance of understanding relationships not just between data elements but also between data and applications and even between data and the business. Understanding data relationships is arguably the most important factor whenever you are moving and transforming data, especially in data migration and data archiving environments but also for moving data into a warehouse and similar applications.” We agree that finding non-obvious connections is crucial to building effective data quality solutions. Many technologies fall short in this regard. They are unable to evaluate relationships based on similarity when data is inconsistent. Philip’s simple example baffles many technologies:

“A typical case might be where one application required a five digit numeric field and another application requires the same five numbers plus an additional two alphabetic characters. So, here’s a question for data quality vendors: can your software tell the difference?”  Applying generic algorithms to data attributes with wildly varying characteristics simply can’t match the accuracy of applying a family of deterministic analytics, each built around specific characteristics of a particular attribute type.

He goes on: “Unfortunately, discovering relationships is not just about profiling your database. There may be relationships that exist across data sources (and types of data source) that you need to understand; and then there is the application factor. While it may not be theoretically correct from a purist data management perspective the fact is that many data relationships are defined within applications so, in one way or another, you really need to discover these.”  We couldn’t have articulated it any better. Many data quality solutions assume a higher degree of order than actually exists in the real world. Being able to deal with ambiguity (e.g., data sometimes missing, data entered in wrong fields) distinguishes the best technologies from their more simplistic brethren.

This post is getting a little long, so we’ll continue this discussion next week. In the meantime, we’d like to hear your reaction.

Identity Resolution Daily Links 2008-11-17

Monday, November 17th, 2008

[Post from Infoglide] Identity Resolution Daily: Proud of Our Heritage

“When we examine our company’s roots, we see that our heritage is finding bad guys. That’s what David Wheeler set out to do when he saw that detectives had a critical need for better tools for criminal investigations. That is what we are beginning to do in the great State of Washington to identify businesses trying to cheat on their workers’ compensation premiums. From desktops to mainframes and everything in between, our roots have spread and have helped keep us stable as the winds of change have buffeted us about.”

Miami Herald: Workers’ compensation investigator accused of fraud

“In September, according to an arrest warrant, Vega visited Pipe Designs Inc., 7710 NW 72nd Ave., in Miami-Dade. The company did not have any workers’ compensation coverage, Vega found. Vega told owner Ronald Triana that he would lower the hefty penalty — between $27,000 and $30,000 — if Triana gave him a $2,500 money order with the payee information blank, according to the warrant.”

onestopclick: MDM ‘driving software development’

“Studies carried out by IT industry analyst Gartner indicate the necessity for firms to increase the effectiveness of their database development, while reducing costs and meeting compliance requirements, is driving the take-up of MDM technologies.”

Computing SA: IT downturn: every cloud has a silver lining

“Open source data integration, data quality, and extraction, transformation and loading (ETL) applications will flourish in these conditions because they are less costly to obtain, widely supported and constantly updated.”

opodo: Travellers reminded of Esta regulations

“Jim Forster, British Airways’ government and industry affairs manager, said: ‘The US is our biggest overseas market and we have been working hard to advise our visa waiver customers that they must apply to the Department of Homeland Security well in advance of travel.’”

The Two Sides of Entity Resolution

Thursday, November 6th, 2008

By John Talburt, Professor of Information Science and Director of the Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ) at the University of Arkansas at Little Rock

I have always liked the definition of Entity Resolution put forward by the Infolab at Stanford University  - “locating and merging records that refer to the same real-world entities”.  The reason is that it succinctly describes the two primary facets of entity resolution, namely locating and merging.  If you look at the literature in the area of entity and identity resolution, you generally find that the focus is on one, but not the other.  Until recently, commercial entity resolution has focused almost entirely on the merge side, mainly because the records being processed were coming from databases, flat files, or other structured sources.  In a structured source, the entity attributes, and consequently the identity attributes, are given explicitly.  In this case, most of the work centers on the process of record linking, i.e. assigning a common identifier to records referring to the same entity.  Unfortunately all too many of these record linking processes subscribe to the “matching myth,” the false assumption that two records represent the same entity if and only if their identifying attributes match, but more about that in another article.

However, now I am seeing increasing attention on the locating side of entity resolution.  Locating is required when information is presented in an unstructured format, such as text documents or images. In this case, the entity references must first be located (identified) and extracted in the source before the merging process can take place.  Once considered the purview of academics, the art of “feature extraction” has gone mainstream as organizations realize that they often possess more information in unstructured format than in structured files.  Recent books like Tapping into Unstructured Data by Inmon and Nesavich, and a number of new commercial software packages for processing unstructured data are evidence of this emerging trend.  Interest by the US intelligence community in developing techniques for efficient, large-scale entity extraction has also motivated new research and interest in this area.  Like so many areas of information technology, the advent of low-cost, high-performance computing has opened the door to many new approaches to entity extraction and identification that were not practical before.  Based on some of the work I have seen, I believe we are rapidly approaching a point where the expression “machine readable” will no longer mean just binary encoding, but reading and understanding in a human sense.

Identity Resolution Daily Links 2007-09-27

Thursday, September 27th, 2007

[Daily Post from Infoglide Software] Resist the Urge to Merge Purge Data

“Mr. Jonas published a great post yesterday comparing identity resolution against match merge, merge purge and list de-duplication systems that is a must-read for the CIOs doing all the hefty lifting in financial services, government and insurance industries. (Note: identity resolution is called “entity resolution” in IBM parlance.)”

GovernmentExecutive.com: Privacy advocates wary of data ‘fusion centers’

Riegle called the initiative ‘a novel and different approach to information-sharing’ and said privacy was a top concern when each center was formed. Wobbleton said transparency is a core part of the mission. ‘We do not want to mess this up,’ he added.”

Washington Post: Patriot Act Provisions Voided

“In a case brought by a Portland man who was wrongly detained as a terrorism suspect in 2004, U.S. District Judge Ann Aiken ruled that the Patriot Act violates the Constitution because it “permits the executive branch of government to conduct surveillance and searches of American citizens without satisfying the probable cause requirements of the Fourth Amendment.”

Leadership Journal: Privacy And Security

Writes Michael Chertoff in the DHS blog: “But what about the tension between privacy and security? Is it true that whatever we do to strengthen our security must be at the expense of privacy? It is not. Our efforts to secure our homeland need not harm our privacy. Rather, in many cases they can actually strengthen it. […] Privacy and security are fundamental rights and we will continue to defend both in our post-9/11 world.”

EFF: “Secure Flight” Returns, Lacking Privacy Protections

“When it enacted the Privacy Act in 1974, Congress sought to restrict the amount of personal information that federal agencies could collect and, significantly, required agencies to be transparent in their information practices. The Privacy Act is intended ‘to promote accountability, responsibility, legislative oversight, and open government with respect to the use of computer technology in the personal information systems and data banks of the Federal Government[.]’ Adherence to these requirements is critical for a system like Secure Flight.”

Resist the Urge to Merge Purge Data

Wednesday, September 26th, 2007

Surely you’re aware of Jeff Jonas, identity resolution’s poster boy and first real celebrity. Mr. Jonas is the chief scientist behind IBM’s Entity Analytic Solutions (EAS) and the founder of Systems Research & Development (SRD). He’s a media magnet who’s been featured in CNN, Forbes, Newsweek, NPR, Time and Wired. While he hasn’t quite made it to the cover of People Magazine yet, he was recently sought out by an NBC affiliate in Philadelphia to comment on the plot of the new NBC show Chuck.

Mr. Jonas published a great post yesterday comparing identity resolution against match merge, merge purge and list de-duplication systems that is a must-read for the CIOs doing all the hefty lifting in financial services, government and insurance industries. (Note: identity resolution is called “entity resolution” in IBM parlance.)

For example, when two insurance companies get the urge to merge, they always run into data compatibility issues when they begin to look at the databases they have currently have in place. For claims and underwriting purposes, data synchronization is not a luxury — it’s essential to the daily flow of business.

Back in August when Dutch AEGON acquired the life insurance operations of U.S. investment bank Merrill Lynch in a $1.3 billion in cash deal, our own Glenn Hopkins commented

“Instead of implementing a master data management program, I know that AEGON would be better served — and save lots of capital — if they bolted an identity resolution solution onto their existing architecture. And instead of merging or purging all the identities both companies possess, an identity resolution solution can sift through all the information and keep it all in its native formats for future use.”

The always eloquent Mr. Jonas extends this argument further in his post, Entity Resolution Systems vs. Match Merge/Merge Purge/List De-duplication Systems.

If it’s not too late for AEGON, they should consider Mr. Jonas’ reasons below to avoid a “ground up reload”:

  • Batch versus real-time
  • Snapshot in time versus perpetually current
  • Data survivorship versus full attribution
  • Data drifting versus self-correcting
  • Single version of truth versus every version of truth
  • Outlier attribute suppression versus context accumulating
  • Binary list processing versus “n” data source ingestion
  • Limited scalability versus massive scalability

Please read the post in its entirety for a full explanation of these points and you’ll see why we agree with Mr. Jonas’ conclusion below.

“Entity resolution systems are best suited for real-time missions where processes require access to the most accurate and most current view of that which is knowable [to the enterprise].”

To conclude this post, we’d like to point out to insurance execs that there are other business applications to consider, as mentioned by Glenn in his post on the AEGON merger last month:

“With mergers and acquisitions, there’s always customer overlap issues in any industry. In the insurance industry, with some customers attempting to defraud insurers by using multiple identities, it’s critical to the bottom-line to cross-reference multiple identity records against not just watch lists but also the following business applications:

  • Automated fraud detection
  • Underwriting risk management
  • Enterprise identity management
  • Transactional CRM
  • Compliance
  • Background checks
  • Producer risk assessment”

Identity Resolution Daily Links 2007-08-15

Wednesday, August 15th, 2007

[Daily Post from Infoglide Software] Identity Resolution and Mergers and Acquisitions

“With mergers and acquisitions, there’s always customer overlap issues in any industry. In the insurance industry, with some customers attempting to defraud insurers by using multiple identities, it’s critical to the bottom-line to cross-reference multiple identity records against not just watch lists but also the following business applications…”

Christian Science Monitor: Airport screening raises privacy issue

“The TSA will also offer people an opportunity to provide their date of birth and gender, so that if their name is the same or very similar to one the watch list the TSA can easily, and permanently, distinguish between them. TSA officials contend that with just a name, 95 percent of the matches against the watch list will be accurate. With a name and date of birth, that percentage goes up to 98.5 percent. Add date of birth, and officials insist there will 99.5 percent accuracy.”

Government Computer News: DHS bares upgrades to immigration travel databases

“The Homeland Security Department has unveiled several important upgrades to databases that collectively contain tens of millions of personal records concerning immigration and travel. Some of the changes are intended to foster information sharing among organizations inside DHS as well as with outside government agencies. Others aim to reorganize the databases internally so as to make them easier to use. “

Retail Solutions Online: Grow Business - Reduce Shrinkage: The 2% Solution

“On a daily basis, retail loss prevention (LP) executives face the daunting task of diminishing corporate financial loss. Add protecting assets and increasing profitability to the mix, and LP executives have a lot on their plates. Still, one thing is on almost everyone’s mind: Shrink rates are on the rise, costing retailers upwards of 2 percent of their sales each year. It’s a constant struggle to detect and prevent inventory shrinkage, as well as internal losses like check fraud, cash theft, and Internet sales fraud.”

Washington Post: Former Guard Accused Of Hiding Muslim Ties

“Federal prosecutors now allege that Jackson intentionally withheld his Muslim name, Abdul-Jalil Mohammad, to conceal a connection to a controversial imam in Southeast Washington.”

Why identity resolution is necessary in behavioral marketing

Thursday, July 5th, 2007

It’s probably obvious to anyone interested in identity resolution that it has some important applications to both business and national security. From employee screening and loss prevention to IT security to Homeland Security, there are lots of reasons an organization would want to resolve identities over potentially complicated data sources. But what about all the recent talk about so-called “micro-targeting” or “behavioral marketing?”

At its best, it’s basically the collection of identity data for the purpose of highly-specialized messaging. At its worst, it’s - well - the collection of identity data for the purpose of highly-specialized messaging. Still, nobody really seems to be talking about the fact that without the right software in place, the data being collected is less likely to produce much of a return and more likely to create significant security risks.

Thus far this year, advertisers have dropped about $575 million into behavioral advertising, and that figure is expected to rise to $1 billion or so next year, according to information on Marketing Vox.

Some of that money is spent on contextual modeling like Yahoo’s SmartAds, which Yahoo’s senior vice president of display marketplaces, Todd Teresi, says will create “scaleable one-to-one marketing” (from the New York Times). In a post on his Web Analysis blog Anil Batra argues that “this will allow Yahoo to build [a] richer set of data” that will lead to higher conversion rates.

All of that seems great for increasing revenue, but it seems like identity resolution should be an integral part of this process as well. Without intelligent systems in place to create what Infoglide Chief Software Architect John Ripley recently called a “living context,” what you’ve got is a huge data store with tons of identity data but without any built-in accuracy or privacy protection, especially since the data’s purpose suggests a great deal of sharing.

Putting it another way, HP Labs researcher Marco Casassa Mont says that identity management and sharing, especially from “an Identity Provider to a Service Provider or between two Identity Providers,” has to have both data and protection policies in place. He says that those policies have to:

  • Enable users to provide their (privacy) preferences in a more explicitly and fine grained way (e.g. in terms of consent, disclosure list, deletion, notification, etc.);
  • Enable enterprise back-end Identity Management solutions to manage the association of preferences and (data handling) policies to data and keep them into account during data processing steps;
  • Enable the exchange of data along with associated preferences/policies;
  • Introduce accountability, tracing and auditing mechanisms.

Lots of people are saying that the identity data collection of behavioral marketing is going to produce incredible results for everyone from retail organizations to political campaigns (as an article on MSNBC.com points out). However, while there is a rush to get data collection and resolution solutions deployed, nobody seems to be asking if those solutions are working hard enough to protect privacy or even accurately navigate the vast stores of identity data being collected.

Bad Behavior has blocked 396 access attempts in the last 7 days.

E-mail It
Portfolio Strategy News The Direct Marketing Voice