[NOTE: We apologize that this is late in getting posted. We were having some technical problems that we just resolved.]
We left off Wednesday with Fisherman Bob and his catch of fish, lobsters, rusty hubcaps, and old boots.
Practically speaking, the balance of “false positives” (records I shouldn’t have found but did) and “false negatives” (records I should have found but didn’t) is something that challenges anyone in the business of information retrieval.
In the world of retail returns fraud, it is a question of potentially turning a loyal customer into a former customer. We either:
- Deny their return because of a false positive hit against a “known shoplifters” database or
- Let “Ima Theef” return a shoplifted item because the shoplifter database contained “Iama Thief” but the tightened matching policy is such that “Ima McLoyalCustomer” would not be denied.
In the retail world, I suspect the tendency is to lean toward avoiding false “positives” because after all, the customer is always right.
Contrast that with the world of background checks and employee vetting. The consequences of missing a hit against the sex offender registry because a newly hired school cafeteria worker misspelled his last name and changed his date of birth by 1 digit on his application are immeasurable. The parameters of this search would most likely err on the side of caution and consider multiple variations on name and date of birth.
In either case, the search problems are similar regardless of the tendency towards false positives or false negatives. Ultimately, whether searching with a wide net or a standard one, the results of the search should be qualified. There needs to be a way to remove the “old boot” from the record set before it gets to the consumers of the data. They should have confidence that records of relevance were returned to them. In principle it sounds like a simple expectation to fulfill, but in practical terms, it is not easy to achieve in a consistent, automated fashion. We hear it regularly from our customers in need of a better solution.
Infoglide Software to the RESQ! No, that is not a typo, it is my not-so-clever acronym to describe what Identity Resolution Engine(tm) (IRE) offers for the search challenges I have mentioned: Restrict/Expand, Search, and Qualify.
Restrict. No you can’t boil the ocean. Whether limited by processing power, memory, single search performance, bandwidth, data volume, or overall throughput, at some point you will unable to a do complete “deep-dive” search and still achieve the required performance characteristics. The system needs an intelligent means to logically subset the searchable data.
Expand. Using exact matches against indexed fields has been the bread and butter of the relational database market for the past 30 years. They are incredibly fast at doing it, but that is not enough. We need to find variations in the data. We need to find things that are “like” what we are looking for. We need to find things that are “near” what we are looking for. The system needs an intelligent means to expand our search net so that our candidate set more-than-likely contains our records of interest. The more “more-than-likely” the better. Obviously, returning every record on every search guarantees 100% recall, but the answer lies in between.
Search. Of course, we have to physically search at some point in our processing. That search may be a SQL query with all the search criteria restricted and expanded as appropriate, a high-performance-in memory database, or a web-service call to a data provider, to name a few. In any case, we end up with a result set of records that, in some shape or form, more than likely contain the records we are looking for.
Qualify. The records in our candidate set have been included because they have, for better or worse, met one or more of the restrictive or expanded parameters in our search. However, as a whole, the record may be discarded because of other attributes, contradictory information, or some other reason to disqualify the record.
As a practical (yet trivial) example, one of my restriction/expansion strategies may have been to do a “starts with” when searching for “John Ripley.” That would yield a search that might be: find me all records where first name starts with ‘Jo’ and last name starts with ‘Rip.’ In this simple example, I have avoided boiling the ocean and only asked to return records that are like Jo___ and Rip___. In my record set I have Joe Riplinski, Jon Ripley, Jolene Ripple, Joseph Ripcord, Johnathan Smith (Riptide California), John Ripley, Rip Johnson, and Cal Joe. Ripken.
Effectively, most of the records are “old boots” and “rusty hubcaps.” The system needs an intelligent means to separate the quality matches from the junk. IRE’s patented “Similarity Search,” with its large library of configurable measures, both generic and domain-specific, can be applied across one or more attributes, qualify the degree of similarity of a potential record, and either include or exclude it from the final record set based on configuration-time or search-time business logic. In this example, John Ripley would qualify 1st, with Jon Ripley coming in 2nd and so on. At some point the scores would be low enough to discard the record completely.
IRE has many configurable techniques to handle the “intelligent” part of restrict and expand such as nicknames, field transpositions, word stemming, ranges, and geographic proximity just to name a few.
The example above uses two fields (first and last name). With only that little amount of information, how we qualify the record (with respect to the search criteria) is limited because we only have name information. Imagine if we also had address and date of birth information. We could make more informed decisions . . . perhaps house-holding logic, familial relationship information (parent and children in the same house), and neighbors. But like the HDTV search discussed on Wednesday, we have been trained by ineffective search strategies to limit the types of things we ask for. We went from searching brand, model number, type of electronics, and feature keywords “Samsung LX-53567 TV HD 1080p” down to brand and type (”Samsung TV”). With the techniques of restrict, expand, search, and qualify, our initial search would not have yielded 0 results but instead “Samsung LX-53576 TV HD 1080i” and “Samsung VX-3345 TV HD 1080i” and so on, ranked by strength of match.
The power of the RESQ search pattern lies not only in the ability to configure each phase of the process with a variety of restrictors, expanders, and qualifiers, but in your ability to actually replace them with a completely different implementation to meet a particular customer and data need. Meanwhile, the pattern remains the same: Flexible yet powerful.