Record linking is the matching of records across data sets where there is no common identifier such as a national identification number. Record linking, or entity resolution, is used in data mining applications to allow searching across multiple data sets. Record linking projects are used to de-duplicate data in many applications from mailing lists to medical records.
Record linkage is also used to reconstitute populations for use by genealogists, geneticists and social scientists. Our record linking software, called GenMergeDB, was developed over more than 25 years especially for these types of projects. Given digitized historical documents in a variety of formats, GenMergeDB can be used to match records across files to give a more complete picture of individuals over time. GenMergeDB can be licensed, but also forms the basis of our record linking services.
The record linking algorithms of GenMergeDB are based on probabilistic techniques that have been under development since the 1950s. (See record linking references) This technique involves computing weights for data values that depend on their frequency in the population and using these weights to assign a value to a match, partial match or mis-match. For example, "Smith" matching results in a low score based on "Zepka" matching because Smith is usually a more common value. Weights are computed for each project so reflect the specific population being processed.
Because names are so important in matching historical data, we utilize a full set of matching algorithms including those based on phonetic changes (Utah Phonetic Transducer, NYIIS) and edit distance (Jaro-Winkler, Levenshtein). Names change over time and even the same name can be changed as it is recorded or transcribed. GenMerge contains special functionality for women's married vs maiden names.
What makes GenMergeDB unique is the handling of family data. Even with limited demographic information, family information can provide the details necessary to identify individuals. Our algorithms compare each common data field for individuals AND then add scoring information from relatives and household members to increase or decrease the confidence of the match.
Are these two records for the same person?
Copyright © 2011 Pleiades Software Development, Inc.
1338 S. Foothill Suite 324
Salt Lake City, UT 84108
Tel: (801) 349-5559
Call Us Today: (801) 349-5559